alcr 135 Posted December 5, 2009 Report Share Posted December 5, 2009 Im scraping url's from google search. Choose by attribute outerhtml <A class=l onmousedown="returnclk*</A> wildcards But this thing gives me the whole url, and I don't want the www infront of it. Anyone got any idea how to just scrape the url without the www? Quote Link to post Share on other sites
random random 0 Posted December 5, 2009 Report Share Posted December 5, 2009 Really new to Ubot here so this probably isn't the answer you're looking for, but one way to do it would be to scrape the whole url, then iterate through your url list afterwards replacing the www with nothing. Quote Link to post Share on other sites
alcr 135 Posted December 5, 2009 Author Report Share Posted December 5, 2009 Well I don't want to do it manually really. That would be a huge pain in the ass. : / Quote Link to post Share on other sites
Guest Jim Posted December 5, 2009 Report Share Posted December 5, 2009 UBot needs more string manipulation functions. In the mean time, here's a solution: execute this javascript before scraping the page: var links = document.getElementsByTagName("a"); for (var i = 0; i < links.length; i++) { links[i].href = links[i].href.replace("http://www.","http://"); } Quote Link to post Share on other sites
alcr 135 Posted December 5, 2009 Author Report Share Posted December 5, 2009 Awesome! Solved the problem! Thanks Jim~ Quote Link to post Share on other sites
random random 0 Posted December 5, 2009 Report Share Posted December 5, 2009 Well I don't want to do it manually really. That would be a huge pain in the ass. : / Didn't mean that you should do it manually. I meant a loop that goes through your entire scraped url list changing all http://www. to http:// in the same way as what Jim posted. His is just cooler. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.