Not scraping the whole url?

alcr · December 5, 2009

Im scraping url's from google search.

Choose by attribute

outerhtml

wildcards

But this thing gives me the whole url, and I don't want the www infront of it. Anyone got any idea how to just scrape the url without the www?

random random · December 5, 2009

Really new to Ubot here so this probably isn't the answer you're looking for, but one way to do it would be to scrape the whole url, then iterate through your url list afterwards replacing the www with nothing.

alcr · December 5, 2009

Well I don't want to do it manually really. That would be a huge pain in the ass. : /

December 5, 2009

UBot needs more string manipulation functions.

In the mean time, here's a solution:

execute this javascript before scraping the page:

var links = document.getElementsByTagName("a");
for (var i = 0; i < links.length; i++)
{
  links[i].href = links[i].href.replace("http://www.","http://");
}

alcr · December 5, 2009

Awesome! Solved the problem! Thanks Jim~

random random · December 5, 2009

Well I don't want to do it manually really. That would be a huge pain in the ass. : /

Didn't mean that you should do it manually. I meant a loop that goes through your entire scraped url list changing all http://www. to http:// in the same way as what Jim posted. His is just cooler.

Sign In

Not scraping the whole url?

Recommended Posts

alcr 135

Link to post

Share on other sites

random random 0

Link to post

Share on other sites

alcr 135

Link to post

Share on other sites

Guest Jim

Link to post

Share on other sites

alcr 135

Link to post

Share on other sites

random random 0

Link to post

Share on other sites

Join the conversation

Browse

Activity