Scraping multiple pages in Google

Martin · August 22, 2010

Hello,

I'm doing some scraping in Google and for certain search phrases I get more than 100 results. That means I have to navigate to the next page and scrape while the 'Next' link exists on the page. However, I can't figure how to identify the 'Next' link. The identifiers I find when using the 'Choose by attribute' in UBot are not actually present in the code. Anyone know how to do this?

Also, how do I create the flow? Obviously, I want to keep navigating to the next page and scrape the URLs until I hit the last page, and then go to the next search phrase and repeat the process. Not sure how to create that loop.

Any help I can get is much appreciated.

Best regards,

Martin

meter · August 22, 2010

You can select by outerhtml, but I think it changes randomly. In a bot I coded I scraped the outerhtml of the button via "Page scrape", and then did "Choose by attribute <scrapedOuterHtml>". Works fine.

Martin · August 22, 2010

Can you show me a screen dump of how you made that work?

Thanks,

Martin

Gogetta · August 23, 2010

Here you go, it will only scrape 7 pages. But you can easily change it if you want to scrape more.

Edit: This wasn't working in IE8 so I fixed it. I posted a new attachment below.

Google Url Scraper.ubot

meter · August 23, 2010

The Gogetta script will work fine, I just find that google bans me if I navigate directly to a search URL when I'm doing a lot of scraping (ie looking for Wordpress blogs). Using random delays and actually clicking the next button appears to let me scrape a lot more off one IP.

Martin · August 23, 2010

Thank you both for your help and advice so far.

@Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did:

Search term: "Powered by vBulletin" AND inurl:register.php

Pages to Scrape: 2

Search within: Anytime

Delay between pages: 30

In 'Save Location' I specified a text file on my harddrive.

Gogetta, is it working for you if you do the same?

Best regards,

Martin

Gogetta · August 23, 2010

Thank you both for your help and advice so far.

@Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did:

Search term: "Powered by vBulletin" AND inurl:register.php
Pages to Scrape: 2
Search within: Anytime
Delay between pages: 30
In 'Save Location' I specified a text file on my harddrive.

Gogetta, is it working for you if you do the same?

Best regards,
Martin

Yeah it's working for me and I am on IE7. Maybe just edit the Choose by attribute, you know the one that selects the google links inside the loop.

Edit: Ok, I fixed it for IE8, it seems to be a different layout between IE7 & IE8

It should now work in both IE7 and IE8. Let me know, Thanks!

Martin · August 23, 2010

Sounds great. When you get a chance, can you please attach the updated version of your bot? Thank you very much for your help.

- Martin

Gogetta · August 23, 2010

Sounds great. When you get a chance, can you please attach the updated version of your bot? Thank you very much for your help.

- Martin

I actually just edited the old post and attached it there. But here ill upload it in this one also.

Edit: Hey Martin, take the Choose by attribute and the add to list out right under it, I forgot to. But if you leave it in you will get blank lines. Sorry man just rushing today. lol!

Take this one out:

<*l onmousedown="return clk(this.href,'','','','*','','*</*>

...and the add to list.

Google Url Scraper.ubot

Martin · August 23, 2010

Gogetta, thank you so much. It's working. Now for some serious scraping :-)

- Martin

Rise · September 4, 2010

Thank you Gogetta,

Your bot works like a charm.

It would be nice to have only one URL by domain name.

How to delete domain names duplicate?

Rise

Sign In

Scraping multiple pages in Google

Recommended Posts

Martin 0

Link to post

Share on other sites

meter 145

Link to post

Share on other sites

Martin 0

Link to post

Share on other sites

Gogetta 263

Link to post

Share on other sites

meter 145

Link to post

Share on other sites

Martin 0

Link to post

Share on other sites

Gogetta 263

Link to post

Share on other sites

Martin 0

Link to post

Share on other sites

Gogetta 263

Link to post

Share on other sites

Martin 0

Link to post

Share on other sites

Rise 0

Link to post

Share on other sites

Join the conversation

Browse

Activity