Jump to content
UBot Underground

Scraping multiple pages in Google


Recommended Posts

Hello,

 

I'm doing some scraping in Google and for certain search phrases I get more than 100 results. That means I have to navigate to the next page and scrape while the 'Next' link exists on the page. However, I can't figure how to identify the 'Next' link. The identifiers I find when using the 'Choose by attribute' in UBot are not actually present in the code. Anyone know how to do this?

 

Also, how do I create the flow? Obviously, I want to keep navigating to the next page and scrape the URLs until I hit the last page, and then go to the next search phrase and repeat the process. Not sure how to create that loop.

 

Any help I can get is much appreciated.

 

Best regards,

Martin

Link to post
Share on other sites

You can select by outerhtml, but I think it changes randomly. In a bot I coded I scraped the outerhtml of the button via "Page scrape", and then did "Choose by attribute <scrapedOuterHtml>". Works fine.

Link to post
Share on other sites

The Gogetta script will work fine, I just find that google bans me if I navigate directly to a search URL when I'm doing a lot of scraping (ie looking for Wordpress blogs). Using random delays and actually clicking the next button appears to let me scrape a lot more off one IP.

Link to post
Share on other sites

Thank you both for your help and advice so far.

 

@Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did:

 

Search term: "Powered by vBulletin" AND inurl:register.php

Pages to Scrape: 2

Search within: Anytime

Delay between pages: 30

In 'Save Location' I specified a text file on my harddrive.

 

Gogetta, is it working for you if you do the same?

 

Best regards,

Martin

Link to post
Share on other sites

Thank you both for your help and advice so far.

 

@Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did:

 

Search term: "Powered by vBulletin" AND inurl:register.php

Pages to Scrape: 2

Search within: Anytime

Delay between pages: 30

In 'Save Location' I specified a text file on my harddrive.

 

Gogetta, is it working for you if you do the same?

 

Best regards,

Martin

 

Yeah it's working for me and I am on IE7. Maybe just edit the Choose by attribute, you know the one that selects the google links inside the loop.

 

Edit: Ok, I fixed it for IE8, it seems to be a different layout between IE7 & IE8

It should now work in both IE7 and IE8. Let me know, Thanks!

Link to post
Share on other sites

Sounds great. When you get a chance, can you please attach the updated version of your bot? Thank you very much for your help.

 

- Martin

 

I actually just edited the old post and attached it there. But here ill upload it in this one also.

 

Edit: Hey Martin, take the Choose by attribute and the add to list out right under it, I forgot to. But if you leave it in you will get blank lines. Sorry man just rushing today. lol!

 

Take this one out:

<*l onmousedown="return clk(this.href,'','','','*','','*</*>

 

...and the add to list.

Google Url Scraper.ubot

Link to post
Share on other sites
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...