Martin 0 Posted August 22, 2010 Report Share Posted August 22, 2010 Hello, I'm doing some scraping in Google and for certain search phrases I get more than 100 results. That means I have to navigate to the next page and scrape while the 'Next' link exists on the page. However, I can't figure how to identify the 'Next' link. The identifiers I find when using the 'Choose by attribute' in UBot are not actually present in the code. Anyone know how to do this? Also, how do I create the flow? Obviously, I want to keep navigating to the next page and scrape the URLs until I hit the last page, and then go to the next search phrase and repeat the process. Not sure how to create that loop. Any help I can get is much appreciated. Best regards,Martin Quote Link to post Share on other sites
meter 145 Posted August 22, 2010 Report Share Posted August 22, 2010 You can select by outerhtml, but I think it changes randomly. In a bot I coded I scraped the outerhtml of the button via "Page scrape", and then did "Choose by attribute <scrapedOuterHtml>". Works fine. Quote Link to post Share on other sites
Martin 0 Posted August 22, 2010 Author Report Share Posted August 22, 2010 Can you show me a screen dump of how you made that work? Thanks,Martin Quote Link to post Share on other sites
Gogetta 263 Posted August 23, 2010 Report Share Posted August 23, 2010 Here you go, it will only scrape 7 pages. But you can easily change it if you want to scrape more. Edit: This wasn't working in IE8 so I fixed it. I posted a new attachment below.Google Url Scraper.ubot Quote Link to post Share on other sites
meter 145 Posted August 23, 2010 Report Share Posted August 23, 2010 The Gogetta script will work fine, I just find that google bans me if I navigate directly to a search URL when I'm doing a lot of scraping (ie looking for Wordpress blogs). Using random delays and actually clicking the next button appears to let me scrape a lot more off one IP. Quote Link to post Share on other sites
Martin 0 Posted August 23, 2010 Author Report Share Posted August 23, 2010 Thank you both for your help and advice so far. @Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did: Search term: "Powered by vBulletin" AND inurl:register.phpPages to Scrape: 2Search within: AnytimeDelay between pages: 30In 'Save Location' I specified a text file on my harddrive. Gogetta, is it working for you if you do the same? Best regards,Martin Quote Link to post Share on other sites
Gogetta 263 Posted August 23, 2010 Report Share Posted August 23, 2010 Thank you both for your help and advice so far. @Gogetta, the script isn't producing the result I expected in as much as the text file I specify to hold the scraped URLs is empty after the run. Here's what I did: Search term: "Powered by vBulletin" AND inurl:register.phpPages to Scrape: 2Search within: AnytimeDelay between pages: 30In 'Save Location' I specified a text file on my harddrive. Gogetta, is it working for you if you do the same? Best regards,Martin Yeah it's working for me and I am on IE7. Maybe just edit the Choose by attribute, you know the one that selects the google links inside the loop. Edit: Ok, I fixed it for IE8, it seems to be a different layout between IE7 & IE8It should now work in both IE7 and IE8. Let me know, Thanks! Quote Link to post Share on other sites
Martin 0 Posted August 23, 2010 Author Report Share Posted August 23, 2010 Sounds great. When you get a chance, can you please attach the updated version of your bot? Thank you very much for your help. - Martin Quote Link to post Share on other sites
Gogetta 263 Posted August 23, 2010 Report Share Posted August 23, 2010 Sounds great. When you get a chance, can you please attach the updated version of your bot? Thank you very much for your help. - Martin I actually just edited the old post and attached it there. But here ill upload it in this one also. Edit: Hey Martin, take the Choose by attribute and the add to list out right under it, I forgot to. But if you leave it in you will get blank lines. Sorry man just rushing today. lol! Take this one out:<*l onmousedown="return clk(this.href,'','','','*','','*</*> ...and the add to list.Google Url Scraper.ubot Quote Link to post Share on other sites
Martin 0 Posted August 23, 2010 Author Report Share Posted August 23, 2010 Gogetta, thank you so much. It's working. Now for some serious scraping :-) - Martin Quote Link to post Share on other sites
Rise 0 Posted September 4, 2010 Report Share Posted September 4, 2010 Thank you Gogetta, Your bot works like a charm. It would be nice to have only one URL by domain name. How to delete domain names duplicate? Rise Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.