Scraping content

moulinier · January 20, 2010

Hi everybody,

I bought UBot a few days ago and now I am trying to create my first bot based on the things I learned when watching the tutorial videos. However, I am a bit lost now.

What I would like to accomplish is the following:

1) Navigating to www.google.com and searching for a keyword I typed in. - I solved this as it is very easy.

2) Getting the bot to collect the first 30 organic search results while ignoring Google shopping results as well as Google Adwords ads. - I did not find a solution for this.

3) Getting the following data from each of these websites:

a) URL

Meta Title

c) Meta Description

d) Meta Keywords

4) Putting this information in a csv or any other file which I can use with Excel. The data of each website should be in one row while the first column is "URL", the second column "Meta Title", etc.

Could anybody point me into the right direction?

Many thanks to everybody and happy botting,

Josef

1nspire · January 20, 2010

2) Getting the bot to collect the first 30 organic search results while ignoring Google shopping results as well as Google Adwords ads. - I did not find a solution for this.

Typically with Google you have title, description and then site in the results. You want to scrape the url in between <cite></cite> tags

3) Getting the following data from each of these websites:

a) URL
Meta Title
c) Meta Description
d) Meta Keywords

Just set up a loop that will loop through each url and when the page loads use add to list and insert the document constants.

4) Putting this information in a csv or any other file which I can use with Excel. The data of each website should be in one row while the first column is "URL", the second column "Meta Title", etc.

Watch this video to help you understand the scraping and csv creation.

Give it your best shot and in a day or so if you are still having trouble let us know. Honestly the best way to learn ubot is to watch and follow the videos plus study bot source. There is a bunch of bot source floating around here.

webautomationlab · January 20, 2010

This is exactly what I have been working on. Right now I have two bots. One scrapes the serp. The other scrapes the head attributes and URL as you want.

The issue is, the second scraper for kws, description, title, url doesn't nav consistently over a large block of URLs. Until that is resolved, you will be limited in how small of a sample you can scrape.

I'm considering lowering mine to 20 results because doing 100 is not stable.

1nspire · January 20, 2010

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

meta_harvester.ubot

Natureboy · January 20, 2010

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

thats what i was gonna say...those delays help out in ways you wouldnt believe...the bot flies thru commands so fast that its already on to the next thing before the last page finished loading or whatever.

webautomationlab · January 20, 2010

It would be handy to be able to dial down the playback speed. It's one of the few features I miss from iMacros.

webautomationlab · January 20, 2010

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

I'm going to use your bot and see if it will do my job.

Thanks. A lot. +Rep

One note, if you scrape PDF links in Google, they will throw nasty errors when you try to get the meta information. I manually removed PDF links between bot 1 and bot 2. Some sort of URL checking would need to be added if something like this was used heavily.

tooltrainer · January 21, 2010

It would be handy to be able to dial down the playback speed. It's one of the few features I miss from iMacros.

This is another reason I like to use "wait for" instead of a timed delay, wait finish, etc. It doesn't work in every instance but it's definitely the vast majority. Can even use wait for followed by a timed delay to get just a hair of extra pause, but never too much. I hate having a bot sit there when I can clearly see it's ready to move on! LOL

Jonathan

webautomationlab · January 21, 2010

This is another reason I like to use "wait for" instead of a timed delay, wait finish, etc. It doesn't work in every instance but it's definitely the vast majority. Can even use wait for followed by a timed delay to get just a hair of extra pause, but never too much. I hate having a bot sit there when I can clearly see it's ready to move on! LOL

Jonathan

Right, but when I'm hitting a list scraped from Google, there is nothing consistent to wait for. I could do an IF, EITHER with a bunch of WAIT FORs I suppose.

I don't know if it is Ubot or IE, but surfing is not robust. I can't feed it 500 sites and go out for the night and reasonably think it will complete 250 or less without locking up.

tooltrainer · January 21, 2010

I've had good luck when I know there could be one of several possible things on the next page, using a while loop with the 'either' eval and a bunch of search page nodes in the eval. Don't know if that'll help you at all though...

Jonathan

moulinier · January 21, 2010

Dear 1nspire,

many thanks for your helpful posting and especially for the bot. I will immediately start working on it.

All the best to you,

Josef

webautomationlab · January 21, 2010

I've had good luck when I know there could be one of several possible things on the next page, using a while loop with the 'either' eval and a bunch of search page nodes in the eval. Don't know if that'll help you at all though...

Jonathan

It helps a little. It just seems like a lot of extra coding work to do what should be a reasonable expectation out of the box. We should be able to nav across a variety of pages (from the top 100 in the google serps no less) with a slight delay between each, without freezing up, and without requiring nodes and nodes of error checking.

tooltrainer · January 21, 2010

Yep, gotta agree with you there. Should work reliably without all the extra code bloat, I just like finding solutions even when things don't behave like I want 'em to.

Jonathan

greencat · January 23, 2010

One thing I'm starting to find work quite reliably is waiting for "</body>". 99% (maybe even 99.999%) of html pages should have them - and it's a good indication that everything worth scraping has loaded as </body> tends to come at the end of a page's code.

Scraping content

Recommended Posts

moulinier 0

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

Natureboy 3

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

moulinier 0

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

greencat 18

Link to post

Share on other sites

Join the conversation