Jump to content
UBot Underground

Scraping content


Recommended Posts

Hi everybody,

 

I bought UBot a few days ago and now I am trying to create my first bot based on the things I learned when watching the tutorial videos. However, I am a bit lost now.

 

What I would like to accomplish is the following:

 

1) Navigating to www.google.com and searching for a keyword I typed in. - I solved this as it is very easy.

 

2) Getting the bot to collect the first 30 organic search results while ignoring Google shopping results as well as Google Adwords ads. - I did not find a solution for this.

 

3) Getting the following data from each of these websites:

 

a) URL

B) Meta Title

c) Meta Description

d) Meta Keywords

 

4) Putting this information in a csv or any other file which I can use with Excel. The data of each website should be in one row while the first column is "URL", the second column "Meta Title", etc.

 

Could anybody point me into the right direction?

 

Many thanks to everybody and happy botting,

Josef

Link to post
Share on other sites
2) Getting the bot to collect the first 30 organic search results while ignoring Google shopping results as well as Google Adwords ads. - I did not find a solution for this.

 

Typically with Google you have title, description and then site in the results. You want to scrape the url in between <cite></cite> tags

 

3) Getting the following data from each of these websites:

 

a) URL

B) Meta Title

c) Meta Description

d) Meta Keywords

 

Just set up a loop that will loop through each url and when the page loads use add to list and insert the document constants.

 

4) Putting this information in a csv or any other file which I can use with Excel. The data of each website should be in one row while the first column is "URL", the second column "Meta Title", etc.

 

Watch this video to help you understand the scraping and csv creation.

 

Give it your best shot and in a day or so if you are still having trouble let us know. Honestly the best way to learn ubot is to watch and follow the videos plus study bot source. There is a bunch of bot source floating around here.

Link to post
Share on other sites

This is exactly what I have been working on. Right now I have two bots. One scrapes the serp. The other scrapes the head attributes and URL as you want.

 

The issue is, the second scraper for kws, description, title, url doesn't nav consistently over a large block of URLs. Until that is resolved, you will be limited in how small of a sample you can scrape.

 

I'm considering lowering mine to 20 results because doing 100 is not stable.

Link to post
Share on other sites

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

 

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

meta_harvester.ubot

  • Like 1
Link to post
Share on other sites

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

 

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

 

thats what i was gonna say...those delays help out in ways you wouldnt believe...the bot flies thru commands so fast that its already on to the next thing before the last page finished loading or whatever.

Link to post
Share on other sites

Here is a bot I made. Now I only tested on 20 results max but I am using a delay instead of wait in the nav. Its set for 3 seconds but if you are having trouble over large lists you may need to up the delay.

 

Oh and I was wrong on the <cite> tag. I am scraping the <A class=l*</A> area as a wildcard.

I'm going to use your bot and see if it will do my job.

 

Thanks. A lot. +Rep

 

One note, if you scrape PDF links in Google, they will throw nasty errors when you try to get the meta information. I manually removed PDF links between bot 1 and bot 2. Some sort of URL checking would need to be added if something like this was used heavily.

Link to post
Share on other sites

It would be handy to be able to dial down the playback speed. It's one of the few features I miss from iMacros.

 

This is another reason I like to use "wait for" instead of a timed delay, wait finish, etc. It doesn't work in every instance but it's definitely the vast majority. Can even use wait for followed by a timed delay to get just a hair of extra pause, but never too much. I hate having a bot sit there when I can clearly see it's ready to move on! LOL

 

Jonathan

Link to post
Share on other sites

This is another reason I like to use "wait for" instead of a timed delay, wait finish, etc. It doesn't work in every instance but it's definitely the vast majority. Can even use wait for followed by a timed delay to get just a hair of extra pause, but never too much. I hate having a bot sit there when I can clearly see it's ready to move on! LOL

 

Jonathan

Right, but when I'm hitting a list scraped from Google, there is nothing consistent to wait for. I could do an IF, EITHER with a bunch of WAIT FORs I suppose.

 

I don't know if it is Ubot or IE, but surfing is not robust. I can't feed it 500 sites and go out for the night and reasonably think it will complete 250 or less without locking up.

Link to post
Share on other sites

I've had good luck when I know there could be one of several possible things on the next page, using a while loop with the 'either' eval and a bunch of search page nodes in the eval. Don't know if that'll help you at all though...

 

Jonathan

Link to post
Share on other sites

I've had good luck when I know there could be one of several possible things on the next page, using a while loop with the 'either' eval and a bunch of search page nodes in the eval. Don't know if that'll help you at all though...

 

Jonathan

It helps a little. It just seems like a lot of extra coding work to do what should be a reasonable expectation out of the box. We should be able to nav across a variety of pages (from the top 100 in the google serps no less) with a slight delay between each, without freezing up, and without requiring nodes and nodes of error checking.

Link to post
Share on other sites

One thing I'm starting to find work quite reliably is waiting for "</body>". 99% (maybe even 99.999%) of html pages should have them - and it's a good indication that everything worth scraping has loaded as </body> tends to come at the end of a page's code.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...