Jump to content
UBot Underground

Could someone please explain how the scraping works for me?


Recommended Posts

Hi,

Just bought a copy of Ubot Studio and I'm looking to write a Yellow Pages scraper and a Google Places scraper initially.

 

I had a play around with writing a Yellow Pages scraper and managed to get a program that creates a nice text file but there are a few issues and to resolve them I think I need to understand how the scraping works.

 

First problem, the entries are not in the same order they are on the page - which I would expect as I'd anticipate that the scraper just goes down the page. Is this not how it works? Or is it a case that my lists in Ubot studio are being stored unordered?

 

Second issue, on Yell.com, not every entry is the same so I get unsynchronised data - again if it's just scanning down the page then I would expect this. I can't confirm it as my entries aren't in the same order they are on the page.

 

I notice that each record has a top line class which goes something like this.

 

<div class="parentListing ui-draggable" id="ad232264_22331508_T2" data-shortlistid="ad100005182203000030" data-natid="232264"> 

 

I was thinking if I check the class for the id and set a variable and then try to scrape all the other class elements only if the variable is still set to the same id, would this work? This of course assumes the answer to my first issue is that it does scrape the page in order. 

 

My final question was regarding the ability to run one script from within another. I tried to find a command for it but couldn't manage it. I like to write in a compartmentalised was - so for example I have a script that open the webpage and queries Yell then a script that scrapes the data. Is there a way to call the second script from within the first one? It's not a biggie but it would be nice to be able to do something like that.

 

I'm using Windows 7, 64 bit and Chrome as a browser - if this makes a difference.

 

Any help or advice would be greatly appreciated. I've attached my current program so you can take a look.

 

Best regards Steve

 

Yell Bot.ubot

Edited by smb1970
Link to post
Share on other sites
My final question was regarding the ability to run one script from within another. I tried to find a command for it but couldn't manage it. I like to write in a compartmentalised was - so for example I have a script that open the webpage and queries Yell then a script that scrapes the data. Is there a way to call the second script from within the first one? It's not a biggie but it would be nice to be able to do something like that.

 

I'm using Windows 7, 64 bit and Chrome as a browser - if this makes a difference.

 

Any help or advice would be greatly appreciated. I've attached my current program so you can take a look.

 

Best regards Steve

use define command for this - put content of second bot inside define command and call it from main bot using custom command

Link to post
Share on other sites
use define command for this - put content of second bot inside define command and call it from main bot using custom command

 

Worked perfectly this, thanks

Best regards Steve

Link to post
Share on other sites
First problem, the entries are not in the same order they are on the page - which I would expect as I'd anticipate that the scraper just goes down the page. Is this not how it works? Or is it a case that my lists in Ubot studio are being stored unordered?

 

[...]

 

I notice that each record has a top line class which goes something like this.

 

<div class="parentListing ui-draggable" id="ad232264_22331508_T2" data-shortlistid="ad100005182203000030" data-natid="232264"> 

 

I was thinking if I check the class for the id and set a variable and then try to scrape all the other class elements only if the variable is still set to the same id, would this work?

 

Content on web pages is often times arranged differently on the underlying page code than it 'appears' to be to the human visitor once it is rendered by the browser, because various HTML elements could float according to the page's coder wish.

 

In fact, one smart trick for SEO is to have all navigational code, for instance, placed at the bottom of the page code, while the content per se at the top, so that SEs would first read that and confer more importance to it that way, as opposed to being diluted by the navigational links, etc...  However, the navigation would still show on top of page, for human use - which is a MUST for good web-design.

Such a thing is acquired by floating the DIV element that wraps the navigational stuff, to stay at the top of the page, over the other DIVs, but the code for it is written last on the page.

 

Apart from SEO tricks and such, many times web-designers (for large companies) would use lots of tricks trying to discourage us from scraping their content and make the bots nonoperational very fast, as they change the classes and ids programatically, quite frequently.

 

Anyway, this is an advanced topic, so maybe not of interest to you right now...

Back to your issue.

Your scraping resulting lists can be easily sorted - there is a sort list function.

 

In regards to your second Q/issue, I am not sure STD license have access to, but there is otherwise an Advanced scraping selector that allows logical constructs, using AND + OR logic for multiple conditions to be met.  Maybe that is what you need to setup.

 

Hope this helps you.

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...