smb1970 0 Posted March 18, 2013 Report Share Posted March 18, 2013 (edited) Hi,Just bought a copy of Ubot Studio and I'm looking to write a Yellow Pages scraper and a Google Places scraper initially. I had a play around with writing a Yellow Pages scraper and managed to get a program that creates a nice text file but there are a few issues and to resolve them I think I need to understand how the scraping works. First problem, the entries are not in the same order they are on the page - which I would expect as I'd anticipate that the scraper just goes down the page. Is this not how it works? Or is it a case that my lists in Ubot studio are being stored unordered? Second issue, on Yell.com, not every entry is the same so I get unsynchronised data - again if it's just scanning down the page then I would expect this. I can't confirm it as my entries aren't in the same order they are on the page. I notice that each record has a top line class which goes something like this. <div class="parentListing ui-draggable" id="ad232264_22331508_T2" data-shortlistid="ad100005182203000030" data-natid="232264"> I was thinking if I check the class for the id and set a variable and then try to scrape all the other class elements only if the variable is still set to the same id, would this work? This of course assumes the answer to my first issue is that it does scrape the page in order. My final question was regarding the ability to run one script from within another. I tried to find a command for it but couldn't manage it. I like to write in a compartmentalised was - so for example I have a script that open the webpage and queries Yell then a script that scrapes the data. Is there a way to call the second script from within the first one? It's not a biggie but it would be nice to be able to do something like that. I'm using Windows 7, 64 bit and Chrome as a browser - if this makes a difference. Any help or advice would be greatly appreciated. I've attached my current program so you can take a look. Best regards Steve Yell Bot.ubot Edited March 18, 2013 by smb1970 Quote Link to post Share on other sites
bestmacros 60 Posted March 18, 2013 Report Share Posted March 18, 2013 My final question was regarding the ability to run one script from within another. I tried to find a command for it but couldn't manage it. I like to write in a compartmentalised was - so for example I have a script that open the webpage and queries Yell then a script that scrapes the data. Is there a way to call the second script from within the first one? It's not a biggie but it would be nice to be able to do something like that. I'm using Windows 7, 64 bit and Chrome as a browser - if this makes a difference. Any help or advice would be greatly appreciated. I've attached my current program so you can take a look. Best regards Steveuse define command for this - put content of second bot inside define command and call it from main bot using custom command Quote Link to post Share on other sites
smb1970 0 Posted March 19, 2013 Author Report Share Posted March 19, 2013 use define command for this - put content of second bot inside define command and call it from main bot using custom command Worked perfectly this, thanksBest regards Steve Quote Link to post Share on other sites
VaultBoss 310 Posted March 19, 2013 Report Share Posted March 19, 2013 First problem, the entries are not in the same order they are on the page - which I would expect as I'd anticipate that the scraper just goes down the page. Is this not how it works? Or is it a case that my lists in Ubot studio are being stored unordered? [...] I notice that each record has a top line class which goes something like this. <div class="parentListing ui-draggable" id="ad232264_22331508_T2" data-shortlistid="ad100005182203000030" data-natid="232264"> I was thinking if I check the class for the id and set a variable and then try to scrape all the other class elements only if the variable is still set to the same id, would this work? Content on web pages is often times arranged differently on the underlying page code than it 'appears' to be to the human visitor once it is rendered by the browser, because various HTML elements could float according to the page's coder wish. In fact, one smart trick for SEO is to have all navigational code, for instance, placed at the bottom of the page code, while the content per se at the top, so that SEs would first read that and confer more importance to it that way, as opposed to being diluted by the navigational links, etc... However, the navigation would still show on top of page, for human use - which is a MUST for good web-design.Such a thing is acquired by floating the DIV element that wraps the navigational stuff, to stay at the top of the page, over the other DIVs, but the code for it is written last on the page. Apart from SEO tricks and such, many times web-designers (for large companies) would use lots of tricks trying to discourage us from scraping their content and make the bots nonoperational very fast, as they change the classes and ids programatically, quite frequently. Anyway, this is an advanced topic, so maybe not of interest to you right now...Back to your issue.Your scraping resulting lists can be easily sorted - there is a sort list function. In regards to your second Q/issue, I am not sure STD license have access to, but there is otherwise an Advanced scraping selector that allows logical constructs, using AND + OR logic for multiple conditions to be met. Maybe that is what you need to setup. Hope this helps you. 1 Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.