Andy 0 Posted July 14, 2010 Report Share Posted July 14, 2010 Hi I am having an unusual problem at ezine articles. I want to scrape the resource box, so try scraping by the ID "Sig". However, I get more than just the html code found on the page. Look at this image: http://andyjwilliams.co.uk/resource-code.gif The top image is taken from the source code of the page. The bottom code is taken from within UBot, when I try to choose by attribute the ID sig. Scraping the innerhtml of the ID = Sig returns the extra code in the lower screenshot which seems to have jquery code inserted. Is Ubot inserting this? Anyone know how I can get the resource box with correctly formatted text without the jQuery code? Quote Link to post Share on other sites
TommyTx 5 Posted July 14, 2010 Report Share Posted July 14, 2010 Ezine articles go out of their way to keep you from scaping... it can be done but requires some study...you may be picking up a frame in one section and ubot is picking up another frame... if you do file, view source code using the dropdowns the code will be the main code only if frames are used..if you scroll around the page and do right click then view source (IE that is) you will then see the code that might be in a frame located under the mouse.. if the code is the save when viewed via the drop down and when viewed by right click on the page then frames are not being used... if the code differs then frames are being used and scraping must be aimed at the desired frame areas to gather the correct data. Somtimes the frames are actually part of the css and difficult to detect via viewing the code. Quote Link to post Share on other sites
Andy 0 Posted July 14, 2010 Author Report Share Posted July 14, 2010 Thanks for the reply Tommy, but I am not totally sure I understand - please forgive my ignorance, but I am new to UBot. If I view source code in ANY web browser - IE, Firefox & Chrome, the code shows the same as the upper part of the screenshot above. This is the same whether I use the View -> Source menu from the top, or clicking around the document and right clicking view->source. If I do a scrape page or scrape chosen attribute in UBot, the code that I get for the webpage is as seen in the lower part of the screenshot. If frames were the issue, would they not be visible in source code of the web browsers? Quote Link to post Share on other sites
TommyTx 5 Posted July 14, 2010 Report Share Posted July 14, 2010 To determine if frames exist... using IE..1. Click view2. Click source..3. remember the code.... then 1. move the mouse to areas on the screen that looked blocked or boxed...2. Right click3. View source if the code is different then frames are used and to gather data requires making sure ubot is looking at same area that you want to extract... However I am not sure frames are being used, just a thought for something to look out for. Quote Link to post Share on other sites
TommyTx 5 Posted July 14, 2010 Report Share Posted July 14, 2010 Might be helpful if you gave the exact url of the ezine article you are working on... that way others here may be able to offer assistance... following negative results, just post the basic ubot code you are using and lickety split someone will fix it for you... unless its top secret.... however if you are a beginner.. I doubt it very top secret.... these guys have seen it all before... and then some... Quote Link to post Share on other sites
Andy 0 Posted July 14, 2010 Author Report Share Posted July 14, 2010 To determine if frames exist... using IE..1. Click view2. Click source..3. remember the code.... then 1. move the mouse to areas on the screen that looked blocked or boxed...2. Right click3. View source if the code is different then frames are used and to gather data requires making sure ubot is looking at same area that you want to extract... However I am not sure frames are being used, just a thought for something to look out for. The code is the same no matter where I click and view source. Also, I have checked lots of ezine article pages and they are all the same. If you want to try a specific, try this one: http://ezinearticles.com/?What-You-Need-to-Know-to-Become-a-Nurse-Educator-and-Every-Level-of-Practice&id=4657225 Quote Link to post Share on other sites
Andy 0 Posted July 14, 2010 Author Report Share Posted July 14, 2010 Here is an example bot that tries to scrape the resource box (attached). It saves two files, one of the innertext and one of the innerhtml to the root folder of the application.resource-box-scraper.ubot Quote Link to post Share on other sites
TommyTx 5 Posted July 14, 2010 Report Share Posted July 14, 2010 <!-- google_ad_section_start --> <div id="body"> <p>Nurse educators are very crucial to the field of nursing and are needed at just about every level of </table> </div> <!--UdmComment--> <!-- google_ad_section_end --> <!--/UdmComment--></div> Scape all text between the start and stop below... is something you might work on.. <!-- google_ad_section_start --> <!-- google_ad_section_end --> I am on my way to work right now, but I might get a chance to play with it some more later tongight... but anyway with the code you sent maybe someone can play with it today for you.... lots of folks do pitch in here so I am sure someone will find a solution for you... ezine scraping is popular and there are probably a ton of them here on the web.... if you don't won't to do all the work, ask if someone has one to share also... its much easier to modify one to do what you want than to design it from scratch.... I do all my ezine scraping using VB6 which has a microsoft INET module... works great but you would need to learn Visual Basic 6.... so stick with ubot.. it can do a great ezine scaper job.. and I have seen some here already working... maybe even search the forum for "Ezine Ubot Scraper" or even the web.... they are all over... Good luck maybe someone will have your bot all repaired when i get back later today.... Quote Link to post Share on other sites
TommyTx 5 Posted July 14, 2010 Report Share Posted July 14, 2010 Here is an example of a search I was referring to... look thru these while waiting for some help to your specific problem... I find that if you find your own fix you learn 10 times more and lots of other stuff while you search for and answer... and you will see that many folks solver their own problem while looking for answers.. =1"]Search for Ezine Can't get the link to work right... just type "ezine scraper" into the search box and it pulls up tons of stuff on ezine scraping.. Quote Link to post Share on other sites
Andy 0 Posted July 14, 2010 Author Report Share Posted July 14, 2010 Well I am stuck here. I have tried looking at the Ezine Article to Wordpress bot and that didn't work, nor did it scrape the resource box properly. I think I'll have a look again tomorrow when I am fresh. Quote Link to post Share on other sites
Andy 0 Posted July 15, 2010 Author Report Share Posted July 15, 2010 I am tearing my hair out here. I just cannot get the info I want. I was wondering. When you scrape something, is there a way to modify it within UBOT? e.g. scrape the resource box as innerhtml, but strip out all the jQuery crap? Quote Link to post Share on other sites
Andy 0 Posted July 15, 2010 Author Report Share Posted July 15, 2010 OK, I am beginning to think this is my computer. I went onto another project - to scrape amazon product names, and when I went to choose the attribute of the hyperlink in the product results page, I get this: <A class=title href="http://www.amazon.com/JBuds-Hi-Fi-Noise-Reducing-Buds-Black/dp/B000IG66VS/ref=sr_1_1?s=electronics&ie=UTF8&qid=1279185745&sr=1-1" jQuery1279185742886="75">JBuds Hi-Fi Noise-Reducing Ear Buds (Black)</A> Notice the jQuery crap again. Do other people get this as well, or is it just my computer? Quote Link to post Share on other sites
Andy 0 Posted July 15, 2010 Author Report Share Posted July 15, 2010 OK, its not my computer. I installed uBot on another and I get the same thing there. This is really, really frustrating. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.