How to scrape non-unique data?

tooltrainer · December 26, 2009

I can't for the life of me figure out how to do this...

I want to scrape keywords from www.shopping.com. The pages look like this:

http://www.shopping.com/ts-car_audio_and_electronics~FD-411

Problem is, the entire page is just laid out in simple tables with no classes or other identifying marks. There's NOTHING to go on that I can find... so I can't just scrape by attribute. I've got my bot currently scraping by position since position always starts at 85 on each page, but some pages have 100 results and some less, and I can't figure out how to correctly identify the number of keywords on the page so that it doesn't over-scrape.

Anyone know how to pull this one off?

Thanks!

Jonathan

1nspire · December 26, 2009

This works. I tried if for other categories and It seemed to work.

tooltrainer · December 26, 2009

Interesting, I'll give that a spin thanks!!

Question, what's with the string "wildcards" in the Choose By Attribute?

Thanks!

Jonathan

1nspire · December 26, 2009

When you use choose by attribute you have 3 options for ubot to use the data. Exact match, wildcards and Reg Ex. I am not to familiar with reg ex but with exact match ubot will look for the exact string of text on a web page. With wildcards ubot will look for the exact match plus add in the data represented by the * asterisk. In your example the asterisk is used where the html of the keyword is which in this case is the url and outer text. I hope I explained well enough. Basically the asterisk is used on the data you want to scrape when in your case you have a list of changing text (keywords).

Now when I attempted to choose by attribute on the link I scraped all anchor text on the page. By widening my choose by attribute the table I was able to differentiate the other anchors to the anchors of the keywords only. Look at the image and you will see how you can widen your range when using choose by attribute.

tooltrainer · December 26, 2009

Yep I figured it out right after I made my post... I was thinking you were using wildcards in a page scrape not a choose by... my mistake.

What I still don't quite follow though is this html:

This means you're grabbing everything from after the opening href quote, to the closing </a> tag, which to my understanding means you should end up with a string like this:

/xCH-car_audio_and_video-car_radio_types">car radio types

How is it getting reduced down to just the anchor text? It works, but I'm not following how...

Thanks!!

Jonathan

tooltrainer · December 26, 2009

I think I just answered my own question... am I right that by defining "outertext" for the Scrape Chosen Attribute, it's telling UBot to only grab the anchor text of the resulting value that has been identified by the Choose by Attribute?

I think I just had a major realization about how UBot works if this is correct...

Jonathan

1nspire · December 26, 2009

That is correct. Depending on what data the choose by attribute has the scrape by attribute will give you some options on what you are wanting to scrape. By choosing the href option I could scrape the url instead of the anchor text. In a simple example the choose by attribute tells ubot what to focus on. What you do with the focused data can be manipulated in a variaty of ways. Working with forms, scraping data, clicking buttons are just a few.

tooltrainer · December 26, 2009

OK then, just to make things more fun... I'm trying to add one more step to this puppy. I'd like to crawl and grab all the URLs for all the available categories. Should be really simple but I'm having issues...

Here's the page I want to scrape URLs from:

http://www.shopping.com/top_searches

But again the HTML is crappy. All URLs look like this:

<a href="/ts-2_way_radios~FD-409">2 Way Radios</a>
<br/>

<a href="/ts-~FD-"></a>
<br/>

<a href="/ts-air_conditioner_accessories~FD-96482">Air Conditioner Accessories</a>
<br/>

etc. There's really nothing to differentiate them from the URLs at the footer, navigation area, etc. So far at best I've gotten so far is scraping EVERY link on the entire page. I can't seem to get it only within the area I want.

I thought I could simply use the same trick you gave previously, something like:

choose by attribute, outer html:
<a href="*</a>
<br/>

and then scrape href of chosen... but it doesn't return anything when I do. Where am I going wrong?

Thanks!!

Jonathan

tooltrainer · December 26, 2009

OK I'm almost there... turns out on closer inspection that ALL the URLs I want start with /ts in them. So now I'm just using choose by attribute, href, http://www.shopping.com/ts* and it's working almost perfectly.

Only remaining issue is that the page has spacers on it, that are actually links to nowhere. This is some LAME page design! The offending URLs look like this:

http://www.shopping.com/ts-~FD-

I end up with 24 of these in the list, not sure how to either prevent them from being scraped, or remove them afterwords.

Jonathan

tooltrainer · December 26, 2009

Well damn if I ain't on a roll today.

Solution was to create a sub that checks the currently selected URL against the known "bad" URL. If it matches, then it sets the current URL to the next in the list and checks again, etc. When a "good" URL is next, then it continues about its business. Beauty eh?

So now I have a new weird problem... one particular page at shopping.com appears to be laid out differently than the others. Go figure... so this is causing the scraper to break because the attributes aren't what it expects. Her's the bad page:

http://www.shopping.com/ts-camping_equipment~FD-59719

It has only a single result on it but the part I want is laid out like this:

<tr>
<td width="10%">1</td>
<td align="left" width="90%">
<a href="-ultima_ii">ultima ii</a>
</td>
</tr>

as opposed to all the others which are laid out like this:

<a href="/ts-2_way_radios~FD-409">2 Way Radios</a>
<br/>

<a href="/ts-~FD-"></a>
<br/>

<a href="/ts-air_conditioner_accessories~FD-96482">Air Conditioner Accessories</a>
<br/>

What's the best way to account for isolated fringe cases like this?

Jonathan

tooltrainer · December 26, 2009

Wow this page is just plain weird... it doesn't even appear if you try to load it with Firefox, and in IE it appears but is broken and doesn't have valid "Top ranking keyword searches for " text. I dunno what to make of this but I suppose at a minimum I can trap for this text missing when I try to grab it as my category header.

Very weird.

Jonathan

tooltrainer · December 27, 2009

OK what was screwing me up was that Firefox and IE return totally different pages for MANY of the shopping.com URLs... I have no idea why! But what this means was that what *I* was seeing using FF, was not what UBot was seeing using IE, and hence I was really confused about what kept happening.

Now that I'm checking the source in IE I can see what UBot is seeing and it was very easy to trap for the bad URLs just by searching the page for "Sorry, the page you requested was not found".

Still would sure love to know why the difference between browser types... have started another thread on that over here:

http://ubotstudio.com/forum/index.php?/topic/2361-ubot-browser-ie-vs-firefox-different-data/

Thanks 1nspire for all the help!

Jonathan

1nspire · December 27, 2009

Glad you got it sorted out. But yes FF and IE often have different styles on the same website. Since shopping.com is another ebay site you would think they would have a strong buyer database and so they may even go as far as having different product listings based on browser usage.

How to scrape non-unique data?

Recommended Posts

tooltrainer 12

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

tooltrainer 12

Link to post

Share on other sites

1nspire 5

Link to post

Share on other sites

Join the conversation