Scrape Chosen question

gtsitour · September 11, 2010

Hi everybody,

I'd like to know how would you handle a situation like the one described below:

Imagine the following html structure..

<div id="content">

<h2>First Group of Categories</h2>

<ul class=grayul>
<li><a href="/folder/category47-page1.html"><b>Building</b></a> <span class=black>(554)</SPAN> – Found in architecture and building, such as bricks, pavements, tiles etc.
<li><a href="/folder/category55-page1.html"><b>Frames</b></a> <span class=black>(86)</SPAN> – Picture frame products.
<li><a href="/folder/category14-page1.html"><b>Misc</b></a> <span class=black>(1113)</SPAN> – Products that do not fit into any other category.
</ul>

<h2>Second Group of Categories</h2>

<ul class=grayul>
<li><a href="/folder/category44-page1.html"><b>Creative</b></a> <span class=black>(663)</SPAN> – Artistic and Creative products.
<li><a href="/folder/category45-page1.html"><b>Misc</b></a> <span class=black>(452)</SPAN> – Products that do not fit into any other category.
</ul>

</div>

Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.

As you can see both uls have the same class name and follow the exact same structure.

The only thing that makes them different is the h2 heading before each of them.

How would you handle such a situation?

Thanks in advance

JohnB · September 11, 2010

It looks like the number in the class can be used to delineate between items.

gtsitour · September 11, 2010

Unfortunately this is not a number, it's the lowercase L.

The class name is the same for both uls and it's "grayul".

The only thing that is different between the uls is the leading <h2> tag.

Any ideas?

IRobot · September 11, 2010

Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.
As you can see both uls have the same class name and follow the exact same structure.
The only thing that makes them different is the h2 heading before each of them.

How would you handle such a situation?

$page scrape

Add the items to a $list

Then process the first item in the list (i.e. the first URL).

gtsitour · September 11, 2010

I tried that but the $page scrape gets all the lis from both the uls and not just the lis from the 1st ul that we need.

I also tried to choose by attribute the 1st ul and then use the $scrape chosen command but no luck.

I can't find a way to process/scrape just the first ul..

Seth Turin · September 11, 2010

here's a piece of ubot ninjitsu to try:

choose the first ul by position

scrape the outer html to a variable, we'll call it 'u'

choose the page's body

change the outer html attribute to the variable u.

this will isolate only the ones you want, so that you can choose them and scrape them the way you would normally.

gtsitour · September 11, 2010

This is indeed ninjitsu! :-) LOL

..but sounds absolutely logical!

I can't wait to try it out. I'll report back the results.

Thanks

Sign In

Scrape Chosen question

Recommended Posts

gtsitour 0

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

gtsitour 0

Link to post

Share on other sites

IRobot 43

Link to post

Share on other sites

gtsitour 0

Link to post

Share on other sites

Seth Turin 224

Link to post

Share on other sites

gtsitour 0

Link to post

Share on other sites

Join the conversation

Browse

Activity