Jump to content
UBot Underground

Scrape Chosen question


Recommended Posts

Hi everybody,

I'd like to know how would you handle a situation like the one described below:

Imagine the following html structure..

 

<div id="content">

<h2>First Group of Categories</h2>

<ul class=grayul>
<li><a href="/folder/category47-page1.html"><b>Building</b></a> <span class=black>(554)</SPAN> – Found in architecture and building, such as bricks, pavements, tiles etc.
<li><a href="/folder/category55-page1.html"><b>Frames</b></a> <span class=black>(86)</SPAN> – Picture frame products.
<li><a href="/folder/category14-page1.html"><b>Misc</b></a> <span class=black>(1113)</SPAN> – Products that do not fit into any other category.
</ul>

<h2>Second Group of Categories</h2>

<ul class=grayul>
<li><a href="/folder/category44-page1.html"><b>Creative</b></a> <span class=black>(663)</SPAN> – Artistic and Creative products.
<li><a href="/folder/category45-page1.html"><b>Misc</b></a> <span class=black>(452)</SPAN> – Products that do not fit into any other category.
</ul>

</div>

 

Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.

As you can see both uls have the same class name and follow the exact same structure.

The only thing that makes them different is the h2 heading before each of them.

 

How would you handle such a situation?

 

Thanks in advance

Link to post
Share on other sites

Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.

As you can see both uls have the same class name and follow the exact same structure.

The only thing that makes them different is the h2 heading before each of them.

 

How would you handle such a situation?

$page scrape

Add the items to a $list

 

Then process the first item in the list (i.e. the first URL).

Link to post
Share on other sites

here's a piece of ubot ninjitsu to try:

 

choose the first ul by position

scrape the outer html to a variable, we'll call it 'u'

 

choose the page's body

change the outer html attribute to the variable u.

 

this will isolate only the ones you want, so that you can choose them and scrape them the way you would normally.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...