Jump to content
UBot Underground

How do you use regex to scrape a url if a string exists within a <td>


Recommended Posts

Hi Guys

 

There is a number of rows in a table.

How do I scrape the url into one list & the anchor text into another list
 

The rows I want to scrape contain a <span title="Number of active keywords">(*)</span>.

 

Rows to be ignored don't have the span <span title="Number of active keywords">(*)</span>

 

(20) is a number that changes
 

Scrape URL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

<td class="cluster-name" style="text-align: left">

<span title="Number of active keywords">(20)</span>
 

<td class="cluster-name" style="text-align: left">

<span title="Number of active keywords">(15)</span>
 

<td class="cluster-name" style="text-align: left">

<span title="Number of active keywords">(5)</span>
 

<td class="cluster-name" style="text-align: left">

<span title="Number of active keywords">(18)</span>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

 

Ignore URL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

<td class="cluster-name" style="text-align: left">

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a>
</td>
 
<td class="cluster-name" style="text-align: left">
<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making hobs online</a>
</td>
 
<td class="cluster-name" style="text-align: left">
<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making yeast online</a>
</td>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
I have tried the following but cant seem to get it to work
 
    clear list(%Silo_article_keyword_list)
    set(#Silo_article_keyword_list$nothing, "Global")
    set(#Silo_article_keyword_list$trim($replace regular expression($find regular expression($scrape attribute(<class="cluster-name">"outerhtml"), "(?:<td.*?<span title=\"Number of active keywords\">\\(\\d+\\)</span>.*?<a href=\"(.*?)\".*?</td>)|(?:<td.*?<a href=\"(.*?)\".*?<span title=\"Number of active keywords\">\\(\\d+\\)</span>.*?</td>)"), "([^aA-zZ\\s]+)"$nothing)), "Global")
    add list to list(%Silo_article_keyword_list$list from text(#Silo_article_keyword_list$new line), "Don\'t Delete""Global")

 

Link to post
Share on other sites

Should work

 

add list to list(%links, $find regular expression($scrape attribute(<(tagname=td AND class="cluster-name")>,"innerhtml"), "(?<=href=\\\").*(?=\\\")"), "Delete", "Global")

Link to post
Share on other sites

Hi Thanks for the response

 

The code is working 95%!

 

What it does is it grabs the first row of the ignore list

<td class="cluster-name" style="text-align: left">

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a>
</td>
 
Any idea why this is happening?
Link to post
Share on other sites

all of your ignore list has same URL, so the command "add list to list" has setting that delete duplicate items in the list. You can change the "Delete" setting to "Don\'t Delete". This will keep all tree links in the list

Link to post
Share on other sites

sorry haven't read carefully your question. this code should work:

 

clear list(%links)
clear list(%anchors)
add list to list(%links$find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>"innerhtml"), "(?<=<a href=\\\").*?(?=\\\">.*?</a>\\s+<span)"), "Don't Delete""Global")


add list to list(%anchors$find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>"innerhtml"), "(?<=<a href=\\\".*?\\\">).*?(?=</a>\\s+<span)"), "Don't Delete""Global"

 

This will keep your duplicate items in the list

Link to post
Share on other sites

@YuraB

That worked perfectly, Thanks for all the help. Your a Legend.

I was not aware one could place conditions within a scrape.

(tagname="td" AND class="cluster-name")

I am assuming this has to be hard coded?

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...