How do you use regex to scrape a url if a string exists within a <td>

mdc101 · May 20, 2014

Hi Guys

There is a number of rows in a table.

How do I scrape the url into one list & the anchor text into another list

The rows I want to scrape contain a (*).

Rows to be ignored don't have the span (*)

(20) is a number that changes

Scrape URL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(20)

(15)

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40940">beer kits recipes</a>

(5)

(18)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Ignore URL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a>

</td>

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making hobs online</a>

</td>

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making yeast online</a>

</td>

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I have tried the following but cant seem to get it to work

clear list(%Silo_article_keyword_list)

set(#Silo_article_keyword_list, $nothing, "Global")
 set(#Silo_article_keyword_list, $trim($replace regular expression($find regular expression($scrape attribute(<class="cluster-name">, "outerhtml"), "(?:<td.*?\$\\d+\$.*?<a href=\"(.*?)\".*?</td>)|(?:<td.*?<a href=\"(.*?)\".*?\$\\d+\$.*?</td>)"), "([^aA-zZ\\s]+)", $nothing)), "Global")
 add list to list(%Silo_article_keyword_list, $list from text(#Silo_article_keyword_list, $new line), "Don\'t Delete", "Global")

YuraB · May 21, 2014

Should work

add list to list(%links, $find regular expression($scrape attribute(<(tagname=td AND class="cluster-name")>,"innerhtml"), "(?<=href=\\\").*(?=\\\")"), "Delete", "Global")

mdc101 · May 21, 2014

Hi Thanks for the response

The code is working 95%!

What it does is it grabs the first row of the ignore list

<a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a>

</td>

Any idea why this is happening?

YuraB · May 21, 2014

all of your ignore list has same URL, so the command "add list to list" has setting that delete duplicate items in the list. You can change the "Delete" setting to "Don\'t Delete". This will keep all tree links in the list

YuraB · May 22, 2014

sorry haven't read carefully your question. this code should work:

clear list(%links)
clear list(%anchors)
add list to list(%links, $find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>, "innerhtml"), "(?<=<a href=\\\").*?(?=\\\">.*?</a>\\s+<span)"), "Don't Delete", "Global")

add list to list(%anchors, $find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>, "innerhtml"), "(?<=<a href=\\\".*?\\\">).*?(?=</a>\\s+<span)"), "Don't Delete", "Global")

This will keep your duplicate items in the list

mdc101 · May 22, 2014

@YuraB

That worked perfectly, Thanks for all the help. Your a Legend.

I was not aware one could place conditions within a scrape.

(tagname="td" AND class="cluster-name")

I am assuming this has to be hard coded?

Sign In

How do you use regex to scrape a url if a string exists within a <td>

Recommended Posts

mdc101 15

Link to post

Share on other sites

YuraB 4

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

YuraB 4

Link to post

Share on other sites

YuraB 4

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

Join the conversation

Browse

Activity