mdc101 15 Posted May 20, 2014 Report Share Posted May 20, 2014 Hi Guys There is a number of rows in a table.How do I scrape the url into one list & the anchor text into another list The rows I want to scrape contain a <span title="Number of active keywords">(*)</span>. Rows to be ignored don't have the span <span title="Number of active keywords">(*)</span> (20) is a number that changes Scrape URL------------------------------------------------------------------------------------------------------------------------------------------------------------------------<td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40938">beer kits</a><span title="Number of active keywords">(20)</span> <td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40939">buy beer kits</a><span title="Number of active keywords">(15)</span> <td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40940">beer kits recipes</a><span title="Number of active keywords">(5)</span> <td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40941">beer kits</a><span title="Number of active keywords">(18)</span>------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Ignore URL------------------------------------------------------------------------------------------------------------------------------------------------------------------------<td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a></td> <td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making hobs online</a></td> <td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making yeast online</a></td>------------------------------------------------------------------------------------------------------------------------------------------------------------------------ I have tried the following but cant seem to get it to work clear list(%Silo_article_keyword_list) set(#Silo_article_keyword_list, $nothing, "Global") set(#Silo_article_keyword_list, $trim($replace regular expression($find regular expression($scrape attribute(<class="cluster-name">, "outerhtml"), "(?:<td.*?<span title=\"Number of active keywords\">\\(\\d+\\)</span>.*?<a href=\"(.*?)\".*?</td>)|(?:<td.*?<a href=\"(.*?)\".*?<span title=\"Number of active keywords\">\\(\\d+\\)</span>.*?</td>)"), "([^aA-zZ\\s]+)", $nothing)), "Global") add list to list(%Silo_article_keyword_list, $list from text(#Silo_article_keyword_list, $new line), "Don\'t Delete", "Global") Quote Link to post Share on other sites
YuraB 4 Posted May 21, 2014 Report Share Posted May 21, 2014 Should work add list to list(%links, $find regular expression($scrape attribute(<(tagname=td AND class="cluster-name")>,"innerhtml"), "(?<=href=\\\").*(?=\\\")"), "Delete", "Global") Quote Link to post Share on other sites
mdc101 15 Posted May 21, 2014 Author Report Share Posted May 21, 2014 Hi Thanks for the response The code is working 95%! What it does is it grabs the first row of the ignore list<td class="cluster-name" style="text-align: left"><a href="http://subdomain.domin.com/blueprints/2088/nodes/40937">buy beer making kits online</a></td> Any idea why this is happening? Quote Link to post Share on other sites
YuraB 4 Posted May 21, 2014 Report Share Posted May 21, 2014 all of your ignore list has same URL, so the command "add list to list" has setting that delete duplicate items in the list. You can change the "Delete" setting to "Don\'t Delete". This will keep all tree links in the list Quote Link to post Share on other sites
YuraB 4 Posted May 22, 2014 Report Share Posted May 22, 2014 sorry haven't read carefully your question. this code should work: clear list(%links)clear list(%anchors)add list to list(%links, $find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>, "innerhtml"), "(?<=<a href=\\\").*?(?=\\\">.*?</a>\\s+<span)"), "Don't Delete", "Global")add list to list(%anchors, $find regular expression($scrape attribute(<(tagname="td" AND class="cluster-name")>, "innerhtml"), "(?<=<a href=\\\".*?\\\">).*?(?=</a>\\s+<span)"), "Don't Delete", "Global") This will keep your duplicate items in the list Quote Link to post Share on other sites
mdc101 15 Posted May 22, 2014 Author Report Share Posted May 22, 2014 @YuraBThat worked perfectly, Thanks for all the help. Your a Legend.I was not aware one could place conditions within a scrape.(tagname="td" AND class="cluster-name")I am assuming this has to be hard coded? Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.