Need Help with Extracting HTML Tags

grantwood · September 27, 2013

Hello,

I am using the Page Scrape command to extract the text below:

<td colspan="3" class="heading small"><strong>Product charges</strong></td>
</tr>

<tr>
<td class="title small">Tuneband for iPhone 4 & iPhone 4S, Black, Grantwood Technology's Armband, Silicone Skin, and Front/Back Screen Protector</td>
<td class="space"> </td>
<td class="quantity small">Qty:  1</td>
<td class="small"></td>
<td class="amount small">$21.99</td>
</tr>

<tr>
<td class="title small">Tuneband for iPhone 5 (NOT FOR IPHONE 5C OR IPHONE 5S), Black, Grantwood Technology's Armband, Silicone Skin, and Front Screen Protector</td>
<td class="space"> </td>
<td class="quantity small">Qty:  1</td>
<td class="small"></td>
<td class="amount small">$22.99</td>
</tr>
<tr>
<td colspan="5" height="25px"><hr></td>
</tr>

I want to use the Find Regular Expression function to extract all occurrences of the tag <td class="title small">, which should be (2) occurrences in this example. There is always a "title" class name, and sometimes there is more than one, like "title small".

However, when I use the following regex, only (1) occurrence is returned. Any ideas?

add list to list(%temp_list, $find regular expression(#temp, "<td class=\"title.*\">"), "Delete", "Global")

HelloInsomnia · September 27, 2013

Here you go:

add list to list(%temp_list, $find regular expression(#temp, "(?<=title\\ssmall\\\"\\>).*?(?=\\<)"), "Delete", "Global")

grantwood · September 27, 2013

Wow! That works perfectly. If I want to extract the quantities and amounts, would I use:

add list to list(%temp_list, $find regular expression(#temp, "(?<=quantity\\ssmall\\\"\\>).*?(?=\\<)"), "Delete", "Global")

add list to list(%temp_list, $find regular expression(#temp, "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Delete", "Global")

HelloInsomnia · September 27, 2013

Yes that looks right.

grantwood · September 27, 2013

The regex for the quantity only returns (1) occurrence. Is that because the (2) occurrences have the same value?

grantwood · September 27, 2013

Never mind. The list was configured to delete duplicates. Duh!

grantwood · June 22, 2015

If there are returns embedded in the text, then the current regex does not produce any matches. For example:

<td class="amount small">
      $21.99
</td>

How would you modify the regex to extract the text (including any returns, tabs, spaces, etc.)? Also, how would you strip all of these characters (Ubot's $trim command only strips spaces), leaving just the amount?

add list to list(%temp_list, $find regular expression(#temp, "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Don\'t Delete", "Global")

HelloInsomnia · June 24, 2015

You can start with this:

set(#temp, "<td class=\"amount small\">
      $21.99
</td>", "Global")
add list to list(%temp_list, $find regular expression($trim($replace(#temp, $new line, $nothing)), "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Don\'t Delete", "Global")

And then when you use each list item you can call $trim to get rid of any extra spaces. That should be able to do it all for you.

grantwood · June 25, 2015

That will work. Thank you!

I wish uBot Studio would add a multiline option.

Sign In

Need Help with Extracting HTML Tags

Recommended Posts

grantwood 5

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

grantwood 5

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

grantwood 5

Link to post

Share on other sites

grantwood 5

Link to post

Share on other sites

grantwood 5

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

grantwood 5

Link to post

Share on other sites

Join the conversation

Browse

Activity