Need help getting first instance! :O

Pizza Pro · October 3, 2013

Hi guys, I need help with this. Been playing with it for hours, but I can't seem to complete it. I should have asked for help earlier.

For this block of text:

<div class="panel entry-content" id="tab-description" style="display: block; ">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>

I want to extract everything from <p> until the first <h3>. So in this case, it would be:

<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>

I tried this:

<p>[\s\S]*?(?=<h3>)

But it selects all instances.

I can't get any farther than that no matter what I do. Help!

Kreatus (Ubot Ninja) · October 3, 2013

Check this code below:

set(#data, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>", "Global")
set(#data, $replace(#data, "
", ""), "Global")
set(#extract, $find regular expression(#data, "(?<=\">).*?(?=<h3>)"), "Global")
set(#extract, $replace(#extract, "</p>", "</p>
"), "Global")

You need to remove the new lines first before you can regex it.

Pizza Pro · October 3, 2013

Thanks a lot for the help, Kreatus. A few questions. What is the reason why I need to remove the new lines in UBot? Why is it that I can't use:

[\s\S]*?

Also, what if I don't have this line in front, because it changes with each page:

<div class="panel entry-content" id="tab-description" style="display: block; ">

If for some pages, I only had this:

<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>

Then, how would I extract

?

Thanks for the help.

Kreatus (Ubot Ninja) · October 3, 2013

That is going to be a tough one without seeing the whole code.

Can you send me the link where can i find the exact codes that you're trying to scrape?

HelloInsomnia · October 3, 2013

Ubot and regex is a bit weird sometimes you have to do silly workarounds and this is one of those - but it does work.

set(#var, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>", "Global")
clear list(%regex)
add list to list(%regex, $find regular expression(#var, "\\<p[^~]+?(?=\\<h3)"), "Delete", "Global")
set(#result, $list item(%regex, 0), "Global")
alert(#result)

Basically it will look for everything except for ~ after the <p and before <h3 this is because it's not easy to select multiple lines using regex and Ubot.

Sign In

Need help getting first instance! :O

Recommended Posts

Pizza Pro 11

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

Pizza Pro 11

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

Join the conversation

Browse

Activity