Jump to content
UBot Underground

Need help getting first instance! :O


Recommended Posts

Hi guys, I need help with this. Been playing with it for hours, but I can't seem to complete it. I should have asked for help earlier.

 

For this block of text:

<div class="panel entry-content" id="tab-description" style="display: block; ">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>

I want to extract everything from <p> until the first <h3>. So in this case, it would be:

<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
I tried this:
 
<p>[\s\S]*?(?=<h3>)

 

 

 
But it selects all instances. :( I can't get any farther than that no matter what I do. Help!
 
Link to post
Share on other sites

Check this code below:

set(#data, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>", "Global")
set(#data, $replace(#data, "
", ""), "Global")
set(#extract, $find regular expression(#data, "(?<=\">).*?(?=<h3>)"), "Global")
set(#extract, $replace(#extract, "</p>", "</p>
"), "Global")

You need to remove the new lines first before you can regex it.

Link to post
Share on other sites

Thanks a lot for the help, Kreatus. :) A few questions. What is the reason why I need to remove the new lines in UBot? Why is it that I can't use:

 

[\s\S]*?

 

Also, what if I don't have this line in front, because it changes with each page:

<div class="panel entry-content" id="tab-description" style="display: block; ">

If for some pages, I only had this:

 

 

<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>

 

  

Then, how would I extract 

 

<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>

 

?

 

Thanks for the help. :)

Link to post
Share on other sites

Ubot and regex is a bit weird sometimes you have to do silly workarounds and this is one of those - but it does work.

set(#var, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \">
<p>Text 1</p>
<p>Text 2</p>
<p>Text 3</p>
<h3>Headline</h3>
<p>Text 4</p>
<p>Text 5</p>
<p>Text 6</p>
<h3>Headline</h3>
<p>Text 5</p>
<p>Text 4</p>", "Global")
clear list(%regex)
add list to list(%regex, $find regular expression(#var, "\\<p[^~]+?(?=\\<h3)"), "Delete", "Global")
set(#result, $list item(%regex, 0), "Global")
alert(#result)

Basically it will look for everything except for ~ after the <p and before <h3 this is because it's not easy to select multiple lines using regex and Ubot.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...