Pizza Pro 11 Posted October 3, 2013 Report Share Posted October 3, 2013 Hi guys, I need help with this. Been playing with it for hours, but I can't seem to complete it. I should have asked for help earlier. For this block of text: <div class="panel entry-content" id="tab-description" style="display: block; "> <p>Text 1</p> <p>Text 2</p> <p>Text 3</p> <h3>Headline</h3> <p>Text 4</p> <p>Text 5</p> <p>Text 6</p> <h3>Headline</h3> <p>Text 5</p> <p>Text 4</p>I want to extract everything from <p> until the first <h3>. So in this case, it would be: <p>Text 1</p> <p>Text 2</p> <p>Text 3</p> I tried this: <p>[\s\S]*?(?=<h3>) But it selects all instances. I can't get any farther than that no matter what I do. Help! Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted October 3, 2013 Report Share Posted October 3, 2013 Check this code below: set(#data, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \"> <p>Text 1</p> <p>Text 2</p> <p>Text 3</p> <h3>Headline</h3> <p>Text 4</p> <p>Text 5</p> <p>Text 6</p> <h3>Headline</h3> <p>Text 5</p> <p>Text 4</p>", "Global") set(#data, $replace(#data, " ", ""), "Global") set(#extract, $find regular expression(#data, "(?<=\">).*?(?=<h3>)"), "Global") set(#extract, $replace(#extract, "</p>", "</p> "), "Global") You need to remove the new lines first before you can regex it. Quote Link to post Share on other sites
Pizza Pro 11 Posted October 3, 2013 Author Report Share Posted October 3, 2013 Thanks a lot for the help, Kreatus. A few questions. What is the reason why I need to remove the new lines in UBot? Why is it that I can't use: [\s\S]*? Also, what if I don't have this line in front, because it changes with each page: <div class="panel entry-content" id="tab-description" style="display: block; ">If for some pages, I only had this: <p>Text 1</p><p>Text 2</p><p>Text 3</p><h3>Headline</h3><p>Text 4</p><p>Text 5</p><p>Text 6</p><h3>Headline</h3><p>Text 5</p><p>Text 4</p> Then, how would I extract <p>Text 1</p><p>Text 2</p><p>Text 3</p> ? Thanks for the help. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted October 3, 2013 Report Share Posted October 3, 2013 That is going to be a tough one without seeing the whole code.Can you send me the link where can i find the exact codes that you're trying to scrape? Quote Link to post Share on other sites
HelloInsomnia 1103 Posted October 3, 2013 Report Share Posted October 3, 2013 Ubot and regex is a bit weird sometimes you have to do silly workarounds and this is one of those - but it does work. set(#var, "<div class=\"panel entry-content\" id=\"tab-description\" style=\"display: block; \"> <p>Text 1</p> <p>Text 2</p> <p>Text 3</p> <h3>Headline</h3> <p>Text 4</p> <p>Text 5</p> <p>Text 6</p> <h3>Headline</h3> <p>Text 5</p> <p>Text 4</p>", "Global") clear list(%regex) add list to list(%regex, $find regular expression(#var, "\\<p[^~]+?(?=\\<h3)"), "Delete", "Global") set(#result, $list item(%regex, 0), "Global") alert(#result) Basically it will look for everything except for ~ after the <p and before <h3 this is because it's not easy to select multiple lines using regex and Ubot. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.