Using Regex To catch text between sections

Frank · June 24, 2011

There are any times when I want to capture text between known characters in some text. Here's how it's done.

Let's say I have this: <td>I want this text here.</td>

I only want the text inbetween the html td tags but not the tags. You should be able to do a pre and post search for information around the tags like this:

(?<=<td>).*?(?=</td>)

The (?=...) is the presearcher and the (?=...) is the post searcher. The stuff in the middle, .*? just tells the regex to grab everything BUT don't be greedy. Once it hits the first end tag, it's done when we specify not to be greedy.

Frank

JohnB · June 24, 2011

Hey Frank, thank you! Is there a way to capture multiple instances of text between like tags?

John

LoWrIdErTJ - BotGuru · June 24, 2011

Hey Frank, thank you! Is there a way to capture multiple instances of text between like tags?

John

Couldn't you just use a while or loop node

search page (regex)

and capture it in the while or loop

TJ

Kreatus (Ubot Ninja) · June 24, 2011

Nice Frank! This code will be very useful in the future.

AutoIM · June 24, 2011

Thanks Frank, another useful regex snippet to put in the scrapbook.

I did a bit of research on google for guides and tutorials on regex a while ago, and was surprised to find out that there isn't just one 'standard' regex format, but it comes in several flavours, a bit like subtle differences in other programming languages coming from different companies. No wonder it's not always possible to copy and paste a piece of regex code and expect it to work first time out of the box.

I also found a website where you could paste in the text from say a page scrape, highlight the part of it you want, and the regex code was generated automatically as you did that. I'll have a rummage around and see if I can find it and post the link here.

Phil

Frank · June 24, 2011

Hey Frank, thank you! Is there a way to capture multiple instances of text between like tags?

John

I'm pretty sure that you can use the find regular expression and save it to a list. Just use the 'find regular expression' and select the list option and make sure you are adding to a list.

Frank

Frank · June 24, 2011

Thanks Frank, another useful regex snippet to put in the scrapbook.

I did a bit of research on google for guides and tutorials on regex a while ago, and was surprised to find out that there isn't just one 'standard' regex format, but it comes in several flavours, a bit like subtle differences in other programming languages coming from different companies. No wonder it's not always possible to copy and paste a piece of regex code and expect it to work first time out of the box.
I also found a website where you could paste in the text from say a page scrape, highlight the part of it you want, and the regex code was generated automatically as you did that. I'll have a rummage around and see if I can find it and post the link here.

Phil

You hit the nail on the head Phil. If there's one way to accomplish a task - then there is 10 ways, lol.

rumen · June 24, 2011

Frank thanks for expression. Till now I used find regular end replace regular for this. Yours is much quick.

Thanks again.

LoWrIdErTJ - BotGuru · June 25, 2011

I actually used this on a link like

http://www.fiverr.com/users/wpbuzz/gigs/invite

regex

(?<=http://www.fiverr.com/users/).*?(?=/gigs/)

Wanting to pull the username "wpbuzz"

However on the saved list it actually removed the username, and saved everything to the left and right..

Am I doing something wrong here?

TJ

Pete · June 25, 2011

You are using http://www.fiverr try http://fiverr

LoWrIdErTJ - BotGuru · June 25, 2011

You are using http://www.fiverr try http://fiverr

thats just the first portion of the regex.

when the purpose is to pull the user name out of the entire string only.

Pete · June 25, 2011

thats just the first portion of the regex.

And the also the reason your regex is failing htaccess settings

(?<=http://fiverr.com/users/).*?(?=/gigs/)

JohnB · June 25, 2011

And the also the reason your regex is failing htaccess settings
(?<=http://fiverr.com/users/).*?(?=/gigs/)

That wasn't the reason at all. The name of the gigs have non-word characters that were not accounted for in the regex. Once adjusted the regex worked fine.

John

LoWrIdErTJ - BotGuru · June 26, 2011

John helped me out on it. Thank you...

Pete · June 26, 2011

That wasn't the reason at all

lol than go to http://www.fiverr.com and tell me what url you land on?

so you have two options change it or remove it

JohnB · June 26, 2011

That would be great if the idea was to navigate to a url. We were parsing text from within a file...As far as i know .htaccess can't block text parsing in a file.

Pete · June 26, 2011

I actually used this on a link like
http://www.fiverr.co...uzz/gigs/invite

regex
(?<=http://www.fiverr.com/users/).*?(?=/gigs/)

Wanting to pull the username "wpbuzz"

However on the saved list it actually removed the username, and saved everything to the left and right..

Am I doing something wrong here?

TJ

Sorry IÃ¢â‚¬â„¢m not so good at mind reading I was going by the information posted

LoWrIdErTJ - BotGuru · June 26, 2011

Sorry IÃ¢â‚¬â„¢m not so good at mind reading I was going by the information posted

Guess you wouldnt need to be with my statement

However on the saved list it actually removed the username, and saved everything to the left and right..

Never the less no big deal its been taken care of.

Kreatus (Ubot Ninja) · June 30, 2011

Hi Frank I got a question. How to make this code work if I want to scrape a code with line breaks..

For example:

    <div class="content">I wrote a really good college application essay. It's under a page and a half, yet it manages to use 407 more words than the suggested maximum. I'm attempting to trim fat, but there isn't much. I fear that if I become to obsessed with length, the content will suffer. <br>
<br>
Any advice?</div>

I want to get the content inside <div class="content"> and </div>

Want to get this

I wrote a really good college application essay. It's under a page and a half, yet it manages to use 407 more words than the suggested maximum. I'm attempting to trim fat, but there isn't much. I fear that if I become to obsessed with length, the content will suffer. <br>
<br>
Any advice?

Thanks!

Pete · June 30, 2011

The only way I know is by removing them Kreatus but the regex I use for that I have never tested in ubot

Kreatus (Ubot Ninja) · June 30, 2011

The only way I know is by removing them Kreatus but the regex I use for that I have never tested in ubot

That wont work zap since I need to scrape from a page.. This page http://answers.yahoo.com/question/index;_ylt=AswTHU808AuRWqCqhKCoK5cjzKIX;_ylv=3?qid=20080407124721AAA2kow to be specific.

Thanks

Pete · June 30, 2011

Is this what you are after? if not it may give you a idea how to get what your after

Yahoo.ubot

Kreatus (Ubot Ninja) · June 30, 2011

Is this what you are after? if not it may give you a idea how to get what your after

Thats right zap! Thanks for the workaround! +1 for you.

Bob The Builder · August 21, 2011

There are any times when I want to capture text between known characters in some text. Here's how it's done.

Let's say I have this: <td>I want this text here.</td>

I only want the text inbetween the html td tags but not the tags. You should be able to do a pre and post search for information around the tags like this:

(?<=<td>).*?(?=</td>)

The (?=...) is the presearcher and the (?=...) is the post searcher. The stuff in the middle, .*? just tells the regex to grab everything BUT don't be greedy. Once it hits the first end tag, it's done when we specify not to be greedy.

Frank

How do you go about making the search for <td> to be case insensitive? I couldn't figure out where to the 'i' as it doesn't seem to accept it anywhere in the ()

I think I have it figured out, (?i) at the beginning seems to do the trick.

itexspert · December 10, 2014

Uhh How many times i was stuck in the <td> Tags thanks for the solution mate!

Using Regex To catch text between sections

Recommended Posts

Frank 177

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

AutoIM 5

Link to post

Share on other sites

Frank 177

Link to post

Share on other sites

Frank 177

Link to post

Share on other sites

rumen 3

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

Bob The Builder 62

Link to post

Share on other sites

itexspert 47

Link to post

Share on other sites

Join the conversation