Jump to content
UBot Underground

Using Regex To catch text between sections


Recommended Posts

There are any times when I want to capture text between known characters in some text. Here's how it's done.

 

Let's say I have this: <td>I want this text here.</td>

 

I only want the text inbetween the html td tags but not the tags. You should be able to do a pre and post search for information around the tags like this:

 

 

(?<=<td>).*?(?=</td>)

 

The (?=...) is the presearcher and the (?=...) is the post searcher. The stuff in the middle, .*? just tells the regex to grab everything BUT don't be greedy. Once it hits the first end tag, it's done when we specify not to be greedy.

 

Frank

  • Like 2
Link to post
Share on other sites

Thanks Frank, another useful regex snippet to put in the scrapbook.

 

I did a bit of research on google for guides and tutorials on regex a while ago, and was surprised to find out that there isn't just one 'standard' regex format, but it comes in several flavours, a bit like subtle differences in other programming languages coming from different companies. No wonder it's not always possible to copy and paste a piece of regex code and expect it to work first time out of the box.

I also found a website where you could paste in the text from say a page scrape, highlight the part of it you want, and the regex code was generated automatically as you did that. I'll have a rummage around and see if I can find it and post the link here.

 

Phil

Link to post
Share on other sites

Hey Frank, thank you! Is there a way to capture multiple instances of text between like tags?

 

John

 

I'm pretty sure that you can use the find regular expression and save it to a list. Just use the 'find regular expression' and select the list option and make sure you are adding to a list.

 

Frank

Link to post
Share on other sites

Thanks Frank, another useful regex snippet to put in the scrapbook.

 

I did a bit of research on google for guides and tutorials on regex a while ago, and was surprised to find out that there isn't just one 'standard' regex format, but it comes in several flavours, a bit like subtle differences in other programming languages coming from different companies. No wonder it's not always possible to copy and paste a piece of regex code and expect it to work first time out of the box.

I also found a website where you could paste in the text from say a page scrape, highlight the part of it you want, and the regex code was generated automatically as you did that. I'll have a rummage around and see if I can find it and post the link here.

 

Phil

 

You hit the nail on the head Phil. If there's one way to accomplish a task - then there is 10 ways, lol.

Link to post
Share on other sites

Frank thanks for expression. Till now I used find regular end replace regular for this. Yours is much quick.

Thanks again.

Link to post
Share on other sites

I actually used this on a link like

http://www.fiverr.com/users/wpbuzz/gigs/invite

 

regex

(?<=http://www.fiverr.com/users/).*?(?=/gigs/)

 

Wanting to pull the username "wpbuzz"

 

 

However on the saved list it actually removed the username, and saved everything to the left and right..

 

 

Am I doing something wrong here?

 

TJ

Link to post
Share on other sites
thats just the first portion of the regex.

And the also the reason your regex is failing htaccess settings

(?<=http://fiverr.com/users/).*?(?=/gigs/)

Link to post
Share on other sites

And the also the reason your regex is failing htaccess settings

(?<=http://fiverr.com/users/).*?(?=/gigs/)

 

 

That wasn't the reason at all. The name of the gigs have non-word characters that were not accounted for in the regex. Once adjusted the regex worked fine.

 

John

 

 

  • Like 1
Link to post
Share on other sites

That would be great if the idea was to navigate to a url. We were parsing text from within a file...As far as i know .htaccess can't block text parsing in a file.

Link to post
Share on other sites
I actually used this on a link like

http://www.fiverr.co...uzz/gigs/invite

 

regex

(?<=http://www.fiverr.com/users/).*?(?=/gigs/)

 

Wanting to pull the username "wpbuzz"

 

 

However on the saved list it actually removed the username, and saved everything to the left and right..

 

 

Am I doing something wrong here?

 

TJ

 

 

Sorry I’m not so good at mind reading I was going by the information posted

Link to post
Share on other sites

Sorry I’m not so good at mind reading I was going by the information posted

 

 

Guess you wouldnt need to be with my statement

However on the saved list it actually removed the username, and saved everything to the left and right..

 

 

Never the less no big deal its been taken care of.

Link to post
Share on other sites

Hi Frank I got a question. How to make this code work if I want to scrape a code with line breaks..

 

For example:

    <div class="content">I wrote a really good college application essay. It's under a page and a half, yet it manages to use 407 more words than the suggested maximum. I'm attempting to trim fat, but there isn't much. I fear that if I become to obsessed with length, the content will suffer. <br>
<br>
Any advice?</div>

 

I want to get the content inside <div class="content"> and </div>

 

Want to get this

I wrote a really good college application essay. It's under a page and a half, yet it manages to use 407 more words than the suggested maximum. I'm attempting to trim fat, but there isn't much. I fear that if I become to obsessed with length, the content will suffer. <br>
<br>
Any advice?

 

Thanks!

Link to post
Share on other sites

The only way I know is by removing them Kreatus but the regex I use for that I have never tested in ubot

Link to post
Share on other sites

The only way I know is by removing them Kreatus but the regex I use for that I have never tested in ubot

That wont work zap since I need to scrape from a page.. This page http://answers.yahoo.com/question/index;_ylt=AswTHU808AuRWqCqhKCoK5cjzKIX;_ylv=3?qid=20080407124721AAA2kow to be specific.

 

Thanks

Link to post
Share on other sites
  • 1 month later...

There are any times when I want to capture text between known characters in some text. Here's how it's done.

 

Let's say I have this: <td>I want this text here.</td>

 

I only want the text inbetween the html td tags but not the tags. You should be able to do a pre and post search for information around the tags like this:

 

 

(?<=<td>).*?(?=</td>)

 

The (?=...) is the presearcher and the (?=...) is the post searcher. The stuff in the middle, .*? just tells the regex to grab everything BUT don't be greedy. Once it hits the first end tag, it's done when we specify not to be greedy.

 

Frank

 

How do you go about making the search for <td> to be case insensitive? I couldn't figure out where to the 'i' as it doesn't seem to accept it anywhere in the ()

 

I think I have it figured out, (?i) at the beginning seems to do the trick.

Link to post
Share on other sites
  • 3 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...