Jump to content
UBot Underground

Recommended Posts

Hello.

 

I want to scrape stuff that resides between two tags (all in one row!):

<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35"§$"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2>

 

 

But the stuff in between those tags could be anything. 

 

So when I try something like:

(?<=\<TAG1\>).+(?=\<TAG2\>)

 

It's of course not going to work because .+ also includes all the <TAG2> in between. It just removes the one at the end.

 

Questions:

1. Can this be done in a single expression? The look ahead should stop when it finds the first match of <TAG1>

2. Is it possible to return just a specific position? Let's say, return only what is between the second <TAG1>****<TAG2> match, but ignore all the others?

Like: Start at the second <TAG1> you find and stop at the second <TAG2> you find.

 

 

Is stuff like this possible? 

 

 

Thanks in advance for your help

Dan

Link to post
Share on other sites

I suck at explaining things but this may help you out http://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match

 

Here's the code for getting the first instance of that code.

set(#var, "<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35\"§$\"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2>", "Global")
set(#firstTag, $list item($find regular expression(#var, "(?<=\\<TAG1\\>).*?(?=\\<TAG2\\>)"), 0), "Global")
alert(#firstTag)

Link to post
Share on other sites

 

I suck at explaining things but this may help you out http://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match

 

Here's the code for getting the first instance of that code.

set(#var, "<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35\"§$\"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2>", "Global")
set(#firstTag, $list item($find regular expression(#var, "(?<=\\<TAG1\\>).*?(?=\\<TAG2\\>)"), 0), "Global")
alert(#firstTag)

Thanks a lot. 

Greedy non greedy... Ok that's something I didn't know. Still don't understand it completely.. But I'm getting closer :-)

 

Dan

Link to post
Share on other sites

Thanks a lot. 

Greedy non greedy... Ok that's something I didn't know. Still don't understand it completely.. But I'm getting closer :-)

 

Dan

The simple way to explain greedy is to say that it will try to match as much as possible, so instead of ending match at first <TAG2> occurance, it will end the match at the last one (since it's greedy and hungry for more :) ).

Link to post
Share on other sites

The simple way to explain greedy is to say that it will try to match as much as possible, so instead of ending match at first <TAG2> occurance, it will end the match at the last one (since it's greedy and hungry for more :) ).

But that always requires a look ahead or look behind right? Without that it wouldn't know where to stop?

 

Dan

Link to post
Share on other sites

But that always requires a look ahead or look behind right? Without that it wouldn't know where to stop?

 

Dan

No it doesn't always require it, since this regex also works:

\<TAG1\>(.*?)\<TAG2\>

The difference is that if you use look ahead/behind, the string that you use there won't be included as part of the match, it's only there as a rule (example above will contain <TAG1> and <TAG2> as part of the match, but with look ahead/behind it will only contain content between them - the example that Kreatues posted above).

Link to post
Share on other sites

Awesome guys!! Thanks a lot. That was really helpful!

 

So besides the regular match types, look ahead and behind and greedy.... What would you say are the next three most important regex options you use?

Something that everyone should have in their toolbox?

 

Dan

Link to post
Share on other sites

Awesome guys!! Thanks a lot. That was really helpful!

 

So besides the regular match types, look ahead and behind and greedy.... What would you say are the next three most important regex options you use?

Something that everyone should have in their toolbox?

 

Dan

I think those are the the most important ones, since they "simulate" scraping when you are extracting data from string that's already in memory.

 

I actually think that everyone who is in scraping seriously should know regular expression, since it's making job easier or even possible sometimes, when nothing else will work.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...