Bot-Factory 602 Posted April 4, 2014 Report Share Posted April 4, 2014 Hello. I want to scrape stuff that resides between two tags (all in one row!):<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35"§$"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!"§!"3.<TAG2> But the stuff in between those tags could be anything. So when I try something like:(?<=\<TAG1\>).+(?=\<TAG2\>) It's of course not going to work because .+ also includes all the <TAG2> in between. It just removes the one at the end. Questions:1. Can this be done in a single expression? The look ahead should stop when it finds the first match of <TAG1>2. Is it possible to return just a specific position? Let's say, return only what is between the second <TAG1>****<TAG2> match, but ignore all the others?Like: Start at the second <TAG1> you find and stop at the second <TAG2> you find. Is stuff like this possible? Thanks in advance for your helpDan Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted April 4, 2014 Report Share Posted April 4, 2014 I suck at explaining things but this may help you out http://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match Here's the code for getting the first instance of that code. set(#var, "<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35\"§$\"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2>", "Global") set(#firstTag, $list item($find regular expression(#var, "(?<=\\<TAG1\\>).*?(?=\\<TAG2\\>)"), 0), "Global") alert(#firstTag) Quote Link to post Share on other sites
UBotDev 276 Posted April 4, 2014 Report Share Posted April 4, 2014 The easiest way for 2. is to just add all matches to UBot %list and get the offset you want with "$list item". Quote Link to post Share on other sites
Bot-Factory 602 Posted April 4, 2014 Author Report Share Posted April 4, 2014 I suck at explaining things but this may help you out http://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match Here's the code for getting the first instance of that code. set(#var, "<TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asduha9h9h9h98h<TAG2><TAG1>asd24r32asd.<TAG2><TAG1>asf54z45t35\"§$\"§$<ypja898H<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2><TAG1>asidug87G(&r67%$<>/ <&547gasdg8g!\"§!\"3.<TAG2>", "Global") set(#firstTag, $list item($find regular expression(#var, "(?<=\\<TAG1\\>).*?(?=\\<TAG2\\>)"), 0), "Global") alert(#firstTag) Thanks a lot. Greedy non greedy... Ok that's something I didn't know. Still don't understand it completely.. But I'm getting closer :-) Dan Quote Link to post Share on other sites
UBotDev 276 Posted April 4, 2014 Report Share Posted April 4, 2014 Thanks a lot. Greedy non greedy... Ok that's something I didn't know. Still don't understand it completely.. But I'm getting closer :-) DanThe simple way to explain greedy is to say that it will try to match as much as possible, so instead of ending match at first <TAG2> occurance, it will end the match at the last one (since it's greedy and hungry for more ). Quote Link to post Share on other sites
Bot-Factory 602 Posted April 4, 2014 Author Report Share Posted April 4, 2014 The simple way to explain greedy is to say that it will try to match as much as possible, so instead of ending match at first <TAG2> occurance, it will end the match at the last one (since it's greedy and hungry for more ).But that always requires a look ahead or look behind right? Without that it wouldn't know where to stop? Dan Quote Link to post Share on other sites
UBotDev 276 Posted April 4, 2014 Report Share Posted April 4, 2014 But that always requires a look ahead or look behind right? Without that it wouldn't know where to stop? DanNo it doesn't always require it, since this regex also works: \<TAG1\>(.*?)\<TAG2\> The difference is that if you use look ahead/behind, the string that you use there won't be included as part of the match, it's only there as a rule (example above will contain <TAG1> and <TAG2> as part of the match, but with look ahead/behind it will only contain content between them - the example that Kreatues posted above). Quote Link to post Share on other sites
Bot-Factory 602 Posted April 4, 2014 Author Report Share Posted April 4, 2014 Awesome guys!! Thanks a lot. That was really helpful! So besides the regular match types, look ahead and behind and greedy.... What would you say are the next three most important regex options you use?Something that everyone should have in their toolbox? Dan Quote Link to post Share on other sites
UBotDev 276 Posted April 8, 2014 Report Share Posted April 8, 2014 Awesome guys!! Thanks a lot. That was really helpful! So besides the regular match types, look ahead and behind and greedy.... What would you say are the next three most important regex options you use?Something that everyone should have in their toolbox? DanI think those are the the most important ones, since they "simulate" scraping when you are extracting data from string that's already in memory. I actually think that everyone who is in scraping seriously should know regular expression, since it's making job easier or even possible sometimes, when nothing else will work. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.