Using Regex To catch text between sections

Jaro · November 6, 2015

Pleease help, I'm trying to scrape hehehe from the text: name=bla value=o>bla - bla <hehehe></td>

I'm using the regex syntax (?<=value=o>.*?<).*?(?=>) just fine in my EditPad Pro, it works exactly how I want... but of course ubot's got problems with my regex again!

Please tell me how to make it work in uBot as well, THANKS!

Bill · November 7, 2015

These all work

(?<=\).*?(?=\&)

(?<=\&lt\).*?(?=\&gt\)

Jaro · November 7, 2015

Thank you Bill, but no it doesn't work for what I need it matches more characters, that's why I really need to have there the value=o...etc. condition...

Buddy S. from the Ubot support advised ^.*value=o>.*<(.*)&gt.* but as it works on his favorite rubular website but it doesn't work in Ubot either, although it works on that website..

Jaro · November 7, 2015

I've just found out that

(?<=value\=o\>.*\&lt\.*?(?=\&gt\

works finally in Ubot Regex editor but the Rubular website says it's an 'Invalid pattern in look-behind.'

Although it works in the Regex editor it doesn't work in Ubot scripts, and again, adds empty values unfortunatelly.

UPDATE 1: And here seems to be the reason:
https://www.ruby-forum.com/topic/4483308

UPDATE 2: I've just solved the problem by adjusting the text on the sides - around the pattern to be selected:

(?<=&lt\.*?(?=\&gt\;\<\/td\>\<td width)

maBOT · December 9, 2015

Hello guyz,

I'm struggling with the pretty similar stuff and I'd like someone to help me out overcome it. Actually I can scrape the desired from two sources, however, It seems I simply can't get it work.

Here are both:

<a href="javascript:void(0);" onclick="checkclosed('EcqodmiMOWM');" class="likebutton">Like Video</a>

<img src="http://i.ytimg.com/vi/EcqodmiMOWM/default.jpg">

I'm trying to scrape the "EcqodmiMOWM" one (no matter through each of these two), put it into variable and using "navigate" function to open it up in uBot as a normal Youtube URL.. Here are entire code which I'm using:

}
set(#youtube,$scrape attribute(<src="http://i.ytimg.com/vi/EcqodmiMOWM/default.jpg">,$find regular expression("","(?<=<img src=\\\"http://i.ytimg.com/vi/).*?(?=/default.jpg\\\">)")),"Global")
in shared browser {
    navigate("https://www.youtube.com/watch?v={#youtube}","Wait")
    wait for browser event("Everything Loaded","")
    wait(10)
}

With this set up debugger returns nothing. ***I've been trying various methods which I read about here except regex, and none of them worked for me. In addition, the page contains only this single a href/ img src, no multiple attributes present on the page.

Any suggestions for plugins, whether free or commercial ones, just mention it here..

Thanks ahead a lot,

P.S.

This is my first forum post ever.. I wanted to seek for help once I really need it ;-)

Pete · December 9, 2015

try this

set(#youtube,$find regular expression("<src=\"http://i.ytimg.com/vi/EcqodmiMOWM/default.jpg\">","(?<=ytimg\\.com\\/vi\\/).*?(?=\\/default\\.jpg)"),"Global")
navigate("https://www.youtube.com/watch?v={#youtube}","Wait")
wait for browser event("Everything Loaded","")
wait(10)

I think you don't see it in the debugger becouse you failed to click the quotation marks for the text in your regex

maBOT · December 9, 2015

try this
set(#youtube,$find regular expression("<src=\"http://i.ytimg.com/vi/EcqodmiMOWM/default.jpg\">","(?<=ytimg\\.com\\/vi\\/).*?(?=\\/default\\.jpg)"),"Global")
navigate("https://www.youtube.com/watch?v={#youtube}","Wait")
wait for browser event("Everything Loaded","")
wait(10)
I think you don't see it in the debugger becouse you failed to click the quotation marks for the text in your regex

Hi Zap and thanks very much. That really did the trick...

However, what if I will be getting the same format like src each time to scrape from but with different tag? So basically EcqodmiMOWM will change (varies with a new value) each time on scraping process.

It seems that "wildcard" cannot help in combination with "regex"..

Do you have any good suggestion on this? I probably need to use some wider selection like "outerhtml" or other selector..

Pete · December 9, 2015

If i knew what urls you needed it would be easyer

navigate("https://www.youtube.com/channel/UCgkY2u5AprRNiIX4JHdIuHA/videos","Wait")
clear list(%urls)
add list to list(%urls,$list from text($scrape attribute(<href=w"/watch?v=*">,"fullhref"),$new line),"Delete","Global")
loop while($comparison($list total(%urls),"> Greater than",0)) {
    navigate($list item(%urls,0),"Wait")
    wait for browser event("Everything Loaded",15)
    wait(10)
    remove from list(%urls,0)
}

maybe this is what you need

maBOT · December 9, 2015

Hello,

Thanks, Zap! Now I've gotten what I wanted...

The below (old) code I needed in loop where part of src ('EcqodmiMOWM') is an unique value on each separate loop. I did it with wildcard + regex. Look at the code:

loop($rand(2,5)) {
set(#youtube,$find regular expression($scrape attribute(<src=w"http://i.ytimg.com/vi/*/default.jpg">,"src"),"(?<=ytimg\\.com\\/vi\\/).*?(?=\\/default\\.jpg)"),"Global")

Thanks a lot for your support!!

BlackHatMon3yMaker · March 29, 2016

Hi Zap and thanks very much. That really did the trick...

However, what if I will be getting the same format like src each time to scrape from but with different tag? So basically EcqodmiMOWM will change (varies with a new value) each time on scraping process.

It seems that "wildcard" cannot help in combination with "regex"..

Do you have any good suggestion on this? I probably need to use some wider selection like "outerhtml" or other selector..

I'm in the same position, any input is greatly appreciated!

maBOT · March 29, 2016

I'm in the same position, any input is greatly appreciated!

Hi, Mony3Maker

Have you noticed my latest reply on this thread which did the trick? Pls see below:

loop($rand(2,5)) {
    set(#youtube,$find regular expression($scrape attribute(<src=w"http://i.ytimg.com/vi/*/default.jpg">,"src"),"(?<=ytimg\\.com\\/vi\\/).*?(?=\\/default\\.jpg)"),"Global")

Hope it helped.

BlackHatMon3yMaker · March 29, 2016

Hi, Mony3Maker

Have you noticed my latest reply on this thread which did the trick? Pls see below:
loop($rand(2,5)) {
    set(#youtube,$find regular expression($scrape attribute(<src=w"http://i.ytimg.com/vi/*/default.jpg">,"src"),"(?<=ytimg\\.com\\/vi\\/).*?(?=\\/default\\.jpg)"),"Global")
Hope it helped.

Hey thanks for the response! I tried plugging in the info i needed to use and still can't get it to extract the info like I want it. Am I missing something obvious? I changed some things but nothing that should cause it not to work?

loop(1) {
    set(#picture,$find regular expression($scrape attribute(<src=w"http://ecx.images-amazon.com/images/I/*">,"src"),"(?<=http://ecx.images-amazon.com/images/I/).*?(?<=._SL1500_.jpg)"),"Global")
}

This is the code with the full link I'm trying to extract

<img src="http://ecx.images-amazon.com/images/I/919sFzge2iL._SL1500_.jpg" class="fullScreen" style="height: 471px; width: 818.333px; margin-top: 10px; margin-left: 93px;">

Using Regex To catch text between sections

Recommended Posts

Jaro 6

Link to post

Share on other sites

Bill 7

Link to post

Share on other sites

Jaro 6

Link to post

Share on other sites

Jaro 6

Link to post

Share on other sites

maBOT 10

Link to post

Share on other sites

Pete 122

Link to post

Share on other sites

maBOT 10

Link to post

Share on other sites

Pete 122

Link to post

Share on other sites

maBOT 10

Link to post

Share on other sites

BlackHatMon3yMaker 0

Link to post

Share on other sites

maBOT 10

Link to post

Share on other sites

BlackHatMon3yMaker 0

Link to post

Share on other sites

Join the conversation