Jump to content
UBot Underground

Recommended Posts

Hey fellow UBot,

 

My first post here, got a problem in my first bot that I am dying to finish. Have been doing this little project for a few days so it's time to ask for help. Please help!

I use regexbuddy trial but since I am new to UBot and this tool, I am not quite there yet. Here is my problem, I am trying to get test1, test2, test3, test4 from the following html text:

<div test="asdf" asd="adfasdf;:"><div test><div test>test1</div></div></div>  -->fail
<div test>test2</div>   --> regex found test2 (good)
<div test='dsf'>test3</div>  --> regex found test3 (good)
<div test1="dasdf" test2='asdf'>test4</div>  --> regex found test4 (good)
<div test></div>  --> regex match empty text (good)

So far, regex found test2, test3, test4 but not test1

<div test><div test>test1</div></div> instead of "test1"

 

Here is my regex:

(?<=<(div)[-a-zA-Z0-9+&@#/%=~_|!:,.;\"\'\s]*>).*(?=</\1>)

1. I need your help extracting "test1"

2. Also for regexbuddy user, what regex engine (.NET/ruby/java script) do you use to get a fully compliant ubot regex?

 

PS

I also try with "scrape attribute" function and "inner text" as attribute to scrape, but it doesn't seem to work with variable. It only scrape current web page.

Edited by Required
Link to post
Share on other sites

add list to list(%test, $scrape attribute(<outerhtml="<body><div test=\"asdf\" asd=\"adfasdf;:\"><div test=\"\"><div test=\"\">test1</div></div></div>
<div test=\"\">test2</div>
<div test=\"dsf\">test3</div>
<div test1=\"dasdf\" test2=\"asdf\">test4</div>
<div test=\"\"></div> </body>">, "innertext"), "Delete", "Global")
Link to post
Share on other sites

Hey pftg4,

 

Thanks for your speedy response.

That code doesnt seem to work in my ubot, does it work on yours? I got 0 items in %test.

I modified it a bit since the html code will be stored in a variable such as #codes or as list item.

set(#codes, "<div test=\"asdf\" asd=\"adfasdf;:\"><body><div test><div test>test1</div></div></div>
<div test>test2</div>
<div test=\'dsf\'>test3</div>
<div test1=\"dasdf\" test2=\'asdf\'>test4</div>
<div test></div></body>", "Global")
add list to list(%test, $scrape attribute(<outerhtml=#codes>, "innertext"), "Delete", "Global")
Link to post
Share on other sites

Actually, just found an old post from other place:

http://forums.codewalkers.com/php-coding-7/get-innerhtml-with-php-regex-936783.html

regex: <([^<> ]*)([^<>]*)?>([^>]*)<\/([^<>]*)>

The problem with this regex is that it matches when the opening and closing tags are not the same. For example, it matches both <div>inner text1</div> and <div>inner text2</a>

Change the regex a bit to extract just the inner text. Here is the modified version with a fix to the opening and closing tags issue:

(?<=<([^<>]*)([^<>]*)?>)([^>]*?)(?=</\1)

It can find all inner text now and works with all tags too.

 

Thank you pftg4 and everyone else that has taken time to solve my problem.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...