blueBottle 1 Posted February 6, 2013 Report Share Posted February 6, 2013 (edited) Hey fellow UBot, My first post here, got a problem in my first bot that I am dying to finish. Have been doing this little project for a few days so it's time to ask for help. Please help!I use regexbuddy trial but since I am new to UBot and this tool, I am not quite there yet. Here is my problem, I am trying to get test1, test2, test3, test4 from the following html text: <div test="asdf" asd="adfasdf;:"><div test><div test>test1</div></div></div> -->fail <div test>test2</div> --> regex found test2 (good) <div test='dsf'>test3</div> --> regex found test3 (good) <div test1="dasdf" test2='asdf'>test4</div> --> regex found test4 (good) <div test></div> --> regex match empty text (good)So far, regex found test2, test3, test4 but not test1 <div test><div test>test1</div></div> instead of "test1" Here is my regex: (?<=<(div)[-a-zA-Z0-9+&@#/%=~_|!:,.;\"\'\s]*>).*(?=</\1>)1. I need your help extracting "test1"2. Also for regexbuddy user, what regex engine (.NET/ruby/java script) do you use to get a fully compliant ubot regex? PSI also try with "scrape attribute" function and "inner text" as attribute to scrape, but it doesn't seem to work with variable. It only scrape current web page. Edited February 6, 2013 by Required Quote Link to post Share on other sites
pftg4 102 Posted February 6, 2013 Report Share Posted February 6, 2013 add list to list(%test, $scrape attribute(<outerhtml="<body><div test=\"asdf\" asd=\"adfasdf;:\"><div test=\"\"><div test=\"\">test1</div></div></div> <div test=\"\">test2</div> <div test=\"dsf\">test3</div> <div test1=\"dasdf\" test2=\"asdf\">test4</div> <div test=\"\"></div> </body>">, "innertext"), "Delete", "Global") Quote Link to post Share on other sites
blueBottle 1 Posted February 6, 2013 Author Report Share Posted February 6, 2013 Hey pftg4, Thanks for your speedy response.That code doesnt seem to work in my ubot, does it work on yours? I got 0 items in %test.I modified it a bit since the html code will be stored in a variable such as #codes or as list item. set(#codes, "<div test=\"asdf\" asd=\"adfasdf;:\"><body><div test><div test>test1</div></div></div> <div test>test2</div> <div test=\'dsf\'>test3</div> <div test1=\"dasdf\" test2=\'asdf\'>test4</div> <div test></div></body>", "Global") add list to list(%test, $scrape attribute(<outerhtml=#codes>, "innertext"), "Delete", "Global") Quote Link to post Share on other sites
blueBottle 1 Posted February 6, 2013 Author Report Share Posted February 6, 2013 Actually, just found an old post from other place: http://forums.codewalkers.com/php-coding-7/get-innerhtml-with-php-regex-936783.html regex: <([^<> ]*)([^<>]*)?>([^>]*)<\/([^<>]*)>The problem with this regex is that it matches when the opening and closing tags are not the same. For example, it matches both <div>inner text1</div> and <div>inner text2</a>Change the regex a bit to extract just the inner text. Here is the modified version with a fix to the opening and closing tags issue: (?<=<([^<>]*)([^<>]*)?>)([^>]*?)(?=</\1)It can find all inner text now and works with all tags too. Thank you pftg4 and everyone else that has taken time to solve my problem. Quote Link to post Share on other sites
pftg4 102 Posted February 6, 2013 Report Share Posted February 6, 2013 worked fine this end glad you sorted your problem welcome to the forum Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.