solidrockguy 1 Posted September 24, 2013 Report Share Posted September 24, 2013 Is there anyone who can help me with this bit of regex? I am using Aymen's http post command to pull all of the text and then I want to gather all of the innertext from links like this one:<h1 style="margin: 2px 0 0 0;"><b></b> <a href="http://www.rvwholesalers.com/design/AmeriLite/AmeriLite.php?floorplan=21MB">AmeriLite - 21MB</a></h1> It needs to distinguish by using the h1 tags since there are other similar links on the same page.I just want to gather all of the innertext from these results - in this case it would be AmeriLite - 21MB. There are typically 10 results per page that all have a similar format and the innertext will be different on every listing. Any help is much appreciated. I can't wait until the regex builder in ubot5. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted September 24, 2013 Report Share Posted September 24, 2013 Here you go: (?<==[a-zA-Z0-9]{3,6}">).*?(?=</a></h1>) Quote Link to post Share on other sites
solidrockguy 1 Posted September 24, 2013 Author Report Share Posted September 24, 2013 Thanks Kreatus. That works to capture the first one but there are some other ones on the page that it catches as well. Here is the URL I'm trying to gather the list from:http://www.rvwholesalers.com/design/rvsearch.php?SEARCHRVTYPE=&Sleeps=0|50&Fiberglass=Z&pricerange=0|500000&exteriorkitchen=Z&Bunks=Z&searchrvlengthrange=0|600&manufacturer=Z&exteriordoors=Z&searchrvweight=0|50000&brand=Z&Slides=ZThe extra ones it captures are the similar ones that have <img src... after the opening h1 tag. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted September 24, 2013 Report Share Posted September 24, 2013 Ok.. Try this one below. Should work: (?<==[a-zA-Z0-9]{3,8}">)[a-zA-Z0-9\s\W]{3,50}?(?=</a></h1>) Quote Link to post Share on other sites
solidrockguy 1 Posted September 24, 2013 Author Report Share Posted September 24, 2013 Thanks a lot for your help - That one does not seem to gather anything on that page. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted September 24, 2013 Report Share Posted September 24, 2013 It does on me. Here's the add to list code. Make sure the page is loaded first. add list to list(%h1 tags, $find regular expression($document text, "(?<==[a-zA-Z0-9]\{3,8\}\">)[a-zA-Z0-9\\s\\W]\{3,50\}?(?=</a></h1>)"), "Delete", "Global") Quote Link to post Share on other sites
solidrockguy 1 Posted September 24, 2013 Author Report Share Posted September 24, 2013 That worked great! Thanks a ton. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.