Anonym 53 Posted January 30, 2012 Report Share Posted January 30, 2012 I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday ). clear list(%h1) add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global") Thanks! Quote Link to post Share on other sites
Anonym 53 Posted January 30, 2012 Author Report Share Posted January 30, 2012 Thank you! So I guess to also cover prospective attributes to the <h1> tag, that should be (?<=<h1\b[^>]*).*?(?=</h1>) ? Quote Link to post Share on other sites
Anonym 53 Posted January 30, 2012 Author Report Share Posted January 30, 2012 It could look like: <html> <head> <title>This is the title I am interested in</title> </head> <body> <h1>Here is a header 1 text without attributes</h1> <h1 id="no-interesting-id" class="no-interested-class-either">Here is a header 1 text with attributes that I do not care about</h1> </body> </html> Quote Link to post Share on other sites
Anonym 53 Posted January 30, 2012 Author Report Share Posted January 30, 2012 Sorry, no it didn't work. Quote Link to post Share on other sites
JohnB 255 Posted January 31, 2012 Report Share Posted January 31, 2012 Rubular can be deceptive because I believe it uses the Ruby flavor where UBot uses the .NET flavor. Sometimes it works for us and sometimes it doesn't. john Quote Link to post Share on other sites
Anonym 53 Posted January 31, 2012 Author Report Share Posted January 31, 2012 I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday ). clear list(%h1) add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global") Thanks! Eureka! It works! Instead of the above code it should be: clear list(%h1) add list to list(%h1, $scrape attribute(<outerhtml=r"<h1\\b[^>]*>(.*?)<\\/h1>">, "innertext"), "Delete", "Global") Problem solved. Nothing wrong with my first regexp that is. It was the scraping...Yeeehaaa! (still I do think it was odd that it didn't work before ) It is REALLY powerful with regular expressions and not at all as impossible as it looked at a first glance (ok, I still have a lot to learn, but...) Quote Link to post Share on other sites
JohnB 255 Posted January 31, 2012 Report Share Posted January 31, 2012 Nice job! http://ubotstudio.com/forum/public/style_emoticons/default/smile.gif John Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.