RegExp error

Anonym · January 30, 2012

I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .

Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday ).


clear list(%h1)
add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global")

Thanks!

Anonym · January 30, 2012

Thank you!

So I guess to also cover prospective attributes to the <h1> tag, that should be

(?<=<h1\b[^>]*).*?(?=</h1>)

?

Anonym · January 30, 2012

It could look like:


<html>
<head>
	<title>This is the title I am interested in</title>
</head>

<body>
	<h1>Here is a header 1 text without attributes</h1>
	<h1 id="no-interesting-id" class="no-interested-class-either">Here is a header 1 text with attributes that I do not care about</h1>
</body>
</html>

Anonym · January 30, 2012

Sorry, no it didn't work.

JohnB · January 31, 2012

Rubular can be deceptive because I believe it uses the Ruby flavor where UBot uses the .NET flavor. Sometimes it works for us and sometimes it doesn't.

john

Anonym · January 31, 2012

I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .
Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday ).
clear list(%h1)
add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global")
Thanks!

Eureka! It works!

Instead of the above code it should be:


clear list(%h1)
add list to list(%h1, $scrape attribute(<outerhtml=r"<h1\\b[^>]*>(.*?)<\\/h1>">, "innertext"), "Delete", "Global")

Problem solved. Nothing wrong with my first regexp that is. It was the scraping...

Yeeehaaa! (still I do think it was odd that it didn't work before )

It is REALLY powerful with regular expressions and not at all as impossible as it looked at a first glance (ok, I still have a lot to learn, but...)

JohnB · January 31, 2012

Nice job! http://ubotstudio.com/forum/public/style_emoticons/default/smile.gif

John

Sign In

RegExp error

Recommended Posts

Anonym 53

Link to post

Share on other sites

Anonym 53

Link to post

Share on other sites

Anonym 53

Link to post

Share on other sites

Anonym 53

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

Anonym 53

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

Join the conversation

Browse

Activity