Jump to content
UBot Underground

Recommended Posts

I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .

Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday :) ).

 


clear list(%h1)
add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global")

 

 

Thanks!

Link to post
Share on other sites

It could look like:

 


<html>
<head>
	<title>This is the title I am interested in</title>
</head>

<body>
	<h1>Here is a header 1 text without attributes</h1>
	<h1 id="no-interesting-id" class="no-interested-class-either">Here is a header 1 text with attributes that I do not care about</h1>
</body>
</html>


Link to post
Share on other sites

Rubular can be deceptive because I believe it uses the Ruby flavor where UBot uses the .NET flavor. Sometimes it works for us and sometimes it doesn't.

 

 

john

Link to post
Share on other sites

I am scraping some webpages for <h1> tags and I am doing this using the regexp <h1\b[^>]*>(.*?)<\/h1> .

Using http://www.rubular.com/ the expression seem ok and only what is in between the tags is "identified", but the output of my script actully includes the <h1> and </h1> tags too. I would prefer it not to. Anyone who sees something obvious that a noob in regexp like me doesn't see (I tried to learn this yesterday :) ).

 


clear list(%h1)
add list to list(%h1, $find regular expression($page scrape("<html ", "</html>"), "<h1\\b[^>]*>(.*?)<\\/h1>"), "Delete", "Global")

 

 

Thanks!

 

 

 

Eureka! It works!

 

Instead of the above code it should be:

 


clear list(%h1)
add list to list(%h1, $scrape attribute(<outerhtml=r"<h1\\b[^>]*>(.*?)<\\/h1>">, "innertext"), "Delete", "Global")

 

 

Problem solved. Nothing wrong with my first regexp that is. It was the scraping...

Yeeehaaa! :) (still I do think it was odd that it didn't work before )

 

It is REALLY powerful with regular expressions and not at all as impossible as it looked at a first glance (ok, I still have a lot to learn, but...)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...