Jump to content
UBot Underground

Best Way To "clean" An Innerhtml Scraped Attribute ?


Recommended Posts

Hello guys,

 

What is the best way to "clean" an innerHTML scraped attribute ?

 

Basically I'm scraping an innerHTML containing an empty div "inline styled", for which I need to find the inline styled background-image URL... 

 

<div class="blablah" style="height:120px;background-image:url(http://somewhere.com/image.jpeg)"></div>

 

I scraped the innerHTML of the parent div of blahblah because else I didn't get what I needed, but now I need to clean up a bit.

 

Any tip is welcome !

 

Thanks a lot,

 

Cheers,

Link to post
Share on other sites

Regex is the answer for this, here is an example:

set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.jpeg)\"></div>","Global")
set(#image_url,$find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),"Global")
Link to post
Share on other sites

Thanks !

 

I forgot to say but I'm adding the scraped attributes to a List.

 

So after adding to list I would need to loop thought my list to "replace" the dirty innerHTML into the "cleaned", regexed text ? Or it there a more elegant solution ?

 

CHeers,

Link to post
Share on other sites

Here are two examples of how you can add it to a list either one at a time like this:

clear list(%image_urls)
set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.jpeg)\"></div>","Global")
add item to list(%image_urls,$find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),"Don\'t Delete","Global")

Or if you need to add more than one at a time:

clear list(%image_urls)
set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image1.jpeg)\"></div>
<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.2jpeg)\"></div>
<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.3jpeg)\"></div>
<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.4jpeg)\"></div>","Global")
add list to list(%image_urls,$list from text($find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),$new line),"Delete","Global")
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...