DjProg 3 Posted April 7, 2016 Report Share Posted April 7, 2016 Hello guys, What is the best way to "clean" an innerHTML scraped attribute ? Basically I'm scraping an innerHTML containing an empty div "inline styled", for which I need to find the inline styled background-image URL... <div class="blablah" style="height:120px;background-image:url(http://somewhere.com/image.jpeg)"></div> I scraped the innerHTML of the parent div of blahblah because else I didn't get what I needed, but now I need to clean up a bit. Any tip is welcome ! Thanks a lot, Cheers, Quote Link to post Share on other sites
HelloInsomnia 1103 Posted April 7, 2016 Report Share Posted April 7, 2016 Regex is the answer for this, here is an example: set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.jpeg)\"></div>","Global") set(#image_url,$find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),"Global") Quote Link to post Share on other sites
DjProg 3 Posted April 8, 2016 Author Report Share Posted April 8, 2016 Thanks ! I forgot to say but I'm adding the scraped attributes to a List. So after adding to list I would need to loop thought my list to "replace" the dirty innerHTML into the "cleaned", regexed text ? Or it there a more elegant solution ? CHeers, Quote Link to post Share on other sites
HelloInsomnia 1103 Posted April 8, 2016 Report Share Posted April 8, 2016 Here are two examples of how you can add it to a list either one at a time like this: clear list(%image_urls) set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.jpeg)\"></div>","Global") add item to list(%image_urls,$find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),"Don\'t Delete","Global") Or if you need to add more than one at a time: clear list(%image_urls) set(#html,"<div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image1.jpeg)\"></div> <div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.2jpeg)\"></div> <div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.3jpeg)\"></div> <div class=\"blablah\" style=\"height:120px;background-image:url(http://somewhere.com/image.4jpeg)\"></div>","Global") add list to list(%image_urls,$list from text($find regular expression(#html,"(?<=image\\:url\\().*?(?=\\))"),$new line),"Delete","Global") Quote Link to post Share on other sites
DjProg 3 Posted April 9, 2016 Author Report Share Posted April 9, 2016 Thanks !! Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.