allcapone1912 7 Posted January 16, 2016 Report Share Posted January 16, 2016 i just bought HTTP POST plugin and try to update my old script used for scraping emails in the old script everything work fine but with HTTP POST there is some problem one of them are hidden javascript emails add item to list(%http second url,$plugin function("HTTP post.dll", "$http get", "http://www.jc-design.com/contact-us.html", $plugin function("HTTP post.dll", "$http useragent string", "Random"), "http://google.com", "", 10),"Delete","Global") set(#http second url,%http second url,"Global") add list to list(%emails,$find regular expression(#http second url,"(?i)\\b[!#$%&\'*+./0-9=?_`a-z\{|\}~^-]+@[.0-9a-z-]+\\.[a-z]\{2,6\}\\b"),"Delete","Global") and for this url HTTP POST dont scrape the email can someone give me an idea how to scrape it?i am not an expert in javascript... Quote Link to post Share on other sites
itexspert 47 Posted January 16, 2016 Report Share Posted January 16, 2016 If emails are hidden in javascript i doubt you can get them using HTTP, javascript usually generates new code that is why you need browser! Quote Link to post Share on other sites
HelloInsomnia 1103 Posted January 16, 2016 Report Share Posted January 16, 2016 Just an FYI, no need to add it a list you can just add the get request straight into the variable. set(#http second url,$plugin function("HTTP post.dll", "$http get", "http://www.jc-design.com/contact-us.html", $plugin function("HTTP post.dll", "$http useragent string", "Random"), "http://google.com", "", 10),"Global") add list to list(%emails,$find regular expression(#http second url,"(?i)\\b[!#$%&\'*+./0-9=?_`a-z\{|\}~^-]+@[.0-9a-z-]+\\.[a-z]\{2,6\}\\b"),"Delete","Global") As for the JS part, it appears for some reason that JS is creating that line with the email address, not sure why but you can see this line in the get request: <br /><script type="text/javascript">insertEmailAddress('','webinfo','jc-design.com','');</script> Now if you had to scrape many pages with this same kind of thing it would be possible to scrape out that info because it just puts the webinfo together with the jc-design.com with an @ symbol. Give this a shot: clear list(%emails) set(#http second url,$plugin function("HTTP post.dll", "$http get", "http://www.jc-design.com/contact-us.html", $plugin function("HTTP post.dll", "$http useragent string", "Random"), "http://google.com", "", 10),"Global") set(#email_line,$plugin function("HTTP post.dll", "$xpath parser", #http second url, "//div[@id=\'main-wrapper\']/div/script", "InnerHtml", "HTML"),"Global") set(#email_line_cleanup,$replace($replace(#email_line,"insertEmailAddress(\'\',",$nothing),",\'\');",$nothing),"Global") clear list(%split_email) add list to list(%split_email,$list from text($replace(#email_line_cleanup,"\'",$nothing),","),"Delete","Global") add item to list(%emails,"{$list item(%split_email,0)}@{$list item(%split_email,1)}","Don\'t Delete","Global") 1 Quote Link to post Share on other sites
allcapone1912 7 Posted January 16, 2016 Author Report Share Posted January 16, 2016 Just an FYI, no need to add it a list you can just add the get request straight into the variable. set(#http second url,$plugin function("HTTP post.dll", "$http get", "http://www.jc-design.com/contact-us.html", $plugin function("HTTP post.dll", "$http useragent string", "Random"), "http://google.com", "", 10),"Global") add list to list(%emails,$find regular expression(#http second url,"(?i)\\b[!#$%&\'*+./0-9=?_`a-z\{|\}~^-]+@[.0-9a-z-]+\\.[a-z]\{2,6\}\\b"),"Delete","Global") As for the JS part, it appears for some reason that JS is creating that line with the email address, not sure why but you can see this line in the get request: <br /><script type="text/javascript">insertEmailAddress('','webinfo','jc-design.com','');</script> Now if you had to scrape many pages with this same kind of thing it would be possible to scrape out that info because it just puts the webinfo together with the jc-design.com with an @ symbol. Give this a shot: clear list(%emails) set(#http second url,$plugin function("HTTP post.dll", "$http get", "http://www.jc-design.com/contact-us.html", $plugin function("HTTP post.dll", "$http useragent string", "Random"), "http://google.com", "", 10),"Global") set(#email_line,$plugin function("HTTP post.dll", "$xpath parser", #http second url, "//div[@id=\'main-wrapper\']/div/script", "InnerHtml", "HTML"),"Global") set(#email_line_cleanup,$replace($replace(#email_line,"insertEmailAddress(\'\',",$nothing),",\'\');",$nothing),"Global") clear list(%split_email) add list to list(%split_email,$list from text($replace(#email_line_cleanup,"\'",$nothing),","),"Delete","Global") add item to list(%emails,"{$list item(%split_email,0)}@{$list item(%split_email,1)}","Don\'t Delete","Global") thanks for your reply and for both advice i just bought http plugin and i not totally familiar with this plugin and by mistake use an useless Add to list also, your code for scraping email its great, just hope will be many urls like this one in order to get all possible emails Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.