botmaker7 5 Posted November 3, 2017 Report Share Posted November 3, 2017 Hey guys, I'm not able to figure out how to solve this problem I'm having.. I have a list of hundreds of URLs Example: http://dailycaller.com/2017/11/02/trump-pick-for-top-agriculture-post-withdraws-name-following-russia-probe-revelations/ https://www.newsmax.com/politics/jeff-sessions-vladimir-putin-court-filings-donald-trump/2017/11/02/id/823762 https://www.wthr.com/article/ship-to-attempt-raising-russian-chopper-wreckage-in-arctic http://www.motherjones.com/kevin-drum/2017/11/a-little-bit-of-pushback-on-the-jeff-sessions-story/ https://pjmedia.com/trending/nunes-dems-suddenly-interested-viewing-doj-dossier-docs-didnt-want-subpoenaed/ https://boingboing.net/2017/11/02/kgb-killed-jfk-celebrity-weig.html https://www.rt.com/business/408536-rosneft-iran-energy-investments/ https://jingtravel.com/russia-mulls-easing-visa-requirements-as-chinese-tourist-numbers-grow/ http://forward.com/fast-forward/386807/billionaire-trump-backer-robert-mercer-sells-breitbart-stake-over-racism-cl/ https://finance.yahoo.com/news/hillary-clinton-defends-her-campaign-120530439.html I want to scan them for foreign links.. and remove those URLs from the list. Example: .co.uk/.in/.ru/ Quote Link to post Share on other sites
ronaldod 4 Posted November 3, 2017 Report Share Posted November 3, 2017 Why do not a match on links you would like to have..com/ etc etc.. Quote Link to post Share on other sites
Solution HelloInsomnia 1103 Posted November 3, 2017 Solution Report Share Posted November 3, 2017 You can use something like this: http(|s)\:\/\/(|www\.)(|[a-zA-Z0-9-]+\.)[a-zA-Z0-9-]+\.(com|net|org).* Depending on where the links are the code may need to be modified but if its just a list like that it should work. You can add more tlds, near the end in the same format like (com|net|org|us|ca) and so on. Here is some example code: set(#links,"http://www.stuff.dailycaller.com/2017/11/02/trump-pick-for-top-agriculture-post-withdraws-name-following-russia-probe-revelations/ https://www.newsmax.com/politics/jeff-sessions-vladimir-putin-court-filings-donald-trump/2017/11/02/id/823762 https://www.wthr.com/article/ship-to-attempt-raising-russian-chopper-wreckage-in-arctic http://www.motherjones.com/kevin-drum/2017/11/a-little-bit-of-pushback-on-the-jeff-sessions-story/ http://www.motherjones.ru/kevin-drum/2017/11/a-little-bit-of-pushback-on-the-jeff-sessions-story/ https://pjmedia.com/trending/nunes-dems-suddenly-interested-viewing-doj-dossier-docs-didnt-want-subpoenaed/ https://boingboing.net/2017/11/02/kgb-killed-jfk-celebrity-weig.html https://www.rt.com/business/408536-rosneft-iran-energy-investments/ https://jingtravel.com/russia-mulls-easing-visa-requirements-as-chinese-tourist-numbers-grow/ https://jingtravel.co.uk/russia-mulls-easing-visa-requirements-as-chinese-tourist-numbers-grow/ http://forward.com/fast-forward/386807/billionaire-trump-backer-robert-mercer-sells-breitbart-stake-over-racism-cl/ https://finance.yahoo.com/news/hillary-clinton-defends-her-campaign-120530439.html","Global") clear list(%links) add list to list(%links,$find regular expression(#links,"http(|s)\\:\\/\\/(|www\\.)(|[a-zA-Z0-9-]+\\.)[a-zA-Z0-9-]+\\.(com|net|org).*"),"Delete","Global") 1 Quote Link to post Share on other sites
botmaker7 5 Posted November 4, 2017 Author Report Share Posted November 4, 2017 You can use something like this: http(|s)\:\/\/(|www\.)(|[a-zA-Z0-9-]+\.)[a-zA-Z0-9-]+\.(com|net|org).* Depending on where the links are the code may need to be modified but if its just a list like that it should work. You can add more tlds, near the end in the same format like (com|net|org|us|ca) and so on. Here is some example code: set(#links,"http://www.stuff.dailycaller.com/2017/11/02/trump-pick-for-top-agriculture-post-withdraws-name-following-russia-probe-revelations/ https://www.newsmax.com/politics/jeff-sessions-vladimir-putin-court-filings-donald-trump/2017/11/02/id/823762 https://www.wthr.com/article/ship-to-attempt-raising-russian-chopper-wreckage-in-arctic http://www.motherjones.com/kevin-drum/2017/11/a-little-bit-of-pushback-on-the-jeff-sessions-story/ http://www.motherjones.ru/kevin-drum/2017/11/a-little-bit-of-pushback-on-the-jeff-sessions-story/ https://pjmedia.com/trending/nunes-dems-suddenly-interested-viewing-doj-dossier-docs-didnt-want-subpoenaed/ https://boingboing.net/2017/11/02/kgb-killed-jfk-celebrity-weig.html https://www.rt.com/business/408536-rosneft-iran-energy-investments/ https://jingtravel.com/russia-mulls-easing-visa-requirements-as-chinese-tourist-numbers-grow/ https://jingtravel.co.uk/russia-mulls-easing-visa-requirements-as-chinese-tourist-numbers-grow/ http://forward.com/fast-forward/386807/billionaire-trump-backer-robert-mercer-sells-breitbart-stake-over-racism-cl/ https://finance.yahoo.com/news/hillary-clinton-defends-her-campaign-120530439.html","Global") clear list(%links) add list to list(%links,$find regular expression(#links,"http(|s)\\:\\/\\/(|www\\.)(|[a-zA-Z0-9-]+\\.)[a-zA-Z0-9-]+\\.(com|net|org).*"),"Delete","Global") Thanks! Works like a charm If I were to write that regex myself it would have taken me 3 days to figure out lol. Really appreciate it! Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.