Mende 2 Posted October 19, 2012 Report Share Posted October 19, 2012 (edited) Hello, im not very good in Regex and wanted to ask you if you can help me with my Problem.Before i post i read any Topic about scraping or regex and searched the forum and the internet and it didnt solve my problem. I want to scrape any href on for example "my-domain.xx" and while scraping he should ignore all hrefs with the domain name itselt as a wildcard ( =>ignore every href containing *my-domain.xx/* ). Here's my code to scrape the hrefs, but im not able to manage the regular expression: set(#mydomain, $random list item(%mydomain), "Global") add list to list(%hrefs, $find regular expression($scrape attribute(<href=w"http://*">, "href"), ""), "Delete", "Global") [EDIT: I changed my code, I accidently did a mistake in it, now its the right] here are two ideas to solve this: -----------------------------[1] I have the expression to validate any href: ^((http|https|ftp):\/\/(www\.)?|www\.)[a-zA-Z0-9\_\-]+\.([a-zA-Z]{2,4}|[a-zA-Z]{2}\.[a-zA-Z]{2})(\/[a-zA-Z0-9\-\._\?\&=,'\+%\$#~]*)*$ but i cant manage the additional expression to exclude the hrefs that contain the domain's name in it. ------------------------------- [2]my other idea was to use the domain as a negation to avoid hrefs that contain the domain's name in it. ------------------------------- I hope i wrote it understandable and thank you for your help. RegardsMende Edited October 20, 2012 by Mende Quote Link to post Share on other sites
LoWrIdErTJ - BotGuru 904 Posted October 19, 2012 Report Share Posted October 19, 2012 im not the greatest with regex.. but for the domain portion you can add in ?(!DOMAINNAMEwithoutTLD) replace with domain name without the tld example.com and should filter those out. Quote Link to post Share on other sites
Mende 2 Posted October 20, 2012 Author Report Share Posted October 20, 2012 I think it could probably be easier to scrape every url from the page and then clean you list. First Thanks to LoWrIdErTJ - BotGuru i will try it soon and give you a short feedback! What is your solution to solve my problem? Thanks! Quote Link to post Share on other sites
Mende 2 Posted October 22, 2012 Author Report Share Posted October 22, 2012 @ LoWrIdErTJ - BotGuru I tried it but didnt came to any results but thank you for your help. @ willywonka nice thank you it worked perfectly. Problem solved =) mende Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.