Jump to content
UBot Underground

RegEx: Exclude certain hrefs


Recommended Posts

Hello,

 

im not very good in Regex and wanted to ask you if you can help me with my Problem.

Before i post i read any Topic about scraping or regex and searched the forum and the internet and it didnt solve my problem.

 

I want to scrape any href on for example "my-domain.xx" and while scraping he should ignore all hrefs with the domain name itselt as a wildcard ( =>ignore every href containing *my-domain.xx/* ).

 

 

Here's my code to scrape the hrefs, but im not able to manage the regular expression:

 

set(#mydomain, $random list item(%mydomain), "Global")

 

add list to list(%hrefs, $find regular expression($scrape attribute(<href=w"http://*">, "href"), ""), "Delete", "Global")

 

[EDIT: I changed my code, I accidently did a mistake in it, now its the right]

 

here are two ideas to solve this:

 

-----------------------------

[1]

 

I have the expression to validate any href:

 

^((http|https|ftp):\/\/(www\.)?|www\.)[a-zA-Z0-9\_\-]+\.([a-zA-Z]{2,4}|[a-zA-Z]{2}\.[a-zA-Z]{2})(\/[a-zA-Z0-9\-\._\?\&=,'\+%\$#~]*)*$

 

but i cant manage the additional expression to exclude the hrefs that contain the domain's name in it.

 

-------------------------------

 

[2]

my other idea was to use the domain as a negation to avoid hrefs that contain the domain's name in it.

 

-------------------------------

 

I hope i wrote it understandable and thank you for your help.

 

 

Regards

Mende

Edited by Mende
Link to post
Share on other sites

I think it could probably be easier to scrape every url from the page and then clean you list.

 

First Thanks to LoWrIdErTJ - BotGuru i will try it soon and give you a short feedback!

 

What is your solution to solve my problem?

 

Thanks!

Link to post
Share on other sites

@ LoWrIdErTJ - BotGuru

 

I tried it but didnt came to any results but thank you for your help.

 

@ willywonka

 

nice thank you it worked perfectly.

 

Problem solved =)

 

mende

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...