Jump to content
UBot Underground

Is there a way to delete duplicate domains in a list using regex?


Recommended Posts

I have a large list of URLs in a .txt file and I need to remove duplicate DOMAINS (and the entire corresponding URL to each duplicate) while leaving behind the first occurrence of each domain.

 

http://www.exampleurl.com/something.php
http://exampleurl.com/somethingelse.htm  
http://exampleurl2.com/another-url  
http://www.exampleurl2.com/a-url.htm  
http://exampleurl2.com/yet-another-url.html  
http://exampleurl.com/  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

 

Whatever the solution is, the output file using the above as the input, should be this:

 

http://www.exampleurl.com/something.php  
http://exampleurl2.com/another-url  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

 

You notice there are no duplicate domains now, and it left behind the first occurrence it came across.

Link to post
Share on other sites
  • 3 months later...
  • 10 months later...

You are going to have to loop them i think

 

set(#List, "http://www.exampleurl.com/something.php
http://exampleurl.com/somethingelse.htm
http://exampleurl2.com/another-url
http://www.exampleurl2.com/a-url.htm
http://exampleurl2.com/yet-another-url.html
http://exampleurl.com/
http://www.exampleurl3.com/here_is_a_url
http://www.exampleurl5.com/something", "Global")
add list to list(%RawUrls, $list from text(#List, $new line), "Delete", "Global")
add item to list(%Clean Url List, $next list item(%RawUrls), "Delete", "Global")
loop($list total(%RawUrls)) {
   set(#Current Url, $find regular expression($next list item(%RawUrls), "[a-z-A-Z0-9]\{1,99\}\\.((com|org|net|eu|pt|uk|es|br|co|cz|fn)|\\.(uk|vu|cz|en|br|es))"), "Global")
   set(#Compair, %Clean Url List, "Global")
   if($contains(#Compair, #Current Url)) {
    then {
    }
    else {
	    add item to list(%Clean Url List, $list item(%RawUrls, $subtract($list position(%RawUrls), 1)), "Delete", "Global")
    }
   }
}

  • Like 1
Link to post
Share on other sites
  • 4 months later...

You should be aware that list indexes (counting items) is 0-based; as such, a list with 3 items has them indexed correspondingly in positions:

 

0 -- for first list item

1 -- for second list item

2 -- for third list item

 

The error you presented most probably comes from you looping the list indexes starting from 1 instead of 0, so that the last item you want to get from the list actually isn't there anymore (your $next list item returns an index outside the bound of the list)

Make sure to loop the list within its boundaries.

 

Use $list item instead of $next list item and an index counter to set the list position properly and definitely.

 

With $next list item you also have to make sure you set the list position index counter to 0 before starting to loop it, which may be another cause of your fail.

 

Hope this helps you...

  • Like 2
Link to post
Share on other sites

You're welcome - feel free to hit the LIKE THIS button on the bottom-right corner of any post of anyone who helps you on the forum.

 

Cheers!

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...