Jump to content
UBot Underground

Recommended Posts

I thought this may be interesting for many applications.

 

Let's build together a REGEX to grab ALL URLs (links) from a page or text.

 

Links may start with http (and sometimes https) or eventually whatever else you may need, say, for instance ftp OR they may lack that at all and be written the old fashion way with only a www. etc...

 

So our expression should start looking to match this at the beginning of the substring:

 

(ftp|http) --> (the | means OR) so basically either starts with ftp OR http and the parenthesis () will enclose them to be considered a group;

however, we may have also https, so we can code that in two ways (either add another or: |  OR add an extra last optional s, like this:

(http|https) OR https? where the ? means the previous character/group repeated zero or one times (in other words, it's optional)

 

So the starting code becomes now:
(http|https?)

that should be followed by the common (for all variants) :// which translates to:

((ftp|https?):\/\/) <-- where you can see the forward slashes had to be escaped first to be correctly interpreted as characters by the regex engine.  Of course, all this composition had to be grouped together again (with the extra external set of parenthesis) to keep it all together.

 

The string continues with or without a www. as we all know (a few years ago it was all present) which in fact is nothing else but a subdomain of the main domain, but nevermind that now.

 

So we need to add the optional www. like this:

(www\.)?  <-- Notice the period (dot) had to be escaped, to instruct the regex engine we mean it literally, or else it would have been interpreted as 'any character'. 

Also note that the whole thing is optional, so in order to apply the question mark ? that means 'previous is optional' we needed to group the whole thing with parenthesis (), or else the ? would have appllied only to the last character (the dot in this case) - which would have been wrong...

 

So far, the expression would find any string that starts like any of the following matrix:

There is one small problem though - some links may start with only www. (no http, etc..) so we need to address that too before we move on along the string.

 

While adding the www. starting as an option would have been easy using the ? again, which would have worked IF the string had that starting combo, the issue would be that in most cases such a regex would return many unwanted substrings, IF the www. is missing altogether. 

 

So basically, this time we need to use the | operator to alternate the options, like this...

  • on the left goes everything we built before:_ ((ftp|https?):\/\/)(www\.)?
  • on the right only the www. alone:_______________________ (www\.)
  • ultimately giving us:___________________ (((ftp|https?):\/\/)(www\.)?|www\.)

 

---------

 

Now that we have the starting part of the substring let's build the part that goes till the tld, inclusively.

 

Here we can have either a simple domain name followed by its tld (.com .biz etc) OR maybe we have subdomains before the domain (such as www is, but could be many other cases)

 

The domain and subdomain parts may have alphanumeric charaters, also they may contain signs like - (hyphen) or _ (underscore) but they must not have spaces, or else we would consider the string (link) has ended there.

The various subdomains, the actual domain name and the last piece, the tld may also contain (be separated by) dots.

 

Here is how we tell the regex engine all about this...

 

Let's first build a class that has all the above mentioned elements in it:

  • digits -----------> for regex, that is an escaped d like so:  \d
  • letters ---------> the a-z group <-- notice in this case the - hyphen between a and z has another meaning (everything FROM a to z)
  • hyphens -------> literally, as in - sign
  • underscores --> _
  • dots ------------> .

Everything gets surrounded by the [] square parenthesis, which (unlike the round ones that simply mean logical grouping) have the significance of defining a class, instead:

[\da-z-_\.]

 

However, this group may appear once or more times, but never be missed, so this time we cannot use the ? and we will use the + modifier instead (meaning once or more):

([\da-z-_\.]+)

 

Once we finish with the subdomain(s).domain construction, the last piece will be the tld (meaning 'top level domain', something like:

  • com
  • org
  • co.uk
  • co.mobi

etc...

 

Notice the last piece was ending with the dot, so we don't need to add it here.  We start with what the substring may contain, in this case, no more digits, underscores, hyphens and the like, but ONLY letters and the occasional dot on composite multilevel tlds combined with ctlds (Country Top Level Domains, like .us, .uk, etc...)

 

Our new class for the ©tld should be:

[a-z\.]

          however, there are a minimum and maximum limit to the number of characters (at least for now), no tld being less than 2 characters in length (.co) most being 3 (.com) some even 4 (.mobi) but also the combos (.co.uk OR .co.mobi) being able to raise that limit upto 7 characters (including the dot in the middle) maximum.

 

So our code will limit the number of repetitions between 2 and 7 inside our group, using the curly brackets {} this way:

([a-z\.]{2,7})

 

Basically, our regex expression now changed to this:

 

(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})

 

 

-----------

 

Next we need to add the expressions to match the actual structure of subfolders+files+parameters of the link itself.

 

We will have groups containing basically anything except for a space now, separated with / slashes now and then, so let's start building it...

 

  • starts with a forward slash (needs to be escaped): \
    followed by
  • any 'word' character/group written as an escaped w:  \w
    could be followed by (or contain):
    • \.   a . dot (escaped)
    • -    a - hyphen
    • _   a _ underscore
    • \?  a ? question mark (escaped)
    • \&  a & ampersand (escaped)

 

[\/\w\.-_\?\&]

 

The thing is, any character in this class as well as the group itself can appear once or a number of times... but also can be missed altogether (as we might match UTLs ending just after the domain name, with no subfolders/files whatsoever, right?

 

Therefore this time, we repeat both the character class AND the group using the * instead of either ? or + as before, as * means repeat zero or more times the previous character/group:

 

[\/\w\.-_\?\&]* --- >  ([\/\w\.-_\?\&]*)*

 

The & (ampersand) and ? (question mark) as literal signs are included because there are tracking links using parameters, like this:  ~~~whatever domain here followed by~~~.com/index.php?a=ablatss&u=gyegye_123
that we want to match as well.

 

And finally, as sometimes these links end with \/ forward slash (needs to be escaped) - meaning it's a subfolder that probably contains an index.html (or .php etc) that will be loaded automatically there - we need to add it at the end, but only as an option.

So here we shall use the ? modifier once again:

 

([\/\w\.-_\?\&]*)*\/?

 

-----------

 

Let's put together all the code again, one more (last) time:

 

(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?

 

 

I truly think this covers all bases, but you guys give it a try too and if there are unmatched exceptions, let me know, maybe we can improve it together.

 

Enjoy!

  • Like 19
Link to post
Share on other sites

For UBot code, some things (the escaping backslash itself) will be double escaped in Code View, so here is the proper code:

 

set(#var_URL_Extract_URL, $find regular expression(#var_INP_Input_TXT, "(((ftp|https?):\\/\\/)(www\\.)?|www\\.)([\\da-z-_\\.]+)([a-z\\.]\{2,7\})([\\/\\w\\.-_\\?\\&]*)*\\/?"), "Global")
  • Like 2
Link to post
Share on other sites

Nope, I don't think so.

 

The \s matches any space (any number of blanks, tabs, etc...) so basically it will extend the match beyond the end of the URL itself and then it will find another word of txt maybe and add it to the result (and we don't want that)

 

As soon as a space is encountered, we need to stop.

Link to post
Share on other sites
  • 2 months later...

OMG

I'm pretty sure by learning this i will be able to solve my confirmation email problems when it's only showing up as plain text in mails !!!!!!

 

I will try this later today WOW THANKS SO MUCH for the time you took to put this out !!!

Link to post
Share on other sites

I must be missing something as the general code isn't working for me

I pasted this code into

$find regular expression

(((ftp|https?):\\/\\/)(www\\.)?|www\\.)([\\da-z-_\\.]+)([a-z\\.]\{2,7\})([\\/\\w\\.-_\\?\\&]*)*\\/?

also this

(ftp|https?):\\/\\/)(www\\.)?|www\\.)([\\da-z-_\\.]+)([a-z\\.]\{2,7\})([\\/\\w\\.-_\\?\\&]*)*\\/?

 

got zero return in debugger.

 


This works by the way

(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

but it adds amp; in the url and of course, it gives and invalid url when followed.

 

 

I guess i have no choice but to follow your guide from start and learn/build the code i need

Now i need the pro version of editpad first...

Link to post
Share on other sites

When you read ALL the above tutorial, you'll understand WHY the code you pasted didn't work :) 


Tip: Beware the escaping characters!

Link to post
Share on other sites
  • 11 months later...

This thread was instrumental in helping me on my bot, I did however run into a problem with the associated regex

 

The url that I wanted to get was formatted as: http://www.domain.com/member-login

 

And when I ran the regex code, it produced http://www.domain.com/member

 

Cutting off the -login from the full domain/file path.

 

I read this page multiple times and (thanks for breaking it down in bite sized chunks) everything looked like it should be working, yet it was cutting off the end portion of the url.

 

Anyway I made a small modification:

 

Original:    (((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?

Modified:   (((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-\-_\?\&]*)*\/?

 

I have run several types/styles of url's through it and now it catches everything as expected.

(by the way - the modification is the -/ being added and shown in red above)

 

Now, being an absolute and complete beginner with Regex, I hope that my little mod to the code didn't break anything and would love for anyone who has a good understanding of regex correct me if my thinking was wrong on this.

Link to post
Share on other sites
  • 8 months later...
  • 2 months later...

Hoping someone can point me in the right direction on this.  I posted here as Im using the regex above and its the best one so far. However I think I need a modification of it

 

Examples of the urls Im trying to get

 

http://www.businesswire.com/news/home/20150130005083/en/Alcoa-Opens-Expanded-Wheels-Manufacturing-Plant-Hungary&ct=ga&cd=CAEYCCoUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNGWQZ69UMcg0wFhQSvuELRIEvs96Q%3E#.VON9J_nF98E

 

http://www.bloomberg.com/news/articles/2015-01-30/rich-indians-add-to-record-sales-of-structured-notes&ct=ga&cd=CAEYAioUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNEuB53O5xkHdDMfx5IW9shW5JTznQ%3E

 

http://www.dallasnews.com/business/columnists/steve-brown/20150129-building-boom-could-add-more-than-1000-hotel-rooms-in-downtown-dallas.ece&ct=ga&cd=CAEYAyoUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNEGyq_XYEK_URZN8-QUvrDlFiBe6w%3E

 

http://www.businessinsider.com/r-dunkin-donuts-2015-drink-expansion-includes-frozen-dunkaccinos-2015-2&ct=ga&cd=CAEYBioTNDQ4OTYwMzk4MTg1NzIwMTg0MjIaMTc0N2Y0NmE3ZjcyMjU1Zjpjb206ZW46VVM&usg=AFQjCNFYZ-Da-ikehk0bH-mYHfx0XGsc_Q%3E

 

Grabbing the url works great with the above regex but it wont work unless I strip off everything after &ct=ga. I tried doing it in 2 steps with the replace after but Im just not getting it. Hopefully someone can help me out here.. Currently just replace it with a break and regex for url again but there has to be a better way..

Link to post
Share on other sites

Hoping someone can point me in the right direction on this.  I posted here as Im using the regex above and its the best one so far. However I think I need a modification of it

 

Examples of the urls Im trying to get

 

http://www.businesswire.com/news/home/20150130005083/en/Alcoa-Opens-Expanded-Wheels-Manufacturing-Plant-Hungary&ct=ga&cd=CAEYCCoUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNGWQZ69UMcg0wFhQSvuELRIEvs96Q%3E#.VON9J_nF98E

 

http://www.bloomberg.com/news/articles/2015-01-30/rich-indians-add-to-record-sales-of-structured-notes&ct=ga&cd=CAEYAioUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNEuB53O5xkHdDMfx5IW9shW5JTznQ%3E

 

http://www.dallasnews.com/business/columnists/steve-brown/20150129-building-boom-could-add-more-than-1000-hotel-rooms-in-downtown-dallas.ece&ct=ga&cd=CAEYAyoUMTEzNDc0OTgyMjMyODg4MDQ2NjYyGjE3NDdmNDZhN2Y3MjI1NWY6Y29tOmVuOlVT&usg=AFQjCNEGyq_XYEK_URZN8-QUvrDlFiBe6w%3E

 

http://www.businessinsider.com/r-dunkin-donuts-2015-drink-expansion-includes-frozen-dunkaccinos-2015-2&ct=ga&cd=CAEYBioTNDQ4OTYwMzk4MTg1NzIwMTg0MjIaMTc0N2Y0NmE3ZjcyMjU1Zjpjb206ZW46VVM&usg=AFQjCNFYZ-Da-ikehk0bH-mYHfx0XGsc_Q%3E

 

Grabbing the url works great with the above regex but it wont work unless I strip off everything after &ct=ga. I tried doing it in 2 steps with the replace after but Im just not getting it. Hopefully someone can help me out here.. Currently just replace it with a break and regex for url again but there has to be a better way..

 

I hope I don't understand it wrong, but you could just use:

http://.+?(?=\/)

Dan

Link to post
Share on other sites

This works great for trimming it to the last / but it looses the post url.

 

&ct=ga) then finally

 

(((ftp|https?):\/\/)(www\.)?|www\.).+?(?=\&ct=ga)

 

and its perfect now, chopping off two other commands from the routine. Dan the man.. Thx bud

 

I hope I don't understand it wrong, but you could just use:

http://.+?(?=\/)

Dan

Link to post
Share on other sites

This works great for trimming it to the last / but it looses the post url.

 

&ct=ga) then finally

 

(((ftp|https?):\/\/)(www\.)?|www\.).+?(?=\&ct=ga)

 

and its perfect now, chopping off two other commands from the routine. Dan the man.. Thx bud

 

Ah ok. I thought you just need the domain :-)

 

Dan

Link to post
Share on other sites
  • 1 year later...

So I have been trying to figure this out for the better half of the day, and I'ms lowly losing my freakin mind lol..

 

Using OP regex:

 

(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?

 

 

Scraping this block of Code:

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=31192499_214027857&name=Lamb_Of_God__-_Laid_To_Rest_(instrumental)" rel="nofollow" download="Lamb_Of_God__-_Laid_To_Rest_(instrumental)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=67810665_73273966&name=Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=96523707_456239030&name=__-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="__-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2725384_69571584&name=Lamb_Of_God_-_Laid_To_Rest_(drum_Track)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(drum_Track)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14147597_266488554&name=Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=19170398_156883225&name=Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=44427398_258156765&name=Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=6536837_97966760&name=Lamb_Of_God_-_Laid_To_Rest_(andy's_Ill_Dubstep_Remix)_" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(andy's_Ill_Dubstep_Remix)_" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=9186025_116877841&name=__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)" rel="nofollow" download="__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=-12053695_76121817&name=__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)" rel="nofollow" download="__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=8857796_57007666&name=Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2141879_158421996&name=Lamb_Of_God_-_Laid_To_Rest_(live)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(live)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=134370700_110461251&name=Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=214484670_323903213&name=Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God'_Rmx)" rel="nofollow" download="Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God'_Rmx)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=22243705_167509174&name=Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=15878466_155097541&name=J.zubkov_Lamb_Of_God_-_Laid_To_Rest_" rel="nofollow" download="J.zubkov_Lamb_Of_God_-_Laid_To_Rest_" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14115593_219755748&name=Lamb_Of_God_-_Laid_To_Rest_Minus" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_Minus" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2000212912_272012340&name=Lamb_Of_God_-_Laid_To_Rest_(demo)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(demo)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=16413909_110076554&name=Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)" rel="nofollow" download="Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

and ending up with these results:
 

 

I've set a piece of code in BOLD and increased the font size of the same (or what is supposed to be the same) link, one pre-regex and one post.  I'm trying to automate the downloading of these files, and need to have the entire URL being picked up by the regex pass, but this piece of regex (and several other ive been itnkering with ALL DAY) keeps cutting the end of link (and if you look, this is the actual name o the file on the CMS, to make things more inconvenient)

 

Ive tried lookaheads, hard-ending the string with every character type and whtiespace i can, etc etc, but can't seem to get it to grab the part of the link that sits between the two quotations!  Could someone wider in the ways of living hell - ii mean, regex - give me a little hand herre?   thanks!

 

What regex is spitting out (end of the link): Lamb_Of_God__

What I NEED it to spit out: Lamb_Of_God__-_Laid_To_Rest_(instrumental)

 

HALP

Link to post
Share on other sites

So I have been trying to figure this out for the better half of the day, and I'ms lowly losing my freakin mind lol..

 

Using OP regex:

 

(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?

 

 

I've set a piece of code in BOLD and increased the font size of the same (or what is supposed to be the same) link, one pre-regex and one post.  I'm trying to automate the downloading of these files, and need to have the entire URL being picked up by the regex pass, but this piece of regex (and several other ive been itnkering with ALL DAY) keeps cutting the end of link (and if you look, this is the actual name o the file on the CMS, to make things more inconvenient)

 

Ive tried lookaheads, hard-ending the string with every character type and whtiespace i can, etc etc, but can't seem to get it to grab the part of the link that sits between the two quotations!  Could someone wider in the ways of living hell - ii mean, regex - give me a little hand herre?   thanks!

 

What regex is spitting out (end of the link): Lamb_Of_God__

What I NEED it to spit out: Lamb_Of_God__-_Laid_To_Rest_(instrumental)

 

HALP

 

You may not need regex for this but if you just want a regex to match most links try this:

 

 

(HTTP|http)(|S|s)\:\/\/(|WWW\.\www\.)[a-zA-Z\.\d-\/-_?&\(\)\=\+\#\%]+

 

It should pick up what you need

  • Like 1
Link to post
Share on other sites

So I have been trying to figure this out for the better half of the day, and I'ms lowly losing my freakin mind lol..

 

Using OP regex:

 

(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?

 

 

Scraping this block of Code:

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=31192499_214027857&name=Lamb_Of_God__-_Laid_To_Rest_(instrumental)" rel="nofollow" download="Lamb_Of_God__-_Laid_To_Rest_(instrumental)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=67810665_73273966&name=Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=96523707_456239030&name=__-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="__-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2725384_69571584&name=Lamb_Of_God_-_Laid_To_Rest_(drum_Track)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(drum_Track)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14147597_266488554&name=Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=19170398_156883225&name=Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)" rel="nofollow" download="Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=44427398_258156765&name=Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=6536837_97966760&name=Lamb_Of_God_-_Laid_To_Rest_(andy's_Ill_Dubstep_Remix)_" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(andy's_Ill_Dubstep_Remix)_" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=9186025_116877841&name=__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)" rel="nofollow" download="__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=-12053695_76121817&name=__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)" rel="nofollow" download="__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=8857796_57007666&name=Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2141879_158421996&name=Lamb_Of_God_-_Laid_To_Rest_(live)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(live)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=134370700_110461251&name=Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=214484670_323903213&name=Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God'_Rmx)" rel="nofollow" download="Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God'_Rmx)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=22243705_167509174&name=Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=15878466_155097541&name=J.zubkov_Lamb_Of_God_-_Laid_To_Rest_" rel="nofollow" download="J.zubkov_Lamb_Of_God_-_Laid_To_Rest_" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14115593_219755748&name=Lamb_Of_God_-_Laid_To_Rest_Minus" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_Minus" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2000212912_272012340&name=Lamb_Of_God_-_Laid_To_Rest_(demo)" rel="nofollow" download="Lamb_Of_God_-_Laid_To_Rest_(demo)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

					<a href="http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=16413909_110076554&name=Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)" rel="nofollow" download="Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)" onclick="return !window.open(this.href);">
						<div id="download"></div>
					</a>
				

and ending up with these results:
 

 

I've set a piece of code in BOLD and increased the font size of the same (or what is supposed to be the same) link, one pre-regex and one post.  I'm trying to automate the downloading of these files, and need to have the entire URL being picked up by the regex pass, but this piece of regex (and several other ive been itnkering with ALL DAY) keeps cutting the end of link (and if you look, this is the actual name o the file on the CMS, to make things more inconvenient)

 

Ive tried lookaheads, hard-ending the string with every character type and whtiespace i can, etc etc, but can't seem to get it to grab the part of the link that sits between the two quotations!  Could someone wider in the ways of living hell - ii mean, regex - give me a little hand herre?   thanks!

 

What regex is spitting out (end of the link): Lamb_Of_God__

What I NEED it to spit out: Lamb_Of_God__-_Laid_To_Rest_(instrumental)

 

HALP

 

 

I see no reason why you need regex for this in the first place,I think you downloaded my plugin,or use xpath,you can replace the document text variable with whatever variable has the document text or http get response

 

I have three examples below,one just scrapes all hrefs,the other shows how to use basic regular expression within the CSS Selector to finetune matches,and the last just makes it easier to write by filling out some parameters for the easy html parser

load html("	<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=31192499_214027857&name=Lamb_Of_God__-_Laid_To_Rest_(instrumental)\" rel=\"nofollow\" download=\"Lamb_Of_God__-_Laid_To_Rest_(instrumental)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=67810665_73273966&name=Lamb_Of_God_-_Laid_To_Rest\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=96523707_456239030&name=__-_Laid_To_Rest_(lamb_Of_God_Cover)\" rel=\"nofollow\" download=\"__-_Laid_To_Rest_(lamb_Of_God_Cover)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2725384_69571584&name=Lamb_Of_God_-_Laid_To_Rest_(drum_Track)\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(drum_Track)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14147597_266488554&name=Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)\" rel=\"nofollow\" download=\"Council_Of_Sinners_-_Laid_To_Rest_(lamb_Of_God_Cover)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=19170398_156883225&name=Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)\" rel=\"nofollow\" download=\"Lamerson_-_Laid_To_Rest_(lamb_Of_God_Cover)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=44427398_258156765&name=Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(guitar_Backing_Track)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=6536837_97966760&name=Lamb_Of_God_-_Laid_To_Rest_(andy\'s_Ill_Dubstep_Remix)_\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(andy\'s_Ill_Dubstep_Remix)_\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=9186025_116877841&name=__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)\" rel=\"nofollow\" download=\"__-_Laid_To_Rest_(lamb_Of_God_Cover_Drunk_Version_A#)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=-12053695_76121817&name=__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)\" rel=\"nofollow\" download=\"__Sicktale_-_Laid_To_Rest(lamb_Of_God_Cover)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=8857796_57007666&name=Lamb_Of_God_-_Laid_To_Rest\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2141879_158421996&name=Lamb_Of_God_-_Laid_To_Rest_(live)\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(live)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=134370700_110461251&name=Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_-_Lamb_Of_God_-_Laid_To_Rest\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=214484670_323903213&name=Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God\'_Rmx)\" rel=\"nofollow\" download=\"Demented_Dimensions_-_Laid_To_Rest_(lamb_Of_God\'_Rmx)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=22243705_167509174&name=Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(kabz_Drumstep_Vip)_[_Rockstep]\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=15878466_155097541&name=J.zubkov_Lamb_Of_God_-_Laid_To_Rest_\" rel=\"nofollow\" download=\"J.zubkov_Lamb_Of_God_-_Laid_To_Rest_\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=14115593_219755748&name=Lamb_Of_God_-_Laid_To_Rest_Minus\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_Minus\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=2000212912_272012340&name=Lamb_Of_God_-_Laid_To_Rest_(demo)\" rel=\"nofollow\" download=\"Lamb_Of_God_-_Laid_To_Rest_(demo)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>
				

					<a href=\"http://mp3clan.com/dl.php?type=get&s=7b3b7694d83efd04685d3712aa6a3d8c&tid=16413909_110076554&name=Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)\" rel=\"nofollow\" download=\"Demented_Dimensions_-_Laid_To_Rest.(lamb_Of_God_Remix)\" onclick=\"return !window.open(this.href);\">
						<div id=\"download\"></div>
					</a>")
add list to list(%hrefs,$plugin function("DeliterCSS.dll", "Deliter CSS Selector", $document text, "a", "href"),"Delete","Global")
add list to list(%SpecificHrefs,$plugin function("DeliterCSS.dll", "Deliter CSS Selector", $document text, "a[href^=\"http://mp3clan.com/dl.php?type=get\"]", "href"),"Delete","Global")
add list to list(%easyWay,$plugin function("DeliterCSS.dll", "Easy HTML Parser", $document text, "a", "href", "^= (attribute value begins with example \'https\')", "http://mp3clan.com/dl.php?type=get", "href"),"Delete","Global")
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...