Jump to content
UBot Underground

[SOLVED]Anyone help me out with some Regex? Need to scrape a URL from block text..


Recommended Posts

Need to scrape a URL out of a large block of text so I thought the easiest way would be to find it using regular expressions since the block of text is huge and I can't find the URL as it's own element..

 

The first part of the url which always stays the same is:

 

https://jigsy.com/account/activate?u=

 

Then there is a part after which might change in length which is just made up of random numbers and letters for example:

 

1124290&k=8d8b6f77c0855c6936544174d77098a0a601c6b2

 

What would the regex be for that URL?

 

Or maybe there's an easier way to scrape it..

 

The text that surrounds it is:

 

Return-Path: srs0=efi1eu=eb=jigsy.com=feedback@XXXXXXXXXXXXXXXX.com

Received: from mip.hushmail.com (LHLO smtp5.hushmail.com) (65.39.178.78) by

server with LMTP; Tue, 29 May 2012 16:51:17 +0000 (UTC)

Received: from smtp5.hushmail.com (localhost.localdomain [127.0.0.1])

by smtp5.hushmail.com (Postfix) with SMTP id 51EBB50149

for <XXXXXXXXXXXXXXXX@hush.com>; Tue, 29 May 2012 16:51:17 +0000 (UTC)

Received: from m1.dnsix.com (m1.dnsix.com [66.11.225.176])

by smtp5.hushmail.com (Postfix) with ESMTP

for <XXXXXXXXXXXXXXXX@hush.com>; Tue, 29 May 2012 16:51:08 +0000 (UTC)

Received: from [65.39.176.60] (helo=viviti.com)

by m1.dnsix.com with esmtp (Exim 4.72)

(envelope-from <feedback@jigsy.com>)

id 1SZPdU-0005CN-HT

for preciousverse41@XXXXXXXXXXXXXXXXXX.com; Tue, 29 May 2012 09:51:08 -0700

Date: Tue, 29 May 2012 09:51:07 -0700

From: feedback@jigsy.com

To: preciousverse41@XXXXXXXXXXXXXXXXXXX.com

Message-Id: <4fc4fe7bd6dcf_2531159386d6bf0881a@electra.vc.bravenet.com.tmail>

Subject: Welcome to Jigsy

Mime-Version: 1.0

Content-Type: multipart/alternative; boundary=mimepart_4fc4fe7bd7628_2531159386d6bf08983

 

Welcome to Jigsy!

 

Thanks for choosing to build your website with us! We hope you enjoy the experience,

and would love to hear any feedback you might have. You can get in touch with us as well

as other members on our message forums at https://forums.jigsy.com.

 

---------------------------------------------------------------------

 

In order to activate your account, please follow the link below:

 

https: //jigsy.com/account/activate?u=1124290&k=8d8b6f77c0855c6936544174d77098a0a601c6b2

 

 

If you did not request this, somebody else did using your email address. If so,

we apologize for the mailing!

 

 

 

------------

 

I've blanked out some addresses with XXXXXXXXXXXX for privacy reasons..

Also I've added a space after https: otherwise the forum was shortening the URL once it hyperlinked it..

Link to post
Share on other sites

Hey - that sounds like an awesome solution. Sorry to sound ignorant but what is a 'regular scrape'.. I'd already looked at page scrape and scrape attribute and couldn't seem to get them to work with that..

Link to post
Share on other sites

This is hushmail so try an add to list page scrape for

clear list: xxx

addto list: xxx

Leftside: open?http://

Rightside: "

 

That should pull all the outbound links on the page

Then loop the list total and use a regex to find the one you need

http(.*)account(.*)\d

 

Then clear the list for the next one

 

Then remember hushnail reformat some link like http://whateverdomain/&uid=whatevertext

When the real address is http://whateverdomain/uid=whatevertext

So you may have to use a replace text as well

 

EDIT

 

Your post below is not showing the body text code as its inside a iframe

 

Last Edit

Easier to make it then explain it

clear list(%temp)

add list to list(%temp, $page scrape("open?http://", "\""), "Delete", "Global")

loop($list total(%accounts)) {

add item to list(%accounts, $find regular expression($next list item(%temp), "http(.*)account(.*)\\d"), "Delete", "Global")

}

Link to post
Share on other sites

Well this is what I've got so far:

 

$scrape attribute(<readonly=1>, "<value=w\"https://jigsy.com/account/activate?u=*&k=*\">")

 

I changed innertext for value as the innertext of the element is blank.. The outerhtml and innerhtml return all the text as well as value however none of these seemed to work when I tried them..

 

Here's the whole thing: in HTML:

 

<textarea readonly="1" style="width:100%;" rows="37"">Return-Path: srs0=efi1eu=eb=jigsy.com=feedback@XXXXXXXXXXXX.com

Received: from mip.hushmail.com (LHLO smtp5.hushmail.com) (65.39.178.78) by

server with LMTP; Tue, 29 May 2012 16:51:17 +0000 (UTC)

Received: from smtp5.hushmail.com (localhost.localdomain [127.0.0.1])

by smtp5.hushmail.com (Postfix) with SMTP id 51EBB50149

for <XXXXXXXXXXXXXXX@hush.com>; Tue, 29 May 2012 16:51:17 +0000 (UTC)

Received: from m1.dnsix.com (m1.dnsix.com [66.11.225.176])

by smtp5.hushmail.com (Postfix) with ESMTP

for <XXXXXXXXXXXXXX@hush.com>; Tue, 29 May 2012 16:51:08 +0000 (UTC)

Received: from [65.39.176.60] (helo=viviti.com)

by m1.dnsix.com with esmtp (Exim 4.72)

(envelope-from <feedback@jigsy.com>)

id 1SZPdU-0005CN-HT

for preciousverse41@XXXXXXXXXXXXXXX.com; Tue, 29 May 2012 09:51:08 -0700

Date: Tue, 29 May 2012 09:51:07 -0700

From: feedback@jigsy.com

To: preciousverse41@XXXXXXXXXXXXXX.com

Message-Id: <4fc4fe7bd6dcf_2531159386d6bf0881a@electra.vc.bravenet.com.tmail>

Subject: Welcome to Jigsy

Mime-Version: 1.0

Content-Type: multipart/alternative; boundary=mimepart_4fc4fe7bd7628_2531159386d6bf08983

 

Welcome to Jigsy!

 

Thanks for choosing to build your website with us! We hope you enjoy the experience,

and would love to hear any feedback you might have. You can get in touch with us as well

as other members on our message forums at https://forums.jigsy.com.

 

---------------------------------------------------------------------

 

In order to activate your account, please follow the link below:

 

https://jigsy.com/account/activate?u=1124290&k=8d8b6f77c0855c6936544174d77098a0a601c6b2

 

 

If you did not request this, somebody else did using your email address. If so,

we apologize for the mailing!

</textarea>

 

 

hope that helps

Link to post
Share on other sites

you do realize you can scrape it with the &amp: and that in the url and when navigated to, it decodes the url encoding?

 

I don't know what you mean by that.. not sure if it's a suggestion or a question..

 

Why wouldn't this be giving me results:

 

$scrape attribute(<readonly=1>, "<value=w\"https://jigsy.com/account/activate?u=*&k=*\">")

Link to post
Share on other sites
you do realize you can scrape it with the &amp: and that in the url and when navigated to, it decodes the url encoding?

 

That’s correct TJ most times

But I’ve had many fail also due to poor hosts or badly installed scripts

Link to post
Share on other sites

That’s correct TJ most times

But I’ve had many fail also due to poor hosts or badly installed scripts

 

Thanks for the help Zap - but I'm not quite getting what you're saying..

 

I'm sending you a PM TJ - thanks!

Link to post
Share on other sites

That’s correct TJ most times

But I’ve had many fail also due to poor hosts or badly installed scripts

 

Zap i have a ubot script here in the tips and tutorials area, on how to encode, or decode urls as well. With a function i built in ubot with javascript.

http://ubotstudio.com/forum/index.php?/topic/9828-url-encoding-decoding-quick-code-sample

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...