Jump to content
UBot Underground

Scraping url from text?


Recommended Posts

I have a list of 100k+ urls in a text file and what I'd like to do is pull a URL from that txt file using a specific string (the root url). I have likely 15-20 urls matching the root but would like to select one at random. I'm new at this, but the only thing I've come up with thus far is to run them through a text field until i found one, but obviously this would destroy any speed the program has. Does anyone have any ideas on how to do this, I'm pulling my hair out trying to figure it out.

Edited by okuma31
Link to post
Share on other sites

loop throw whole list, compare and create new list with links you need, then use new link for your activity. list with 100k might be too big to handle so split it to lists with 50k links each

Link to post
Share on other sites
loop throw whole list, compare and create new list with links you need, then use new link for your activity. list with 100k might be too big to handle so split it to lists with 50k links each

That's what I was planning on doing, but I see two issues with this

 

1) Lets say it takes 15 seconds to find the url, that's going to kill my links per minute.

2) In theory it will always select the first target because that's the first link it will uncover.

Link to post
Share on other sites

Okuma have you used the variable function $common list items?  This variable function returns a new list containing the common items between the first list (root url) and second list (100+K urls text file).  Possibly using this UBot variable function can help with efficiency?  Once you have the new list containing common items you can then select randomly.

Edited by wilriv21
  • Like 1
Link to post
Share on other sites
Okuma have you used the variable function $common list items?  This variable function returns a new list containing the common items between the first list (100+K urls text file) and the second list (root url).  Possibly using this UBot variable function can help with efficiency?  Once you have the new list containing common items you can then select randomly.

This is a great idea, I'm playing with this now, but it appears to only scrape exact match urls.

 

As an example

 

The item I'll use is a root domain

 

http://aol.com/

 

Now my list of urls is filled with all kinds of crazy stuff, but the specific 15 or so I'm looking for contain extra strings on the root such as...

 

http://aol.com/311234

http://aol.com/33424

http://aol.com/76876

http://aol.com/12321

http://aol.com/8978978

 

Regardless, I feel like this is getting closer, if this ends up working be sure and send your paypal and I'll send you a fiver for a beer on me. =]

Link to post
Share on other sites

I edited the first list to be the much smaller root urls and the second list to be the much larger 100+ K.  This change should make the process more efficient.

Link to post
Share on other sites
clear list(%root domain urls)
clear list(%master url list)
add list to list(%master url list, $list from file("C:\\Users\\Tdub\\Desktop\\extracted from thoughthappiness.txt"), "Delete", "Global")
add item to list(%root domain urls, "http://7minutegarden.com/", "Delete", "Global")
add list to list(%final output list, $common list items(%root domain urls, %master url list), "Delete", "Global")
save to file("C:\\Users\\Tdub\\Desktop\\the final output list.txt", %final output list)

 

The root domain list contains one url, as you can see in the above code, but when I open the final output list.txt it contains 0 items. But when I add in an exact match url from both lists then it will cotain the single matching url, it's a bit bizarre seeing as how in my mind the root domain would be contained in all of the items in the master list.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...