Jump to content
UBot Underground

scraping all URLs of a particular website?


Recommended Posts

Hi all,

Just wondering if anyone has done this before and possibly share a few tips on how to do this...

 

Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file.

 

Any help would be much appreciated!

Link to post
Share on other sites

Hi all,

Just wondering if anyone has done this before and possibly share a few tips on how to do this...

 

Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file.

 

Any help would be much appreciated!

 

 

If the website has a sitemap like... yourdomain.com/sitemap.xml then you can get ubot to navigate to that url then scrape all the url's like this:

 

navigate("http://yourdomain.com/sitemap.xml", "Wait")
add item to list(%scrapedurls, $scrape attribute(<href=w"http://yourdomain.com/*">, "href"), "Delete", "Global")

 

(Change yourdomain to the actual domain name obviously)

That will add all url's on the domain into a list, then just save that list out as txt or csv.

 

Hope thats what you wanted to know.

 

Mark

Link to post
Share on other sites

thanks for your help!

 

 

well the thing is that the target sites don't necessarily always have a sitemap...also, not all pages are indexed in google...I want to be able to scrape unindexed URLs as well.

 

e.g. if a site has got a simple opt-in form with "name" and "email address" fields, once clicked, visitors will be taken to another URL(thank-you page or whatever), which may or may not be indexed in google or listed in the sitemap.

 

any further help would be much appreciated!

Link to post
Share on other sites

nope, not impossible. I did the following recently (code obviously not copy and paste):

 

if list1 empty navigate to main page, add list to list with a scrape attribute of urls containing domain

set list position of list1 to 0

loop (list total) {

set currentitem to nextlistitem

navigate to nextlistitem

add list to list todolist with a scrape attribute of urls containing domain

add item to visitedlist

}

clear list list1

add list to list list1 (subtract lists todolist visitedlist)

 

Just run this five times (or loop it) and it will alywas go one level deeper.

Link to post
Share on other sites

thanks Lombi, at least it's possible now...

 

are you selling it by any chance? I'm new to ubot and haven't got the time and expertise to figure this out myself. I'd be happy to pay for it.

 

thanks again!

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...