scraping all URLs of a particular website?

hissho · July 29, 2012

Hi all,

Just wondering if anyone has done this before and possibly share a few tips on how to do this...

Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file.

Any help would be much appreciated!

bu11d0g · July 29, 2012

Hi all,
Just wondering if anyone has done this before and possibly share a few tips on how to do this...

Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file.

Any help would be much appreciated!

If the website has a sitemap like... yourdomain.com/sitemap.xml then you can get ubot to navigate to that url then scrape all the url's like this:

navigate("http://yourdomain.com/sitemap.xml", "Wait")
add item to list(%scrapedurls, $scrape attribute(<href=w"http://yourdomain.com/*">, "href"), "Delete", "Global")

(Change yourdomain to the actual domain name obviously)

That will add all url's on the domain into a list, then just save that list out as txt or csv.

Hope thats what you wanted to know.

Mark

hissho · July 31, 2012

thanks for your help!

well the thing is that the target sites don't necessarily always have a sitemap...also, not all pages are indexed in google...I want to be able to scrape unindexed URLs as well.

e.g. if a site has got a simple opt-in form with "name" and "email address" fields, once clicked, visitors will be taken to another URL(thank-you page or whatever), which may or may not be indexed in google or listed in the sitemap.

any further help would be much appreciated!

hissho · August 1, 2012

bump...any further help please? or it's just impossible for ubot to do it?

Lombi · August 1, 2012

nope, not impossible. I did the following recently (code obviously not copy and paste):

if list1 empty navigate to main page, add list to list with a scrape attribute of urls containing domain

set list position of list1 to 0

loop (list total) {

set currentitem to nextlistitem

navigate to nextlistitem

add list to list todolist with a scrape attribute of urls containing domain

add item to visitedlist

}

clear list list1

add list to list list1 (subtract lists todolist visitedlist)

Just run this five times (or loop it) and it will alywas go one level deeper.

hissho · August 2, 2012

thanks Lombi, at least it's possible now...

are you selling it by any chance? I'm new to ubot and haven't got the time and expertise to figure this out myself. I'd be happy to pay for it.

thanks again!

hissho · August 4, 2012

or anyone here would be interested in putting what Lombi said into code and sell it to me?

Sign In

scraping all URLs of a particular website?

Recommended Posts

hissho 0

Link to post

Share on other sites

bu11d0g 4

Link to post

Share on other sites

hissho 0

Link to post

Share on other sites

hissho 0

Link to post

Share on other sites

Lombi 34

Link to post

Share on other sites

hissho 0

Link to post

Share on other sites

hissho 0

Link to post

Share on other sites

Join the conversation

Browse

Activity