hissho 0 Posted July 29, 2012 Report Share Posted July 29, 2012 Hi all,Just wondering if anyone has done this before and possibly share a few tips on how to do this... Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file. Any help would be much appreciated! Quote Link to post Share on other sites
bu11d0g 4 Posted July 29, 2012 Report Share Posted July 29, 2012 Hi all,Just wondering if anyone has done this before and possibly share a few tips on how to do this... Let's say my site is xyz.com and I want to use ubot to scrape ALL the URLs of that site. Nothing else, I just want the URLs (no meta description, no text, no images, no videos etc etc) and save it to a text file. Any help would be much appreciated! If the website has a sitemap like... yourdomain.com/sitemap.xml then you can get ubot to navigate to that url then scrape all the url's like this: navigate("http://yourdomain.com/sitemap.xml", "Wait") add item to list(%scrapedurls, $scrape attribute(<href=w"http://yourdomain.com/*">, "href"), "Delete", "Global") (Change yourdomain to the actual domain name obviously)That will add all url's on the domain into a list, then just save that list out as txt or csv. Hope thats what you wanted to know. Mark Quote Link to post Share on other sites
hissho 0 Posted July 31, 2012 Author Report Share Posted July 31, 2012 thanks for your help! well the thing is that the target sites don't necessarily always have a sitemap...also, not all pages are indexed in google...I want to be able to scrape unindexed URLs as well. e.g. if a site has got a simple opt-in form with "name" and "email address" fields, once clicked, visitors will be taken to another URL(thank-you page or whatever), which may or may not be indexed in google or listed in the sitemap. any further help would be much appreciated! Quote Link to post Share on other sites
hissho 0 Posted August 1, 2012 Author Report Share Posted August 1, 2012 bump...any further help please? or it's just impossible for ubot to do it? Quote Link to post Share on other sites
Lombi 34 Posted August 1, 2012 Report Share Posted August 1, 2012 nope, not impossible. I did the following recently (code obviously not copy and paste): if list1 empty navigate to main page, add list to list with a scrape attribute of urls containing domainset list position of list1 to 0loop (list total) {set currentitem to nextlistitemnavigate to nextlistitemadd list to list todolist with a scrape attribute of urls containing domainadd item to visitedlist}clear list list1add list to list list1 (subtract lists todolist visitedlist) Just run this five times (or loop it) and it will alywas go one level deeper. Quote Link to post Share on other sites
hissho 0 Posted August 2, 2012 Author Report Share Posted August 2, 2012 thanks Lombi, at least it's possible now... are you selling it by any chance? I'm new to ubot and haven't got the time and expertise to figure this out myself. I'd be happy to pay for it. thanks again! Quote Link to post Share on other sites
hissho 0 Posted August 4, 2012 Author Report Share Posted August 4, 2012 or anyone here would be interested in putting what Lombi said into code and sell it to me? Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.