Scrape and Browse same domain

smartquin · March 7, 2013

Hi Guys

I'm trying to write a bot that will go to a webpage, scrape the <a> tags and get the hrefs, and visit each page (to simulate a real person). However, I don't want to be directed to the website's rss feed, to another domain (like their facebook/twitter page), a javascript link etc.This is what I have so far:

set(#rootdomain, $find regular expression($url, "(?![/|/www.])[a-zA-Z0-9\\-\\.]+\\.[a-zA-Z]\{2,4\}(?=/)"), "Global")
clear list(%urls)
clear list(%cleanedurls)
add list to list(%urls, $list from text($scrape attribute(<tagname="a">, "href"), $new line), "Delete", "Global")
set list position(%urls, 0)
loop($list total(%urls)) {
    set(#temp, $next list item(%urls), "Global")
    if($contains(#temp, #rootdomain)) {
        then {
            add item to list(%cleanedurls, $list item(%urls, $list position(%urls)), "Delete", "Global")
        }
        else {
        }
    }
}
add list to list(%cleanedurls, $scrape attribute(<(href=w"/*" OR href=w"..*")>, "href"), "Delete", "Global")
loop($rand(0, $list total(%cleanedurls))) {
    click(<href=$random list item(%cleanedurls)>, "Left Click", "No")
    wait($rand(20, 180))
}

The problem I have is some links could be relative, others absolute, and this bot still adds the addresses for googleads, xml feeds etc.

Any help would be greatly appreciated!

Steve · March 13, 2013

Instead of scraping <a> tags from the site directly... What about scraping the all of the urls of the site from a search engine and possibly adding a negative term to exclude rss? For example, something like: site:domain.com -/rss/

Not sure if that will work for you, but just a thought.

smartquin · March 14, 2013

Hmmm, possible, however I was more after soemthing that will scrape the relative and absolute paths of whatever page of the site you are currently on. Essentially I just need something to exclude rss/xml feeds and anything that leeds off-site. I'm not too good at reg-ex though, any help would be great :-D

Sign In

Scrape and Browse same domain

Recommended Posts

smartquin 1

Link to post

Share on other sites

Steve 30

Link to post

Share on other sites

smartquin 1

Link to post

Share on other sites

Join the conversation

Browse

Activity