Jump to content
UBot Underground

Scrape and Browse same domain


Recommended Posts

Hi Guys

 

I'm trying to write a bot that will go to a webpage, scrape the <a> tags and get the hrefs, and visit each page (to simulate a real person). However, I don't want to be directed to the website's rss feed, to another domain (like their facebook/twitter page), a javascript link etc.This is what I have so far:

 

set(#rootdomain, $find regular expression($url, "(?![/|/www.])[a-zA-Z0-9\\-\\.]+\\.[a-zA-Z]\{2,4\}(?=/)"), "Global")
clear list(%urls)
clear list(%cleanedurls)
add list to list(%urls, $list from text($scrape attribute(<tagname="a">, "href"), $new line), "Delete", "Global")
set list position(%urls, 0)
loop($list total(%urls)) {
    set(#temp, $next list item(%urls), "Global")
    if($contains(#temp, #rootdomain)) {
        then {
            add item to list(%cleanedurls, $list item(%urls, $list position(%urls)), "Delete", "Global")
        }
        else {
        }
    }
}
add list to list(%cleanedurls, $scrape attribute(<(href=w"/*" OR href=w"..*")>, "href"), "Delete", "Global")
loop($rand(0, $list total(%cleanedurls))) {
    click(<href=$random list item(%cleanedurls)>, "Left Click", "No")
    wait($rand(20, 180))
}

 

The problem I have is some links could be relative, others absolute, and this bot still adds the addresses for googleads, xml feeds etc.

 

Any help would be greatly appreciated!

Link to post
Share on other sites

Instead of scraping <a> tags from the site directly... What about scraping the all of the urls of the site from a search engine and possibly adding a negative term to exclude rss?  For example, something like:   site:domain.com -/rss/

 

Not sure if that will work for you, but just a thought.  

Link to post
Share on other sites

Hmmm, possible, however I was more after soemthing that will scrape the relative and absolute paths of whatever page of the site you are currently on. Essentially I just need something to exclude rss/xml feeds and anything that leeds off-site. I'm not too good at reg-ex though, any help would be great :-D

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...