smartquin 1 Posted March 7, 2013 Report Share Posted March 7, 2013 Hi Guys I'm trying to write a bot that will go to a webpage, scrape the <a> tags and get the hrefs, and visit each page (to simulate a real person). However, I don't want to be directed to the website's rss feed, to another domain (like their facebook/twitter page), a javascript link etc.This is what I have so far: set(#rootdomain, $find regular expression($url, "(?![/|/www.])[a-zA-Z0-9\\-\\.]+\\.[a-zA-Z]\{2,4\}(?=/)"), "Global")clear list(%urls)clear list(%cleanedurls)add list to list(%urls, $list from text($scrape attribute(<tagname="a">, "href"), $new line), "Delete", "Global")set list position(%urls, 0)loop($list total(%urls)) { set(#temp, $next list item(%urls), "Global") if($contains(#temp, #rootdomain)) { then { add item to list(%cleanedurls, $list item(%urls, $list position(%urls)), "Delete", "Global") } else { } }}add list to list(%cleanedurls, $scrape attribute(<(href=w"/*" OR href=w"..*")>, "href"), "Delete", "Global")loop($rand(0, $list total(%cleanedurls))) { click(<href=$random list item(%cleanedurls)>, "Left Click", "No") wait($rand(20, 180))} The problem I have is some links could be relative, others absolute, and this bot still adds the addresses for googleads, xml feeds etc. Any help would be greatly appreciated! Quote Link to post Share on other sites
Steve 30 Posted March 13, 2013 Report Share Posted March 13, 2013 Instead of scraping <a> tags from the site directly... What about scraping the all of the urls of the site from a search engine and possibly adding a negative term to exclude rss? For example, something like: site:domain.com -/rss/ Not sure if that will work for you, but just a thought. Quote Link to post Share on other sites
smartquin 1 Posted March 14, 2013 Author Report Share Posted March 14, 2013 Hmmm, possible, however I was more after soemthing that will scrape the relative and absolute paths of whatever page of the site you are currently on. Essentially I just need something to exclude rss/xml feeds and anything that leeds off-site. I'm not too good at reg-ex though, any help would be great :-D Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.