Guest turbolapp Posted November 28, 2009 Report Share Posted November 28, 2009 Any thoughts on how to scrape these slippery bastards? ;D http://www.dogpile.com/dogpile_other/ws/searchspy/qcat=Web/_iceUrlFlag=11?_IceUrl=true Really I'd like to just scrape the search terms and not the hrefs. Quote Link to post Share on other sites
Aaron Nimocks 19 Posted November 28, 2009 Report Share Posted November 28, 2009 Attached is a script that works. Just change the output file path. 1. Basically chose the attribute for outertext as a wildcard. 2. Then scrape the page for everything within the anchor text. 3. Save to file 4. Refresh the page and do it again I don't know your goal but I set it not to record duplicates. Guess it depends if you are making an automated keyword search tool or just recording search history data. Quote Link to post Share on other sites
Guest turbolapp Posted November 28, 2009 Report Share Posted November 28, 2009 Wow that's really a creative way to approach it. Works just great. The only thing is the bot never stops. But whatever, it still scrapes alot of data and I can just stop it when I'm done with it. So thanks! Quote Link to post Share on other sites
Aaron Nimocks 19 Posted November 28, 2009 Report Share Posted November 28, 2009 When does it stop? Did you see I added a UI field to enter how many loops you want to do? Just enter a really high number and it should keep running. Quote Link to post Share on other sites
Guest turbolapp Posted November 28, 2009 Report Share Posted November 28, 2009 I just entered one or two loops and it doesn't stop....I thought it might have something to do with the loop increments kindof being redundant with the actual loop itself but even when I just removed the inc #loops it still just looped endlessly. Like I said, it's really no biggie as the script itself works just fine, but if you're like me you want to know WHY it's doing that. I'll probably wake up in the middle of the night with the answer. ;D Quote Link to post Share on other sites
Aaron Nimocks 19 Posted November 28, 2009 Report Share Posted November 28, 2009 I just had it refresh the page every 20 seconds. The loop was the number on how many times to refresh. You can't just sit there and let it run and scrape. You should refresh it. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.