magzmedia 6 Posted February 17, 2011 Report Share Posted February 17, 2011 hey guys, was just running a bot I compiled on one machine on another and I noticed something weird. You know when scraping Google you select: outerhtml as an attribute... <A class=l onmousedown="return clk(this.href,*')" href="*">*</A> and then just scrape by attribute href you normally get the url..eg http://scapedsite.com On the other machine I'm getting this: /url?sa=t&source=web&cd=11&ved=0CBgQFjAAOAo&url=http%3A%2F%2Flibrary.med.utah.edu%2Fblog%2Feccles%2F2010%2F04%2F08%2Fto-twitter-or-not-to-twitter%2F&ei=4GlcTe8qy63xA7bw5ZUC&usg=AFQjCNF0SoJRJlUJxM6pADE07hhlfBxbRw Any ideas why Google is serving this data? and why it's fine on one machine and not another Any help would be appreciated CheersRob Quote Link to post Share on other sites
LoWrIdErTJ - BotGuru 904 Posted February 17, 2011 Report Share Posted February 17, 2011 make sure your search settings are the same. In Google normally you can set search preferences and if 1 is different then the other that might be the issue your having. Best to compile a bot on google using standard pref.http://www.google.com/advanced_search?q=google+search+preference Quote Link to post Share on other sites
magzmedia 6 Posted February 17, 2011 Author Report Share Posted February 17, 2011 Thanks LoWrIdErTJ, I've checked my internet and browser settings and they appear to be consistent over all the machines on my network. All are running Win 7 the only difference being that my Laptop is 64 bit. It's a really weird one...if I go to Google blog search everything is fine, but on the standard Google search. It does not work. I've checked my Internet settings and this anomaly is also apparent in Google chrome as well. Doing a search for the returned result brings up information about SOAP and tracking... does anyone have any additional ideas? Thanks Rob Quote Link to post Share on other sites
UBotBuddy 331 Posted February 17, 2011 Report Share Posted February 17, 2011 Did you try this? * Quote Link to post Share on other sites
magzmedia 6 Posted February 17, 2011 Author Report Share Posted February 17, 2011 Hey Botbuddy, Thanks for the reply , but the issue is with the data Google is serving, look what happens when I select by outerhtml in choose by attribute: <A class=l onmousedown="" href="/url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw">Blogger: Create your free <EM>Blog</EM></A> using wildcards <A class=l *>*</A> will return the following when scraping href: /url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw I'm hoping that this is not going to be how Google serves links in the future...or scraping will be a nightmare, I mean even using regex won't return the url , will it? I just hope it can be resolved with a simple change in the Internet options, but I can't see anything If this is a new change being rolled out by Google...then we'll all be shafted won't we? Please advise guys thiss could make our work a lot harder. Cheers Rob Quote Link to post Share on other sites
UBotBuddy 331 Posted February 17, 2011 Report Share Posted February 17, 2011 It's a Google thing. Google does this to change things up. Quote Link to post Share on other sites
magzmedia 6 Posted February 17, 2011 Author Report Share Posted February 17, 2011 Hey Botbuddy, do you think this is a temporary issue then? and nothing to worry about? Cheers Rob Quote Link to post Share on other sites
UBotBuddy 331 Posted February 17, 2011 Report Share Posted February 17, 2011 I wish I could predict Google. I would become rich if I could. I have given up with scraping Google in a production bot. Almost as soon as I deliver to a customer Google will change something and my bot will break. Then I have to fix it. The cost is TOO high in terms of my time. Quote Link to post Share on other sites
magzmedia 6 Posted February 17, 2011 Author Report Share Posted February 17, 2011 Just discovered that it's all down to Google instant...although i was sure I tested for that..Google Instant throws in the stupid urls, problem is sorted thanks for yuour help guys. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.