Jump to content
UBot Underground

Probs with scraping href attribute on Google - Strange one


Recommended Posts

hey guys,

 

was just running a bot I compiled on one machine on another and I noticed something weird. You know when scraping Google you select:

 

outerhtml as an attribute...

 

<A class=l onmousedown="return clk(this.href,*')" href="*">*</A>

 

and then just scrape by attribute href

 

you normally get the url..eg http://scapedsite.com

 

On the other machine I'm getting this:

 

/url?sa=t&source=web&cd=11&ved=0CBgQFjAAOAo&url=http%3A%2F%2Flibrary.med.utah.edu%2Fblog%2Feccles%2F2010%2F04%2F08%2Fto-twitter-or-not-to-twitter%2F&ei=4GlcTe8qy63xA7bw5ZUC&usg=AFQjCNF0SoJRJlUJxM6pADE07hhlfBxbRw

 

Any ideas why Google is serving this data? and why it's fine on one machine and not another

 

Any help would be appreciated

 

Cheers

Rob

Link to post
Share on other sites

make sure your search settings are the same. In Google normally you can set search preferences and if 1 is different then the other that might be the issue your having.

 

Best to compile a bot on google using standard pref.

http://www.google.com/advanced_search?q=google+search+preference

Link to post
Share on other sites

Thanks LoWrIdErTJ,

 

I've checked my internet and browser settings and they appear to be consistent over all the machines on my network. All are running Win 7 the only difference being that my Laptop is 64 bit.

 

It's a really weird one...if I go to Google blog search everything is fine, but on the standard Google search. It does not work. I've checked my Internet settings and this anomaly is also apparent in Google chrome as well. Doing a search for the returned result brings up information about SOAP and tracking...

 

does anyone have any additional ideas?

 

Thanks

 

Rob

Link to post
Share on other sites

Hey Botbuddy,

 

Thanks for the reply , but the issue is with the data Google is serving, look what happens when I select by outerhtml in choose by attribute:

 

 

<A class=l onmousedown="" href="/url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw">Blogger: Create your free <EM>Blog</EM></A>

 

using wildcards <A class=l *>*</A> will return the following when scraping href:

 

/url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw

 

I'm hoping that this is not going to be how Google serves links in the future...or scraping will be a nightmare, I mean even using regex won't return the url , will it?

 

I just hope it can be resolved with a simple change in the Internet options, but I can't see anything :(

 

If this is a new change being rolled out by Google...then we'll all be shafted won't we?

 

Please advise guys thiss could make our work a lot harder.

 

Cheers

 

Rob

Link to post
Share on other sites

I wish I could predict Google. I would become rich if I could.

 

I have given up with scraping Google in a production bot. Almost as soon as I deliver to a customer Google will change something and my bot will break. Then I have to fix it. The cost is TOO high in terms of my time.

Link to post
Share on other sites

Just discovered that it's all down to Google instant...although i was sure I tested for that..Google Instant throws in the stupid urls, problem is sorted thanks for yuour help guys.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...