Probs with scraping href attribute on Google - Strange one

magzmedia · February 17, 2011

hey guys,

was just running a bot I compiled on one machine on another and I noticed something weird. You know when scraping Google you select:

outerhtml as an attribute...

and then just scrape by attribute href

you normally get the url..eg http://scapedsite.com

On the other machine I'm getting this:

/url?sa=t&source=web&cd=11&ved=0CBgQFjAAOAo&url=http%3A%2F%2Flibrary.med.utah.edu%2Fblog%2Feccles%2F2010%2F04%2F08%2Fto-twitter-or-not-to-twitter%2F&ei=4GlcTe8qy63xA7bw5ZUC&usg=AFQjCNF0SoJRJlUJxM6pADE07hhlfBxbRw

Any ideas why Google is serving this data? and why it's fine on one machine and not another

Any help would be appreciated

Cheers

Rob

LoWrIdErTJ - BotGuru · February 17, 2011

make sure your search settings are the same. In Google normally you can set search preferences and if 1 is different then the other that might be the issue your having.

Best to compile a bot on google using standard pref.

http://www.google.com/advanced_search?q=google+search+preference

magzmedia · February 17, 2011

Thanks LoWrIdErTJ,

I've checked my internet and browser settings and they appear to be consistent over all the machines on my network. All are running Win 7 the only difference being that my Laptop is 64 bit.

It's a really weird one...if I go to Google blog search everything is fine, but on the standard Google search. It does not work. I've checked my Internet settings and this anomaly is also apparent in Google chrome as well. Doing a search for the returned result brings up information about SOAP and tracking...

does anyone have any additional ideas?

Thanks

Rob

UBotBuddy · February 17, 2011

Did you try this?

*

magzmedia · February 17, 2011

Hey Botbuddy,

Thanks for the reply , but the issue is with the data Google is serving, look what happens when I select by outerhtml in choose by attribute:

<A class=l onmousedown="" href="/url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw">Blogger: Create your free <EM>Blog</EM></A>

using wildcards <A class=l *>*</A> will return the following when scraping href:

/url?sa=t&source=web&cd=1&sqi=2&ved=0CDoQFjAA&url=http%3A%2F%2Fwww.blogger.com%2F&ei=3h9dTfHxG4nPhAeh1bGqCA&usg=AFQjCNGb2S_ihucIn-YF2o0ZlC3KCh92Aw

I'm hoping that this is not going to be how Google serves links in the future...or scraping will be a nightmare, I mean even using regex won't return the url , will it?

I just hope it can be resolved with a simple change in the Internet options, but I can't see anything

If this is a new change being rolled out by Google...then we'll all be shafted won't we?

Please advise guys thiss could make our work a lot harder.

Cheers

Rob

UBotBuddy · February 17, 2011

It's a Google thing. Google does this to change things up.

magzmedia · February 17, 2011

Hey Botbuddy, do you think this is a temporary issue then? and nothing to worry about?

Cheers

Rob

UBotBuddy · February 17, 2011

I wish I could predict Google. I would become rich if I could.

I have given up with scraping Google in a production bot. Almost as soon as I deliver to a customer Google will change something and my bot will break. Then I have to fix it. The cost is TOO high in terms of my time.

magzmedia · February 17, 2011

Just discovered that it's all down to Google instant...although i was sure I tested for that..Google Instant throws in the stupid urls, problem is sorted thanks for yuour help guys.

Sign In

Probs with scraping href attribute on Google - Strange one

Recommended Posts

magzmedia 6

Link to post

Share on other sites

LoWrIdErTJ - BotGuru 904

Link to post

Share on other sites

magzmedia 6

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

magzmedia 6

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

magzmedia 6

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

magzmedia 6

Link to post

Share on other sites

Join the conversation

Browse

Activity