Jump to content
UBot Underground

Google Results Scraping Problem


Recommended Posts

Hey guys,

 

I'm unable to scrape google results when I get a group of results together.

 

Here's an example:

http://content.screencast.com/users/riktubrs/folders/Jing/media/6af49237-40ed-415f-a204-bb4b5c8bc6a3/2012-10-29_1530.png

 

As you can see above, I am unable to scrape these 4 results.

 

Any ideas how to scrape them as well? If I get 14 results as 'Page 1' in google, it's fine.

 

Thanks

Link to post
Share on other sites

Ok I got a convoluted way of getting this to work, but I'd like some more keywords to test it out please.

 

Cheers

 

Here are two example phrases:

 

"Calculus Help Please"

"roger that mean"

 

Let me know how it goes! I'm still stuck at that..

 

Thanks!

Link to post
Share on other sites

Here is a crude way of doing it.

 

add list to list(%google results, $scrape attribute(<onmousedown=w"return *">, "href"), "Delete", "Global")

 

The only problem is that it is also scraping the link for the cached version of the page. I'm sure there is a way to regex out that part of the result.

  • Like 1
Link to post
Share on other sites

Here is a crude way of doing it.

 

add list to list(%google results, $scrape attribute(<onmousedown=w"return *">, "href"), "Delete", "Global")

 

The only problem is that it is also scraping the link for the cached version of the page. I'm sure there is a way to regex out that part of the result.

 

Works like a charm. Removed all cache links with regex $contains.

 

Thanks!

Link to post
Share on other sites

Glad I could help.

 

If it's not too much to ask. Could you p.m. me the regex you used? Still learning regex and it will help me understand it a little deeper...

 

If not I understand.

 

Thanks,

Justin

Link to post
Share on other sites

I actually found an easier way of removing it than regex, using $contains.

 

set(#serp position, $list total(%serp), "Global")
loop($list total(%serp)) {
if($contains($list item(%serp, #serp position), "http://webcache.googleusercontent.com")) {
 then {
  remove from list(%serp, #serp position)
  decrement(#serp position)
 }
 else {
  decrement(#serp position)
 }
}
}

 

I hope it helps.

Link to post
Share on other sites

Ok, I may as well share this solution anyhow. It took me quite a while to figure out - and I'm sure there's far simpler ways. However, I only wanted to scrape the indented results, nothing else, so I needed to look at each one of the ten results individually to see whether there were indented results or not.

 

clear cookies
clear all lists()
ui stat monitor("Total Indented Reults:", $list total(%mainresults))
ui text box("Keyword", #keyword)
define clear all lists {
   clear list(%mainresults)
   clear list(%results)
   clear list(%results1)
   clear list(%results2)
   clear list(%results3)
   clear list(%results4)
   clear list(%results5)
   clear list(%results6)
   clear list(%results7)
   clear list(%results8)
   clear list(%results9)
   clear list(%results10)
}
ui button("Scrape") {
   clear all lists()
   navigate("http://www.google.ie/#hl=en&output=search&sclient=psy-ab&q=test", "Wait")
   wait for browser event("Page Loaded", "")
   change attribute(<name="q">, "value", #keyword)
   click(<name="btnG">, "Left Click", "No")
   wait for browser event("Page Loaded", "")
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'1\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results1, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'1\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results1, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'2\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results2, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'2\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results2, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'3\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results3, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'3\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results3, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'4\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results4, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'4\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results4, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'5\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results5, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'5\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results5, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'6\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results6, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'6\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results6, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'7\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results7, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'7\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results7, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'8\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results8, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'8\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results8, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'9\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results9, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'9\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results9, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'10\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 6) {
    then {
	    remove from list(%results, 0)
	    remove from list(%results, 0)
	    add list to list(%results10, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%results, $scrape attribute(<outerhtml=w"<a href=\"*\" onmousedown=\"return rwt(this,\'\',\'\',\'\',\'10\',\'*\',\'\',\'*\',null,event)\">*</a>">, "href"), "Don\'t Delete", "Global")
   if($list total(%results) = 3) {
    then {
	    remove from list(%results, 0)
	    add list to list(%results10, %results, "Delete", "Global")
	    clear list(%results)
    }
    else {
	    clear list(%results)
    }
   }
   add list to list(%mainresults, %results1, "Delete", "Global")
   add list to list(%mainresults, %results2, "Delete", "Global")
   add list to list(%mainresults, %results3, "Delete", "Global")
   add list to list(%mainresults, %results4, "Delete", "Global")
   add list to list(%mainresults, %results5, "Delete", "Global")
   add list to list(%mainresults, %results6, "Delete", "Global")
   add list to list(%mainresults, %results7, "Delete", "Global")
   add list to list(%mainresults, %results8, "Delete", "Global")
   add list to list(%mainresults, %results9, "Delete", "Global")
   add list to list(%mainresults, %results10, "Delete", "Global")
}

 

Just enter a keyword then press Scrape. It will give you the total number of indented results. Sometimes, there's more than 4 results on a page so that's where the issue arose for me.

  • Like 1
Link to post
Share on other sites

this is what i came up with quickly

 

clear list(%googleresults)
ui text box("Keyword", #keyword)
ui stat monitor("Links", $list total(%googleresults))
navigate("http://www.google.co.uk/#hl=en&sclient=psy-ab&q={#keyword}", "Wait")
set(#scrapepages, $replace regular expression($scrape attribute(<onmousedown=w"return *">, "href"), "http://webcache\\.googleusercontent\\.com.*", ""), "Global")
add list to list(%googleresults, $list from text(#scrapepages, $new line), "Delete", "Global")

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...