Jump to content
UBot Underground

How To Scrape Google Reviews Using Socks If Hidden Behind "more" Button


Recommended Posts

Hi all, I am trying to scrape Google reviews via socks.

 

The problem: Google only displays approx. 8  reviews when the Google profile is loaded. The next batch of remaining reviews are loaded when the "More" button is pressed. I guess it is loaded via javascript. The Google Profile URL remains unchanged. 

My question: Is there any way to scrape all reviews of a profile via socks?

 

PS: I haven't tried Aymens HTTP Post plugin. Would it work with Aymens plugin? How?

 

Any idea is appreciated. Thank you.

 

-----

 

I am sorry - I am very new to this so maybe none of this makes much sense - but here is what I have found so far:

 

When the "more" button is pressed this is what happens (see below). So I figured I need to reconstruct the request URL and send s.th. to it via socket navigate POST. You can actually scrape the ozv= and the f.sid= parameter from the page first. I assume I can leave the avw= and the rt= as they are. But I dont know about the request ID. It seems to be generated with every click on the more-button.

 

If I POST the request URL nothing happens. Maybe because of the false reqid. 

 

Since I am new with this I dont know if I am completely on the wrong track...

 

  1. Request URL:

https://plus.google.com/_/pages/local/loadreviews?ozv=es_oz_20150312.07_p1&avw=pr%3Apr&f.sid=-1412152857337063380&_reqid=640891&rt=j

  1. Request Method:

POST

  1. Status Code:

 

200 OK

  1. Request Headersview source
    1. Content-Type:

application/x-www-form-urlencoded;charset=UTF-8

  1. Origin:

https://plus.google.com

  1. Referer:

https://plus.google.com/

  1. User-Agent:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1

  1. X-Same-Domain:

1

  1. Query String Parametersview URL encoded
    1. ozv:

es_oz_20150312.07_p1

  1. avw:

pr:pr

  1. f.sid:

-1412152857337063380

  1. _reqid:

640891

  1. rt:

j

  1. Form Dataview URL encoded
    1. f.req:

["6504450088010350842",null,[null,null,[[28,18,10],[30]]],[null,null,null,true,11,true],null,null,[360,2,[110,0]]]

  1. :
  1. Response Headersview source
    1. Alternate-Protocol:

443:quic,p=0.5

  1. Cache-Control:

no-cache, no-store, max-age=0, must-revalidate

  1. Content-Disposition:

attachment; filename="response.bin"; filename*=UTF-8''response.bin

  1. Content-Encoding:

gzip

  1. Content-Type:

application/json; charset=utf-8

  1. Date:

Sun, 15 Mar 2015 10:23:05 GMT

  1. Expires:

Fri, 01 Jan 1990 00:00:00 GMT

  1. Pragma:

no-cache

  1. Server:

GSE

  1. Transfer-Encoding:

chunked

  1. X-Content-Type-Options:

nosniff

  1. X-Frame-Options:

SAMEORIGIN

  1. X-XSS-Protection:

1; mode=block

Edited by jens.wagner@freshamedia.de
Link to post
Share on other sites

Ok, it seems the reqid is assigned for each "session" (e.g. 450186). Once the page is loaded, the reqid is set. If you reload the google profile, you get a fresh reqid. But the difference between the two ist the number of seconds that have passed between the refresh. That means the "reqid-to-be-assigned" is counting upwards with every second that passes. If 100 seconds have passed the new reqid will be 450286.

 

The first digit of the reqid is a counter for the number of requests made with this reqid. In our example it would be the 4 in 450186. With every request this number counts up.

 

Now we only need to figure out how the "50186" part of the "450186" reqid is generated in the first place. I would assume that it has something to do with a unix time difference. Possibly between dates given by headers resulting from GET requests to gstatic that occur on pageload?

Link to post
Share on other sites

had a look at that,some very difficult stuff,the site is xhtml,so it runs xml,i tried firing a couple of old user agents at it,no success(might try older ones)

 

I did have look when making my gmail bot,the site is exactly the same xhtml,reqid and everything else,I needed to know which mails were new,and which were old etc,really bothered me,I used to just log in to gmail and with firebug watch the traffic,and was messing around with inputting randomly some of the strings that came up in the url posts,litterally mashing that garbage together,till BAM,the page just returns pure XML code which has everything in it,I mean everything,and easy to scrape so playing around till it errors into firing out its XML can also work :)

 

Their might be something on the internet abut returning pure xml from a xhtml url

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...