How To Scrape Google Reviews Using Socks If Hidden Behind "more" Button

ilovepizza · March 15, 2015

Hi all, I am trying to scrape Google reviews via socks.

The problem: Google only displays approx. 8 reviews when the Google profile is loaded. The next batch of remaining reviews are loaded when the "More" button is pressed. I guess it is loaded via javascript. The Google Profile URL remains unchanged.

My question: Is there any way to scrape all reviews of a profile via socks?

PS: I haven't tried Aymens HTTP Post plugin. Would it work with Aymens plugin? How?

Any idea is appreciated. Thank you.

-----

I am sorry - I am very new to this so maybe none of this makes much sense - but here is what I have found so far:

When the "more" button is pressed this is what happens (see below). So I figured I need to reconstruct the request URL and send s.th. to it via socket navigate POST. You can actually scrape the ozv= and the f.sid= parameter from the page first. I assume I can leave the avw= and the rt= as they are. But I dont know about the request ID. It seems to be generated with every click on the more-button.

If I POST the request URL nothing happens. Maybe because of the false reqid.

Since I am new with this I dont know if I am completely on the wrong track...

Request URL:

https://plus.google.com/_/pages/local/loadreviews?ozv=es_oz_20150312.07_p1&avw=pr%3Apr&f.sid=-1412152857337063380&_reqid=640891&rt=j

Request Method:

POST

Status Code:

200 OK

Request Headersview source
1. Content-Type:

application/x-www-form-urlencoded;charset=UTF-8

Origin:

https://plus.google.com

Referer:

https://plus.google.com/

User-Agent:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1

X-Same-Domain:

1

Query String Parametersview URL encoded
1. ozv:

es_oz_20150312.07_p1

avw:

pr:pr

f.sid:

-1412152857337063380

_reqid:

640891

rt:

j

Form Dataview URL encoded
1. f.req:

["6504450088010350842",null,[null,null,[[28,18,10],[30]]],[null,null,null,true,11,true],null,null,[360,2,[110,0]]]

:

Response Headersview source
1. Alternate-Protocol:

443:quic,p=0.5

Cache-Control:

no-cache, no-store, max-age=0, must-revalidate

Content-Disposition:

attachment; filename="response.bin"; filename*=UTF-8''response.bin

Content-Encoding:

gzip

Content-Type:

application/json; charset=utf-8

Date:

Sun, 15 Mar 2015 10:23:05 GMT

Expires:

Fri, 01 Jan 1990 00:00:00 GMT

Pragma:

no-cache

Server:

GSE

Transfer-Encoding:

chunked

X-Content-Type-Options:

nosniff

X-Frame-Options:

SAMEORIGIN

X-XSS-Protection:

1; mode=block

Edited March 15, 2015 by jens.wagner@freshamedia.de

ilovepizza · March 16, 2015

Ok, it seems the reqid is assigned for each "session" (e.g. 450186). Once the page is loaded, the reqid is set. If you reload the google profile, you get a fresh reqid. But the difference between the two ist the number of seconds that have passed between the refresh. That means the "reqid-to-be-assigned" is counting upwards with every second that passes. If 100 seconds have passed the new reqid will be 450286.

The first digit of the reqid is a counter for the number of requests made with this reqid. In our example it would be the 4 in 450186. With every request this number counts up.

Now we only need to figure out how the "50186" part of the "450186" reqid is generated in the first place. I would assume that it has something to do with a unix time difference. Possibly between dates given by headers resulting from GET requests to gstatic that occur on pageload?

deliter · March 19, 2015

had a look at that,some very difficult stuff,the site is xhtml,so it runs xml,i tried firing a couple of old user agents at it,no success(might try older ones)

I did have look when making my gmail bot,the site is exactly the same xhtml,reqid and everything else,I needed to know which mails were new,and which were old etc,really bothered me,I used to just log in to gmail and with firebug watch the traffic,and was messing around with inputting randomly some of the strings that came up in the url posts,litterally mashing that garbage together,till BAM,the page just returns pure XML code which has everything in it,I mean everything,and easy to scrape so playing around till it errors into firing out its XML can also work

Their might be something on the internet abut returning pure xml from a xhtml url

Sign In

How To Scrape Google Reviews Using Socks If Hidden Behind "more" Button

Recommended Posts

ilovepizza 2

Link to post

Share on other sites

ilovepizza 2

Link to post

Share on other sites

deliter 203

Link to post

Share on other sites

Join the conversation

Browse

Activity