ilovepizza 2 Posted March 15, 2015 Report Share Posted March 15, 2015 (edited) Hi all, I am trying to scrape Google reviews via socks. The problem: Google only displays approx. 8 reviews when the Google profile is loaded. The next batch of remaining reviews are loaded when the "More" button is pressed. I guess it is loaded via javascript. The Google Profile URL remains unchanged. My question: Is there any way to scrape all reviews of a profile via socks? PS: I haven't tried Aymens HTTP Post plugin. Would it work with Aymens plugin? How? Any idea is appreciated. Thank you. ----- I am sorry - I am very new to this so maybe none of this makes much sense - but here is what I have found so far: When the "more" button is pressed this is what happens (see below). So I figured I need to reconstruct the request URL and send s.th. to it via socket navigate POST. You can actually scrape the ozv= and the f.sid= parameter from the page first. I assume I can leave the avw= and the rt= as they are. But I dont know about the request ID. It seems to be generated with every click on the more-button. If I POST the request URL nothing happens. Maybe because of the false reqid. Since I am new with this I dont know if I am completely on the wrong track... Request URL:https://plus.google.com/_/pages/local/loadreviews?ozv=es_oz_20150312.07_p1&avw=pr%3Apr&f.sid=-1412152857337063380&_reqid=640891&rt=jRequest Method:POSTStatus Code: 200 OKRequest Headersview source Content-Type: application/x-www-form-urlencoded;charset=UTF-8Origin:https://plus.google.comReferer:https://plus.google.com/User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1X-Same-Domain:1Query String Parametersview URL encoded ozv: es_oz_20150312.07_p1avw:pr:prf.sid:-1412152857337063380_reqid:640891rt:jForm Dataview URL encoded f.req: ["6504450088010350842",null,[null,null,[[28,18,10],[30]]],[null,null,null,true,11,true],null,null,[360,2,[110,0]]]:Response Headersview source Alternate-Protocol: 443:quic,p=0.5Cache-Control:no-cache, no-store, max-age=0, must-revalidateContent-Disposition:attachment; filename="response.bin"; filename*=UTF-8''response.binContent-Encoding:gzipContent-Type:application/json; charset=utf-8Date:Sun, 15 Mar 2015 10:23:05 GMTExpires:Fri, 01 Jan 1990 00:00:00 GMTPragma:no-cacheServer:GSETransfer-Encoding:chunkedX-Content-Type-Options:nosniffX-Frame-Options:SAMEORIGINX-XSS-Protection:1; mode=block Edited March 15, 2015 by jens.wagner@freshamedia.de Quote Link to post Share on other sites
ilovepizza 2 Posted March 16, 2015 Author Report Share Posted March 16, 2015 Ok, it seems the reqid is assigned for each "session" (e.g. 450186). Once the page is loaded, the reqid is set. If you reload the google profile, you get a fresh reqid. But the difference between the two ist the number of seconds that have passed between the refresh. That means the "reqid-to-be-assigned" is counting upwards with every second that passes. If 100 seconds have passed the new reqid will be 450286. The first digit of the reqid is a counter for the number of requests made with this reqid. In our example it would be the 4 in 450186. With every request this number counts up. Now we only need to figure out how the "50186" part of the "450186" reqid is generated in the first place. I would assume that it has something to do with a unix time difference. Possibly between dates given by headers resulting from GET requests to gstatic that occur on pageload? Quote Link to post Share on other sites
deliter 203 Posted March 19, 2015 Report Share Posted March 19, 2015 had a look at that,some very difficult stuff,the site is xhtml,so it runs xml,i tried firing a couple of old user agents at it,no success(might try older ones) I did have look when making my gmail bot,the site is exactly the same xhtml,reqid and everything else,I needed to know which mails were new,and which were old etc,really bothered me,I used to just log in to gmail and with firebug watch the traffic,and was messing around with inputting randomly some of the strings that came up in the url posts,litterally mashing that garbage together,till BAM,the page just returns pure XML code which has everything in it,I mean everything,and easy to scrape so playing around till it errors into firing out its XML can also work Their might be something on the internet abut returning pure xml from a xhtml url Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.