darknode 3 Posted October 21, 2013 Report Share Posted October 21, 2013 (edited) I'm trying to scrape the time field from google. however when google displays a thumbnail, it breaks the script. to explain this, if you run add list to list(%time, $scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global") add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global") on the url https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d you should see results like this http://i.imgur.com/bi1ZuXN.png is it possible to create some type of if statement to keep the list order correct for every item, rather than having to create multiple lists that will loose the order they should be in?there should only be 10 results in one list. Edited October 21, 2013 by darknode Quote Link to post Share on other sites
iDollarsteam 13 Posted October 21, 2013 Report Share Posted October 21, 2013 try this : clear list(%time) navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait") wait for browser event("DOM Ready", "") set(#page, $document text, "Global") add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global") set list position(%time, 0) clear list(%time1) loop($list total(%time)) { set(#element, $next list item(%time), "Global") if($contains(#element, "slp\">")) { then { set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global") set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global") add item to list(%time1, #newElement, "Delete", "Global") } } } Quote Link to post Share on other sites
darknode 3 Posted October 21, 2013 Author Report Share Posted October 21, 2013 try this : clear list(%time) navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait") wait for browser event("DOM Ready", "") set(#page, $document text, "Global") add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global") set list position(%time, 0) clear list(%time1) loop($list total(%time)) { set(#element, $next list item(%time), "Global") if($contains(#element, "slp\">")) { then { set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global") set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global") add item to list(%time1, #newElement, "Delete", "Global") } } } i'm showing 0 items found in the debugger.hmm. Quote Link to post Share on other sites
k1lv9h 76 Posted October 21, 2013 Report Share Posted October 21, 2013 Hi, Sample code:sample-google-search-format-001.ubot Kevin Quote Link to post Share on other sites
iDollarsteam 13 Posted October 21, 2013 Report Share Posted October 21, 2013 try this : clear list(%time) navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait") wait for browser event("DOM Ready", "") set(#page, $document text, "Global") add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global") set list position(%time, 0) clear list(%time1) loop($list total(%time)) { set(#element, $next list item(%time), "Global") if($contains(#element, "slp\">")) { then { set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global") set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global") add item to list(%time1, #newElement, "Delete", "Global") } } } i'm showing 0 items found in the debugger.hmm. It works perfectly on my end Quote Link to post Share on other sites
Steve 30 Posted October 21, 2013 Report Share Posted October 21, 2013 It works perfectly on my end I gave it a test run too and was getting 0 items also. Didn't seem to be showing the ad time at all on the page loaded. Quote Link to post Share on other sites
darknode 3 Posted October 21, 2013 Author Report Share Posted October 21, 2013 Hi, Sample code:http://www.ubotstudio.com/forum/public/style_images/master/attachicon.gifsample-google-search-format-001.ubot KevinI ran the script, but it does not even collect what i'm trying to collect in the screen shot and example provided.i'm not even sure what the same script is attempting to do. Quote Link to post Share on other sites
k1lv9h 76 Posted October 21, 2013 Report Share Posted October 21, 2013 Hi, Yeah your right. It collects the blue colored title, green colored url and time if available. Then places the values in a three column list. It will process regular search results as well as results narrowed to the last 24 hours(&tbs=qdr:d). There is no possibility of the data to get out of order for each search result page. So you got me, Kevin Quote Link to post Share on other sites
J Bot 5 Posted October 24, 2013 Report Share Posted October 24, 2013 While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want. In this case that is a DIV class="s" -- then I run my regex against that.The issue is that the nesting structure of the resulting html is inconsistent.BOTH instances... (with our without an image) contain the DIV class="f slp"The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago"With a 24 hour search, I did have to allow for Google returning "1 day ago"so.... Here ya go. navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait") clear list(%time) set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global") add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global") Quote Link to post Share on other sites
darknode 3 Posted October 24, 2013 Author Report Share Posted October 24, 2013 While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want. In this case that is a DIV class="s" -- then I run my regex against that. The issue is that the nesting structure of the resulting html is inconsistent. BOTH instances... (with our without an image) contain the DIV class="f slp" The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago" With a 24 hour search, I did have to allow for Google returning "1 day ago" so.... Here ya go. navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait") clear list(%time) set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global") add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global") i'll try it out, i think one also says mins ago Quote Link to post Share on other sites
J Bot 5 Posted October 25, 2013 Report Share Posted October 25, 2013 You should be able to modify that regex by simply adding additional options in the parentheses, eg. (hours|day|mins) not certain that nesting works, but you could try it to make the code more flexible.... Edit: tested\d\{1,2}\s(min|hour|day|week|month)(\s|s\s)ago this one should return both singular and plural. Quote Link to post Share on other sites
darknode 3 Posted October 25, 2013 Author Report Share Posted October 25, 2013 i get a bunch of stuff in #results only but nothing in %time Quote Link to post Share on other sites
Bill 7 Posted October 25, 2013 Report Share Posted October 25, 2013 I'm not sure if your looking for just one list. In this example I added another list "time3" Is this what your looking for? navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")wait for browser event("DOM Ready", "")add list to list(%time, $scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global")add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global")add list to list(%time3, $list from text($find regular expression($document text, "(?<=[a-zA-Z]+\\\"\\>)\\d+\\s\\w+\\s\\w+"), ""), "Delete", "Global") Quote Link to post Share on other sites
J Bot 5 Posted October 25, 2013 Report Share Posted October 25, 2013 just to make sure we're on the same page..... http://www.screencast.com/t/tIMNrdEYA Quote Link to post Share on other sites
darknode 3 Posted October 25, 2013 Author Report Share Posted October 25, 2013 just to make sure we're on the same page..... http://www.screencast.com/t/tIMNrdEYA This was very informative, thank you. i'm sure this will help with more than google, although i've not ran into the issue anywhere else. Quote Link to post Share on other sites
darknode 3 Posted October 28, 2013 Author Report Share Posted October 28, 2013 just to make sure we're on the same page..... http://www.screencast.com/t/tIMNrdEYA just got a chance to try it out, i'm not getting results in the list, but the scrape is working as you showed, may be me typing the regex incorrectly. can you please post the regex you used in the video so i can compare? Quote Link to post Share on other sites
Frank 177 Posted October 28, 2013 Report Share Posted October 28, 2013 When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need. Frank Quote Link to post Share on other sites
darknode 3 Posted October 29, 2013 Author Report Share Posted October 29, 2013 When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need. Frank I tried \d(1,2)\s(min|hour|day)(\s|s\s)ago and was not having any luck Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.