conditional scraping possible??

darknode · October 21, 2013

I'm trying to scrape the time field from google. however when google displays a thumbnail, it breaks the script.

to explain this, if you run

add list to list(%time, $scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global")
add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global")

on the url

https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d

you should see results like this

http://i.imgur.com/bi1ZuXN.png

is it possible to create some type of if statement to keep the list order correct for every item, rather than having to create multiple lists that will loose the order they should be in?

there should only be 10 results in one list.

Edited October 21, 2013 by darknode

iDollarsteam · October 21, 2013

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

darknode · October 21, 2013

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

i'm showing 0 items found in the debugger.

hmm.

k1lv9h · October 21, 2013

Hi,

Sample code:

sample-google-search-format-001.ubot

Kevin

iDollarsteam · October 21, 2013

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

i'm showing 0 items found in the debugger.

hmm.

It works perfectly on my end

Steve · October 21, 2013

It works perfectly on my end

I gave it a test run too and was getting 0 items also. Didn't seem to be showing the ad time at all on the page loaded.

darknode · October 21, 2013

Hi,

Sample code:
http://www.ubotstudio.com/forum/public/style_images/master/attachicon.gifsample-google-search-format-001.ubot

Kevin

I ran the script, but it does not even collect what i'm trying to collect in the screen shot and example provided.

i'm not even sure what the same script is attempting to do.

k1lv9h · October 21, 2013

Hi,

Yeah your right.

It collects the blue colored title, green colored url and time if available. Then places the values in a three column list.

It will process regular search results as well as results narrowed to the last 24 hours(&tbs=qdr:d).

There is no possibility of the data to get out of order for each search result page.

So you got me,

Kevin

J Bot · October 24, 2013

While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want. In this case that is a DIV class="s" -- then I run my regex against that.

The issue is that the nesting structure of the resulting html is inconsistent.

BOTH instances... (with our without an image) contain the DIV class="f slp"

The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago"

With a 24 hour search, I did have to allow for Google returning "1 day ago"

so.... Here ya go.

navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")
clear list(%time)
set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global")
add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global")

darknode · October 24, 2013

While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want. In this case that is a DIV class="s" -- then I run my regex against that.

The issue is that the nesting structure of the resulting html is inconsistent.

BOTH instances... (with our without an image) contain the DIV class="f slp"

The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago"

With a 24 hour search, I did have to allow for Google returning "1 day ago"

so.... Here ya go.
navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")
clear list(%time)
set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global")
add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global")

i'll try it out, i think one also says mins ago

J Bot · October 25, 2013

You should be able to modify that regex by simply adding additional options in the parentheses, eg.

(hours|day|mins)

not certain that nesting works, but you could try it to make the code more flexible....

Edit: tested

\d\{1,2}\s(min|hour|day|week|month)(\s|s\s)ago

this one should return both singular and plural.

darknode · October 25, 2013

i get a bunch of stuff in #results only but nothing in %time

Bill · October 25, 2013

I'm not sure if your looking for just one list.

In this example I added another list "time3" Is this what your looking for?

navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
add list to list(%time, $scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global")
add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global")
add list to list(%time3, $list from text($find regular expression($document text, "(?<=[a-zA-Z]+\\\"\\>)\\d+\\s\\w+\\s\\w+"), "
"), "Delete", "Global")

J Bot · October 25, 2013

just to make sure we're on the same page.....

http://www.screencast.com/t/tIMNrdEYA

darknode · October 25, 2013

just to make sure we're on the same page.....

http://www.screencast.com/t/tIMNrdEYA

This was very informative, thank you. i'm sure this will help with more than google, although i've not ran into the issue anywhere else.

darknode · October 28, 2013

just to make sure we're on the same page.....

http://www.screencast.com/t/tIMNrdEYA

just got a chance to try it out, i'm not getting results in the list, but the scrape is working as you showed, may be me typing the regex incorrectly. can you please post the regex you used in the video so i can compare?

Frank · October 28, 2013

When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need.

Frank

darknode · October 29, 2013

When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need.

Frank

I tried

\d(1,2)\s(min|hour|day)(\s|s\s)ago

and was not having any luck

conditional scraping possible??

Recommended Posts

darknode 3

Link to post

Share on other sites

iDollarsteam 13

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

k1lv9h 76

Link to post

Share on other sites

iDollarsteam 13

Link to post

Share on other sites

Steve 30

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

k1lv9h 76

Link to post

Share on other sites

J Bot 5

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

J Bot 5

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

Bill 7

Link to post

Share on other sites

J Bot 5

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

Frank 177

Link to post

Share on other sites

darknode 3

Link to post

Share on other sites

Join the conversation