Jump to content
UBot Underground

conditional scraping possible??


Recommended Posts

I'm trying to scrape the time field from google. however when google displays a thumbnail, it breaks the script.

 

to explain this, if you run 

add list to list(%time, $scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global")
add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global")

on the url

https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d

you should see results like this

 

http://i.imgur.com/bi1ZuXN.png

 

 

is it possible to create some type of if statement to keep the list order correct for every item, rather than having to create multiple lists that will loose the order they should be in?

there should only be 10 results in one list.

Edited by darknode
Link to post
Share on other sites

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

Link to post
Share on other sites

 

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

i'm showing 0 items found in the debugger.

hmm.

Link to post
Share on other sites

 

 

try this :

clear list(%time)
navigate("https://www.google.ro/search?q=open+source+site:youtube.com&tbs=qdr:d&bav=on.2,or.r_qf.&bvm=pv.xjs.s.en_US.O2lQuQLBa4Q.O#q=open+source+site:youtube.com&start=30&tbs=qdr:d", "Wait")
wait for browser event("DOM Ready", "")
set(#page, $document text, "Global")
add list to list(%time, $find regular expression(#page, "(?<=class=\"f)[\\w\\W]*?ago"), "Don\'t Delete", "Global")
set list position(%time, 0)
clear list(%time1)
loop($list total(%time)) {
    set(#element, $next list item(%time), "Global")
    if($contains(#element, "slp\">")) {
        then {
            set(#rawElement, $find regular expression(#element, "(?<=slp\">).*?ago"), "Global")
            set(#newElement, $replace(#rawElement, "</div><span class=\"st\"><span class=\"f\">", $nothing), "Global")
            add item to list(%time1, #newElement, "Delete", "Global")
        }
    }
}

i'm showing 0 items found in the debugger.

hmm.

 

It works perfectly on my end

Link to post
Share on other sites

Hi,

 

Sample code:

http://www.ubotstudio.com/forum/public/style_images/master/attachicon.gifsample-google-search-format-001.ubot

 

Kevin

I ran the script, but it does not even collect what i'm trying to collect in the screen shot and example provided.

i'm not even sure what the same script is attempting to do.

Link to post
Share on other sites

Hi,

 

Yeah your right.

 

It collects the blue colored title, green colored url and time if available. Then places the values in a three column list.

 

It will process regular search results as well as results narrowed to the last 24 hours(&tbs=qdr:d).

 

There is no possibility of the data to get out of order for each search result page.

 

So you got me,

 

Kevin

Link to post
Share on other sites

While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want.  In this case that is a DIV class="s" -- then I run my regex against that.

The issue is that the nesting structure of the resulting html is inconsistent.

BOTH instances... (with our without an image) contain the DIV class="f slp"

The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago"

With a 24 hour search, I did have to allow for Google returning "1 day ago"

so.... Here ya go.

navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")
clear list(%time)
set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global")
add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global")
Link to post
Share on other sites

 

While grabbing the whole page is easier, I prefer to find the highest level element I HAVE to scrape in order to capture everything I want.  In this case that is a DIV class="s" -- then I run my regex against that.

 

The issue is that the nesting structure of the resulting html is inconsistent.

 

BOTH instances... (with our without an image) contain the DIV class="f slp"

 

The only thing I found to be consistent was the format of the resulting string you are looking for "nn hours ago"

 

With a 24 hour search, I did have to allow for Google returning "1 day ago"

 

so.... Here ya go.

navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d", "Wait")
clear list(%time)
set(#ResultsOnly, $scrape attribute(<class="s">, "outerhtml"), "Global")
add list to list(%time, $find regular expression(#ResultsOnly, "\\d\{1,2\}\\s(hours|day)\\sago"), "Don\'t Delete", "Global")

i'll try it out, i think one also says mins ago

Link to post
Share on other sites

You should be able to modify that regex by simply adding additional options in the parentheses, eg.

 

(hours|day|mins)

 

not certain that nesting works, but you could try it to make the code more flexible....

 

Edit: tested

\d\{1,2}\s(min|hour|day|week|month)(\s|s\s)ago

 

this one should return both singular and plural. 

Link to post
Share on other sites

I'm not sure if your looking for just one list.

 

In this example I added another list "time3" Is this what your looking for?

 

navigate("https://www.google.com/#q=open+source+site:youtube.com&tbs=qdr:d""Wait")
wait for browser event("DOM Ready""")
add list to list(%time$scrape attribute(<outerhtml=w"<div class=\"f slp\">*</div>">, "innertext"), "Don\'t Delete", "Global")
add list to list(%time2, $scrape attribute(<outerhtml=w"<span class=\"f\">*</span>">, "innertext"), "Delete", "Global")
add list to list(%time3, $list from text($find regular expression($document text, "(?<=[a-zA-Z]+\\\"\\>)\\d+\\s\\w+\\s\\w+"), "
"), "Delete", "Global")

Link to post
Share on other sites

just to make sure we're on the same page.....

 

http://www.screencast.com/t/tIMNrdEYA

 

just got a chance to try it out, i'm not getting results in the list, but the scrape is working as you showed, may be me typing the regex incorrectly. can you please post the regex you used in the video so i can compare?

Link to post
Share on other sites

When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need.

 

Frank

Link to post
Share on other sites

When it comes to google, there will be differences between countries and the code that they use. I too also recommend to learn the page (it is very obscure), and then use tools like regex to get exactly what you need.

 

Frank

 

I tried 

\d(1,2)\s(min|hour|day)(\s|s\s)ago 

and was not having any luck

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...