Jump to content
UBot Underground

Scraping Stuff...isolating Right Link


Recommended Posts

Hey guys so i am facing a small problem see i am scraping only specific link this is the idea

 

i use some keyword to search Youtube

then put all links from search info a list 

Then visit them one by one and now it comes tricky part

I have to click on "Show More" to check for youtube description

Now in most cases videos which i am looking for have a lot of Links in their Description check out the example

https://www.youtube.com/watch?v=hpqbzPj92HU

So if a link starts with www.something.com Scrape it into a Table If not just skip to another youtube URL from the list.

 

I was wondering is there a way to do this in regex you know i am not good with regex so i thought to ask you guys.

 

So go to youtube

Search for keyword

Scrape all links from search

Visit them one by one

On every visited link click on "Show More" check if the description has links starting with www.something.com

If it does save them into table

If it doesn't skip them and go to another youtube link.

 

 

 

 

My Code so far,feel free to enhance it if you can.

 

 ui text box("Search",#keywords)

clear cookies
set user agent("Firefox 6")
navigate("www.youtube.com","Wait")
wait for browser event("Everything Loaded","")
type text(<name="search_query">,#keywords,"Standard")
wait for browser event("Everything Loaded","")
click(<id="search-btn">,"Left Click","No")
wait(5)
add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")
set(#position,0,"Global")
loop({$list total(%urls)}") {
    navigate($list item(%urls,#position),"Wait")
    wait for browser event("Everything Loaded","")
    click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No")
    set(#document,$scrape attribute(<id="action-panel-details">,"innertext"),"Global")

}
 

 

Any Help is Greatly Appreciated i am trying to work it out but i am stuck on this problem.

 

Thanks Guys!

Link to post
Share on other sites

This is how you would extract specific URLs:

add list to list(%URLs, $find regular expression($document text, "(?<=href=\")http://www\\.keek\\.com[^\"]+"), "Delete", "Global")

P.S.: You should start using "wait for element" command instead of fixed delays...

Link to post
Share on other sites

Yes i use it normally for waiting but i just started working on this besides i don't think you understood me correctly,

 

See links will be different i need to make the difference between links which start like this:

 

"www"

"http://"

 

 

Rest of the links will be random,i need to make the difference between these two so if link is

http://something.com

it will skip it

 

But if link starts like this

www.something.com

 

then it will save it in a spreadsheet!

Link to post
Share on other sites

So i got this far my friend helped me to build a regex that finds any url in the document text,so what i did it clicked on Show more details and then i put the entire description inside a variable and found url-s in that description so i have another issue right now

 

What kind of logic should i use to make difference between URL-s that start with 

 

www.something.

and

http://something.

 

As i mentioned if URL starts with www.something then i need to save it inside a table

But if a link starts with

http://something then i need to script to save the this url into a table

 

this is my code so far

 

 

ui text box("Search",#keywords)
clear cookies
set user agent("Firefox 6")
navigate("www.youtube.com","Wait")
wait for browser event("Everything Loaded","")
type text(<name="search_query">,#keywords,"Standard")
wait for browser event("Everything Loaded","")
click(<id="search-btn">,"Left Click","No")
wait(5)
add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")
set(#position,0,"Global")
loop({$list total(%urls)}") {
    navigate($list item(%urls,#position),"Wait")
    wait for browser event("Everything Loaded","")
    click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No")
    wait(3)
    set(#description,$scrape attribute(<id="eow-description">,"innertext"),"Global")
    set(#find,$find regular expression(#description,"(([\\w-]+://?|www[.])[^\\s()<>]+)"),"Global")
    increment(#position)
}

Link to post
Share on other sites

The example page you showed doesn't contain the 1st type of URLs, that's why I only gave you code for 2nd type.

 

I think it would be the easiest to just scrape all URLs and then check in a loop if URL meets 1st or 2nd type, and do action accordingly.

Link to post
Share on other sites

just use this and throw into a list

 

set(#urls,"http://www.something.com
http://www.something2.com
http://something.com
something.com
www.something.com","Global")
set(#WWW,$find regular expression(#urls,".*www.*"),"Global")
alert(#WWW)

 

A way to do it is

 

ui text box("Search",#keywords)
clear cookies
set user agent("Firefox 6")
navigate("www.youtube.com","Wait")
wait for browser event("Everything Loaded","")
type text(<name="search_query">,#keywords,"Standard")
wait for browser event("Everything Loaded","")
click(<id="search-btn">,"Left Click","No")
wait(5)
add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")
set(#position,0,"Global")
clear list(%WWW_Urls)
loop({$list total(%urls)}") {
    navigate($list item(%urls,#position),"Wait")
    wait for browser event("Everything Loaded","")
    click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No")
    wait(3)
    add list to list(%WWW_Urls,$list from text($find regular expression($scrape attribute(<id="eow-description">,"innertext"),".*www.*"),$new line),"Delete","Global")
    increment(#position)
}

 

CD

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...