itexspert 47 Posted December 27, 2014 Report Share Posted December 27, 2014 Hey guys so i am facing a small problem see i am scraping only specific link this is the idea i use some keyword to search Youtubethen put all links from search info a list Then visit them one by one and now it comes tricky partI have to click on "Show More" to check for youtube descriptionNow in most cases videos which i am looking for have a lot of Links in their Description check out the example https://www.youtube.com/watch?v=hpqbzPj92HUSo if a link starts with www.something.com Scrape it into a Table If not just skip to another youtube URL from the list. I was wondering is there a way to do this in regex you know i am not good with regex so i thought to ask you guys. So go to youtubeSearch for keywordScrape all links from searchVisit them one by oneOn every visited link click on "Show More" check if the description has links starting with www.something.comIf it does save them into tableIf it doesn't skip them and go to another youtube link. My Code so far,feel free to enhance it if you can. ui text box("Search",#keywords)clear cookiesset user agent("Firefox 6")navigate("www.youtube.com","Wait")wait for browser event("Everything Loaded","")type text(<name="search_query">,#keywords,"Standard")wait for browser event("Everything Loaded","")click(<id="search-btn">,"Left Click","No")wait(5)add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")set(#position,0,"Global")loop(" {$list total(%urls)}") { navigate($list item(%urls,#position),"Wait") wait for browser event("Everything Loaded","") click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No") set(#document,$scrape attribute(<id="action-panel-details">,"innertext"),"Global")} Any Help is Greatly Appreciated i am trying to work it out but i am stuck on this problem. Thanks Guys! Quote Link to post Share on other sites
UBotDev 276 Posted December 27, 2014 Report Share Posted December 27, 2014 This is how you would extract specific URLs: add list to list(%URLs, $find regular expression($document text, "(?<=href=\")http://www\\.keek\\.com[^\"]+"), "Delete", "Global") P.S.: You should start using "wait for element" command instead of fixed delays... Quote Link to post Share on other sites
itexspert 47 Posted December 27, 2014 Author Report Share Posted December 27, 2014 Yes i use it normally for waiting but i just started working on this besides i don't think you understood me correctly, See links will be different i need to make the difference between links which start like this: "www""http://" Rest of the links will be random,i need to make the difference between these two so if link ishttp://something.comit will skip it But if link starts like thiswww.something.com then it will save it in a spreadsheet! Quote Link to post Share on other sites
itexspert 47 Posted December 29, 2014 Author Report Share Posted December 29, 2014 So i got this far my friend helped me to build a regex that finds any url in the document text,so what i did it clicked on Show more details and then i put the entire description inside a variable and found url-s in that description so i have another issue right now What kind of logic should i use to make difference between URL-s that start with www.something.andhttp://something. As i mentioned if URL starts with www.something then i need to save it inside a tableBut if a link starts withhttp://something then i need to script to save the this url into a table this is my code so far ui text box("Search",#keywords)clear cookiesset user agent("Firefox 6")navigate("www.youtube.com","Wait")wait for browser event("Everything Loaded","")type text(<name="search_query">,#keywords,"Standard")wait for browser event("Everything Loaded","")click(<id="search-btn">,"Left Click","No")wait(5)add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")set(#position,0,"Global")loop(" {$list total(%urls)}") { navigate($list item(%urls,#position),"Wait") wait for browser event("Everything Loaded","") click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No") wait(3) set(#description,$scrape attribute(<id="eow-description">,"innertext"),"Global") set(#find,$find regular expression(#description,"(([\\w-]+://?|www[.])[^\\s()<>]+)"),"Global") increment(#position)} Quote Link to post Share on other sites
UBotDev 276 Posted December 29, 2014 Report Share Posted December 29, 2014 The example page you showed doesn't contain the 1st type of URLs, that's why I only gave you code for 2nd type. I think it would be the easiest to just scrape all URLs and then check in a loop if URL meets 1st or 2nd type, and do action accordingly. Quote Link to post Share on other sites
Code Docta (Nick C.) 638 Posted December 29, 2014 Report Share Posted December 29, 2014 just use this and throw into a list set(#urls,"http://www.something.comhttp://www.something2.comhttp://something.comsomething.comwww.something.com","Global")set(#WWW,$find regular expression(#urls,".*www.*"),"Global")alert(#WWW) A way to do it is ui text box("Search",#keywords)clear cookiesset user agent("Firefox 6")navigate("www.youtube.com","Wait")wait for browser event("Everything Loaded","")type text(<name="search_query">,#keywords,"Standard")wait for browser event("Everything Loaded","")click(<id="search-btn">,"Left Click","No")wait(5)add list to list(%urls,$scrape attribute(<href=w"/watch?v=*">,"fullhref"),"Delete","Global")set(#position,0,"Global")clear list(%WWW_Urls)loop(" {$list total(%urls)}") { navigate($list item(%urls,#position),"Wait") wait for browser event("Everything Loaded","") click(<innerhtml="<span class=\"yt-uix-button-content\">Show more </span>">,"Left Click","No") wait(3) add list to list(%WWW_Urls,$list from text($find regular expression($scrape attribute(<id="eow-description">,"innertext"),".*www.*"),$new line),"Delete","Global") increment(#position)} CD Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.