Trying To Build A Simple Yt Scraper And Running Into A Few Issues

nubot · March 11, 2016

I'm trying to build a simple YT title and url scraper and am running into a few issues.

1.) I'm not quite sure if I'm using the best attribute possible to scrape my data...it seems that the last 2 bots I made died the next day even though I used wildcards. I think it may be good now but guess I'll find out tomorrow

2.) I'm running into an issue with the title and url not matching up at some point, my guess is due to it not being able to find the page attribute. It could just be something in my script though. I suspect one culprit might be that I'm using "don't delete" for titles but "delete for urls". I did this because for titles, they can be the same sometimes and for urls, I put don't delete because for some reason it was spitting out double the amount of records (could be an error in my script though).

What possible checks could I add to prevent the mismatching of the title and urls. What would be the best way to handle this? Not really sure how to tackle this one.

If you'd like to reproduce my error go to youtube.com and search for something, then run the script (don't forget to change output directory). It *might* run fine for the first 2 or 3 pages but if you let it run through about 5-10 that's usually where something screws up and I end up with the titles and urls not matching properly beyond the point of the initial failure

To reproduce the double records simply change the %fullurl list to "don't delete".

Ready to get my bot ripped apart

Here is my code. Thank you in advance!

 clear all data
define Scrapes Titles and URLs {
    add list to list(%title,$scrape attribute(<outerhtml=w"<a href=\"/watch?v=*\" class=\"yt-uix-sessionlink yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2*\" title=*">,"title"),"Don\'t Delete","Global")
    add list to list(%url,$scrape attribute(<outerhtml=w"<a href=\"/watch?v=*\" class=\"yt-uix-sessionlink yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2*\" title=*">,"href"),"Don\'t Delete","Global")
    with each(%url,#url) {
        add item to list(%fullurl,"http://www.youtube.com{#url}","Delete","Global")
    }
    add list to table as column(&TitlesURLs,0,0,%title)
    add list to table as column(&TitlesURLs,0,1,%fullurl)
    save to file("DIRECTORYGOESHERE\\testcsv.csv",&TitlesURLs)
}
loop while($exists(<data-link-type="next">)) {
    Scrapes Titles and URLs()
    click(<data-link-type="next">,"Left Click","No")
    wait for browser event("Everything Loaded","")
}

Edited March 11, 2016 by nubot

pash · March 11, 2016

sample scrape 1 time

clear all data
navigate("https://www.youtube.com/results?search_query=Gaming+Music","Wait")
wait for browser event("Everything Loaded","")
wait(1)
add list to list(%title,$scrape attribute(<outerhtml=w"<a href=\"/watch?v=*\" class=\"yt-uix-sessionlink yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2*\" title=*">,"title"),"Don\'t Delete","Global")
add list to list(%url,$scrape attribute(<outerhtml=w"<a href=\"/watch?v=*\" class=\"yt-uix-sessionlink yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2*\" title=*">,"fullhref"),"Don\'t Delete","Global")
add list to table as column(&TitlesURLs,0,0,%title)
add list to table as column(&TitlesURLs,0,1,%url)
save to file("DIRECTORYGOESHERE\\testcsv.csv",&TitlesURLs)
if($exists(<data-link-type="next">)) {
    then {
        click(<data-link-type="next">,"Left Click","No")
        wait for browser event("Everything Loaded","")
        wait(1)
    }
    else {
    }
}

for url use "fullhref" instead "href"

nubot · March 11, 2016

Thanks, works great! Now have to work on some filtering...

Sign In

Trying To Build A Simple Yt Scraper And Running Into A Few Issues

Recommended Posts

nubot 1

Link to post

Share on other sites

pash 504

Link to post

Share on other sites

nubot 1

Link to post

Share on other sites

Join the conversation

Browse

Activity