Jump to content
UBot Underground

Scraping Youtube Comments For Email Addresses (Regex And Xpath)


Recommended Posts

I have tried to come up with a good regex to pick up emails from youtube comments, but to no avail. I ended up using xpath to get all of the emails I was testing with, but it was a simple xpath that also picked up the text of each comment. I am hoping that I can find a way to filter just the email addresses from the scrape, but regex is really a pain for me, xpath seems better, but that was the best I could do...

 

Here is an example of what I did:

<div class="comment-renderer-text-content">hey bud could u send over the spred sheet please too somefool@<a href="http://gmail.com/" class="yt-uix-servicelink  " data-url="http://gmail.com/" data-target-new-window="True" data-servicelink="CBEQtnUiEwj2rceU--zVAhWEtZwKHUN5Bss" target="_blank">gmail.com</a></div>

I used the regex:

//div[contains(text(),"@")]

and of course it worked to pull all of the emails in the comments section, but along with their comments. Which I would have to then further filter (worst case scenario), if I wanted to get JUST the email addresses.

 

So, is there a better way to grab just the email addresses from YouTube comments? Or should I just work on filtering the lists that I get by using that xpath?

 

Any help is greatly appreciated! Thank you.

Link to post
Share on other sites

Seems like it would be less of a hassle by using their api.

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

Sample:

{
"kind": "youtube#commentThread",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/oiE9FRIrB9fgC0gb3ysYjfk68v8\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"videoId": "kffacxfA7G4",
"topLevelComment": {
"kind": "youtube#comment",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/TzJp3O_T_q0Tg5RLS7mykB-X6b4\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"authorDisplayName": "Mara Horvath",
"authorProfileImageUrl": "https://yt3.ggpht.com/-MrRpz4O2XYw/AAAAAAAAAAI/AAAAAAAAAAA/BMUg1y6dtE8/s28-c-k-no-mo-rj-c0xffffff/photo.jpg",
"authorChannelUrl": "http://www.youtube.com/channel/UCyn34cuY5ruzthCKbkqew_A",
"authorChannelId": {
"value": "UCyn34cuY5ruzthCKbkqew_A"
},
"videoId": "kffacxfA7G4",
"textDisplay": "Wow factor 7 mil likes and 8 mil dislikes",
"textOriginal": "Wow factor 7 mil likes and 8 mil dislikes",
"canRate": true,
"viewerRating": "none",
"likeCount": 0,
"publishedAt": "2017-08-24T06:46:03.000Z",
"updatedAt": "2017-08-24T06:46:03.000Z"
}
},
"canReply": true,
"totalReplyCount": 0,
"isPublic": true
}
  • Like 1
Link to post
Share on other sites

You can do it this way (as well as others but this should work) and you may want to find a better regex I just whipped this up so it may not be perfect:

set(#html,"<div class=\"comment-renderer-text-content\">hey bud could u send over the spred sheet please too somefool@<a href=\"http://gmail.com/\" class=\"yt-uix-servicelink  \" data-url=\"http://gmail.com/\" data-target-new-window=\"True\" data-servicelink=\"CBEQtnUiEwj2rceU--zVAhWEtZwKHUN5Bss\" target=\"_blank\">gmail.com</a></div>","Global")
set(#cleaned,$strip tags(#html),"Global")
set(#emails,$find regular expression(#cleaned,"[a-zA-Z0-9\\.\\+-_]+\\@[a-zA-Z0-9-]+\\.[a-zA-Z]\{2,4\}(\\.[a-zA-Z]\{2,4\}|)"),"Global")
  • Like 1
Link to post
Share on other sites

 

Seems like it would be less of a hassle by using their api.

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

Sample:

{
"kind": "youtube#commentThread",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/oiE9FRIrB9fgC0gb3ysYjfk68v8\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"videoId": "kffacxfA7G4",
"topLevelComment": {
"kind": "youtube#comment",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/TzJp3O_T_q0Tg5RLS7mykB-X6b4\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"authorDisplayName": "Mara Horvath",
"authorProfileImageUrl": "https://yt3.ggpht.com/-MrRpz4O2XYw/AAAAAAAAAAI/AAAAAAAAAAA/BMUg1y6dtE8/s28-c-k-no-mo-rj-c0xffffff/photo.jpg",
"authorChannelUrl": "http://www.youtube.com/channel/UCyn34cuY5ruzthCKbkqew_A",
"authorChannelId": {
"value": "UCyn34cuY5ruzthCKbkqew_A"
},
"videoId": "kffacxfA7G4",
"textDisplay": "Wow factor 7 mil likes and 8 mil dislikes",
"textOriginal": "Wow factor 7 mil likes and 8 mil dislikes",
"canRate": true,
"viewerRating": "none",
"likeCount": 0,
"publishedAt": "2017-08-24T06:46:03.000Z",
"updatedAt": "2017-08-24T06:46:03.000Z"
}
},
"canReply": true,
"totalReplyCount": 0,
"isPublic": true
}

 

 

Seems like it would be less of a hassle by using their api.

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

Sample:

{
"kind": "youtube#commentThread",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/oiE9FRIrB9fgC0gb3ysYjfk68v8\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"videoId": "kffacxfA7G4",
"topLevelComment": {
"kind": "youtube#comment",
"etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/TzJp3O_T_q0Tg5RLS7mykB-X6b4\"",
"id": "z22id1jqjxmde1k5a04t1aokgxoxaylq3nzxj2ol22bwrk0h00410",
"snippet": {
"authorDisplayName": "Mara Horvath",
"authorProfileImageUrl": "https://yt3.ggpht.com/-MrRpz4O2XYw/AAAAAAAAAAI/AAAAAAAAAAA/BMUg1y6dtE8/s28-c-k-no-mo-rj-c0xffffff/photo.jpg",
"authorChannelUrl": "http://www.youtube.com/channel/UCyn34cuY5ruzthCKbkqew_A",
"authorChannelId": {
"value": "UCyn34cuY5ruzthCKbkqew_A"
},
"videoId": "kffacxfA7G4",
"textDisplay": "Wow factor 7 mil likes and 8 mil dislikes",
"textOriginal": "Wow factor 7 mil likes and 8 mil dislikes",
"canRate": true,
"viewerRating": "none",
"likeCount": 0,
"publishedAt": "2017-08-24T06:46:03.000Z",
"updatedAt": "2017-08-24T06:46:03.000Z"
}
},
"canReply": true,
"totalReplyCount": 0,
"isPublic": true
}

Hmm, I am actually more confused by this than anything, lol! I have pash's api plugin, so I could probably do something with that, but I don't understand what I am looking at here, could you possibly explain what is going on here?

Link to post
Share on other sites

 

You can do it this way (as well as others but this should work) and you may want to find a better regex I just whipped this up so it may not be perfect:

set(#html,"<div class=\"comment-renderer-text-content\">hey bud could u send over the spred sheet please too somefool@<a href=\"http://gmail.com/\" class=\"yt-uix-servicelink  \" data-url=\"http://gmail.com/\" data-target-new-window=\"True\" data-servicelink=\"CBEQtnUiEwj2rceU--zVAhWEtZwKHUN5Bss\" target=\"_blank\">gmail.com</a></div>","Global")
set(#cleaned,$strip tags(#html),"Global")
set(#emails,$find regular expression(#cleaned,"[a-zA-Z0-9\\.\\+-_]+\\@[a-zA-Z0-9-]+\\.[a-zA-Z]\{2,4\}(\\.[a-zA-Z]\{2,4\}|)"),"Global")

 

 

You can do it this way (as well as others but this should work) and you may want to find a better regex I just whipped this up so it may not be perfect:

set(#html,"<div class=\"comment-renderer-text-content\">hey bud could u send over the spred sheet please too somefool@<a href=\"http://gmail.com/\" class=\"yt-uix-servicelink  \" data-url=\"http://gmail.com/\" data-target-new-window=\"True\" data-servicelink=\"CBEQtnUiEwj2rceU--zVAhWEtZwKHUN5Bss\" target=\"_blank\">gmail.com</a></div>","Global")
set(#cleaned,$strip tags(#html),"Global")
set(#emails,$find regular expression(#cleaned,"[a-zA-Z0-9\\.\\+-_]+\\@[a-zA-Z0-9-]+\\.[a-zA-Z]\{2,4\}(\\.[a-zA-Z]\{2,4\}|)"),"Global")

Thank you helloinsomnia. You know I really am horrible at regex, I even tried using your tool to generate something that would work, but I don't know what I am doing wrong. I will try to study this and see what I can learn from it. Did you use your tool to make this or just your own knowledge?

 

So essentially I have to clean the html out before I use the regex?

Link to post
Share on other sites

Hmm, I am actually more confused by this than anything, lol! I have pash's api plugin, so I could probably do something with that, but I don't understand what I am looking at here, could you possibly explain what is going on here?

 

You're looking at the more efficient way to scrape the comment section of a video. But if you'd rather wait for images, css, etc to load instead then keep on doing it the way you are.

 

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

 

 

ENTER_KEY_HERE: Is your api key you get from google.

videoId: The id of the video on youtube. it can be found at the end of any youtube url.

 

The sample I posted shows just 1 of the 50 comments that are returned when you navigate to the url above.

 

"textDisplay": "Wow factor 7 mil likes and 8 mil dislikes",

"textOriginal": "Wow factor 7 mil likes and 8 mil dislikes",

 

The text above is a comment.

  • Like 1
Link to post
Share on other sites

You're looking at the more efficient way to scrape the comment section of a video. But if you'd rather wait for images, css, etc to load instead then keep on doing it the way you are.

 

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

 

 

ENTER_KEY_HERE: Is your api key you get from google.

videoId: The id of the video on youtube. it can be found at the end of any youtube url.

 

The sample I posted shows just 1 of the 50 comments that are returned when you navigate to the url above.

 

"textDisplay": "Wow factor 7 mil likes and 8 mil dislikes",

"textOriginal": "Wow factor 7 mil likes and 8 mil dislikes",

 

The text above is a comment.

Oh I don't dispute the API is a better choice. I just don't understand the code you shared, or where to put it. (the large example you shared)

 

I understand this code:

https://www.googleapis.com/youtube/v3/commentThreads?key=ENTER_KEY_HERE&textFormat=plainText&part=snippet&videoId=kffacxfA7G4&maxResults=50

but don't know how to call it in ubot? (just navigate?)

 

I am trying to figure out how to do this with the youtube plugin from pash, but with no luck so far. The documentation for that plugin is non-existent sadly.

Link to post
Share on other sites

I just wrote it off the top of my head in this case.

 

For the API stuff, you would use HTTP Post plugin to make the request and then use JSON Path Parser to scrape what you need, Dan has some tutorials on HTTP Post that would probably be a good starting point: http://www.bot-factory.com/http-plugin-tutorials/

  • Like 1
Link to post
Share on other sites

So I did a mix of things here to achieve the best results.

 

Here is a breakdown for those who are interested:

 

  • Firstly, using the API is limited to 100 results. I don't want that, and my method returns well over 500 emails from specific targeted videos.
  • Secondly, the YouTube API plugin from pash is far too complicated and annoying to set up for this anyway, while HTTP Post plugin is a lot easier. (though again, the API is limited anyway, so it is a waste of time to use!)
  • I couldn't find a good Regex to grab only the comments that had the "@" mail sign in them, so I used the following xpath:   //div[contains(text(),"@")]
  • After using the xpath to grab all of the comments with the email symbol in them, I then used "strip tags" to filter out the markup text (thank you for that idea helloinsomnia!)
  • Then using the correct "find regular expression", I was able to get all of the emails isolated and so I exported them.

Now the task is complete, but my path to trying to understand how to best use regex and the advanced element selector continues! I'm still so confused as to how I can best utilize these things in the future, but at least this issue is resolved. Thanks all!

  • Like 1
Link to post
Share on other sites

but my path to trying to understand how to best use regex and the advanced element selector continues! I'm still so confused as to how I can best utilize these things in the future, but at least this issue is resolved. Thanks all!

 

I will have some stuff on this in the coming months when I launch my new site ;)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...