ubotapprentice 1 Posted February 14, 2011 Report Share Posted February 14, 2011 hello i want to know how to scrape an email from a page i watch the tutorial 11 still dont understand example i got a list of urls .......from a list.....each url is visited and now i need to scrape the email inside the text example: name: john smith i try using the page scrape funtion and this is what it scraped manually '<DIV style="LINE-HEIGHT: 19px; MARGIN-LEFT: 20px"><B>john smith</B><BR>21 years, canada. soc number 1013606283<BR>birth: 15.08.1989<BR>canada canada<BR>phone. (1) 123456789 / (1) 3112048446<BR><A href="?v=b&cs=wh&to=johnsmith5@hotmail.com" target=_blank>' how do i scrape only the email.....on all those urls visited? Quote Link to post Share on other sites
JohnB 255 Posted February 15, 2011 Report Share Posted February 15, 2011 Ok, ubotapprentice. I didn't know which license you had, so I created a script for both the standard and the pro licenses. The regex being used will scrape any email from any page. John scrape_email.ubot 2 Quote Link to post Share on other sites
ubotapprentice 1 Posted February 15, 2011 Author Report Share Posted February 15, 2011 i got standard what is a regex? what is this? [a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s^*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})?$ theres no tutorial about regular expressions, so i dont know how to use regular expressions.... ok so you mean.... that... [a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s^*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})?$ with that search string.....you can actually scrape any email on the body content of any page visited? Quote Link to post Share on other sites
Diji1 0 Posted February 16, 2011 Report Share Posted February 16, 2011 Regex is basically a way of matching (or searching) text that can cover any form that text can take. Here's Wikipedia's definition: http://en.wikipedia.org/wiki/Regex If that sounds complicated (people usually find it complicated at first, don't worry, you get the hang of it after some practice) think of it like this: when you use search in notepad you might look for the word "cow" and it would find "cow" if you had the sentence "The big brown cow chewed happily on some grass". So let's say instead of wanting to find the word "cow" you wanted to find something that took a particular form rather than an exact word. For example, let's say you wanted to find all the words that began with "c" and had "o" as the second letter. A regular expression (or regex, same meaning) will allow you to do this. You can write a regex that will match "cow" but not match "chewed" or any other word not starting with "co". Or lets say you wanted to find all text that is an email - above is one way to do that. Here's two sites that have been very helpful to me with learning regex: http://www.regular-expressions.info/ - this has quick lessons and more detailed explanations of everything. http://rubular.com/ - this is where you can test out regex - you enter what you want to match and it shows you what is matched as you write your regular expression. I read the quickstart at the first site and then went to the second when I needed to use regex and came up with what I needed. Actually writing regex is the best way to learn it I found Quote Link to post Share on other sites
JohnB 255 Posted February 16, 2011 Report Share Posted February 16, 2011 i got standard what is a regex? what is this? [a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s^*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})?$ theres no tutorial about regular expressions, so i dont know how to use regular expressions.... ok so you mean.... that... [a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s^*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})?$ with that search string.....you can actually scrape any email on the body content of any page visited? That's correct. I provided that regex string for you because it is as close to a universal string as you will find. Try it on as many pages as you like, and you should have no problems scraping emails. Enjoy. John Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted February 16, 2011 Report Share Posted February 16, 2011 Lily already created a video tutorial on regex before http://vimeo.com/17904661 and also Frank Quote Link to post Share on other sites
Diji1 0 Posted February 16, 2011 Report Share Posted February 16, 2011 That's correct. I provided that regex string for you because it is as close to a universal string as you will find. Try it on as many pages as you like, and you should have no problems scraping emails. Enjoy. John BTW, love how your regex scrapes email [AT] server [DOT] com and other ways of hiding emails JohnB Quote Link to post Share on other sites
JohnB 255 Posted February 16, 2011 Report Share Posted February 16, 2011 BTW, love how your regex scrapes email [AT] server [DOT] com and other ways of hiding emails JohnB Thanks! http://ubotstudio.com/forum/public/style_emoticons/default/smile.gif It tries to cover them all! Quote Link to post Share on other sites
Jubu 1 Posted August 15, 2012 Report Share Posted August 15, 2012 Is there a ubot 4 version of this? I try to use the find regular expression and the regex but I'm not sure what the text is supposed to be. Should it be the document, or should I scrape the page? It doesn't seem to grab it for some reason. Quote Link to post Share on other sites
Aymen 385 Posted September 29, 2012 Report Share Posted September 29, 2012 here is a simple regex format that will allow you to find emails at "myname@email.com" format :[a-zA-Z0-9\-\_]+\@[a-zA-Z0-9\_]+\.[a-zA-Z0-9\-\_]{2,4} Quote Link to post Share on other sites
mamica 10 Posted November 5, 2014 Report Share Posted November 5, 2014 Anyone have new regex for scraping all type of emails from a website? the codes provided are not working and i cant open a scrape emails.ubot file because it is not valid 4.0 file. Please help. Quote Link to post Share on other sites
bamboo 0 Posted December 7, 2014 Report Share Posted December 7, 2014 Bump lol Regex found everywhere not working for emails like this Braham.Candice-GranpapaEnterprises@email.com Probably because of the uppercase and dots and hypen. If someone could make a regex for this, I'd be extremely grateful. I didn't know anything about Regex as a this morning and I've been battling with it all day. Quote Link to post Share on other sites
bamboo 0 Posted December 7, 2014 Report Share Posted December 7, 2014 Ok looks like that email might be too hard to get... how about regext to scrape emails in this format Firstname.Comapanyname@randomdomain.comBraham.GranpapaEnterprises@email.com examples as given above. Kindly note the uppercase letters. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.