Kreatus (Ubot Ninja) 422 Posted March 8, 2011 Report Share Posted March 8, 2011 Hi guys I need help on scraping email addresses on various websites. Normally I scrape email addresses using this code "(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3})" but when I encounter an email address with spaces like this "contact @ gmail . com". What is the right code to scrape email address even that one with spaces? Thanks Quote Link to post Share on other sites
JohnB 255 Posted March 8, 2011 Report Share Posted March 8, 2011 Here ya go...it's universal: [a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})? Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted March 8, 2011 Author Report Share Posted March 8, 2011 Thanks john but it doesnt work on email address with spaces.. This is one example that I want to scrape email address http://www.seo.com/contact/ Quote Link to post Share on other sites
JohnB 255 Posted March 9, 2011 Report Share Posted March 9, 2011 Ok I modified your regex...this is NOT an optimal solution, but the regex works. (I say it's not optimal because I ultimately had to grab it by position...but you can modify that) The regex now grabs with or without the spaces. If you need me to explain what exactly I added to the regex let me know. email_bich.ubot John 2 Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted March 9, 2011 Author Report Share Posted March 9, 2011 Thanks john! That works great! No need for explanation.. +1 Quote Link to post Share on other sites
JohnB 255 Posted March 9, 2011 Report Share Posted March 9, 2011 Awesome! Quote Link to post Share on other sites
JohnB 255 Posted March 9, 2011 Report Share Posted March 9, 2011 I'll explain anyhow for anyone else reading the thread... I added this: (\s|) before AND after the @. What it says is: Look for a space or nothing at all...whenever you have a "nothing" on one side of a pipe it makes it optional meaning the space does not need to be there. I hope that helps someone else as well! John Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted March 9, 2011 Author Report Share Posted March 9, 2011 Thanks for the explanation john! I think we need a thread about regex codes or a regex forum sub category for future reference about the codes like this. Quote Link to post Share on other sites
JohnB 255 Posted March 9, 2011 Report Share Posted March 9, 2011 Let me talk to the "Threadmaster"...(Buddy of course...he runs a tight ship! http://ubotstudio.com/forum/public/style_emoticons/default/smile.gif John PS I say this because he already started a thread that has helpful tips, etc so we don't have to keep looking up the little things we need often. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted March 9, 2011 Author Report Share Posted March 9, 2011 Great john. Thanks! Quote Link to post Share on other sites
Abs* 12 Posted July 7, 2011 Report Share Posted July 7, 2011 HI John I tested the bot you added - the email_bich.ubot I tested it on the following url http://www.kavoir.com/2011/02/www-gmail-com.html The script only pulls the following root@host.exa not sure if its due to the extra . Quote Link to post Share on other sites
Abs* 12 Posted July 7, 2011 Report Share Posted July 7, 2011 Hi John I had a play with the regex - Kreatus' tutorial point outs on the following thread helped loads http://ubotstudio.com/forum/index.php?/topic/6489-regex-101-and-beyond/ So I downloaded the regex cheatsheet as suggested by Frank and also found the following site which helped test the code in real time - similar to the tool that frank uses http://regex.larsolavtorvik.com/ Playing around a little I came up with the following code (\w+(\s|)@(\s|)[a-zA-Z_]+?\.[a-zA-Z_]+(\.|)[a-zA-Z]{1,3}) I added the following to your original code +(\.|)[a-zA-Z]{1,3}) This code should now also work with emails like .co.uk and also subdomain emails - I think i have understood the code correctly - was a little unsure of the escape character \ got the idea from the (\s|) you placed to look for spaces - seems to be working but need to test properly - thanks abbs Quote Link to post Share on other sites
Abs* 12 Posted July 7, 2011 Report Share Posted July 7, 2011 hi just wondering if anyone can help Ive setup a quick test page here Ive set the regular expression to return a list - however it will only scrape the support@gmail.com email address and no more any idea why this is thanks Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted July 7, 2011 Author Report Share Posted July 7, 2011 Hi abs this works fine with me regex-email.ubot Quote Link to post Share on other sites
Abs* 12 Posted July 8, 2011 Report Share Posted July 8, 2011 Excellent - working perfect here too - I think it was because i was using a set command to find the regular expression instead of add to list thanks a million Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.