This week we decided to splurge and let you know about two features that are going to be in the new UBot Studio 5. The first is a very simple way to find data on websites or in text. For example – maybe you want to find all of the email addresses or web addresses from a web page or CSV file. UBot Studio has a nice scrape function that already lets you do this. Our scraper looks for occurrences of text between other examples of text. While this works a lot of the time, sometimes you just want to find every example of something on a page, regardless of what it’s between.

If you give yourself a few minutes to learn Regular Expressions, you’ll find a whole world opened up of really cool things you can scrape, which our scraper would find difficult. So with that in mind–and knowing that many of you already use the RegEx engine that’s inside of UBot Studio–there will be a really cool Regular Expression Builder inside UBot Studio 5:

The Regular Expression Builder in UBot Studio 5

The Regular Expression Builder in UBot Studio 5

If you aren’t at all familiar with Regular Expressions, and the phrase is giving you chills, relax,  it’s easier than you think. We already have a 3-part instructional video which will help you if you decide to dive in. But here’s a simple way to explain it:

(w+)@(w+.)(w+)(.w+)?

I know that doesn’t seem simple–but that phrase will find you most email addresses. Here’s how.

Let’s start at the beginning. (w+) simply means find any one or more character word. The w says “Go out and find every instance of anything that is a letter, number, or underscore,” and the + means it must be at least one character in length but can be more. Putting this expression on either side of the @ symbol means find at least one word that contains one or more characters, then the @ symbol, then another word, followed by the “.” – ie, find anything that looks like “word@domain.”.

The next portion – (w+) – you’ve already seen. With that addition, our regular expression is looking for anything along the lines of “word@domain.com” (or, “whatever@address.net”, “email@server.org”, etc).

But what about .co.uk addresses?  (.w+)* says that we’re looking for a period followed by one or more word characters. But what’s the * after the end parentheses mean? *  means that the preceding metacharacter, literal or group can occur zero or more times. As an example, wd* would match a word character followed by zero or more digits. In our example, we use parentheses to group together a series of metacharacters, so the * applies to the whole group. So, you can interpret (.w+)* as ‘match a period followed by one or more word characters, and match that combination zero or more times’. The goal here is that not all email addresses have a .co.uk ending–but some do. So using the .w+ expression, we can find email addresses that end in .com, net.au, etc.

The UBot Studio 5 RegEx Builder will include common expressions like this one, to make this sort of thing easy for everyone.

The second feature that we’ll be adding is a Spintax Editor:

The Spintax Editor in UBot Studio 5

The Spintax Editor in UBot Studio 5

(Spintax is a simple way to randomize your content. It’s easy: Simply put phrases or words that mean roughly the same thing inside of brackets, like this: {hello|hi|g’day} and when you spin the text, you’ll get any one of those words or phrases back! While it’s true that *poorly* spun content is a terrible idea, high quality spun content is hard to detect. So we’re expanding our Spintax Engine and adding a simple feature to help you build Spintax articles. This nifty Editor will let you create articles using a built-in thesaurus, so you shouldn’t have to spend extra money on any additional spinning service in the future. It’ll be a snap to build articles right inside UBot Studio with this new feature.

 

We’re just getting started on the awesome that is UBot Studio 5. I look forward to updating you again next week!

 

xkcd agrees–Regular Expressions are a powerful thing

 

– Seth

Published by Seth Turin

Leave a Reply

Your email address will not be published. Required fields are marked *