The purpose of regular expressions:
- Search for particular items within a large body of text. eg. You may wish to identify all email addresses in some content using a text editor.
- Replace particular items. eg. You may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents in a text editor.
- Validate input. eg. You may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation etc in a program you are writing.
- Coordinate actions. eg. You may wish to process certain files in a directory, but only if they meet particular conditions, in work you are doing on the command line.
- Reformat text. eg. You may export data from one program as a text file then modify its layout so you may import it into another program using a text editor.
Since version 3 (circa 2004), bash has a built-in regular expression comparison operator, represented by =~.
Note. By default regular expressions work in case sensitive way, but it is possible to make a regular expression look for matches in a case insensitive way.
The dot – any character
The dot ( . ) (or full stop) character is what we refer to as one metacharacter. Metacharacters are characters which have a special meaning.The . allows us to match any character.
Ranges of Characters
Sometimes we would like to be a bit more specific than that. This is where ranges come in useful. We specify a range of characters by enclosing them within square brackets ( [ ] ). This can be numbers or letters, and there is no limit to how many characters you may place inside the square brackets. You can either enumerate all the characters or use range like 1-5, which stand for 1,2,3,4,5. Note this can also be number or letters, such as [1-10] or [a-f]
Eg. J[ai]n : Jan or Jin
Negated character classes
Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don’t want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
Find characters that aren’t
Sometimes we may want to find the presence of a character which is not a range of characters. We can do this by placing a caret ( ^ ) at the beginning of the range.
Eg. t[^eo]d When today is over Ted will have a tedious time tidying up.
Multipliers
Multipliers allow us to increase the number of times an item may occur in our regular expression. Here is the basic set of multipliers:
- * – item occurs zero or more times.
- + – item occurs one or more times.
- ? – item occurs zero or one times.
- {5} – item occurs five times.
- {3,7} – item occurs between 3 and 7 times.
- {2,} – item occurs at least 2 times.
Their effect will be applied to whatever is directly in front of them. It could be a normal character, eg:
E.g. lo* Are you looking at the lock or the silk?
We can also use metacharacter with multipliers:
l.*k
Are you looking at the lock or the silk?
Now this one may seem a bit odd to you at first. The ‘.*’ matches zero or more of any character. It is normal to think that it will come across the first ‘k’ and then say ‘yep, I’ve found a match’, but what it actually does is say ‘k is also any character however so let’s see how far we can take this’ and it keeps going until it finds the final ‘k’ in the string. This is what’s referred to as greedy matching. It’s normal behaviour is to try and find the largest string it can which matches the pattern. We may reverse this behaviour and make it not greedy or lazy by placing a question mark ( ? ) after the multiplier (which can seem a little confusing as the question mark is a multiplier itself but you’ll get the hang of it).
l.*?k Are you looking at the lock or the silk?
Escaping Metacharacters
Sometimes we may actually want to search for one of the characters which is a metacharacter ( such as . ) . To do this we use a feature called escaping. By placing the backslash ( \ ) in front of a metacharacter we can remove it’s special meaning.
E.g. here\.
Welcome to here.
Shorthand Character Classes
In the previous section of this tutorial we looked at the range operator ( [] ). That allowed us to specify a set of characters, any of which could be matched. There are some ranges that are used frequently so a set of shortcuts has been created to refer to them. We access these by using the escape character ‘ \ ‘ followed by a letter. (In this case the escape character introduces a special meaning rather than taking it away.)
- \s – matches anything which is considered whitespace. This could be a space, tab, line break etc.
- \S – matches the opposite of \s, that is anything which is not considered whitespace.
- \d – matches anything which is considered a digit. ie 0 – 9 (It is effectively a shortcut for [0-9]).
- \D – matches the opposite of \d, that is anything which is not considered a digit.
- \w – matches anything which is considered a word character. That is [A-Za-z0-9_]. Note the inclusion of the underscore character ‘_’. This is because in programming and other areas we regulaly use the underscore as part of, say, a variable or function name.
- \W – matches the opposite of \w, that is anything which is not considered a word character.
Let’s have a look at this script to test the user input:
#!/bin/bash echo " Please enter a name pit or pat" read name if [[ $name =~ ^p[ai]t$ ]] then echo " $name , yes! it's the one! pattern matches " else echo " Not $name! The pattern is not correct, give me more again" fi
Non Printable Characters
As well as our normal characters, there are a few other characters which we don’t actually see but which help in formatting our text. These are the:
- Tab – represented in regular expressions as \t
- Carriage return – represented in regular expressions as \r
- Line feed (or newline) – represented in regular expressions as \n
The tab character you should be familiar with (it prints a larger gap than a normal space) but the other two are a bit more interesting.
The concepts of carriage return and line feed came about with mechanical typewriters. The carriage return function moved the cursor from the end of the line to the beginning of the line. The line feed function moved down a line.
Depending on the OS you are using, one or a combination of these can be used to signify a new line.
- Windows – uses the sequence \r\n (in that order)
- Mac OS (version 9 and below) – uses the sequence \r
- Unix/Linux and OSX – uses the sequence \n
Anchors – ^ and $
Building upon the idea of new lines we introduce two particular locations on a line which are the beginning and the end of the line. We can refer to these locations in our regular expressions using the following special characters:
- ^ (caret) – represents the beginning of the line.
- $ (dollar) – represents the end of the line.
Word Boundaries
Word boundaries are an example of another zero width character used often within regular expressions. A word boundary is the very beginning or end of a word. They may be identified using the following:
- \< – represents the beginning of a word.
- \> – represents the end of a word.
- \b – represents either the beginning or end of a word.
The first two items listed above aren’t available in all regular expression tools but \b generally is so it is the safer one to use.
A word is generally considered to be a string of characters that would be matched by the \w character class (that is, A-Z, a-z, 0-9 and _). Note that this doesn’t include punctuation such as the apostrophe ( ‘ ) as may be seen in the example below.
\bt\w+\b Now that's the truth and you know it.
Grouping
Like math, we can group several characters together in our regular expression using brackets ‘( )’
e.g.John (Reginald )?Smith John Reginald Smith is sometime just called John Smith.
we can also use it to validate IP address:
An IP address is a set of 4 numbers (between 0 and 255) separated by full stops (eg. 192.168.0.5).
\b(\d{1,3}\.){3}\d{1,3}\b The server has an address of 10.18.0.20 and the printer has an address of 10.18.0.116.
Back references
Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The first set of brackets is referred to with \1, the second set of brackets with \2 and so on.
Alternation
With alternation we are looking for something or something else. We have seen a very basic example of alternation with the range operator. This allows us to perform alternation with a single character, but sometimes we would like to perform the operation with a larger set of characters. We can achieve this with the pipe symbol ( | ) which means or.
So for intance, if we wanted to find all instance of either ‘dog’ or ‘cat’ we could do the following:
dog|cat Harold Smith has two dogs and one cat.
Look ahead and Look behind
Both of them operate in one of two modes:
- Positive – in which we are seeking to find something which matches.
- Negative – in which we are seeking to find something which doesn’t match.
Lookaheads
We can look ahead in our string and see if it matches the given pattern.
Let’s say we want to find some numbers between 4000 and 5000, this could be explained as ” looking for a ‘4’ followed by 3 ditigs and at least one of those digits is not a ‘0’ “.
We could do it like :
\b4([1-9]\d\d|\d[1-9]\d|\d\d[1-9])\b
Now we will match 4010 but not 4000.
This will search the first digit not 0 , or the second digit not 0, or the third digit not 0, which matches our criteria.
Because we want to find at least one of those digits is not a ‘0’, a negative lookahead will help us:
(?!x)
The negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace x with what it is you don’t want to match.
\b4(?!000)\d{3}\b
Now we still match 4010 but not 4000.
A positive lookahead works in the same way but the characters inside the lookahead have to match:
(?=x)
Lookbehind
Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards and then throw it away.
They follow a similar syntax but include a ‘<‘ after the ‘?’ (Think of it as an arrow pointing backwards).
(?<=x) and (?<!x)
If you want to find the first block of domain name in an email, try this:
(?<=\@)(\w*|\d*)(?=\b)
bar.ba@test.co.uk
Explanation:
- Positive Lookbehind (?<=\@)
- \w* | \d* means any number of letters or number
- Positive Lookahead (?=\b)
- \b assert position at a word boundary
Useful tools
Regular expression 101: https://regex101.com/r/0ur1We/1
RegExr: https://regexr.com/
Example
(\d{4}[-, ]?){3}\d{4} 1122-3111-1112-1111
- 4 digits, followed by either a dash, comma or space, zero or one time
- {3} Quantifier — Matches exactly 3 times
- \d{4} The last 4 digit.
Reference
https://www.regular-expressions.info/tutorialcnt.html