Syntax of Regular Expression in R
A regular expression is a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of string or vector. Generally this pattern is then used by string searching algorithms or function to “find” or “find and replace” operations on string, or for input validation. In R we use regular expression for feature engineering.
Before building model sometimes we have to do some feature engineering. Think, about a name pattern as follows:
Matinoff, Mr. Nicola
First we have to understand the pattern and then we have to write our regular expression according to expected result.
Here, pattern in the name is last_name then coma then title then dot and lastly first_name. Our expected requirement is to identify the last_name from the full name. So, regular expression to find out the last_name will be:
.*,
In the above pattern
- ‘.’ means any character, except \n or line terminator
- ‘*’ means zero or more times
- ‘,’ mean end character with coma (,)
There are different expression syntax in R to identify different patterns. You can practice the below syntax of regular expression in R
Syntax of Regular Expression:
Syntax | Description |
\\d | Digit, 0,1,2 … 9 |
\\D | Not Digit |
\\s | Space |
\\S | Not Space |
\\w | Word |
\\W | Not Word |
\\t | Tab |
\\n | New line |
^ | Beginning of the string |
$ | End of the string |
\ | Escape special characters, e.g. \\ is “\”, \+ is “+” |
| | Alternation match. e.g. /(e|d)n/ matches “en” and “dn” |
• | Any character, except \n or line terminator |
[ab] | a or b |
[^ab] | Any character except a and b |
[0-9] | All Digit |
[A-Z] | All uppercase A to Z letters |
[a-z] | All lowercase a to z letters |
[A-z] | All Uppercase and lowercase a to z letters |
i+ | i at least one time |
i* | i zero or more times |
i? | i zero or 1 time |
i{n} | i occurs n times in sequence |
i{n1,n2} | i occurs n1 – n2 times in sequence |
i{n1,n2}? | non greedy match, see above example |
i{n,} | i occures >= n times |
[:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:] |
[:alpha:] | Alphabetic characters: [:lower:] and [:upper:] |
[:blank:] | Blank characters: e.g. space, tab |
[:cntrl:] | Control characters |
[:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9 |
[:graph:] | Graphical characters: [:alnum:] and [:punct:] |
[:lower:] | Lower-case letters in the current locale |
[:print:] | Printable characters: [:alnum:], [:punct:] and space |
[:punct:] | Punctuation character: ! ” # $ % & ‘ ( ) * + , – . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ |
[:space:] | Space characters: tab, newline, vertical tab, form feed, carriage return, space |
[:upper:] | Upper-case letters in the current locale |
[:xdigit:] | Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f |