This chapter gives an overview of the JavaScript API for regular expressions. It assumes that you are roughly familiar with how they work. If you are not, there are many good tutorials on the Web. Two examples are:
The terms used here closely reflect the grammar in the ECMAScript specification. I sometimes deviate to make things easier to understand.
The syntax for general atoms is as follows:
All of the following characters have special meaning:
\ ^ $ . * + ? ( ) [ ] { } |You can escape them by prefixing a backslash. For example:
> /^(ab)$/.test('(ab)')
false
> /^\(ab\)$/.test('(ab)')
trueAdditional special characters are:
Inside a character class [...]:
-
Inside a group that starts with a question mark (?...):
: = ! < >
The angle brackets are used only by the XRegExp library (see Chapter 30), to name groups.
. (dot)
Matches any JavaScript character (UTF-16 code unit) except line terminators (newline, carriage return, etc.). To really match any character, use [\s\S]. For example:
> /./.test('\n')
false
> /[\s\S]/.test('\n')
true\f (form feed), \n (line feed, newline), \r (carriage return), \t (horizontal tab), and \v (vertical tab).
\0 matches the NUL character (\u0000).
\cA – \cZ.
\u0000 – \xFFFF (Unicode code units; see Chapter 24).
\x00 – \xFF.
\d matches any digit (same as [0-9]);
\D matches any nondigit (same as [^0-9]).
\w matches any Latin alphanumeric character plus underscore (same as [A-Za-z0-9_]);
\W matches all characters not matched by \w.
\s matches whitespace characters (space, tab, line feed, carriage return, form feed, all Unicode spaces, etc.);
\S matches all nonwhitespace characters.
The syntax for character classes is as follows:
[«charSpecs»] matches any single character that matches at least one of the charSpecs.
[^«charSpecs»] matches any single character that does not match any of the charSpecs.
The following constructs are all character specifications:
Source characters match themselves. Most characters are source characters (even many characters that are special elsewhere). Only three characters are not:
\ ] -
As usual, you escape via a backslash. If you want to match a dash without escaping it, it must be the first character after the opening bracket or the right side of a range, as described shortly.
Class escapes: Any of the character escapes and character class escapes listed previously are allowed. There is one additional escape:
\b): Outside a character class, \b matches word boundaries. Inside a character class, it matches the control character backspace.
-), followed by a source character or a class escape.
To demonstrate using character classes, this example parses a date formatted in the ISO 8601 standard:
functionparseIsoDate(str){varmatch=/^([0-9]{4})-([0-9]{2})-([0-9]{2})$/.exec(str);// Other ways of writing the regular expression:// /^([0-9][0-9][0-9][0-9])-([0-9][0-9])-([0-9][0-9])$/// /^(\d\d\d\d)-(\d\d)-(\d\d)$/if(!match){thrownewError('Not an ISO date: '+str);}console.log('Year: '+match[1]);console.log('Month: '+match[2]);console.log('Day: '+match[3]);}
And here is the interaction:
> parseIsoDate('2001-12-24')
Year: 2001
Month: 12
Day: 24The syntax for groups is as follows:
(«pattern») is a capturing group. Whatever is matched by pattern can be accessed via backreferences or as the result of a match operation.
(?:«pattern») is a noncapturing group. pattern is still matched against the input, but not saved as a capture. Therefore, the group does not have a number you can refer to (e.g., via a backreference).
\1, \2, and so on are known as backreferences; they refer back to a previously matched group. The number after the backslash can be any integer greater than or equal to 1, but the first digit must not be 0.
In this example, a backreference guarantees the same amount of a’s before and after the dash:
> /^(a+)-\1$/.test('a-a')
true
> /^(a+)-\1$/.test('aaa-aaa')
true
> /^(a+)-\1$/.test('aa-a')
falseThis example uses a backreference to match an HTML tag (obviously, you should normally use a proper parser to process HTML):
> var tagName = /<([^>]+)>[^<]*<\/\1>/;
> tagName.exec('<b>bold</b>')[1]
'b'
> tagName.exec('<strong>text</strong>')[1]
'strong'
> tagName.exec('<strong>text</stron>')
nullAny atom (including character classes and groups) can be followed by a quantifier:
? means match never or once.
* means match zero or more times.
+ means match one or more times.
{n} means match exactly n times.
{n,} means match n or more times.
{n,m} means match at least n, at most m, times.
By default, quantifiers are greedy; that is, they match as much as possible. You can get reluctant matching (as little as possible) by suffixing any of the preceding quantifiers (including the ranges in curly braces) with a question mark (?). For example:
> '<a> <strong>'.match(/^<(.*)>/)[1] // greedy 'a> <strong' > '<a> <strong>'.match(/^<(.*?)>/)[1] // reluctant 'a'
Thus, .*? is a useful pattern for matching everything until the next occurrence of the following atom. For example, the following is a more compact version of the regular expression for HTML tags just shown (which used [^<]* instead of .*?):
/<(.+?)>.*?<\/\1>/Assertions, shown in the following list, are checks about the current position in the input:
| Matches only at the beginning of the input. |
| Matches only at the end of the input. |
| Matches only at a word boundary.
Don’t confuse with |
| Matches only if not at a word boundary. |
| Positive lookahead: Matches only if |
| Negative lookahead: Matches only if |
This example matches a word boundary via \b:
> /\bell\b/.test('hello')
false
> /\bell\b/.test('ello')
false
> /\bell\b/.test('ell')
trueThis example matches the inside of a word via \B:
> /\Bell\B/.test('ell')
false
> /\Bell\B/.test('hell')
false
> /\Bell\B/.test('hello')
trueLookbehind is not supported. Manually Implementing Lookbehind explains how to implement it manually.
A disjunction operator (|) separates two alternatives; either of the alternatives must match for the disjunction to match. The alternatives are atoms (optionally including quantifiers).
The operator binds very weakly, so you have to be careful that the alternatives don’t extend too far.
For example, the following regular expression matches all strings that either start with aa or end with bb:
> /^aa|bb$/.test('aaxx')
true
> /^aa|bb$/.test('xxbb')
trueIn other words, the disjunction binds more weakly than even ^ and $ and the two alternatives are ^aa and bb$. If you want to match the two strings 'aa' and 'bb', you need parentheses:
/^(aa|bb)$/
Similarly, if you want to match the strings 'aab' and 'abb':
/^a(a|b)b$/
JavaScript’s regular expressions have only very limited support for Unicode. Especially when it comes to code points in the astral planes, you have to be careful. Chapter 24 explains the details.
You can create a regular expression via either a literal or a constructor and configure how it works via flags.
There are two ways to create a regular expression: you can use a literal or the constructor RegExp:
Literal |
| Compiled at load time |
Constructor (second argument is optional) |
| Compiled at runtime |
A literal and a constructor differ in when they are compiled:
The literal is compiled at load time. The following code will cause an exception when it is evaluated:
functionfoo(){/[/;}
The constructor compiles the regular expression when it is called. The following code will not cause an exception, but calling foo() will:
functionfoo(){newRegExp('[');}
Thus, you should normally use literals, but you need the constructor if you want to dynamically assemble a regular expression.
Flags are a suffix of regular expression literals and a parameter of regular expression constructors; they modify the matching behavior of regular expressions. The following flags exist:
| Short name | Long name | Description |
|
| The given regular expression is matched multiple times. Influences several methods, especially |
|
| Case is ignored when trying to match the given regular expression. |
|
| In multiline mode, the begin operator |
The short name is used for literal prefixes and constructor parameters (see examples in the next section). The long name is used for properties of a regular expression that indicate what flags were set during its creation.
Regular expressions have the following instance properties:
Flags: boolean values indicating what flags are set:
global: Is flag /g set?
ignoreCase: Is flag /i set?
multiline: Is flag /m set?
Data for matching multiple times (flag /g is set):
lastIndex is the index where to continue the search next time.
The following is an example of accessing the instance properties for flags:
> var regex = /abc/i; > regex.ignoreCase true > regex.multiline false
In this example, we create the same regular expression first with a literal, then with a constructor, and use the test() method to determine whether it matches a string:
> /abc/.test('ABC')
false
> new RegExp('abc').test('ABC')
falseIn this example, we create a regular expression that ignores case (flag /i):
> /abc/i.test('ABC')
true
> new RegExp('abc', 'i').test('ABC')
trueThe test() method checks whether a regular expression, regex, matches a string, str:
regex.test(str)
test() operates differently depending on whether the flag /g is set or not.
If the flag /g is not set, then the method checks whether there is a match somewhere in str. For example:
> var str = '_x_x'; > /x/.test(str) true > /a/.test(str) false
If the flag /g is set, then the method returns true as many times as there are matches for regex in str. The property regex.lastIndex contains the index after the last match:
> var regex = /x/g; > regex.lastIndex 0 > regex.test(str) true > regex.lastIndex 2 > regex.test(str) true > regex.lastIndex 4 > regex.test(str) false
The search() method looks for a match with regex within str:
str.search(regex)
If there is a match, the index where it was found is returned. Otherwise, the result is -1. The properties global and lastIndex of regex are ignored as the search is performed (and lastIndex is not changed).
For example:
> 'abba'.search(/b/) 1 > 'abba'.search(/x/) -1
If the argument of search() is not a regular expression, it is converted to one:
> 'aaab'.search('^a+b+$')
0The following method call captures groups while matching regex against str:
varmatchData=regex.exec(str);
If there was no match, matchData is null. Otherwise, matchData is a match result, an array with two additional properties:
input is the complete input string.
index is the index where the match was found.
If the flag /g is not set, only the first match is returned:
> var regex = /a(b+)/;
> regex.exec('_abbb_ab_')
[ 'abbb',
'bbb',
index: 1,
input: '_abbb_ab_' ]
> regex.lastIndex
0If the flag /g is set, all matches are returned if you invoke exec() repeatedly. The return value null signals that there are no more matches. The property lastIndex indicates where matching will continue next time:
> var regex = /a(b+)/g; > var str = '_abbb_ab_'; > regex.exec(str) [ 'abbb', 'bbb', index: 1, input: '_abbb_ab_' ] > regex.lastIndex 6 > regex.exec(str) [ 'ab', 'b', index: 7, input: '_abbb_ab_' ] > regex.lastIndex 10 > regex.exec(str) null
Here we loop over matches:
varregex=/a(b+)/g;varstr='_abbb_ab_';varmatch;while(match=regex.exec(str)){console.log(match[1]);}
and we get the following output:
bbb b
The following method call matches regex against str:
varmatchData=str.match(regex);
If the flag /g of regex is not set, this method works like RegExp.prototype.exec():
> 'abba'.match(/a/) [ 'a', index: 0, input: 'abba' ]
If the flag is set, then the method returns an array with all matching substrings in str (i.e., group 0 of every match) or null if there is no match:
> 'abba'.match(/a/g) [ 'a', 'a' ] > 'abba'.match(/x/g) null
The replace() method searches a string, str, for matches with search and replaces them with replacement:
str.replace(search,replacement)
There are several ways in which the two parameters can be specified:
search
Either a string or a regular expression:
/g flag. This is unexpected and a major pitfall.
global flag, otherwise only one attempt is made to match the regular expression.
replacement
Either a string or a function:
If replacement is a string, its content is used verbatim to replace the match. The only exception is the special character dollar sign ($), which starts so-called replacement directives:
$n inserts group n from the match. n must be at least 1 ($0 has no special meaning).
The matching substring:
$` (backtick) inserts the text before the match.
$& inserts the complete match.
$' (apostrophe) inserts the text after the match.
$$ inserts a single $.
This example refers to the matching substring and its prefix and suffix:
> 'axb cxd'.replace(/x/g, "[$`,$&,$']") 'a[a,x,b cxd]b c[axb c,x,d]d'
This example refers to a group:
> '"foo" and "bar"'.replace(/"(.*?)"/g, '#$1#') '#foo# and #bar#'
If replacement is a function, it computes the string that is to replace the match. This function has the following signature:
function(completeMatch,group_1,...,group_n,offset,inputStr)
completeMatch is the same as $& previously, offset indicates where the match was found, and inputStr is what is being matched against.
Thus, you can use the special variable arguments to access groups (group 1 via arguments[1], and so on). For example:
> function replaceFunc(match) { return 2 * match }
> '3 apples and 5 oranges'.replace(/[0-9]+/g, replaceFunc)
'6 apples and 10 oranges'Regular expressions whose /g flag is set are problematic if a method invoked on them must be invoked multiple times to return all results. That’s the case for two methods:
RegExp.prototype.test()
RegExp.prototype.exec()
Then JavaScript abuses the regular expression as an iterator, as a pointer into the sequence of results. That causes problems:
/g regular expressions can’t be inlined
For example:
// Don’t do that:varcount=0;while(/a/g.test('babaa'))count++;
The preceding loop is infinite, because a new regular expression is created for each loop iteration, which restarts the iteration over the results. Therefore, the code must be rewritten:
varcount=0;varregex=/a/g;while(regex.test('babaa'))count++;
Here is another example:
// Don’t do that:functionextractQuoted(str){varmatch;varresult=[];while((match=/"(.*?)"/g.exec(str))!=null){result.push(match[1]);}returnresult;}
Calling the preceding function will again result in an infinite loop. The correct version is (why lastIndex is set to 0 is explained shortly):
varQUOTE_REGEX=/"(.*?)"/g;functionextractQuoted(str){QUOTE_REGEX.lastIndex=0;varmatch;varresult=[];while((match=QUOTE_REGEX.exec(str))!=null){result.push(match[1]);}returnresult;}
Using the function:
> extractQuoted('"hello", "world"')
[ 'hello', 'world' ]It’s a best practice not to inline anyway (then you can give regular expressions descriptive names). But you have to be aware that you can’t do it, not even in quick hacks.
/g regular expressions as parameters
test() and exec() multiple times must be careful with a regular expression handed to it as a parameter. Its flag /g must active and, to be safe, its lastIndex should be set to zero (an explanation is offered in the next example).
/g regular expressions (e.g., constants)
lastIndex property to zero, before using it as an iterator (an explanation is offered in the next example). As iteration depends on lastIndex, such a regular expression can’t be used in more than one iteration at the same time.
The following example illustrates problem 2. It is a naive implementation of a function that counts how many matches there are for the regular expression regex in the string str:
// Naive implementationfunctioncountOccurrences(regex,str){varcount=0;while(regex.test(str))count++;returncount;}
Here’s an example of using this function:
> countOccurrences(/x/g, '_x_x') 2
The first problem is that this function goes into an infinite loop if the regular expression’s /g flag is not set. For example:
countOccurrences(/x/,'_x_x')// never terminates
The second problem is that the function doesn’t work correctly if regex.lastIndex isn’t 0, because that property indicates where to start the search. For example:
> var regex = /x/g; > regex.lastIndex = 2; > countOccurrences(regex, '_x_x') 1
The following implementation fixes the two problems:
functioncountOccurrences(regex,str){if(!regex.global){thrownewError('Please set flag /g of regex');}varorigLastIndex=regex.lastIndex;// storeregex.lastIndex=0;varcount=0;while(regex.test(str))count++;regex.lastIndex=origLastIndex;// restorereturncount;}
A simpler alternative is to use match():
functioncountOccurrences(regex,str){if(!regex.global){thrownewError('Please set flag /g of regex');}return(str.match(regex)||[]).length;}
There’s one possible pitfall: str.match() returns null if the /g flag is set and there are no matches. We avoid that pitfall in the preceding code by using [] if the result of match() isn’t truthy.
This section gives a few tips and tricks for working with regular expressions in JavaScript.
Sometimes, when you assemble a regular expression manually, you want to use a given string verbatim. That means that none of the special characters (e.g., *, [) should be interpreted as such—all of them need to be escaped. JavaScript has no built-in means for this kind of quoting, but you can program your own function, quoteText, that would work as follows:
> console.log(quoteText('*All* (most?) aspects.'))
\*All\* \(most\?\) aspects\.Such a function is especially handy if you need to do a search and replace with multiple occurrences. Then the value to search for must be a regular expression with the global flag set. With quoteText(), you can use arbitrary strings. The function looks like this:
functionquoteText(text){returntext.replace(/[\\^$.*+?()[\]{}|=!<>:-]/g,'\\$&');}
All special characters are escaped, because you may want to quote several characters inside parentheses or square brackets.
If you don’t use assertions such as ^ and $, most regular expression methods find a pattern anywhere. For example:
> /aa/.test('xaay')
true
> /^aa$/.test('xaay')
falseThe empty regular expression matches everything. We can create an instance of RegExp based on that regular expression like this:
> new RegExp('').test('dfadsfdsa')
true
> new RegExp('').test('')
trueHowever, the empty regular expression literal would be //, which is interpreted as a comment by JavaScript. Therefore, the following is the closest you can get via a literal: /(?:)/ (empty noncapturing group). The group matches everything, while not capturing anything, which the group from influencing the result returned by exec(). Even JavaScript itself uses the preceding representation when displaying an empty regular expression:
> new RegExp('')
/(?:)/The empty regular expression has an inverse—the regular expression that matches nothing:
> var never = /.^/;
> never.test('abc')
false
> never.test('')
falseLookbehind is an assertion. Similar to lookahead, a pattern is used to check something about the current position in the input, but otherwise ignored. In contrast to lookahead, the match for the pattern has to end at the current position (not start at it).
The following function replaces each occurrence of the string 'NAME' with the value of the parameter name, but only if the occurrence is not preceded by a quote. We handle the quote by “manually” checking the character before the current match:
functioninsertName(str,name){returnstr.replace(/NAME/g,function(completeMatch,offset){if(offset===0||(offset>0&&str[offset-1]!=='"')){returnname;}else{returncompleteMatch;}});}
> insertName('NAME "NAME"', 'Jane')
'Jane "NAME"'
> insertName('"NAME" NAME', 'Jane')
'"NAME" Jane'An alternative is to include the characters that may escape in the regular expression. Then you have to temporarily add a prefix to the string you are searching in; otherwise, you’d miss matches at the beginning of that string:
functioninsertName(str,name){vartmpPrefix=' ';str=tmpPrefix+str;str=str.replace(/([^"])NAME/g,function(completeMatch,prefix){returnprefix+name;});returnstr.slice(tmpPrefix.length);// remove tmpPrefix}
Atoms (see Atoms: General):
. (dot) matches everything except line terminators (e.g., newlines). Use [\s\S] to really match everything.
Character class escapes:
\d matches digits ([0-9]); \D matches nondigits ([^0-9]).
\w matches Latin alphanumeric characters plus underscore ([A-Za-z0-9_]); \W matches all other characters.
\s matches all whitespace characters (space, tab, line feed, etc.); \S matches all nonwhitespace characters.
Character class (set of characters): [...] and [^...]
[abc] (all characters except \ ] - match themselves)
[\d\w]
[A-Za-z0-9]
Groups:
(...); backreference: \1
(?:...)
Quantifiers (see Quantifiers):
Greedy:
? * +
{n} {n,} {n,m}
? after any of the greedy quantifiers.
Assertions (see Assertions):
^ $
\b \B
(?=...) (pattern must come next, but is otherwise ignored)
(?!...) (pattern must not come next, but is otherwise ignored)
Disjunction: |
Creating a regular expression (see Creating a Regular Expression):
/xyz/i (compiled at load time)
new RegExp('xzy', 'i') (compiled at runtime)
Flags (see Flags):
/g (influences several regular expression methods)
/i
/m (^ and $ match per line, as opposed to the complete input)
Methods:
regex.test(str): Is there a match (see RegExp.prototype.test: Is There a Match?)?
/g is not set: Is there a match somewhere?
/g is set: Return true as many times as there are matches.
str.search(regex): At what index is there a match (see String.prototype.search: At What Index Is There a Match?)?
regex.exec(str): Capture groups (see the section RegExp.prototype.exec: Capture Groups)?
/g is not set: Capture groups of first match only (invoked once)
/g is set: Capture groups of all matches (invoked repeatedly; returns null if there are no more matches)
str.match(regex): Capture groups or return all matching substrings (see String.prototype.match: Capture Groups or Return All Matching Substrings)
/g is not set: Capture groups
/g is set: Return all matching substrings in an array
str.replace(search, replacement): Search and replace (see String.prototype.replace: Search and Replace)
search: String or regular expression (use the latter, set /g!)
replacement: String (with $1, etc.) or function (arguments[1] is group 1, etc.) that returns a string
For tips on using the flag /g, see Problems with the Flag /g.
Mathias Bynens (@mathias) and Juan Ignacio Dopazo (@juandopazo) recommended using match() and test() for counting occurrences, and Šime Vidas (@simevidas) warned me about being careful with match() if there are no matches. The pitfall of the global flag causing infinite loops comes from a talk by Andrea Giammarchi (@webreflection). Claude Pache told me to escape more characters in quoteText().