Regular Expressions (Regex)
Consider the following text as a test for regular expressions:
It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy…
Regex in Java
The Java classes for regular expressions are Pattern
and Matcher
in java.util.regex
.
The standard usage is to define a regular expression that we wish to match, and compile this Pattern
. We then take the string that we wish to match and compare our regular expression against this with Matcher
.
Matcher
Objects
Let us say we wished to search for a specific string, such as the word “Rebel”.
To do this we must first define a matching pattern for the regular expression. As we are trying to match the word “Rebel”, in our input we wish to match the characters “R”, “e”, “b”, “e”, “l”, in sequence.
This is a common operation and so it is made easy in Regex, to match this the corresponding string is "Rebel"
:
Pattern pattern = Pattern.compile("Rebel");
Matcher matcher = pattern.matcher(input);
Querying a Matcher
Object
Now that we have a matched Matcher
object, we can query this to find the results of our matching. There are three core methods that we will use. start()
, end()
, and find()
.
matcher.start()
returns the integer of the 0th position of the first occurrence of the matching regex within the input array. Unsurprisingly matcher.end()
tells us the integer position that the regex stops matching. If we call matcher.find()
this will return true
if the input string has another occurrence of the regex, and will move onto this part of the string. If no more matches remain, it will return false
.
This means the easiest way to work with this is in a while
loop, as this method will exit the loop once it has reached the end of the String:
while(matcher.find()) {
System.out.println(matcher.start());
System.out.println(matcher.end());
System.out.println(input.substring(matcher.start(), matcher.end()));
}
If we run this, we see that it prints the start and end integers of each match in the input string, and we can use the substring(start, end)
method to display each of our matches.
$ java Rebel
29
34
Rebel
158
163
Rebel
Or
|
is used as or in regex. To match two strings you could use the following:
Pattern pattern = Pattern.compile("Rebel|Empire");
Character Sets/Ranges
To define a character set use []
. You can set the range of characters like so: [0-9]
.
This will match a single character in that set.
Set Boundaries
To define how many times to match a character set you can use a qualifier with {}
.
For example, [A-Z]{5}
would match with all sequences of file letters which are fully capitalised. We can also use ranges: [A-Z]{5-7}
.
There are also some looser qualifiers:
+
- One or more.*
- Zero or more.
If we made the regex [A-Z]*
this would also match with each space in our string as this matches zero or more. We must also tell the regex to only start scanning for matching patterns once we have reached a boundary:
^
Start of line anchor.\\b
Start or end of a word.- Word boundary.
$
End of line anchor.
Wildcards
To save writing sets for commonly matched values, the following wildcards exist:
\d
- A single digit from 0-9.\w
- A single letter, number or underscore.\s
- A single whitespace character including space, tab and newline.-
.
- Any single character apart from newline.To match for a full stop then you can escape with
\
. The backslash would also need escaping for Java.
To match the opposite of these wildcards you can use the uppercase version. For example, \D
will match non-numerical characters.
In Java you must escape backslashes, \
, when you want to write them in a String literal. This makes \d
into \\d
and so on.
Grouping
To match a second regex from a previous group you can use ()
. For example From:(.*)
will save all of the characters after From:
to a specific group.
When using groups the regular expression will match the entire regular expression and save this as the first (0th) group. It will then work through each set of brackets and save these in subsequent groups. We access each of these groups via their integer index and the group()
method.
Pattern pattern = Pattern.compile("From:(.*)");
Matcher matcher = pattern.matcher(spam);
System.out.println("After From: " + matcher.group(1));