1. 概述
本教程将深入探讨Java正则表达式API,以及如何在Java编程语言中使用正则表达式。
在正则表达式领域存在多种实现版本(如grep、Perl、Python、PHP、awk等),这意味着在一个语言中有效的正则表达式,在另一个语言中可能无法工作。Java的正则表达式语法与Perl最为相似。
2. 环境准备
在Java中使用正则表达式无需特殊配置。JDK自带了专门处理正则操作的java.util.regex
包,只需导入即可。此外,java.lang.String
类也内置了正则支持,是日常开发中常用的方式。
3. Java正则包
java.util.regex
包包含三个核心类:
- Pattern:表示编译后的正则表达式。该类没有公共构造方法,需通过静态
compile()
方法创建,传入正则表达式作为参数。 - Matcher:解释模式并对输入字符串执行匹配操作。同样没有公共构造方法,通过调用
Pattern
对象的matcher()
方法获取。 - PatternSyntaxException:非受检异常,表示正则表达式中的语法错误。
4. 简单示例
先从最基础的正则用法开始。当正则应用于字符串时,可能匹配零次或多次。最简单的匹配是字符串字面量匹配:
@Test
public void givenText_whenSimpleRegexMatches_thenCorrect() {
Pattern pattern = Pattern.compile("foo");
Matcher matcher = pattern.matcher("foo");
assertTrue(matcher.find());
}
find()
方法会遍历输入文本,每次匹配返回true
。我们可以用它计算匹配次数:
@Test
public void givenText_whenSimpleRegexMatchesTwice_thenCorrect() {
Pattern pattern = Pattern.compile("foo");
Matcher matcher = pattern.matcher("foofoo");
int matches = 0;
while (matcher.find()) {
matches++;
}
assertEquals(matches, 2);
}
提取公共逻辑到runTest
方法:
public static int runTest(String regex, String text) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int matches = 0;
while (matcher.find()) {
matches++;
}
return matches;
}
5. 元字符
元字符影响模式匹配方式,为搜索模式添加逻辑。最简单的是点.
,匹配任意字符:
@Test
public void givenText_whenMatchesWithDotMetach_thenCorrect() {
int matches = runTest(".", "foo");
assertTrue(matches > 0);
}
对比以下示例,正则foo.
在foofoo
中只匹配一次:
@Test
public void givenRepeatedText_whenMatchesOnceWithDotMetach_thenCorrect() {
int matches= runTest("foo.", "foofoo");
assertEquals(matches, 1);
}
⚠️ 注意:点后的任意字符只消耗一个字符,导致剩余部分无法完整匹配。其他元字符包括<(\[{\\^-=$!|\]})?*+.>
等。
6. 字符类
6.1. OR类
格式[abc]
匹配集合中任意元素:
@Test
public void givenORSet_whenMatchesAny_thenCorrect() {
int matches = runTest("[abc]", "b");
assertEquals(matches, 1);
}
@Test
public void givenORSet_whenMatchesAnyAndAll_thenCorrect() {
int matches = runTest("[abc]", "cab");
assertEquals(matches, 3);
}
@Test
public void givenORSet_whenMatchesAllCombinations_thenCorrect() {
int matches = runTest("[bcr]at", "bat cat rat");
assertEquals(matches, 3);
}
6.2. NOR类
添加^
取反:
@Test
public void givenNORSet_whenMatchesNon_thenCorrect() {
int matches = runTest("[^abc]", "g");
assertTrue(matches > 0);
}
@Test
public void givenNORSet_whenMatchesAllExceptElements_thenCorrect() {
int matches = runTest("[^bcr]at", "sat mat eat");
assertTrue(matches > 0);
}
6.3. 范围类
使用连字符-
定义范围:
@Test
public void givenUpperCaseRange_whenMatchesUpperCase_thenCorrect() {
int matches = runTest("[A-Z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 2);
}
@Test
public void givenLowerCaseRange_whenMatchesLowerCase_thenCorrect() {
int matches = runTest("[a-z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 26);
}
@Test
public void givenBothLowerAndUpperCaseRange_whenMatchesAllLetters_thenCorrect() {
int matches = runTest("[a-zA-Z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 28);
}
@Test
public void givenNumberRange_whenMatchesAccurately_thenCorrect() {
int matches = runTest("[1-5]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 2);
}
@Test
public void givenNumberRange_whenMatchesAccurately_thenCorrect2(){
int matches = runTest("3[0-5]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 1);
}
6.4. 联合类
合并多个字符类:
@Test
public void givenTwoSets_whenMatchesUnion_thenCorrect() {
int matches = runTest("[1-3[7-9]]", "123456789");
assertEquals(matches, 6); // 跳过4,5,6
}
6.5. 交集类
使用&&
取交集:
@Test
public void givenTwoSets_whenMatchesIntersection_thenCorrect() {
int matches = runTest("[1-6&&[3-9]]", "123456789");
assertEquals(matches, 4); // 交集为3,4,5,6
}
6.6. 减法类
通过取反实现减法:
@Test
public void givenSetWithSubtraction_whenMatchesAccurately_thenCorrect() {
int matches = runTest("[0-9&&[^2468]]", "123456789");
assertEquals(matches, 5); // 匹配奇数1,3,5,7,9
}
7. 预定义字符类
Java提供预定义字符类简化表达式,注意反斜杠需转义:
\d
:数字(等价[0-9]
)\D
:非数字\s
:空白字符\S
:非空白字符\w
:单词字符(等价[a-zA-Z_0-9]
)\W
:非单词字符
@Test
public void givenDigits_whenMatches_thenCorrect() {
int matches = runTest("\\d", "123");
assertEquals(matches, 3);
}
@Test
public void givenNonDigits_whenMatches_thenCorrect() {
int matches = runTest("\\D", "a6c");
assertEquals(matches, 2);
}
@Test
public void givenWhiteSpace_whenMatches_thenCorrect() {
int matches = runTest("\\s", "a c");
assertEquals(matches, 1);
}
@Test
public void givenNonWhiteSpace_whenMatches_thenCorrect() {
int matches = runTest("\\S", "a c");
assertEquals(matches, 2);
}
@Test
public void givenWordCharacter_whenMatches_thenCorrect() {
int matches = runTest("\\w", "hi!");
assertEquals(matches, 2);
}
@Test
public void givenNonWordCharacter_whenMatches_thenCorrect() {
int matches = runTest("\\W", "hi!");
assertEquals(matches, 1);
}
8. 量词
量词控制匹配次数:
?
:零次或一次(等价{0,1}
)*
:零次或多次(等价{0,}
)+
:一次或多次(等价{1,}
){n}
:恰好n次{n,m}
:n到m次
@Test
public void givenZeroOrOneQuantifier_whenMatches_thenCorrect() {
int matches = runTest("a?", "hi");
assertEquals(matches, 3); // 匹配零次或一次,包括空字符串
}
@Test
public void givenZeroOrManyQuantifier_whenMatches_thenCorrect() {
int matches = runTest("a*", "hi");
assertEquals(matches, 3);
}
@Test
public void givenOneOrManyQuantifier_whenMatches_thenCorrect() {
int matches = runTest("a+", "hi");
assertFalse(matches > 0); // 无匹配
}
@Test
public void givenBraceQuantifier_whenMatches_thenCorrect() {
int matches = runTest("a{3}", "aaaaaa");
assertEquals(matches, 2); // 匹配两个"aaa"
}
@Test
public void givenBraceQuantifierWithRange_whenMatchesLazily_thenCorrect() {
int matches = runTest("a{2,3}?", "aaaa");
assertEquals(matches, 2); // 懒惰匹配:两个"aa"
}
9. 捕获组
捕获组将多个字符视为单一单元,支持编号和反向引用:
@Test
public void givenCapturingGroup_whenMatches_thenCorrect() {
int matches = runTest("(\\d\\d)", "12");
assertEquals(matches, 1);
}
@Test
public void givenCapturingGroup_whenMatchesWithBackReference_thenCorrect() {
int matches = runTest("(\\d\\d)\\1", "1212");
assertEquals(matches, 1); // 反向引用匹配重复的"12"
}
@Test
public void givenCapturingGroupAndWrongInput_whenMatchFailsWithBackReference_thenCorrect() {
int matches = runTest("(\\d\\d)\\1", "1213");
assertFalse(matches > 0); // 反向引用必须完全相同
}
10. 边界匹配器
控制匹配位置:
^
:字符串开始$
:字符串结束\b
:单词边界\B
:非单词边界
@Test
public void givenText_whenMatchesAtBeginning_thenCorrect() {
int matches = runTest("^dog", "dogs are friendly");
assertTrue(matches > 0);
}
@Test
public void givenText_whenMatchesAtEnd_thenCorrect() {
int matches = runTest("dog$", "Man's best friend is a dog");
assertTrue(matches > 0);
}
@Test
public void givenText_whenMatchesAtWordBoundary_thenCorrect() {
int matches = runTest("\\bdog\\b", "a dog is friendly");
assertTrue(matches > 0);
}
@Test
public void givenWrongText_whenMatchFailsAtWordBoundary_thenCorrect() {
int matches = runTest("\\bdog\\b", "snoop dogg is a rapper");
assertFalse(matches > 0); // "dogg"不是独立单词
}
11. Pattern类方法
11.1. Pattern.CANON_EQ
启用规范等价匹配:
@Test
public void givenRegexWithCanonEq_whenMatchesOnEquivalentUnicode_thenCorrect() {
int matches = runTest("\u00E9", "\u0065\u0301", Pattern.CANON_EQ);
assertTrue(matches > 0); // 匹配组合字符'é'
}
11.2. Pattern.CASE_INSENSITIVE
忽略大小写:
@Test
public void givenRegexWithCaseInsensitiveMatcher_whenMatchesOnDifferentCases_thenCorrect() {
int matches = runTest("dog", "This is a Dog", Pattern.CASE_INSENSITIVE);
assertTrue(matches > 0);
}
11.3. Pattern.COMMENTS
允许注释和空白:
@Test
public void givenRegexWithComments_whenMatchesWithFlag_thenCorrect() {
int matches = runTest("dog$ #check end of text", "This is a dog", Pattern.COMMENTS);
assertTrue(matches > 0);
}
11.4. Pattern.DOTALL
使.
匹配行终止符:
@Test
public void givenRegexWithLineTerminator_whenMatchesWithDotall_thenCorrect() {
Pattern pattern = Pattern.compile("(.*)", Pattern.DOTALL);
Matcher matcher = pattern.matcher("line1\nline2");
matcher.find();
assertEquals("line1\nline2", matcher.group(1));
}
11.5. Pattern.LITERAL
字面量模式(禁用元字符):
@Test
public void givenRegex_whenMatchesWithLiteralFlag_thenCorrect() {
int matches = runTest("(.*)", "text(.*)", Pattern.LITERAL);
assertTrue(matches > 0); // 精确匹配字符串"(.*)"
}
11.6. Pattern.MULTILINE
多行模式(^
和$
匹配行首/尾):
@Test
public void givenRegex_whenMatchesWithMultilineFlag_thenCorrect() {
int matches = runTest("dog$", "This is a dog\nAnd a fox", Pattern.MULTILINE);
assertTrue(matches > 0); // 匹配第一行末尾的"dog"
}
12. Matcher类方法
12.1. 索引方法
获取匹配位置:
@Test
public void givenMatch_whenGetsIndices_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("This dog is mine");
matcher.find();
assertEquals(5, matcher.start()); // 匹配起始索引
assertEquals(8, matcher.end()); // 匹配结束索引
}
12.2. 查找方法
检查匹配:
matches()
:全字符串匹配lookingAt()
:从开头部分匹配
@Test
public void whenStudyMethodsWork_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are friendly");
assertTrue(matcher.lookingAt()); // 开头部分匹配
assertFalse(matcher.matches()); // 非全字符串匹配
}
12.3. 替换方法
替换匹配文本:
@Test
public void whenReplaceFirstWorks_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceFirst("cat");
assertEquals("cats are domestic animals, dogs are friendly", newStr);
}
@Test
public void whenReplaceAllWorks_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceAll("cat");
assertEquals("cats are domestic animals, cats are friendly", newStr);
}
13. 总结
本文系统介绍了Java正则表达式API的核心功能,包括模式编译、匹配操作、字符类、量词、边界匹配等关键概念。通过合理运用java.util.regex
包,可以高效处理复杂的文本匹配需求。完整代码示例可在GitHub项目中获取。