1. 概述

本教程将深入探讨Java正则表达式API,以及如何在Java编程语言中使用正则表达式。

在正则表达式领域存在多种实现版本(如grep、Perl、Python、PHP、awk等),这意味着在一个语言中有效的正则表达式,在另一个语言中可能无法工作。Java的正则表达式语法与Perl最为相似。

2. 环境准备

在Java中使用正则表达式无需特殊配置。JDK自带了专门处理正则操作的java.util.regex包,只需导入即可。此外,java.lang.String类也内置了正则支持,是日常开发中常用的方式。

3. Java正则包

java.util.regex包包含三个核心类:

  • Pattern:表示编译后的正则表达式。该类没有公共构造方法,需通过静态compile()方法创建,传入正则表达式作为参数。
  • Matcher:解释模式并对输入字符串执行匹配操作。同样没有公共构造方法,通过调用Pattern对象的matcher()方法获取。
  • PatternSyntaxException:非受检异常,表示正则表达式中的语法错误。

4. 简单示例

先从最基础的正则用法开始。当正则应用于字符串时,可能匹配零次或多次。最简单的匹配是字符串字面量匹配

@Test
public void givenText_whenSimpleRegexMatches_thenCorrect() {
    Pattern pattern = Pattern.compile("foo");
    Matcher matcher = pattern.matcher("foo");
 
    assertTrue(matcher.find());
}

find()方法会遍历输入文本,每次匹配返回true。我们可以用它计算匹配次数:

@Test
public void givenText_whenSimpleRegexMatchesTwice_thenCorrect() {
    Pattern pattern = Pattern.compile("foo");
    Matcher matcher = pattern.matcher("foofoo");
    int matches = 0;
    while (matcher.find()) {
        matches++;
    }
 
    assertEquals(matches, 2);
}

提取公共逻辑到runTest方法:

public static int runTest(String regex, String text) {
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    int matches = 0;
    while (matcher.find()) {
        matches++;
    }
    return matches;
}

5. 元字符

元字符影响模式匹配方式,为搜索模式添加逻辑。最简单的是点.,匹配任意字符:

@Test
public void givenText_whenMatchesWithDotMetach_thenCorrect() {
    int matches = runTest(".", "foo");
    
    assertTrue(matches > 0);
}

对比以下示例,正则foo.foofoo中只匹配一次:

@Test
public void givenRepeatedText_whenMatchesOnceWithDotMetach_thenCorrect() {
    int matches= runTest("foo.", "foofoo");
 
    assertEquals(matches, 1);
}

⚠️ 注意:点后的任意字符只消耗一个字符,导致剩余部分无法完整匹配。其他元字符包括<(\[{\\^-=$!|\]})?*+.>等。

6. 字符类

6.1. OR类

格式[abc]匹配集合中任意元素:

@Test
public void givenORSet_whenMatchesAny_thenCorrect() {
    int matches = runTest("[abc]", "b");
    assertEquals(matches, 1);
}

@Test
public void givenORSet_whenMatchesAnyAndAll_thenCorrect() {
    int matches = runTest("[abc]", "cab");
    assertEquals(matches, 3);
}

@Test
public void givenORSet_whenMatchesAllCombinations_thenCorrect() {
    int matches = runTest("[bcr]at", "bat cat rat");
    assertEquals(matches, 3);
}

6.2. NOR类

添加^取反:

@Test
public void givenNORSet_whenMatchesNon_thenCorrect() {
    int matches = runTest("[^abc]", "g");
    assertTrue(matches > 0);
}

@Test
public void givenNORSet_whenMatchesAllExceptElements_thenCorrect() {
    int matches = runTest("[^bcr]at", "sat mat eat");
    assertTrue(matches > 0);
}

6.3. 范围类

使用连字符-定义范围:

@Test
public void givenUpperCaseRange_whenMatchesUpperCase_thenCorrect() {
    int matches = runTest("[A-Z]", "Two Uppercase alphabets 34 overall");
    assertEquals(matches, 2);
}

@Test
public void givenLowerCaseRange_whenMatchesLowerCase_thenCorrect() {
    int matches = runTest("[a-z]", "Two Uppercase alphabets 34 overall");
    assertEquals(matches, 26);
}

@Test
public void givenBothLowerAndUpperCaseRange_whenMatchesAllLetters_thenCorrect() {
    int matches = runTest("[a-zA-Z]", "Two Uppercase alphabets 34 overall");
    assertEquals(matches, 28);
}

@Test
public void givenNumberRange_whenMatchesAccurately_thenCorrect() {
    int matches = runTest("[1-5]", "Two Uppercase alphabets 34 overall");
    assertEquals(matches, 2);
}

@Test
public void givenNumberRange_whenMatchesAccurately_thenCorrect2(){
    int matches = runTest("3[0-5]", "Two Uppercase alphabets 34 overall");
    assertEquals(matches, 1);
}

6.4. 联合类

合并多个字符类:

@Test
public void givenTwoSets_whenMatchesUnion_thenCorrect() {
    int matches = runTest("[1-3[7-9]]", "123456789");
    assertEquals(matches, 6); // 跳过4,5,6
}

6.5. 交集类

使用&&取交集:

@Test
public void givenTwoSets_whenMatchesIntersection_thenCorrect() {
    int matches = runTest("[1-6&&[3-9]]", "123456789");
    assertEquals(matches, 4); // 交集为3,4,5,6
}

6.6. 减法类

通过取反实现减法:

@Test
public void givenSetWithSubtraction_whenMatchesAccurately_thenCorrect() {
    int matches = runTest("[0-9&&[^2468]]", "123456789");
    assertEquals(matches, 5); // 匹配奇数1,3,5,7,9
}

7. 预定义字符类

Java提供预定义字符类简化表达式,注意反斜杠需转义:

  • \d:数字(等价[0-9]
  • \D:非数字
  • \s:空白字符
  • \S:非空白字符
  • \w:单词字符(等价[a-zA-Z_0-9]
  • \W:非单词字符
@Test
public void givenDigits_whenMatches_thenCorrect() {
    int matches = runTest("\\d", "123");
    assertEquals(matches, 3);
}

@Test
public void givenNonDigits_whenMatches_thenCorrect() {
    int matches = runTest("\\D", "a6c");
    assertEquals(matches, 2);
}

@Test
public void givenWhiteSpace_whenMatches_thenCorrect() {
    int matches = runTest("\\s", "a c");
    assertEquals(matches, 1);
}

@Test
public void givenNonWhiteSpace_whenMatches_thenCorrect() {
    int matches = runTest("\\S", "a c");
    assertEquals(matches, 2);
}

@Test
public void givenWordCharacter_whenMatches_thenCorrect() {
    int matches = runTest("\\w", "hi!");
    assertEquals(matches, 2);
}

@Test
public void givenNonWordCharacter_whenMatches_thenCorrect() {
    int matches = runTest("\\W", "hi!");
    assertEquals(matches, 1);
}

8. 量词

量词控制匹配次数:

  • ?:零次或一次(等价{0,1}
  • *:零次或多次(等价{0,}
  • +:一次或多次(等价{1,}
  • {n}:恰好n次
  • {n,m}:n到m次
@Test
public void givenZeroOrOneQuantifier_whenMatches_thenCorrect() {
    int matches = runTest("a?", "hi");
    assertEquals(matches, 3); // 匹配零次或一次,包括空字符串
}

@Test
public void givenZeroOrManyQuantifier_whenMatches_thenCorrect() {
    int matches = runTest("a*", "hi");
    assertEquals(matches, 3);
}

@Test
public void givenOneOrManyQuantifier_whenMatches_thenCorrect() {
    int matches = runTest("a+", "hi");
    assertFalse(matches > 0); // 无匹配
}

@Test
public void givenBraceQuantifier_whenMatches_thenCorrect() {
    int matches = runTest("a{3}", "aaaaaa");
    assertEquals(matches, 2); // 匹配两个"aaa"
}

@Test
public void givenBraceQuantifierWithRange_whenMatchesLazily_thenCorrect() {
    int matches = runTest("a{2,3}?", "aaaa");
    assertEquals(matches, 2); // 懒惰匹配:两个"aa"
}

9. 捕获组

捕获组将多个字符视为单一单元,支持编号和反向引用:

@Test
public void givenCapturingGroup_whenMatches_thenCorrect() {
    int matches = runTest("(\\d\\d)", "12");
    assertEquals(matches, 1);
}

@Test
public void givenCapturingGroup_whenMatchesWithBackReference_thenCorrect() {
    int matches = runTest("(\\d\\d)\\1", "1212");
    assertEquals(matches, 1); // 反向引用匹配重复的"12"
}

@Test
public void givenCapturingGroupAndWrongInput_whenMatchFailsWithBackReference_thenCorrect() {
    int matches = runTest("(\\d\\d)\\1", "1213");
    assertFalse(matches > 0); // 反向引用必须完全相同
}

10. 边界匹配器

控制匹配位置:

  • ^:字符串开始
  • $:字符串结束
  • \b:单词边界
  • \B:非单词边界
@Test
public void givenText_whenMatchesAtBeginning_thenCorrect() {
    int matches = runTest("^dog", "dogs are friendly");
    assertTrue(matches > 0);
}

@Test
public void givenText_whenMatchesAtEnd_thenCorrect() {
    int matches = runTest("dog$", "Man's best friend is a dog");
    assertTrue(matches > 0);
}

@Test
public void givenText_whenMatchesAtWordBoundary_thenCorrect() {
    int matches = runTest("\\bdog\\b", "a dog is friendly");
    assertTrue(matches > 0);
}

@Test
public void givenWrongText_whenMatchFailsAtWordBoundary_thenCorrect() {
    int matches = runTest("\\bdog\\b", "snoop dogg is a rapper");
    assertFalse(matches > 0); // "dogg"不是独立单词
}

11. Pattern类方法

11.1. Pattern.CANON_EQ

启用规范等价匹配:

@Test
public void givenRegexWithCanonEq_whenMatchesOnEquivalentUnicode_thenCorrect() {
    int matches = runTest("\u00E9", "\u0065\u0301", Pattern.CANON_EQ);
    assertTrue(matches > 0); // 匹配组合字符'é'
}

11.2. Pattern.CASE_INSENSITIVE

忽略大小写:

@Test
public void givenRegexWithCaseInsensitiveMatcher_whenMatchesOnDifferentCases_thenCorrect() {
    int matches = runTest("dog", "This is a Dog", Pattern.CASE_INSENSITIVE);
    assertTrue(matches > 0);
}

11.3. Pattern.COMMENTS

允许注释和空白:

@Test
public void givenRegexWithComments_whenMatchesWithFlag_thenCorrect() {
    int matches = runTest("dog$  #check end of text", "This is a dog", Pattern.COMMENTS);
    assertTrue(matches > 0);
}

11.4. Pattern.DOTALL

使.匹配行终止符:

@Test
public void givenRegexWithLineTerminator_whenMatchesWithDotall_thenCorrect() {
    Pattern pattern = Pattern.compile("(.*)", Pattern.DOTALL);
    Matcher matcher = pattern.matcher("line1\nline2");
    matcher.find();
    assertEquals("line1\nline2", matcher.group(1));
}

11.5. Pattern.LITERAL

字面量模式(禁用元字符):

@Test
public void givenRegex_whenMatchesWithLiteralFlag_thenCorrect() {
    int matches = runTest("(.*)", "text(.*)", Pattern.LITERAL);
    assertTrue(matches > 0); // 精确匹配字符串"(.*)"
}

11.6. Pattern.MULTILINE

多行模式(^$匹配行首/尾):

@Test
public void givenRegex_whenMatchesWithMultilineFlag_thenCorrect() {
    int matches = runTest("dog$", "This is a dog\nAnd a fox", Pattern.MULTILINE);
    assertTrue(matches > 0); // 匹配第一行末尾的"dog"
}

12. Matcher类方法

12.1. 索引方法

获取匹配位置:

@Test
public void givenMatch_whenGetsIndices_thenCorrect() {
    Pattern pattern = Pattern.compile("dog");
    Matcher matcher = pattern.matcher("This dog is mine");
    matcher.find();
    assertEquals(5, matcher.start()); // 匹配起始索引
    assertEquals(8, matcher.end());   // 匹配结束索引
}

12.2. 查找方法

检查匹配:

  • matches():全字符串匹配
  • lookingAt():从开头部分匹配
@Test
public void whenStudyMethodsWork_thenCorrect() {
    Pattern pattern = Pattern.compile("dog");
    Matcher matcher = pattern.matcher("dogs are friendly");
    assertTrue(matcher.lookingAt()); // 开头部分匹配
    assertFalse(matcher.matches());  // 非全字符串匹配
}

12.3. 替换方法

替换匹配文本:

@Test
public void whenReplaceFirstWorks_thenCorrect() {
    Pattern pattern = Pattern.compile("dog");
    Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
    String newStr = matcher.replaceFirst("cat");
    assertEquals("cats are domestic animals, dogs are friendly", newStr);
}

@Test
public void whenReplaceAllWorks_thenCorrect() {
    Pattern pattern = Pattern.compile("dog");
    Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
    String newStr = matcher.replaceAll("cat");
    assertEquals("cats are domestic animals, cats are friendly", newStr);
}

13. 总结

本文系统介绍了Java正则表达式API的核心功能,包括模式编译、匹配操作、字符类、量词、边界匹配等关键概念。通过合理运用java.util.regex包,可以高效处理复杂的文本匹配需求。完整代码示例可在GitHub项目中获取。


原始标题:A Guide To Java Regular Expressions API