使用 indexOf 查找字符串中所有单词出现位置

1. 概述

在很多领域中，我们都需要在一大段文本中查找某个特定的字符模式或单词。比如在生物信息学中，可能需要在一个染色体序列中查找某段 DNA 片段；在新闻编辑系统中，编辑可能需要在大量文本中定位某个关键词；在数据监控中，系统通过查找可疑关键词来识别垃圾信息。

这类问题由于常见且棘手，常被形象地称为 “大海捞针问题”。在本篇文章中，我们将介绍一种简单的算法，它利用 Java 中 String 类的 indexOf(String str, int fromIndex) 方法，来查找字符串中某个单词的所有出现位置。

2. 简单算法

我们不只统计单词出现的次数，而是找出它在文本中的每一个位置。这个算法的特点如下：

✅ 会匹配单词嵌套在其他词中的情况。例如，搜索 “able” 时，也能匹配到 “comfortable” 和 “tablet”
✅ 不区分大小写
✅ 基于朴素字符串匹配思想，也就是“暴力匹配”，逐个位置检查是否匹配目标词

2.1. 实现代码

public class WordIndexer {

    public List<Integer> findWord(String textString, String word) {
        List<Integer> indexes = new ArrayList<Integer>();
        String lowerCaseTextString = textString.toLowerCase();
        String lowerCaseWord = word.toLowerCase();

        int index = 0;
        while(index != -1){
            index = lowerCaseTextString.indexOf(lowerCaseWord, index);
            if (index != -1) {
                indexes.add(index);
                index++;
            }
        }
        return indexes;
    }
}

2.2. 测试代码

我们用莎士比亚《哈姆雷特》中的一段经典台词作为测试文本，查找其中的 “or”：

@Test
public void givenWord_whenSearching_thenFindAllIndexedLocations() {
    String theString;
    WordIndexer wordIndexer = new WordIndexer();

    theString = "To be, or not to be: that is the question: "
      + "Whether 'tis nobler in the mind to suffer "
      + "The slings and arrows of outrageous fortune, "
      + "Or to take arms against a sea of troubles, "
      + "And by opposing end them? To die: to sleep; "
      + "No more; and by a sleep to say we end "
      + "The heart-ache and the thousand natural shocks "
      + "That flesh is heir to, 'tis a consummation "
      + "Devoutly to be wish'd. To die, to sleep; "
      + "To sleep: perchance to dream: ay, there's the rub: "
      + "For in that sleep of death what dreams may come,";

    List<Integer> expectedResult = Arrays.asList(7, 122, 130, 221, 438);
    List<Integer> actualResult = wordIndexer.findWord(theString, "or");
    assertEquals(expectedResult, actualResult);
}

运行测试后，我们得到如下结果：

index of 7, in "or"
index of 122, in "fortune"
index of 130, in "Or
index of 221, in "more"
index of 438, in "For"

从算法复杂度来看，该算法的时间复杂度为 O(m(n-m))，其中 m 是目标词长度，n* 是文本长度。对于几千字符的文本来说完全没问题，但如果是几十亿字符的文本，那效率就不太行了。

3. 改进版算法

上面的算法是典型的“暴力匹配”，适用于任何情况。但如果我们可以提前知道目标词中 没有重复字符模式（如 “aaa”），就可以稍微优化一下。

我们可以避免每次都从下一个字符开始查找，而是直接跳到上一次匹配结果之后的位置。这样可以减少不必要的比较，理想情况下复杂度可以优化到 *O(n)*。

下面是优化后的 findWordUpgrade 方法：

public List<Integer> findWordUpgrade(String textString, String word) {
    List<Integer> indexes = new ArrayList<Integer>();
    StringBuilder output = new StringBuilder();
    String lowerCaseTextString = textString.toLowerCase();
    String lowerCaseWord = word.toLowerCase();
    int wordLength = 0;

    int index = 0;
    while(index != -1){
        index = lowerCaseTextString.indexOf(lowerCaseWord, index + wordLength);  // 小幅优化
        if (index != -1) {
            indexes.add(index);
        }
        wordLength = word.length();
    }
    return indexes;
}

⚠️ 注意：这种优化只适用于目标词中没有重复字符的情况。如果不确定，还是用原始暴力方法更稳妥。

4. 总结

本文介绍了如何通过 Java 的 indexOf() 方法实现一个不区分大小写的单词查找算法。虽然 indexOf() 方法本身是区分大小写的，但我们通过转小写的方式绕过了这个问题。

总的来说，indexOf() 是一个非常实用的方法，可以快速在文本中定位子串，而无需手动处理字符串切割逻辑。

完整代码已上传至 GitHub：点击查看

✅ 适合用于快速查找、文本分析等场景
❌ 不推荐用于超大规模文本的实时匹配
⚠️ 若需高性能，可考虑使用 KMP 或正则优化方案

Persistence

REST

Security