1. 概述

Apache Lucene 是一个高性能全文搜索引擎库,支持多种编程语言集成。本文将深入解析其核心概念,并通过实践示例帮助快速上手。

2. Maven 配置

首先添加核心依赖:

<dependency>        
    <groupId>org.apache.lucene</groupId>          
    <artifactId>lucene-core</artifactId>
    <version>7.1.0</version>
</dependency>

最新版本可查阅 Maven 中央仓库

为解析查询语句,还需添加:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.1.0</version>
</dependency>

最新版本见 此处

3. 核心概念

3.1 索引机制

Lucene 采用倒排索引(Inverted Index)技术——与传统映射相反,它将关键词映射到文档(类似书籍末尾的术语索引)。这种设计使搜索响应速度远超直接文本扫描。

3.2 文档模型

文档(Document)是字段的集合,每个字段包含关联值。索引由多个文档构成,搜索结果返回匹配度最高的文档集合。注意:文档不限于文本文件,也可能是数据库表或对象集合。

3.3 字段结构

文档的字段(Field)采用键值对形式:

title: 茶叶功效
body: 探讨草本茶的健康益处...

titlebody 作为独立字段,支持联合或单独检索。

3.4 文本分析

分析(Analysis)将文本拆分为精确单元以优化搜索。处理流程包括:

  1. 提取关键词
  2. 移除停用词(如 "a", "the")
  3. 去除标点符号
  4. 转换为小写

内置分析器示例:

  • StandardAnalyzer:基础语法分析 + 停用词过滤 + 小写转换
  • SimpleAnalyzer:按非字母字符分割 + 小写转换
  • WhiteSpaceAnalyzer:按空白字符分割

支持自定义分析器满足特殊需求。

3.5 搜索流程

索引构建完成后,通过 QueryIndexSearcher 执行搜索:

  • IndexWriter 负责创建索引
  • IndexSearcher 执行查询操作
  • 返回匹配文档集合(TopDocs)

3.6 查询语法

Lucene 提供灵活的查询语法:

基础查询

fieldName:text    # 字段限定查询
例:title:茶叶

范围查询

timestamp:[1509909322,1572981321]

通配符查询

dri?nk    # ? 匹配单字符
d*k       # * 匹配多字符
uni*      # 匹配 "uni" 开头词汇

组合查询

title: "早餐饮茶" AND "咖啡"

更多语法细节参考 官方文档

4. 实战示例

创建内存索引并添加文档:

Directory memoryIndex = new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writter = new IndexWriter(memoryIndex, indexWriterConfig);
Document document = new Document();

document.add(new TextField("title", title, Field.Store.YES));
document.add(new TextField("body", body, Field.Store.YES));

writter.addDocument(document);
writter.close();

TextField 构造函数的第三个参数控制是否存储字段值。

实现搜索方法:

public List<Document> searchIndex(String inField, String queryString) {
    Query query = new QueryParser(inField, analyzer)
      .parse(queryString);

    IndexReader indexReader = DirectoryReader.open(memoryIndex);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    TopDocs topDocs = searcher.search(query, 10);  // 返回前10条结果
    List<Document> documents = new ArrayList<>();
    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
        documents.add(searcher.doc(scoreDoc.doc));
    }
    return documents;
}

测试用例:

@Test
public void givenSearchQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Hello world", "Some hello world");
    
    List<Document> documents 
      = inMemoryLuceneIndex.searchIndex("body", "world");
    
    assertEquals(
      "Hello world", 
      documents.get(0).get("title"));
}

6. 高级查询

掌握基础后,探索 Lucene 的查询实现:

6.1 TermQuery

最基础的查询单元,包含字段名和搜索词:

@Test
public void givenTermQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("activity", "running in track");
    inMemoryLuceneIndex.indexDocument("activity", "Cars are running on road");

    Term term = new Term("body", "running");
    Query query = new TermQuery(term);

    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(2, documents.size());
}

6.2 PrefixQuery

匹配前缀的查询:

@Test
public void givenPrefixQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("article", "Lucene introduction");
    inMemoryLuceneIndex.indexDocument("article", "Introduction to Lucene");

    Term term = new Term("body", "intro");
    Query query = new PrefixQuery(term);

    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(2, documents.size());
}

6.3 WildcardQuery

通配符查询:

Term term = new Term("body", "intro*");
Query query = new WildcardQuery(term);

6.4 PhraseQuery

短语匹配查询,支持词距控制:

inMemoryLuceneIndex.indexDocument(
  "quotes", 
  "A rose by any other name would smell as sweet.");

Query query = new PhraseQuery(
  1, "body", new BytesRef("smell"), new BytesRef("sweet"));

List<Document> documents = inMemoryLuceneIndex.searchIndex(query);

⚠️ 第一个参数 slop 指定匹配词之间的最大间隔词数。

6.5 FuzzyQuery

模糊匹配,容忍拼写错误:

inMemoryLuceneIndex.indexDocument("article", "Halloween Festival");
inMemoryLuceneIndex.indexDocument("decoration", "Decorations for Halloween");

Term term = new Term("body", "hallowen");  // 故意拼错
Query query = new FuzzyQuery(term);

List<Document> documents = inMemoryLuceneIndex.searchIndex(query);

6.6 BooleanQuery

组合多条件查询:

inMemoryLuceneIndex.indexDocument("Destination", "Las Vegas singapore car");
inMemoryLuceneIndex.indexDocument("Commutes in singapore", "Bus Car Bikes");

Term term1 = new Term("body", "singapore");
Term term2 = new Term("body", "car");

TermQuery query1 = new TermQuery(term1);
TermQuery query2 = new TermQuery(term2);

BooleanQuery booleanQuery 
  = new BooleanQuery.Builder()
    .add(query1, BooleanClause.Occur.MUST)  // 必须包含
    .add(query2, BooleanClause.Occur.MUST)
    .build();

7. 结果排序

按指定字段排序搜索结果:

@Test
public void givenSortFieldWhenSortedThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Ganges", "River in India");
    inMemoryLuceneIndex.indexDocument("Mekong", "This river flows in south Asia");
    inMemoryLuceneIndex.indexDocument("Amazon", "Rain forest river");
    inMemoryLuceneIndex.indexDocument("Rhine", "Belongs to Europe");
    inMemoryLuceneIndex.indexDocument("Nile", "Longest River");

    Term term = new Term("body", "river");
    Query query = new WildcardQuery(term);

    SortField sortField 
      = new SortField("title", SortField.Type.STRING_VAL, false);  // false=升序
    Sort sortByTitle = new Sort(sortField);

    List<Document> documents 
      = inMemoryLuceneIndex.searchIndex(query, sortByTitle);
    assertEquals(4, documents.size());
    assertEquals("Amazon", documents.get(0).getField("title").stringValue());
}

8. 删除文档

根据条件删除索引文档:

IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(memoryIndex, indexWriterConfig);
writer.deleteDocuments(term);  // 删除匹配term的文档

测试验证:

@Test
public void whenDocumentDeletedThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Ganges", "River in India");
    inMemoryLuceneIndex.indexDocument("Mekong", "This river flows in south Asia");

    Term term = new Term("title", "ganges");
    inMemoryLuceneIndex.deleteDocument(term);

    Query query = new TermQuery(term);
    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(0, documents.size());  // 确认已删除
}

9. 总结

本文系统介绍了 Apache Lucene 的核心概念与实战技巧,包括:

  • ✅ 倒排索引原理
  • ✅ 文档与字段模型
  • ✅ 文本分析机制
  • ✅ 多种查询类型实现
  • ✅ 结果排序与文档删除

完整示例代码可在 GitHub 获取。掌握这些基础后,可进一步探索分布式搜索(如 Solr/Elasticsearch)等高级应用。


原始标题:Introduction to Apache Lucene | Baeldung