使用Lucene实现简单的文件搜索

1. 概述

Apache Lucene 是一个全文搜索引擎，支持多种编程语言。如需入门，可参考我们的介绍文章。

本文将演示如何索引文本文件，并在文件中搜索字符串和文本片段。

2. Maven配置

首先添加必要依赖：

<dependency>        
    <groupId>org.apache.lucene</groupId>          
    <artifactId>lucene-core</artifactId>
    <version>7.1.0</version>
</dependency>

最新版本可在这里查看。

解析搜索查询还需：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.1.0</version>
</dependency>

3. 文件系统目录

索引文件前需先创建文件系统索引。

Lucene 提供 FSDirectory 类创建索引：

Directory directory = FSDirectory.open(Paths.get(indexPath));

indexPath 是目录路径。若目录不存在，Lucene 会自动创建。

FSDirectory 有三个具体实现：

SimpleFSDirectory
NIOFSDirectory
MMapDirectory

⚠️ 各实现存在环境特定问题：

SimpleFSDirectory 并发性能差（多线程读同一文件时会阻塞）
NIOFSDirectory 在 Windows 系统存在文件通道问题
MMapDirectory 有内存释放问题

Lucene 通过 FSDirectory.open() 自动选择最佳实现，避免环境兼容性问题。

4. 索引文本文件

创建索引目录后，添加文件到索引：

public void addFileToIndex(String filepath) {

    Path path = Paths.get(filepath);
    File file = path.toFile();
    IndexWriterConfig indexWriterConfig
     = new IndexWriterConfig(analyzer);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexWriter indexWriter = new IndexWriter(
      indexDirectory, indexWriterConfig);
    Document document = new Document();

    FileReader fileReader = new FileReader(file);
    document.add(
      new TextField("contents", fileReader));
    document.add(
      new StringField("path", file.getPath(), Field.Store.YES));
    document.add(
      new StringField("filename", file.getName(), Field.Store.YES));

    indexWriter.addDocument(document);
    indexWriter.close();
}

关键点：

创建包含两个字段（path 和 filename）的 StringField
添加名为 contents 的 TextField（直接传入 FileReader 实例）
通过 IndexWriter 将文档写入索引
必须调用 close() 释放索引文件锁

5. 搜索索引文件

实现搜索功能：

public List<Document> searchFiles(String inField, String queryString) {
    Query query = new QueryParser(inField, analyzer)
      .parse(queryString);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexReader indexReader = DirectoryReader
      .open(indexDirectory);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    TopDocs topDocs = searcher.search(query, 10);
    
    return topDocs.scoreDocs.stream()
      .map(scoreDoc -> searcher.doc(scoreDoc.doc))
      .collect(Collectors.toList());
}

测试用例：

@Test
public void givenSearchQueryWhenFetchedFileNamehenCorrect(){
    String indexPath = "/tmp/index";
    String dataPath = "/tmp/data/file1.txt";
    
    Directory directory = FSDirectory
      .open(Paths.get(indexPath));
    LuceneFileSearch luceneFileSearch 
      = new LuceneFileSearch(directory, new StandardAnalyzer());
    
    luceneFileSearch.addFileToIndex(dataPath);
    
    List<Document> docs = luceneFileSearch
      .searchFiles("contents", "consectetur");
    
    assertEquals("file1.txt", docs.get(0).get("filename"));
}

执行流程：

在 indexPath 创建文件系统索引
索引 file1.txt
在 contents 字段搜索字符串 "consectetur"

6. 总结

本文演示了使用 Lucene 索引和搜索文本的基础操作。更深入的索引、搜索和查询技术，可参考我们的Lucene 入门文章。

示例代码可在 GitHub 获取。

Persistence

REST

Security