使用Java合并多个PDF文件为单个PDF文档

1. 引言

在现代业务和文档管理流程中，将多个PDF文件合并为单个PDF文档是常见需求。典型场景包括：

整合演示文稿
合并报告
打包文档集合

Java生态中存在多个开箱即用的PDF处理库，其中最主流的是：

本文将基于这两个库实现PDF合并功能，并对比其实现差异。

2. 环境准备

2.1. 依赖配置

在pom.xml中添加以下依赖：

Apache PDFBox依赖：

<dependency> 
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId> 
    <version>2.0.31</version> 
</dependency>

iText依赖：

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.13.3</version>
</dependency>

2.2. 测试环境搭建

创建测试用PDF文件的辅助方法：

static void createPDFDoc(String content, String filePath) throws IOException {
    PDDocument document = new PDDocument();
    for (int i = 0; i < 3; i++) {
        PDPage page = new PDPage();
        document.addPage(page);

        try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
            contentStream.beginText();
            contentStream.setFont(PDType1Font.HELVETICA_BOLD, 14);
            contentStream.showText(content + ", page:" + i);
            contentStream.endText();
        }
    }
    document.save("src/test/resources/temp/" + filePath);
    document.close();
}

测试生命周期管理：

@BeforeEach
public void create() throws IOException {
    File tempDirectory = new File("src/test/resources/temp");
    tempDirectory.mkdirs();
    List.of(List.of("hello_world1", "file1.pdf"), List.of("hello_world2", "file2.pdf"))
        .forEach(pair -> {
            try {
                createPDFDoc(pair.get(0), pair.get(1));
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
}

@AfterEach
public void destroy() throws IOException {
    Stream<Path> paths = Files.walk(Paths.get("src/test/resources/temp/"));
    paths.sorted((p1, p2) -> -p1.compareTo(p2))
         .forEach(path -> {
            try {
                Files.delete(path);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
}

3. 使用Apache PDFBox

Apache PDFBox是开源的PDF处理库，提供创建、操作和提取PDF内容的完整功能。

核心实现：

void mergeUsingPDFBox(List<String> pdfFiles, String outputFile) throws IOException {
    PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
    pdfMergerUtility.setDestinationFileName(outputFile);

    pdfFiles.forEach(file -> {
        try {
            pdfMergerUtility.addSource(new File(file));
        } catch (FileNotFoundException e) {
            throw new RuntimeException(e);
        }
    });

    pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}

关键点说明：

PDFMergerUtility是专用合并工具类
addSource()添加源文件
mergeDocuments()执行合并操作
MemoryUsageSetting.setupMainMemoryOnly()指定仅使用内存缓冲（避免磁盘IO）

单元测试验证：

@Test
void givenMultiplePdfs_whenMergeUsingPDFBoxExecuted_thenPdfsMerged() throws IOException {
    List<String> files = List.of("src/test/resources/temp/file1.pdf", "src/test/resources/temp/file2.pdf");
    PDFMerge pdfMerge = new PDFMerge();
    pdfMerge.mergeUsingPDFBox(files, "src/test/resources/temp/output.pdf");

    try (PDDocument document = PDDocument.load(new File("src/test/resources/temp/output.pdf"))) {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        String actual = pdfStripper.getText(document);
        String expected = """
            hello_world1, page:0
            hello_world1, page:1
            hello_world1, page:2
            hello_world2, page:0
            hello_world2, page:1
            hello_world2, page:2
            """;
        assertEquals(expected, actual);
    }
}

4. 使用iText

iText是另一个主流PDF处理库，支持文本、图像、表格等复杂元素操作。

核心实现：

void mergeUsingIText(List<String> pdfFiles, String outputFile) throws IOException, DocumentException {
    List<PdfReader> pdfReaders = List.of(new PdfReader(pdfFiles.get(0)), new PdfReader(pdfFiles.get(1)));
    Document document = new Document();
    FileOutputStream fos = new FileOutputStream(outputFile);
    PdfWriter writer = PdfWriter.getInstance(document, fos);
    document.open();
    PdfContentByte directContent = writer.getDirectContent();
    PdfImportedPage pdfImportedPage;
    for (PdfReader pdfReader : pdfReaders) {
        int currentPdfReaderPage = 1;
        while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
            document.newPage();
            pdfImportedPage = writer.getImportedPage(pdfReader, currentPdfReaderPage);
            directContent.addTemplate(pdfImportedPage, 0, 0);
            currentPdfReaderPage++;
        }
    }
    fos.flush();
    document.close();
    fos.close();
}

关键点说明：

PdfReader读取源文件
PdfWriter写入目标文件
getImportedPage()导入页面
directContent.addTemplate()合并页面内容
需要手动管理资源关闭（⚠️容易踩坑）

单元测试验证：

@Test
void givenMultiplePdfs_whenMergeUsingITextExecuted_thenPdfsMerged() throws IOException, DocumentException {
    List<String> files = List.of("src/test/resources/temp/file1.pdf", "src/test/resources/temp/file2.pdf");
    PDFMerge pdfMerge = new PDFMerge();
    pdfMerge.mergeUsingIText(files, "src/test/resources/temp/output1.pdf");
    try (PDDocument document = PDDocument.load(new File("src/test/resources/temp/output1.pdf"))) {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        String actual = pdfStripper.getText(document);
        String expected = """
            hello_world1, page:0
            hello_world1, page:1
            hello_world1, page:2
            hello_world2, page:0
            hello_world2, page:1
            hello_world2, page:2
            """;
        assertEquals(expected, actual);
    }
}

5. 总结

本文对比了两种主流Java PDF合并方案：

特性	Apache PDFBox	iText
API复杂度	✅ 简单粗暴	❌ 需要手动管理资源
内存控制	✅ 灵活配置	❌ 依赖JVM默认设置
商业使用	✅ 完全开源	❌ 需要商业授权
高级功能支持	⚠️ 基础功能	✅ 支持复杂元素

选择建议：

简单合并场景 → 优先PDFBox（API简洁）
复杂文档处理 → 考虑iText（功能强大）
商业项目 → 注意iText授权问题

完整示例代码已上传至GitHub仓库。

Persistence

REST

Security