使用 Xerces 进行 DOM 解析

1. 概述

本文将介绍如何使用 Apache Xerces 进行 DOM 解析——这是一个成熟且广泛应用的 XML 解析与操作库。

Java 中有多种解析 XML 的方式，本文聚焦于 DOM 解析。DOM 解析器会将整个 XML 文档加载到内存中，并构建一棵完整的树形结构，便于随机访问和修改。

如果你对 Java 中的 XML 库支持感兴趣，可以参考我们之前的文章。

✅ 优势：支持随机访问、可修改
❌ 缺点：内存占用高，不适合大文件

2. 示例文档

我们以下面这个 XML 文件作为示例：

<?xml version="1.0"?>
<tutorials>
    <tutorial tutId="01" type="java">
        <title>Guava</title>
        <description>Introduction to Guava</description>
        <date>04/04/2016</date>
        <author>GuavaAuthor</author>
    </tutorial>
    <tutorial tutId="02" type="spring">
        <title>Spring Boot</title>
        <description>Getting Started with Spring Boot</description>
        <date>05/04/2016</date>
        <author>SpringAuthor</author>
    </tutorial>
    <tutorial tutId="03" type="hibernate">
        <title>Hibernate</title>
        <description>Hibernate Best Practices</description>
        <date>06/04/2016</date>
        <author>HibernateAuthor</author>
    </tutorial>
    <tutorial tutId="04" type="maven">
        <title>Maven</title>
        <description>Maven Tips and Tricks</description>
        <date>07/04/2016</date>
        <author>MavenAuthor</author>
    </tutorial>
</tutorials>

文档结构说明：

根节点为 <tutorials>
包含 4 个 <tutorial> 子节点
每个 <tutorial> 有 2 个属性：tutId 和 type
每个 <tutorial> 有 4 个子元素：title、description、date、author

3. 加载 XML 文件

⚠️ 踩坑提醒：Xerces 已集成在 JDK 中，无需额外引入依赖。Java 自带的 javax.xml.parsers 包底层就是基于 Xerces 实现的。

加载 XML 文件的代码非常简单：

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("src/test/resources/example_jdom.xml"));
doc.getDocumentElement().normalize();

关键步骤：

通过 DocumentBuilderFactory 获取 DocumentBuilder
使用 parse() 方法加载 XML 文件，返回 Document 对象
调用 normalize() 方法：合并文本节点、清除空白字符，避免解析时被干扰

📌 normalize() 很容易被忽略，但建议每次都调用，避免子节点中出现意外的空白 Text 节点。

4. 解析 DOM

DOM 的核心是 Node 接口，所有元素、属性、文本都视为节点。

4.1 获取指定标签的元素

使用 getElementsByTagName() 可以获取所有匹配标签名的节点列表（NodeList）：

@Test
public void whenGetElementByTag_thenSuccess() {
    NodeList nodeList = doc.getElementsByTagName("tutorial");
    Node first = nodeList.item(0);

    assertEquals(4, nodeList.getLength());
    assertEquals(Node.ELEMENT_NODE, first.getNodeType());
    assertEquals("tutorial", first.getNodeName());        
}

✅ NodeList 是类数组结构，通过 item(index) 访问元素
✅ getLength() 返回匹配节点数量

4.2 获取元素属性

通过 getAttributes() 获取 NamedNodeMap，再遍历获取属性名和值：

@Test
public void whenGetFirstElementAttributes_thenSuccess() {
    Node first = doc.getElementsByTagName("tutorial").item(0);
    NamedNodeMap attrList = first.getAttributes();

    assertEquals(2, attrList.getLength());
    
    assertEquals("tutId", attrList.item(0).getNodeName());
    assertEquals("01", attrList.item(0).getNodeValue());
    
    assertEquals("type", attrList.item(1).getNodeName());
    assertEquals("java", attrList.item(1).getNodeValue());
}

🔍 注意：NamedNodeMap 不是 Map，不能用 get(key)，必须通过索引访问。

5. 遍历节点

要访问某个元素的子节点，使用 getChildNodes()：

@Test
public void whenTraverseChildNodes_thenSuccess() {
    Node first = doc.getElementsByTagName("tutorial").item(0);
    NodeList nodeList = first.getChildNodes();
    int n = nodeList.getLength();
    Node current;
    for (int i = 0; i < n; i++) {
        current = nodeList.item(i);
        if (current.getNodeType() == Node.ELEMENT_NODE) {
            System.out.println(
                current.getNodeName() + ": " + current.getTextContent());
        }
    }
}

输出结果：

title: Guava
description: Introduction to Guava
date: 04/04/2016
author: GuavaAuthor

⚠️ 注意：getChildNodes() 返回的列表包含所有子节点，包括空白 Text 节点（换行、缩进），所以要用 Node.ELEMENT_NODE 过滤。

✅ 更优雅的方式是使用 getTextContent() 直接获取文本内容，无需递归遍历文本节点。

6. 修改 DOM

DOM 是可变的，可以直接修改节点内容。

例如，将第一个 <tutorial> 的 type 属性从 "java" 改为 "other"：

@Test
public void whenModifyDocument_thenModified() {
    NodeList nodeList = doc.getElementsByTagName("tutorial");
    Element first = (Element) nodeList.item(0);

    assertEquals("java", first.getAttribute("type")); 
    
    first.setAttribute("type", "other");
    assertEquals("other", first.getAttribute("type"));     
}

✅ 修改非常简单，调用 Element.setAttribute(key, value) 即可
✅ 修改的是内存中的 Document 对象，尚未持久化

7. 创建新文档

除了解析，我们还能从零创建 XML 文档。

目标 XML：

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<users>
    <user id="1">
        <email>user@example.com</email>
    </user>
</users>

实现代码：

@Test
public void whenCreateNewDocument_thenCreated() throws Exception {
    Document newDoc = builder.newDocument();
    Element root = newDoc.createElement("users");
    newDoc.appendChild(root);

    Element first = newDoc.createElement("user");
    root.appendChild(first);
    first.setAttribute("id", "1");

    Element email = newDoc.createElement("email");
    email.appendChild(newDoc.createTextNode("user@example.com"));
    first.appendChild(email);

    assertEquals(1, newDoc.getChildNodes().getLength());
    assertEquals("users", newDoc.getChildNodes().item(0).getNodeName());
}

关键 API：

createElement(tagName)：创建元素节点
createTextNode(text)：创建文本节点
appendChild(child)：添加子节点
setAttribute(key, value)：设置属性

📌 新文档默认不包含 XML 声明，后续保存时会自动添加。

8. 保存文档

修改或创建完成后，需要将 Document 写入文件或输出流。

8.1 保存到文件

private void saveDomToFile(Document document, String fileName) throws Exception {
    DOMSource dom = new DOMSource(document);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    StreamResult result = new StreamResult(new File(fileName));
    transformer.transform(dom, result);
}

8.2 打印到控制台

private void printDom(Document document) throws Exception {
    DOMSource dom = new DOMSource(document);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.transform(dom, new StreamResult(System.out));
}

✅ Transformer 可以格式化输出，但默认不缩进
✅ 如需美化输出（带缩进），可设置 transformer.setOutputProperty(OutputKeys.INDENT, "yes");

9. 总结

本文通过实际示例演示了如何使用 Java 内置的 Xerces DOM 解析器完成以下操作：

✅ 加载 XML 文件
✅ 遍历和查询节点
✅ 读取属性和文本内容
✅ 修改 DOM 树
✅ 从零创建新文档
✅ 保存到文件或打印

虽然 DOM 解析简单直观，但内存占用高，适合小文件或需要频繁修改的场景。对于大文件，建议使用 SAX 或 StAX 流式解析。

💡 源码已托管至 GitHub：https://github.com/eugenp/tutorials/tree/master/xml

Persistence

REST

Security