Java XPath 解析完全指南 | Baeldung中文网

2. 简单的 XPath 解析器

import jakarta.xml.namespace.NamespaceContext;
import jakarta.xml.parsers.DocumentBuilder;
import jakarta.xml.parsers.DocumentBuilderFactory;
import jakarta.xml.parsers.ParserConfigurationException;
import jakarta.xml.xpath.XPath;
import jakarta.xml.xpath.XPathConstants;
import jakarta.xml.xpath.XPathExpressionException;
import jakarta.xml.xpath.XPathFactory;

import org.w3c.dom.Document;

public class DefaultParser {
    
    private File file;

    public DefaultParser(File file) {
        this.file = file;
    }
}

现在我们深入分析 DefaultParser 中的核心组件：

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = newSecureDocumentBuilderFactory();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

拆解上述代码：

为生成 XML 文档的 DOM 树，我们使用 newSecureDocumentBuilderFactory() 创建 builderFactory 实例。此方法内部会配置一个安全的 DocumentBuilderFactory 实例：

该方法通过禁用外部实体和 DTD 相关的危险特性来增强 XML 解析安全性。处理不可信来源的 XML 时，这是防止 XML 漏洞的最佳实践。

DocumentBuilderFactory builderFactory = newSecureDocumentBuilderFactory();

DocumentBuilder builder = builderFactory.newDocumentBuilder();

有了 DocumentBuilder 实例后，即可从多种输入源解析 XML 文档，如 InputStream、File、URL 和 SAX：

Document xmlDocument = builder.parse(fileIS);

Document 表示整个 XML 文档，是文档树的根节点，提供数据访问入口：

XPath xPath = XPathFactory.newInstance().newXPath();

通过 XPath 对象执行表达式，从文档中提取所需信息：

xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

可编译字符串形式的 XPath 表达式，并指定期望返回的数据类型（如 NODESET、NODE 或 String）。

3. 开始实战

掌握基础组件后，我们通过一个简单 XML 示例深入实践：

<?xml version="1.0"?>
<Tutorials>
    <Tutorial tutId="01" type="java">
        <title>Guava</title>
        <description>Introduction to Guava</description>
        <date>04/04/2016</date>
        <author>GuavaAuthor</author>
    </Tutorial>
    <Tutorial tutId="02" type="java">
        <title>XML</title>
        <description>Introduction to XPath</description>
        <date>04/05/2016</date>
        <author>XMLAuthor</author>
    </Tutorial>
</Tutorials>

3.1 获取基础元素列表

使用 XPath 表达式从 XML 中提取节点列表：

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

通过上述表达式可获取根节点下的教程列表。也可使用 "//Tutorial"，但此表达式会返回文档中所有 <Tutorial> 节点（无论层级）。

通过指定 NODESET 返回有序节点集合，可通过索引访问节点。

3.2 通过 ID 获取特定节点

通过过滤条件基于 ID 查找元素：

DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(this.getFile());
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial[@tutId=" + "'" + id + "'" + "]";
node = (Node) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODE);

此类表达式称为谓词，是定位特定数据的利器，例如：

/Tutorials/Tutorial[1]
/Tutorials/Tutorial[first()]
/Tutorials/Tutorial[position()<4]

3.3 按标签名获取节点

引入轴（axes）概念实现高级定位：

Document xmlDocument = builder.parse(this.getFile());
this.clean(xmlDocument);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//Tutorial[descendant::title[text()=" + "'" + name + "'" + "]]";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

上述表达式查找包含特定文本 <title> 的所有 <Tutorial> 元素。

基于示例 XML，可查找包含 "Guava" 或 "XML" 的 <title>，获取完整 <Tutorial> 元素。

轴提供灵活的 XML 导航方式，详见官方文档。

3.4 在表达式中操作数据

XPath 支持在表达式中直接操作数据：

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//Tutorial[number(translate(date, '/', '')) > " + date + "]";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

此场景中，方法接收 "ddmmyyyy" 格式日期字符串，但 XML 存储为 "dd/mm/yyyy"。使用 XPath 内置函数转换格式后匹配结果。

3.5 处理带命名空间的文档

若 XML 定义命名空间（如 example_namespace.xml），检索规则需调整：

<?xml version="1.0"?>
<Tutorials xmlns="/full_archive">
</Tutorials>

此时 "//Tutorial" 将无法匹配结果，因为所有 <Tutorial> 元素位于 /full_archive 命名空间下。

解决步骤：

首先设置命名空间上下文：

xPath.setNamespaceContext(new NamespaceContext() {
    @Override
    public Iterator getPrefixes(String arg0) {
        return null;
    }
    @Override
    public String getPrefix(String arg0) {
        return null;
    }
    @Override
    public String getNamespaceURI(String arg0) {
        if ("bdn".equals(arg0)) {
            return "/full_archive";
        }
        return null;
    }
});

定义 "bdn" 为命名空间 "/full_archive" 的标识。后续 XPath 表达式需包含该前缀：

String expression = "/bdn:Tutorials/bdn:Tutorial";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

3.6 避免空文本节点陷阱

3.3 节代码中解析 XML 后调用了 this.clean(xmlDocument)。

当文档包含空文本节点时，遍历元素或子节点可能遇到意外行为。例如调用 node.getFirstChild() 时可能返回空的 "#Text" 节点而非预期结果。

解决方案：遍历文档并移除空节点：

NodeList childNodes = node.getChildNodes();
for (int n = childNodes.getLength() - 1; n >= 0; n--) {
    Node child = childNodes.item(n);
    short nodeType = child.getNodeType();
    if (nodeType == Node.ELEMENT_NODE) {
        clean(child);
    }
    else if (nodeType == Node.TEXT_NODE) {
        String trimmedNodeVal = child.getNodeValue().trim();
        if (trimmedNodeVal.length() == 0){
            node.removeChild(child);
        }
        else {
            child.setNodeValue(trimmedNodeVal);
        }
    } else if (nodeType == Node.COMMENT_NODE) {
        node.removeChild(child);
    }
}

通过检查节点类型，移除不需要的节点（空文本节点、注释等）。

4. 总结

本文探讨了使用标准 Java JDK 进行 XPath 解析的基础知识。Java 默认提供了强大的 XML/HTML 解析、读取和处理支持。

XPath 表达式不仅限于 Java，还可与 XSLT 结合导航 XML 文档。常用库包括 JDOM、Saxon、XQuery、JAXP、Jaxen 和 Jackson。针对 HTML 解析，可使用 JSoup 等专用库。

Persistence

REST

Security