Prometheus Java 客户端指南

1. 简介

随着分布式系统日益复杂，监控成为保障应用性能和快速定位问题的关键。Prometheus 作为一款强大的开源监控告警工具，正是为此而生。

Prometheus Java 客户端让我们能轻松地为应用添加监控指标，暴露实时数据供 Prometheus 采集和监控。

本文将探讨如何通过 Maven 使用 Prometheus Java 客户端库，包括创建自定义指标和配置 HTTP 服务器暴露指标。同时还会介绍库中提供的不同指标类型，并通过实际示例整合这些概念。

2. 项目配置

要开始使用 Prometheus Java 客户端，我们先用 Maven 管理项目依赖。需要在 pom.xml 中添加以下核心依赖：

<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>prometheus-metrics-core</artifactId>
    <version>1.3.1</version>
</dependency>
<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>prometheus-metrics-instrumentation-jvm</artifactId>
    <version>1.3.1</version>
</dependency>
<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>prometheus-metrics-exporter-httpserver</artifactId>
    <version>1.3.1</version>
</dependency>

这些依赖的作用如下：

✅ prometheus-metrics-core：核心库，提供定义和注册自定义指标（如计数器、仪表盘、直方图等）的基础能力

✅ prometheus-metrics-instrumentation-jvm：开箱即用的 JVM 指标，包括堆内存使用、GC 时间、线程数等

✅ prometheus-metrics-exporter-httpserver：嵌入式 HTTP 服务器，用于暴露 Prometheus 格式的指标，创建 /metrics 接口供 Prometheus 采集

3. 创建和暴露 JVM 指标

本节介绍如何暴露 Prometheus Java 客户端提供的 JVM 指标。这些指标能帮我们深入了解应用性能。得益于 prometheus-metrics-instrumentation-jvm 依赖，我们可以轻松注册开箱即用的 JVM 指标，无需自定义埋点：

public static void main(String[] args) throws InterruptedException, IOException {
    JvmMetrics.builder().register();

    HTTPServer server = HTTPServer.builder()
      .port(9400)
      .buildAndStart();

    System.out.println("HTTPServer listening on http://localhost:" + server.getPort() + "/metrics");

    Thread.currentThread().join();
}

为了让 Prometheus 能采集 JVM 指标，我们通过 HTTP 接口暴露它们。使用 prometheus-metrics-exporter-httpserver 依赖创建一个简单的 HTTP 服务器，监听指定端口并提供指标数据。

调用 join() 方法保持主线程持续运行，确保 HTTP 服务器始终活跃，这样 Prometheus 就能持续采集指标。

3.1. 测试应用

应用启动后，我们可以通过浏览器访问 http://localhost:9400/metrics 查看暴露的指标，或使用 curl 命令从命令行获取：

$ curl http://localhost:9400/metrics

会看到类似这样的 Prometheus 格式的 JVM 指标：

# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap",} 5242880
jvm_memory_bytes_used{area="nonheap",} 2345678
# HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.
# TYPE jvm_gc_collection_seconds summary
jvm_gc_collection_seconds_count{gc="G1 Young Generation",} 5
jvm_gc_collection_seconds_sum{gc="G1 Young Generation",} 0.087
...

输出展示了内存使用、GC 详情、线程数等 JVM 指标。Prometheus 会采集并分析这些按专用格式暴露的指标。

4. 指标类型

Prometheus Java 客户端将指标分为不同类型，每种类型用于衡量应用行为的不同方面。**这些类型基于 OpenMetrics 标准**，Prometheus 严格遵循该标准。

下面介绍主要的指标类型及其典型用法。

4.1. Counter

Counter 是只增不减的指标，适合统计请求数、错误数或完成任务数。Counter 的值只能在进程重启时重置。

例如统计应用处理的 HTTP 请求数：

Counter requestCounter = Counter.builder()
  .name("http_requests_total")
  .help("Total number of HTTP requests")
  .labelNames("method", "status")
  .register();

requestCounter.labelValues("GET", "200").inc();

使用 labelNames 和 labelValues 为指标添加维度。Prometheus 中的标签是键值对，用于区分同一指标的不同类别。

4.2. Gauge

Gauge 是可增可减的指标，适合跟踪随时间波动的值，如内存使用量、温度或活跃线程数。

例如测量当前内存使用量：

Gauge memoryUsage = Gauge.builder()
  .name("memory_usage_bytes")
  .help("Current memory usage in bytes")
  .register();

memoryUsage.set(5000000);

4.3. Histogram

Histogram 用于观察和跟踪值的分布情况，如请求延迟或响应大小。它记录预配置的桶（bucket），提供每个桶的观测数量、总数和总和，帮助我们分析数据分布并计算分位数。

下面通过详细示例测量 HTTP 请求延迟，并使用自定义桶跟踪特定响应时间范围：

Histogram requestLatency = Histogram.builder()
  .name("http_request_latency_seconds")
  .help("Tracks HTTP request latency in seconds")
  .labelNames("method")
  .register();

Random random = new Random();
for (int i = 0; i < 100; i++) {
    double latency = 0.1 + (3 * random.nextDouble());
    requestLatency.labelValues("GET").observe(latency);
}

创建时未指定自定义桶，因此库使用默认桶。默认桶覆盖指数范围的值，适合测量持续时间或延迟的场景。具体默认桶边界如下：

[5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, +Inf]

查看结果会看到类似输出：

http_request_latency_seconds_bucket{method="GET",le="0.005"} 0
http_request_latency_seconds_bucket{method="GET",le="0.01"} 0
http_request_latency_seconds_bucket{method="GET",le="0.025"} 0
http_request_latency_seconds_bucket{method="GET",le="0.05"} 0
http_request_latency_seconds_bucket{method="GET",le="0.1"} 0
http_request_latency_seconds_bucket{method="GET",le="0.25"} 6
http_request_latency_seconds_bucket{method="GET",le="0.5"} 15
http_request_latency_seconds_bucket{method="GET",le="1.0"} 32
http_request_latency_seconds_bucket{method="GET",le="2.5"} 79
http_request_latency_seconds_bucket{method="GET",le="5.0"} 100
http_request_latency_seconds_bucket{method="GET",le="10.0"} 100
http_request_latency_seconds_bucket{method="GET",le="+Inf"} 100
http_request_latency_seconds_count{method="GET"} 100
http_request_latency_seconds_sum{method="GET"} 157.8138389516349

每个桶显示落在该范围内的观测数量。例如 le="0.25" 的桶表示有 6 个请求耗时 ≤250ms。+Inf 桶捕获所有观测，其计数等于总观测数。

4.4. Summary

Summary 类似 Histogram，但通过计算分位数汇总观测数据，而非使用预定义桶。适合跟踪请求延迟或响应大小，帮助我们计算中位数（50%分位）或 90%分位等关键指标：

Summary requestDuration = Summary.builder()
  .name("http_request_duration_seconds")
  .help("Tracks the duration of HTTP requests in seconds")
  .quantile(0.5, 0.05)
  .quantile(0.9, 0.01)
  .register();

for (int i = 0; i < 100; i++) {
    double duration = 0.05 + (2 * random.nextDouble());
    requestDuration.observe(duration);
}

我们定义了两个分位数：

0.5（50%分位）近似中位数，允许 5% 误差
0.9（90%分位）表示 90% 的请求快于此值，允许 1% 误差

当 Prometheus 采集指标时，会看到类似输出：

http_request_duration_seconds{quantile="0.5"} 1.3017345289221114
http_request_duration_seconds{quantile="0.9"} 1.8304437814581778
http_request_duration_seconds_count 100
http_request_duration_seconds_sum 110.5670284649691

分位数显示 50% 和 90% 分位的观测值。即 50% 的请求耗时 <1.3 秒，90% 的请求耗时 <1.9 秒。

4.5. Info

Info 指标存储应用的静态标签，用于版本号、构建信息或环境详情。它不是性能指标，而是向 Prometheus 输出添加信息性元数据的方式。

Info appInfo = Info.builder()
  .name("app_info")
  .help("Application version information")
  .labelNames("version", "build")
  .register();

appInfo.addLabelValues("1.0.0", "12345");

4.6. StateSet

StateSet 指标表示多个可激活或停用的状态，适合跟踪应用的不同运行状态或功能开关状态：

StateSet stateSet = StateSet.builder()
  .name("feature_flags")
  .help("Feature flags")
  .labelNames("env")
  .states("feature1")
  .register();

stateSet.labelValues("dev").setFalse("feature1");

5. Prometheus 指标类型概览

Prometheus Java 客户端提供多种指标类型，用于捕获应用性能和行为的各个维度。下表总结了每种指标类型的关键特性、用途及示例：

指标类型	描述	示例用例
Counter	只增不减的指标，通常用于统计事件	统计 HTTP 请求数或错误数
Gauge	可增可减的指标，用于跟踪波动的值	监控内存使用量或活跃线程数
Histogram	将观测值分布到可配置的桶中	观察请求延迟或响应大小
Summary	跟踪观测值分布并计算可配置分位数	测量请求持续时间或延迟分位数
Info	存储带应用元数据的静态标签	捕获版本或构建信息
StateSet	跟踪多个可激活/停用的运行状态	监控功能开关状态

6. 总结

本文探讨了如何有效使用 Prometheus Java 客户端监控应用，包括自定义指标和 JVM 指标的埋点。首先介绍了通过 Maven 依赖配置项目，接着通过 HTTP 接口暴露指标，然后详细讲解了 Counter、Gauge、Histogram 和 Summary 等核心指标类型，它们各自适用于跟踪不同的性能指标。

⚠️ 踩坑提示：生产环境中注意合理设置 Histogram 桶边界，避免桶过细导致存储压力，或过粗导致数据精度不足。

本文的完整实现代码可在 GitHub 上查看。

Persistence

REST

Security