使用指数退避和抖动实现更智能的重试机制

1. 概述

在分布式系统中，网络调用随时可能失败。为了提升系统的健壮性，客户端通常会实现重试机制（Retry）来应对这些短暂的失败。

本文将重点介绍两种优化重试策略的方法：

指数退避（Exponential Backoff）
抖动（Jitter）

通过这两个策略的结合使用，可以有效避免重试请求对服务端造成“雪崩”冲击，同时提高系统整体的稳定性。

2. 重试的基本概念

假设我们有一个客户端需要调用一个远程服务 PingPongService：

interface PingPongService {
    String call(String ping) throws PingPongServiceException;
}

当服务抛出 PingPongServiceException 时，客户端应该进行重试。重试逻辑看似简单，但如果处理不当，反而可能加剧服务端的压力。

3. 使用 Resilience4j 实现重试

我们使用 Resilience4j 这个轻量级容错库来实现重试逻辑。首先在 pom.xml 中添加依赖：

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-retry</artifactId>
</dependency>

然后创建一个重试配置对象：

RetryConfig retryConfig = RetryConfig.custom()
  .maxAttempts(MAX_RETRIES)
  .intervalFunction(intervalFn)
  .build();

接着用它来装饰我们的服务调用函数：

Function<String, String> pingPongFn = Retry
    .decorateFunction(retry, ping -> service.call(ping));
pingPongFn.apply("Hello");

4. 指数退避：让重试更“温柔”

直接重试会加重服务端压力。指数退避的思路是：每次失败后，等待时间逐渐变长：

wait_interval = base * multiplier^n

其中：

base 是初始等待时间
n 是失败次数
multiplier 是增长因子

在 Resilience4j 中，可以这样配置：

IntervalFunction intervalFn =
  IntervalFunction.ofExponentialBackoff(INITIAL_INTERVAL, MULTIPLIER);

假设我们模拟 4 个并发客户端调用服务，日志如下：

[thread-1] At 00:37:42.756
[thread-2] At 00:37:42.756
[thread-3] At 00:37:42.756
[thread-4] At 00:37:42.756

[thread-2] At 00:37:43.802
[thread-4] At 00:37:43.802
[thread-1] At 00:37:43.802
[thread-3] At 00:37:43.802

[thread-2] At 00:37:45.803
[thread-1] At 00:37:45.803
[thread-4] At 00:37:45.803
[thread-3] At 00:37:45.803

[thread-2] At 00:37:49.808
[thread-3] At 00:37:49.808
[thread-4] At 00:37:49.808
[thread-1] At 00:37:49.808

✅ 看起来每次等待时间在变长，但 ❌ 所有客户端几乎在同一时间发起重试，导致“重试洪峰”。

5. 抖动：打破同步，避免重试洪峰

为了解决同步重试的问题，我们引入了 抖动（Jitter），即在等待时间中加入随机因子：

wait_interval = (base * 2^n) +/- random_interval

在 Resilience4j 中，可以这样配置：

IntervalFunction intervalFn = 
  IntervalFunction.ofExponentialRandomBackoff(INITIAL_INTERVAL, MULTIPLIER, RANDOMIZATION_FACTOR);

再来看一次并发调用的日志：

[thread-2] At 39:21.297
[thread-4] At 39:21.297
[thread-3] At 39:21.297
[thread-1] At 39:21.297

[thread-2] At 39:21.918
[thread-3] At 39:21.868
[thread-4] At 39:22.011
[thread-1] At 39:22.184

[thread-1] At 39:23.086
[thread-5] At 39:23.939
[thread-3] At 39:24.152
[thread-4] At 39:24.977

[thread-3] At 39:26.861
[thread-1] At 39:28.617
[thread-4] At 39:28.942
[thread-2] At 39:31.039

✅ 现在请求更均匀地分布了，既避免了重试洪峰，也避免了大量空闲时间。

6. 总结

指数退避让客户端重试之间等待时间逐渐变长，缓解服务端压力
抖动通过加入随机因子打破同步，避免多个客户端同时重试造成“雪崩”
二者结合使用，可以显著提升系统在面对网络波动时的稳定性

如果你正在构建高并发系统，建议在重试逻辑中引入指数退避 + 抖动策略，这会是一个简单但非常有效的优化手段。

源码地址：GitHub

Persistence

REST

Security