2. 依赖与配置

首先添加 spring-ai-starter-model-openai 依赖:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.1.0</version>
</dependency>

接着配置 Spring AI 的 OpenAI 模型参数:

spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.audio.speech.options.model=tts-1
spring.ai.openai.audio.speech.options.voice=alloy
spring.ai.openai.audio.speech.options.response-format=mp3
spring.ai.openai.audio.speech.options.speed=1.0

⚠️ 必须设置 OpenAI API 密钥,同时指定:

  • TTS 模型名称(如 tts-1
  • 语音类型(如 alloy
  • 响应格式(如 mp3
  • 语速(1.0 为正常速度)

3. 构建文本转语音应用

创建核心服务类 TextToSpeechService

@Service
public class TextToSpeechService {

    private OpenAiAudioSpeechModel openAiAudioSpeechModel;

    @Autowired
    public TextToSpeechService(OpenAiAudioSpeechModel openAiAudioSpeechModel) {
        this.openAiAudioSpeechModel = openAiAudioSpeechModel;
    }

    public byte[] makeSpeech(String text) {
        SpeechPrompt speechPrompt = new SpeechPrompt(text);
        SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
        return response.getResult().getOutput();
    }
}

✅ 这里使用 Spring AI 预配置的 OpenAiAudioSpeechModelmakeSpeech() 方法将文本转换为音频字节数组。

创建控制器 TextToSpeechController

@RestController
public class TextToSpeechController {
    private final TextToSpeechService textToSpeechService;

    @Autowired
    public TextToSpeechController(TextToSpeechService textToSpeechService) {
        this.textToSpeechService = textToSpeechService;
    }

    @GetMapping("/text-to-speech")
    public ResponseEntity<byte[]> generateSpeechForText(@RequestParam String text) {
        return ResponseEntity.ok(textToSpeechService.makeSpeech(text));
    }
}

最后编写测试用例验证接口:

@SpringBootTest
@ExtendWith(SpringExtension.class)
@AutoConfigureMockMvc
@EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".*")
class TextToSpeechLiveTest {

    @Autowired
    private MockMvc mockMvc;

    @Autowired
    private TextToSpeechService textToSpeechService;

    @Test
    void givenTextToSpeechService_whenCallingTextToSpeechEndpoint_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
        byte[] audioContent = mockMvc.perform(get("/text-to-speech")
          .param("text", "Hello from Baeldung"))
          .andExpect(status().isOk())
          .andReturn()
          .getResponse()
          .getContentAsByteArray();

        assertNotEquals(0, audioContent.length);
    }
}

测试调用 /text-to-speech 接口,验证返回的音频数据非空。保存为文件即可得到 MP3 音频。

4. 添加实时流式音频接口

直接获取大文件音频会占用大量内存,且无法边生成边播放。OpenAI 支持流式响应,我们来扩展 TextToSpeechService

public Flux<byte[]> makeSpeechStream(String text) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text);
    Flux<SpeechResponse> responseStream = openAiAudioSpeechModel.stream(speechPrompt);

    return responseStream
      .map(SpeechResponse::getResult)
      .map(Speech::getOutput);
}

✅ 使用 stream() 方法生成字节流,避免内存爆炸问题。

新增流式接口:

@GetMapping(value = "/text-to-speech-stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public ResponseEntity<StreamingResponseBody> streamSpeech(@RequestParam("text") String text) {
    Flux<byte[]> audioStream = textToSpeechService.makeSpeechStream(text);

    StreamingResponseBody responseBody = outputStream -> {
        audioStream.toStream().forEach(bytes -> {
            try {
                outputStream.write(bytes);
                outputStream.flush();
            } catch (IOException e) {
                throw new UncheckedIOException(e);
            }
        });
    };

    return ResponseEntity.ok()
      .contentType(MediaType.APPLICATION_OCTET_STREAM)
      .body(responseBody);
}

💡 使用 application/octet-stream 标识流式响应。如果是 WebFlux 环境,可以直接返回 Flux<byte[]>

测试流式接口:

@Test
void givenStreamingEndpoint_whenCalled_thenReceiveAudioFileBytes() throws Exception {

    String longText = """
          Hello from Baeldung!
          Here, we explore the world of Java,
          Spring, and web development with clear, practical tutorials.
          Whether you're just starting out or diving deep into advanced
          topics, you'll find guides to help you write clean, efficient,
          and modern code.
          """;

    mockMvc.perform(get("/text-to-speech-stream")
        .param("text", longText)
        .accept(MediaType.APPLICATION_OCTET_STREAM))
      .andExpect(status().isOk())
      .andDo(result -> {
          byte[] response = result.getResponse().getContentAsByteArray();
          assertNotNull(response);
          assertTrue( response.length > 0);
      });
}

虽然 MockMvc 会收集完整响应,但实际接口支持实时流式传输。

5. 针对特定调用自定义模型参数

有时需要为单次请求覆盖默认参数,使用 OpenAiAudioSpeechOptions 实现参数覆盖:

public byte[] makeSpeech(String text, OpenAiAudioSpeechOptions speechOptions) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);
    SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
    return response.getResult().getOutput();
}

✅ 重写 makeSpeech() 方法,传入自定义选项。空对象时使用默认配置。

新增自定义参数接口:

@GetMapping("/text-to-speech-customized")
public ResponseEntity<byte[]> generateSpeechForTextCustomized(
    @RequestParam("text") String text, 
    @RequestParam Map<String, String> params) {
    
    OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
      .model(params.get("model"))
      .voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(params.get("voice")))
      .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.valueOf(params.get("responseFormat")))
      .speed(Float.parseFloat(params.get("speed")))
      .build();

    return ResponseEntity.ok(textToSpeechService.makeSpeech(text, speechOptions));
}

通过 Map 动态构建选项,支持参数:

  • model: TTS 模型
  • voice: 语音类型
  • responseFormat: 音频格式
  • speed: 语速

测试自定义参数接口:

@Test
void givenTextToSpeechService_whenCallingTextToSpeechEndpointWithAnotherVoiceOption_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
    byte[] audioContent = mockMvc.perform(get("/text-to-speech-customized")
      .param("text", "Hello from Baeldung")
      .param("model", "tts-1")
      .param("voice", "NOVA")
      .param("responseFormat", "MP3")
      .param("speed", "1.0"))
    .andExpect(status().isOk())
    .andReturn()
    .getResponse()
    .getContentAsByteArray();

    assertNotEquals(0, audioContent.length);
}

使用 NOVA 语音覆盖默认配置,成功获取定制音频。

6. 总结

文本转语音 API 让生成自然语音变得简单粗暴。通过现代模型和简单配置,就能为应用添加动态语音交互能力。

本文演示了如何使用 Spring AI 集成 OpenAI TTS 模型,包括:

  • 基础文本转语音功能
  • 流式音频传输
  • 动态参数覆盖

踩坑提醒:流式传输时注意处理 IOException,参数覆盖时需验证枚举值有效性。完整代码可在 GitHub 获取。


原始标题:A Guide to OpenAI Text-to-Speech (TTS) in Spring AI | Baeldung