使用Spring AI和OpenAI实现音频文件转写

1. 概述

企业经常需要从各种音频内容中提取有意义的数据，例如：

转写客户支持通话进行情感分析
为视频生成字幕
从会议记录中生成会议纪要

手动转写音频文件既耗时又昂贵。为此，OpenAI提供了强大的语音转文本模型，能够准确转写多种语言的音频文件。

本文将探讨如何在Spring AI中使用OpenAI的语音转文本模型实现音频转写功能。

要跟随本教程，你需要准备好OpenAI API密钥。

2. 项目搭建

在实现音频转写功能前，我们需要添加必要的依赖并正确配置应用。

2.1. 依赖配置

首先在项目的pom.xml中添加Spring AI的OpenAI启动器依赖：

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.0.0-M7</version>
</dependency>

由于当前版本1.0.0-M7是里程碑版本，我们还需要在pom.xml中添加Spring Milestones仓库：

<repositories>
    <repository>
        <id>spring-milestones</id>
        <name>Spring Milestones</name>
        <url>https://repo.spring.io/milestone</url>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>
</repositories>

这个仓库专门发布里程碑版本，而非标准的Maven中央仓库。

2.2. OpenAI属性配置

接下来在application.yaml中配置OpenAI API密钥和语音转文本模型：

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        transcription:
          options:
            model: whisper-1
            language: en

我们使用${}属性占位符从环境变量加载API密钥值。

这里通过whisper-1指定了OpenAI的Whisper模型。注意OpenAI还提供了更先进的语音转文本模型如gpt-4o-transcribe和gpt-4o-mini-transcribe，但当前Spring AI版本暂不支持。

此外，我们指定音频文件语言为en。也可以根据需求按ISO-639-1格式指定其他输入语言。未指定时，模型会自动检测音频中的语言。

配置上述属性后，Spring AI会自动创建OpenAiAudioTranscriptionModel类型的bean，供我们与指定模型交互。

3. 构建音频转写器

配置就绪后，让我们创建AudioTranscriber服务类，注入Spring AI自动创建的OpenAiAudioTranscriptionModel bean。

首先定义两个简单的record类表示请求和响应载荷：

record TranscriptionRequest(MultipartFile audioFile, @Nullable String context) {}

record TranscriptionResponse(String transcription) {}

TranscriptionRequest包含待转写的audioFile和可选的context（辅助转写过程）。注意OpenAI目前支持mp3、mp4、mpeg、mpga、m4a、wav和webm格式的音频文件。

TranscriptionResponse则简单封装生成的转写文本。

现在实现核心功能：

TranscriptionResponse transcribe(TranscriptionRequest transcriptionRequest) {
    AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
      transcriptionRequest.audioFile().getResource(),
      OpenAiAudioTranscriptionOptions
        .builder()
        .prompt(transcriptionRequest.context())
        .build()
    );
    AudioTranscriptionResponse response = openAiAudioTranscriptionModel.call(prompt);
    return new TranscriptionResponse(response.getResult().getOutput());
}

我们在AudioTranscriber类中添加transcribe()方法：

使用音频文件资源和可选的上下文提示创建AudioTranscriptionPrompt
调用自动注入的OpenAiAudioTranscriptionModel的call()方法
从响应中提取转写文本并封装返回

⚠️ 踩坑预警：当前语音转文本模型限制音频文件大小为25MB，但Spring Boot默认限制上传文件大小为1MB。需要在application.yaml中调整：

spring:
  servlet:
    multipart:
      max-file-size: 25MB
      max-request-size: 25MB

我们将最大文件大小和请求大小都设置为25MB，应该能满足大多数音频转写需求。

4. 测试音频转写器

实现服务层后，让我们为其暴露REST API：

@PostMapping("/transcribe")
ResponseEntity<TranscriptionResponse> transcribe(
  @RequestParam("audioFile") MultipartFile audioFile,
  @RequestParam("context") String context
) {
    TranscriptionRequest transcriptionRequest = new TranscriptionRequest(audioFile, context);
    TranscriptionResponse response = audioTranscriber.transcribe(transcriptionRequest);
    return ResponseEntity.ok(response);
}

使用HTTPie命令行工具测试该接口：

http -f POST :8080/transcribe [email protected] context="关于Baeldung的简短描述"

我们调用/transcribe接口并提交音频文件及其上下文。演示用的音频文件位于代码库的src/test/resources/audio文件夹，内容是关于Baeldung的简短描述。

查看响应结果：

{
    "transcription": "Baeldung is a top-notch educational platform that specializes in Java, Spring, and related technologies. It offers a wealth of tutorials, articles, and courses that help developers master programming concepts. Known for its clear examples and practical guides, Baeldung is a go-to resource for developers looking to level up their skills."
}

API正确返回了音频文件的转写结果。

**注意提供上下文提示如何帮助模型正确转写"Baeldung"名称。没有这个上下文时，Whisper模型会将其转写为"Baildung"**。

5. 总结

本文探讨了在Spring AI中使用OpenAI实现音频文件转写的方法。

我们完成了必要的配置，使用OpenAI的Whisper语音转文本模型实现了音频转写器，并通过测试验证了功能。特别展示了上下文提示如何提高转写准确性，尤其是对领域特定术语的识别。

本文所有代码示例可在GitHub获取。

Persistence

REST

Security