在Spring Boot中使用Amazon Textract提取图片中的文本

1. 概述

企业经常需要从各种类型的图片中提取有意义的数据，例如处理发票或收据用于费用跟踪、处理身份文件用于KYC（了解你的客户）流程，或自动化表单数据录入。然而，手动从图片中提取文本是一个耗时且昂贵的过程。

Amazon Textract提供了一种自动化解决方案，使用机器学习从文档中提取印刷文本和手写数据。

本教程将探讨如何在Spring Boot应用中使用Amazon Textract从图片中提取文本。我们将逐步讲解必要的配置，并实现从本地图片文件和存储在Amazon S3中的图片提取文本的功能。

2. 项目搭建

在开始从图片中提取文本之前，我们需要包含SDK依赖并正确配置应用。

2.1. 依赖项

首先将Amazon Textract依赖添加到项目的pom.xml文件中：

<dependency>
    <groupId>software.amazon.awssdk</groupId>
    <artifactId>textract</artifactId>
    <version>2.27.5</version>
</dependency>

该依赖提供了TextractClient和其他相关类，我们将使用它们与Textract服务交互。

2.2. 定义AWS配置属性

要使用Textract服务提取文本，**我们需要配置AWS凭证进行身份验证，并指定要使用服务的AWS区域**。

我们将这些属性存储在项目的application.yaml文件中，并使用@ConfigurationProperties将值映射到POJO，服务层在与Textract交互时引用该POJO：

@Validated
@ConfigurationProperties(prefix = "com.baeldung.aws")
class AwsConfigurationProperties {
    @NotBlank
    private String region;

    @NotBlank
    private String accessKey;

    @NotBlank
    private String secretKey;

    // 标准setter和getter方法
}

我们还添加了验证注解以确保所有必需的属性都正确配置。如果任何验证失败，Spring的ApplicationContext将无法启动。这使我们能够遵循快速失败原则。

以下是application.yaml文件的片段，定义了将自动映射到AwsConfigurationProperties类的必需属性：

com:
  baeldung:
    aws:
      region: ${AWS_REGION}
      access-key: ${AWS_ACCESS_KEY}
      secret-key: ${AWS_SECRET_KEY}

我们使用*${}*属性占位符从环境变量加载属性值。

这种设置允许我们将AWS属性外部化，并在应用中轻松访问它们。

2.3. 声明TextractClient Bean

现在我们已经配置了属性，让我们引用它们来定义TextractClient bean：

@Bean
public TextractClient textractClient() {
    String region = awsConfigurationProperties.getRegion();
    String accessKey = awsConfigurationProperties.getAccessKey();
    String secretKey = awsConfigurationProperties.getSecretKey();
    AwsBasicCredentials awsCredentials = AwsBasicCredentials.create(accessKey, secretKey);

    return TextractClient.builder()
      .region(Region.of(region))
      .credentialsProvider(StaticCredentialsProvider.create(awsCredentials))
      .build();
}

TextractClient类是与Textract服务交互的主要入口点。我们将在服务层中自动装配它，并发送请求从图片文件中提取文本。

3. 从图片中提取文本

既然我们已经定义了TextractClient bean，让我们创建一个TextExtractor类并引用它来实现预期功能：

public String extract(@ValidFileType MultipartFile image) {
    byte[] imageBytes = image.getBytes();
    DetectDocumentTextResponse response = textractClient.detectDocumentText(request -> request
      .document(document -> document
        .bytes(SdkBytes.fromByteArray(imageBytes))
        .build())
      .build());
    
    return transformTextDetectionResponse(response);
}

private String transformTextDetectionResponse(DetectDocumentTextResponse response) {
    return response.blocks()
      .stream()
      .filter(block -> block.blockType().equals(BlockType.LINE))
      .map(Block::text)
      .collect(Collectors.joining(" "));
}

在extract()方法中，我们将MultipartFile转换为字节数组，并将其作为Document传递给*detectDocumentText()*方法。

Amazon Textract目前仅支持PNG、JPEG、TIFF和PDF文件格式。我们创建了一个[自定义验证注解]@ValidFileType，确保上传的文件是这些受支持的格式之一。

在我们的演示中，在辅助方法transformTextDetectionResponse()中，我们通过连接每个block的文本内容，将DetectDocumentTextResponse转换为简单的String。但转换逻辑可以根据业务需求自定义。

除了从应用传递图片外，我们还可以从存储在S3存储桶中的图片提取文本：

public String extract(String bucketName, String objectKey) {
    textractClient.detectDocumentText(request -> request
      .document(document -> document
        .s3Object(s3Object -> s3Object
          .bucket(bucketName)
          .name(objectKey)
          .build())
        .build())
      .build());
    
    return transformTextDetectionResponse(response);
}

在我们的重载*extract()*方法中，我们接受S3存储桶名称和对象键作为参数，允许我们指定图片在S3中的位置。

⚠️ 需要注意的是，我们调用TextractClient bean的*detectDocumentText()*方法，这是一个同步操作，用于处理单页文档。**但对于处理多页文档，Amazon Textract提供了异步操作**。

4. IAM权限

最后，为了使我们的应用正常运行，我们需要为在应用中配置的IAM用户配置一些权限：

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowTextractDocumentDetection",
            "Effect": "Allow",
            "Action": "textract:DetectDocumentText",
            "Resource": "*"
        },
        {
            "Sid": "AllowS3ReadAccessToSourceBucket",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::bucket-name/*"
        }
    ]
}

在我们的IAM策略中，AllowTextractDocumentDetection语句允许我们调用DetectDocumentText API从图片中提取文本。

如果我们要从存储在S3中的图片提取文本，还需要包含AllowS3ReadAccessToSourceBucket语句以允许读取S3存储桶的访问权限。

我们的IAM策略遵循最小权限原则，仅授予应用正常运行所需的必要权限。

5. 结论

本文探讨了如何在Spring Boot中使用Amazon Textract从图片中提取文本。

我们讨论了如何从本地图片文件以及存储在Amazon S3中的图片提取文本。

Amazon Textract是一项强大的服务，在金融科技和健康科技行业被广泛使用，有助于自动化处理发票或从医疗表格中提取患者数据等任务。

一如既往，本文中使用的所有代码示例都可以在GitHub上找到。

Persistence

REST

Security