1. 删除单个文档

ElasticSearch 提供了多种方式来删除单个文档,最常见的包括 Delete API 和 Delete By Query API。

1.1 使用 Delete API

当你知道文档的 _index_id 时,Delete API 是最直接的方式:

$ curl -X DELETE "localhost:9200/customers/_doc/1"
{"_index":"customers","_id":"1","_version":3,"result":"deleted","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":20,"_primary_term":1}

上述命令删除了 customers 索引中 ID 为 1 的文档。如果文档存在,ElasticSearch 将返回操作成功的 JSON 响应。

✅ 优点:操作简单,适合删除少量已知 ID 的文档。
❌ 缺点:不适合批量删除或条件删除。

1.2 使用 Delete By Query API

当你需要根据某些条件删除多个文档时,Delete By Query API 更加高效:

$ curl -X POST "localhost:9200/customers/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "last_purchase_date": {
        "lt": "now-1y"
      }
    }
  }
}'
{"took":258,"timed_out":false,"total":4,"deleted":4,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

该命令删除了 customers 索引中所有 last_purchase_date 超过一年的文档。

⚠️ 注意事项:

  • 操作不是原子的,失败可能导致部分删除
  • 对大数据集可能较耗资源,建议在低峰期执行

✅ 建议:可添加 size 参数限制单次删除数量,以降低集群压力。

2. 批量删除操作

当需要删除大量文档时,使用 Bulk API 可以显著提升效率。

2.1 使用 Python 客户端执行批量删除

以下是一个使用 Python 和 elasticsearch-py 客户端的示例:

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(["http://localhost:9200"])

def generate_actions(inactive_customer_ids):
    for customer_id in inactive_customer_ids:
        yield {
            "_op_type": "delete",
            "_index": "customers",
            "_id": customer_id
        }

inactive_customer_ids = ["3", "5", "8"]

response = helpers.bulk(es, generate_actions(inactive_customer_ids))
print(f"Deleted {response[0]} documents")

运行结果:

$ python3 bulk-removal.py 
Deleted 3 documents

✅ 优势:

  • 减少网络往返次数
  • 提升删除效率,适合大规模数据操作

3. 索引级别删除

当需要删除整个索引或大量数据时,索引级别的操作更为高效。

3.1 删除整个索引

删除整个索引是最快捷的方式:

$ curl -X DELETE "localhost:9200/customers"
{"acknowledged":true}

⚠️ 警告:该操作不可逆,删除后数据无法恢复。

✅ 适用场景:删除日志索引、测试数据等。

3.2 使用别名实现零停机时间的索引重建

如果你需要删除部分数据但又不希望影响服务可用性,可以使用别名 + 重建索引的方式:

  1. 创建别名
$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
  "actions": [
    { "add": { "index": "customers", "alias": "current_customers" }}
  ]
}'
{"acknowledged":true,"errors":false}
  1. 创建新索引
$ curl -X PUT "localhost:9200/customers_v2" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "email": { "type": "keyword" },
      "name": { "type": "text" }
    }
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"customers_v2"}
  1. 重新索引并排除 inactive 用户
$ curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "customers",
    "query": {
      "bool": {
        "must_not": {
          "term": { "status": "inactive" }
        }
      }
    }
  },
  "dest": {
    "index": "customers_v2"
  }
}'
{"took":251,"timed_out":false,"total":7,"updated":0,"created":7,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}
  1. 切换别名指向新索引
$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
  "actions": [
    { "remove": { "index": "customers", "alias": "current_customers" }},
    { "add":    { "index": "customers_v2", "alias": "current_customers" }}
  ]
}'
{"acknowledged":true,"errors":false}

✅ 优势:

  • 零停机时间
  • 支持对数据做清洗、转换、重新映射等操作

4. 总结

本文介绍了多种从 ElasticSearch 中删除数据的方法:

  • Delete API:适用于删除单个已知文档
  • Delete By Query API:按条件批量删除,但需注意操作的非原子性和资源消耗
  • Bulk API:适用于高效批量删除大量文档
  • 索引级操作:包括删除整个索引和使用别名进行零停机重建索引

通过合理使用这些方法,可以有效管理 ElasticSearch 中的数据,确保集群性能和存储资源的高效利用。


原始标题:Removing Data From ElasticSearch