ES学习4-分页

  1. From+Size
  2. Search After
  3. Scroll
  4. 总结

业务需求中我们经常会遇到需要分页展示搜索结果的情况。ES支持多种分页查询方式,包括from+size、scroll、search after等三种方式。

再开始详细介绍前,我们先初始化一个索引表:

PUT books
{
  "mappings": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "text"
      },
      "author": {
        "type": "text"
      },
      "price": {
        "type": "float"
      }
    }
  }
}

POST _bulk
{"index": {"_index": "books"}}
{"id": 1, "name": "Java Core Technology", "author": "Cay S. Horstmann", "price": 88.20}
{"index": {"_index": "books"}}
{"id": 2, "name": "The Beauty Of Math", "author": "Wujun", "price": 50.50}
{"index": {"_index": "books"}}
{"id": 3, "name": "Shakespeare: Romeo and Juliet", "author": "William Shakespeare", "price": 100.00}
{"index": {"_index": "books"}}
{"id": 4, "name": "Pride and Prejudice", "author": "Austen", "price": 32.30}
{"index": {"_index": "books"}}
{"id": 5, "name": "Animal Farm", "author": "Orwell", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 6, "name": "Great Expectations", "author": "Dickens", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 7, "name": "Lord of the Flies", "author": "Golding", "price": 83.00}
{"index": {"_index": "books"}}
{"id": 8, "name": "The Good Earth", "author": "Buck", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 9, "name": "A Connecticut Yankee in King Arthur's Court", "author": "Twain", "price": 22.00}
{"index": {"_index": "books"}}
{"id": 10, "name": "Oliver Twist", "author": "Dickens", "price": 212.00}
{"index": {"_index": "books"}}
{"id": 11, "name": "Brave New World", "author": "Huxley", "price": 42.00}
{"index": {"_index": "books"}}
{"id": 12, "name": "the Canterbury Tales", "author": "Chaucer", "price": 24.00}
{"index": {"_index": "books"}}
{"id": 13, "name": "the Old Man and the Sea", "author": "Hemingway", "price": 66.00}

From+Size

ES的search API支持指定from、size参数实现翻页功能。这种分页方式与MySQL的from+size方式非常类似。例如:

select * from table_books order by price desc limit N size M

MySQL里为了实现该查询需要顺序先找出N条数据并丢弃,然后再取M条数据。当页数N越大时候,性能越差。

而在ES中实现from+size分页查询的样例如下:

GET /books/_search
{
 "query": {"match_all": {}},
 "sort": [
   {
     "price": {
       "order": "desc"
     }
   }
 ], 
 "from": 0,
 "size": 10
}

同样的,ES的这种分页方式通常会将请求广播到索引表的每一个分片,待每个分片都检索出(N+M条)数据后再汇总数据再进行翻页检索。因此,这种深翻页也存在非常大的性能问题。默认情况下,ES通过index.max_result_window参数限制最多只能翻页10000个文档。

Search After

Search After方式支持使用前一页返回的排序值来检索下一页数据。与MySQL的下边语句类似:

select * from table_books where id > ${last_id} order by id desc limit 10

ES在每个查询返回结果中的每一个文档都会携带一个索引值,可用于下一页的查询:

GET /books/_search
{
  "query": {"match_all": {}},
  "sort": [
    {"price": "desc"},
    {"id": "desc"}      
  ],
  "size": 2
}

// response -> 
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 13,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "books",
        "_id": "ce2kI5ABzT8j3ZAE2smo",
        "_score": null,
        "_source": {
          "id": 10,
          "name": "Oliver Twist",
          "author": "Dickens",
          "price": 212
        },
        "sort": [
          212,
          10
        ]
      },
      {
        "_index": "books",
        "_id": "au2kI5ABzT8j3ZAE2smo",
        "_score": null,
        "_source": {
          "id": 3,
          "name": "Shakespeare: Romeo and Juliet",
          "author": "William Shakespeare",
          "price": 100
        },
        "sort": [
          100,
          3
        ]
      }
    ]
  }
}

那么查询下一页则可以使用上一页的索引值[100, “3”],如下:

GET /books/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {"price": "desc"},
    {"id": "asc"}
  ],
  "search_after": [
    100,
    "3"
  ],
  "size": 2
}

// response -> 
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 13,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "books",
        "_id": "bO2kI5ABzT8j3ZAE2smo",
        "_score": null,
        "_source": {
          "id": 5,
          "name": "Animal Farm",
          "author": "Orwell",
          "price": 93
        },
        "sort": [
          93,
          5
        ]
      },
      {
        "_index": "books",
        "_id": "be2kI5ABzT8j3ZAE2smo",
        "_score": null,
        "_source": {
          "id": 6,
          "name": "Great Expectations",
          "author": "Dickens",
          "price": 93
        },
        "sort": [
          93,
          6
        ]
      }
    ]
  }
}

上边的排序请求中为了找到价格高到低的书籍,使用了两个字段[{“price”: “desc”}, {“id”: “desc”}]进行排序。这里为何还要使用id字段?原因是为确保相同价格的书籍在分页排序时候不会重复、丢失或者顺序不可控,这里需要额外增加一个唯一键以保障顺序。

这种方式不存在深翻页问题,但无法支持随机跳页。另外当数据被变更并刷新的情况下,页面的返回结果并不是稳定的,会出现重复数据的情况。为了避免这种情况,可以使用ES 7.10版本后提供的point in time (PIT)来构建索引数据的轻量级视图,在该视图下ES会保障数据的稳定性,使用如下:

-- 启动PIT视图
POST /books/_pit?keep_alive=5m

-- reponse:
{
  "id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA"
}

-- 查询第一页
GET _search
{
  "query": {"match_all": {}},
  "sort": [
    {"price": "desc"},
    {"_shard_doc": "desc"}
  ],
  "size": 2,
  "pit": {
    "id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
    "keep_alive": "5m"
  }
}

-- response:
{
  "pit_id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 13,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "books",
        "_id": "Bl7ZKJAB2vthpLOOy9ge",
        "_score": null,
        "_source": {
          "id": 10,
          "name": "Oliver Twist",
          "author": "Dickens",
          "price": 212
        },
        "sort": [
          212,
          9
        ]
      },
      {
        "_index": "books",
        "_id": "_17ZKJAB2vthpLOOy9ce",
        "_score": null,
        "_source": {
          "id": 3,
          "name": "Shakespeare: Romeo and Juliet",
          "author": "William Shakespeare",
          "price": 100
        },
        "sort": [
          100,
          2
        ]
      }
    ]
  }
}

--翻下一页 
GET _search
{
  "query": {"match_all": {}},
  "sort": [
    {"price": "desc"},
    {"_shard_doc": "desc"}
  ],
  "size": 2,
  "pit": {
    "id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
    "keep_alive": "5m"
  },
  "search_after": [
      100,
      2
    ]
}

从上边的例子可以看到,当使用了PIT后,返回结果里的排序值除了包含price,还会多出一个_shard_doc值。例如[100, 2]。这个值是由分片索引和lucene内部文档Id组合而来。ES会确保在该PIT中每一个文档的_shard_doc都是唯一的。另外,你也可以显式指定 {“_shard_doc”: “desc”}来控制返回文档的顺序。

Scroll

除了上述两种方法以外,ES还支持Scroll命令分页。Scroll命令可用于非实时地分页检索大量数据。该命令是通过开始执行时构建一个快照以屏蔽掉所有数据变更,从而实现稳定的翻页。但这也决定了使用Scroll命令获取的翻页数据并不是实时的。另外快照需要保持所有旧数据不被删除,相比之下PIT轻量级视图的性能开销要更加小。因此当然在最新的ES版本中,已经建议使用Search After代替Scroll命令。

下边简单了解下Scroll命令:

-- 首页查询
POST /books/_search?scroll=5m
{
  "query": {"match_all": {}},
  "sort": [{"price":"desc"}],
  "size": 2
}

-- response
{
  "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAKSRZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn",
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 13,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "books",
        "_id": "Bl7ZKJAB2vthpLOOy9ge",
        "_score": null,
        "_source": {
          "id": 10,
          "name": "Oliver Twist",
          "author": "Dickens",
          "price": 212
        },
        "sort": [
          212
        ]
      },
      {
        "_index": "books",
        "_id": "_17ZKJAB2vthpLOOy9ce",
        "_score": null,
        "_source": {
          "id": 3,
          "name": "Shakespeare: Romeo and Juliet",
          "author": "William Shakespeare",
          "price": 100
        },
        "sort": [
          100
        ]
      }
    ]
  }
}

-- 下一页
POST _search/scroll
{
  "scroll": "5m",
  "scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAJExZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn"
}

-- 再下一页
POST _search/scroll
{
  "scroll": "5m",
  "scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAJExZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn"
}

-- ...

总结

分页方式 优点 缺点 适用场景
from+size 简单直观容易理解;
小数据集下表现良好;
支持跳页查询;
大数据集下性能较差;
有深度分页限制(默认10000)
需要支持跳页但没有深翻页的情景
scroll 使用快照保障数据的稳定;
没有深翻页性能问题;
适用于大量数据处理情景;
不支持实时查询;
快照占用较多性能,用完需要及时删除;
适用于导出大量数据、离线批处理任务等场景
search after 没有深翻页性能问题;
数据不稳定;
需要正确选择排序ID避免重复数据问题;
适用于实时性要求高且存在深翻页的场景,如无限滚动列表
search after (PIT) 使用PIT轻量级视图保障数据稳定性;
没有深翻页性能问题;
数据稳定;需要正确选择排序ID避免重复数据问题; 适用于不要求实时性并存在深翻页的场景,如无限滚动列表

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 duval1024@gmail.com

×

喜欢就点赞,疼爱就打赏