业务需求中我们经常会遇到需要分页展示搜索结果的情况。ES支持多种分页查询方式,包括from+size、scroll、search after等三种方式。
再开始详细介绍前,我们先初始化一个索引表:
PUT books
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text"
},
"author": {
"type": "text"
},
"price": {
"type": "float"
}
}
}
}
POST _bulk
{"index": {"_index": "books"}}
{"id": 1, "name": "Java Core Technology", "author": "Cay S. Horstmann", "price": 88.20}
{"index": {"_index": "books"}}
{"id": 2, "name": "The Beauty Of Math", "author": "Wujun", "price": 50.50}
{"index": {"_index": "books"}}
{"id": 3, "name": "Shakespeare: Romeo and Juliet", "author": "William Shakespeare", "price": 100.00}
{"index": {"_index": "books"}}
{"id": 4, "name": "Pride and Prejudice", "author": "Austen", "price": 32.30}
{"index": {"_index": "books"}}
{"id": 5, "name": "Animal Farm", "author": "Orwell", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 6, "name": "Great Expectations", "author": "Dickens", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 7, "name": "Lord of the Flies", "author": "Golding", "price": 83.00}
{"index": {"_index": "books"}}
{"id": 8, "name": "The Good Earth", "author": "Buck", "price": 93.00}
{"index": {"_index": "books"}}
{"id": 9, "name": "A Connecticut Yankee in King Arthur's Court", "author": "Twain", "price": 22.00}
{"index": {"_index": "books"}}
{"id": 10, "name": "Oliver Twist", "author": "Dickens", "price": 212.00}
{"index": {"_index": "books"}}
{"id": 11, "name": "Brave New World", "author": "Huxley", "price": 42.00}
{"index": {"_index": "books"}}
{"id": 12, "name": "the Canterbury Tales", "author": "Chaucer", "price": 24.00}
{"index": {"_index": "books"}}
{"id": 13, "name": "the Old Man and the Sea", "author": "Hemingway", "price": 66.00}
From+Size
ES的search API支持指定from、size参数实现翻页功能。这种分页方式与MySQL的from+size方式非常类似。例如:
select * from table_books order by price desc limit N size M
MySQL里为了实现该查询需要顺序先找出N条数据并丢弃,然后再取M条数据。当页数N越大时候,性能越差。
而在ES中实现from+size分页查询的样例如下:
GET /books/_search
{
"query": {"match_all": {}},
"sort": [
{
"price": {
"order": "desc"
}
}
],
"from": 0,
"size": 10
}
同样的,ES的这种分页方式通常会将请求广播到索引表的每一个分片,待每个分片都检索出(N+M条)数据后再汇总数据再进行翻页检索。因此,这种深翻页也存在非常大的性能问题。默认情况下,ES通过index.max_result_window参数限制最多只能翻页10000个文档。
Search After
Search After方式支持使用前一页返回的排序值来检索下一页数据。与MySQL的下边语句类似:
select * from table_books where id > ${last_id} order by id desc limit 10
ES在每个查询返回结果中的每一个文档都会携带一个索引值,可用于下一页的查询:
GET /books/_search
{
"query": {"match_all": {}},
"sort": [
{"price": "desc"},
{"id": "desc"}
],
"size": 2
}
// response ->
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "books",
"_id": "ce2kI5ABzT8j3ZAE2smo",
"_score": null,
"_source": {
"id": 10,
"name": "Oliver Twist",
"author": "Dickens",
"price": 212
},
"sort": [
212,
10
]
},
{
"_index": "books",
"_id": "au2kI5ABzT8j3ZAE2smo",
"_score": null,
"_source": {
"id": 3,
"name": "Shakespeare: Romeo and Juliet",
"author": "William Shakespeare",
"price": 100
},
"sort": [
100,
3
]
}
]
}
}
那么查询下一页则可以使用上一页的索引值[100, “3”],如下:
GET /books/_search
{
"query": {
"match_all": {}
},
"sort": [
{"price": "desc"},
{"id": "asc"}
],
"search_after": [
100,
"3"
],
"size": 2
}
// response ->
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "books",
"_id": "bO2kI5ABzT8j3ZAE2smo",
"_score": null,
"_source": {
"id": 5,
"name": "Animal Farm",
"author": "Orwell",
"price": 93
},
"sort": [
93,
5
]
},
{
"_index": "books",
"_id": "be2kI5ABzT8j3ZAE2smo",
"_score": null,
"_source": {
"id": 6,
"name": "Great Expectations",
"author": "Dickens",
"price": 93
},
"sort": [
93,
6
]
}
]
}
}
上边的排序请求中为了找到价格高到低的书籍,使用了两个字段[{“price”: “desc”}, {“id”: “desc”}]进行排序。这里为何还要使用id字段?原因是为确保相同价格的书籍在分页排序时候不会重复、丢失或者顺序不可控,这里需要额外增加一个唯一键以保障顺序。
这种方式不存在深翻页问题,但无法支持随机跳页。另外当数据被变更并刷新的情况下,页面的返回结果并不是稳定的,会出现重复数据的情况。为了避免这种情况,可以使用ES 7.10版本后提供的point in time (PIT)来构建索引数据的轻量级视图,在该视图下ES会保障数据的稳定性,使用如下:
-- 启动PIT视图
POST /books/_pit?keep_alive=5m
-- reponse:
{
"id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA"
}
-- 查询第一页
GET _search
{
"query": {"match_all": {}},
"sort": [
{"price": "desc"},
{"_shard_doc": "desc"}
],
"size": 2,
"pit": {
"id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
"keep_alive": "5m"
}
}
-- response:
{
"pit_id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "books",
"_id": "Bl7ZKJAB2vthpLOOy9ge",
"_score": null,
"_source": {
"id": 10,
"name": "Oliver Twist",
"author": "Dickens",
"price": 212
},
"sort": [
212,
9
]
},
{
"_index": "books",
"_id": "_17ZKJAB2vthpLOOy9ce",
"_score": null,
"_source": {
"id": 3,
"name": "Shakespeare: Romeo and Juliet",
"author": "William Shakespeare",
"price": 100
},
"sort": [
100,
2
]
}
]
}
}
--翻下一页
GET _search
{
"query": {"match_all": {}},
"sort": [
{"price": "desc"},
{"_shard_doc": "desc"}
],
"size": 2,
"pit": {
"id": "gbuKBAEFYm9va3MWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAWdlpSWmRScFBUdHlqVXF2YTZnc0x6ZwAAAAAAAAAsDBZlcjFaMjUxSVQ4R1BrM2Q2a2R2UWVRAAEWNzZ6WmNZMHBUWTZtQWJqWDVCVDFqZwAA",
"keep_alive": "5m"
},
"search_after": [
100,
2
]
}
从上边的例子可以看到,当使用了PIT后,返回结果里的排序值除了包含price,还会多出一个_shard_doc值。例如[100, 2]。这个值是由分片索引和lucene内部文档Id组合而来。ES会确保在该PIT中每一个文档的_shard_doc都是唯一的。另外,你也可以显式指定 {“_shard_doc”: “desc”}来控制返回文档的顺序。
Scroll
除了上述两种方法以外,ES还支持Scroll命令分页。Scroll命令可用于非实时地分页检索大量数据。该命令是通过开始执行时构建一个快照以屏蔽掉所有数据变更,从而实现稳定的翻页。但这也决定了使用Scroll命令获取的翻页数据并不是实时的。另外快照需要保持所有旧数据不被删除,相比之下PIT轻量级视图的性能开销要更加小。因此当然在最新的ES版本中,已经建议使用Search After代替Scroll命令。
下边简单了解下Scroll命令:
-- 首页查询
POST /books/_search?scroll=5m
{
"query": {"match_all": {}},
"sort": [{"price":"desc"}],
"size": 2
}
-- response
{
"_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAKSRZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn",
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "books",
"_id": "Bl7ZKJAB2vthpLOOy9ge",
"_score": null,
"_source": {
"id": 10,
"name": "Oliver Twist",
"author": "Dickens",
"price": 212
},
"sort": [
212
]
},
{
"_index": "books",
"_id": "_17ZKJAB2vthpLOOy9ce",
"_score": null,
"_source": {
"id": 3,
"name": "Shakespeare: Romeo and Juliet",
"author": "William Shakespeare",
"price": 100
},
"sort": [
100
]
}
]
}
}
-- 下一页
POST _search/scroll
{
"scroll": "5m",
"scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAJExZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn"
}
-- 再下一页
POST _search/scroll
{
"scroll": "5m",
"scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFldkampjR3VmUzdDXzRTcXBtckQ3bEEAAAAAAAAJExZ2WlJaZFJwUFR0eWpVcXZhNmdzTHpn"
}
-- ...
总结
分页方式 | 优点 | 缺点 | 适用场景 |
---|---|---|---|
from+size | 简单直观容易理解; 小数据集下表现良好; 支持跳页查询; |
大数据集下性能较差; 有深度分页限制(默认10000) |
需要支持跳页但没有深翻页的情景 |
scroll | 使用快照保障数据的稳定; 没有深翻页性能问题; 适用于大量数据处理情景; |
不支持实时查询; 快照占用较多性能,用完需要及时删除; |
适用于导出大量数据、离线批处理任务等场景 |
search after | 没有深翻页性能问题; |
数据不稳定; 需要正确选择排序ID避免重复数据问题; |
适用于实时性要求高且存在深翻页的场景,如无限滚动列表 |
search after (PIT) | 使用PIT轻量级视图保障数据稳定性; 没有深翻页性能问题; |
数据稳定;需要正确选择排序ID避免重复数据问题; | 适用于不要求实时性并存在深翻页的场景,如无限滚动列表 |
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 duval1024@gmail.com