问Elasticsearch滚动上限- python api
EN

Stack Overflow用户

提问于 2018-01-11 06:11:28

回答 1查看 1.6K关注 0票数 1

如果我们滚动特定大小的块，有没有办法使用python api来设置检索文档数量的上限。因此，假设我想要滚动最多100K个文档，以2K为单位，其中有超过1000万个文档可用。

我已经实现了一个类似object的计数器，但我想知道是否有更自然的解决方案。

es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
    index="INDEX", 
    doc_type="DOC_TYPE", 
    body=es_query,
    size=2000,
    scroll="1m")

data = []
for hit in result["hits"]["hits"]:
    for d in hit["_source"]["attributes"]["data_of_interest"]:
        data.append(d)
        do_something(*args)


scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]

i = 0
while(scroll_size>0):
    if i % 10000 == 0:
        print("Scrolling ({})...".format(i))

    result = es.scroll(scroll_id=scroll_id, scroll="1m")
    scroll_id = result["_scroll_id"]
    scroll_size = len(result['hits']['hits'])

data = []
for hit in result["hits"]["hits"]:
    for d in hit["_source"]["attributes"]["data_of_interest"]:
        data.append(d)
        do_something(*args)

i += 1
if i == 100000:
    break

python

elasticsearch

回答 1

Stack Overflow用户

发布于 2018-01-11 17:24:32

对我来说，如果你只想要前100K，你应该首先缩小你的查询范围。这会加速你的进程。例如，您可以按日期添加筛选器。

关于代码，除了使用计数器之外，我不知道其他方法。为了提高可读性，我只需要更正缩进并删除if语句。

es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
    index="INDEX", 
    doc_type="DOC_TYPE", 
    body=es_query,
    size=2000,
    scroll="1m")

data = []
for hit in result["hits"]["hits"]:
    for d in hit["_source"]["attributes"]["data_of_interest"]:
        data.append(d)
        do_something(*args)

scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]

i = 0
while(scroll_size > 0 & i < 100000):

    print("Scrolling ({})...".format(i))

    result = es.scroll(scroll_id=scroll_id, scroll="1m")
    scroll_id = result["_scroll_id"]
    scroll_size = len(result['hits']['hits'])

    # data = [] why redefining the list ? 
    for hit in result["hits"]["hits"]:
        for d in hit["_source"]["attributes"]["data_of_interest"]:
            data.append(d)
            do_something(*args)
    i ++

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48196940

复制

相似问题

问Elasticsearch滚动上限- python api
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Elasticsearch滚动上限- python apiEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Elasticsearch滚动上限- python api
EN