首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用trigram检索Django中的文本

用trigram检索Django中的文本
EN

Stack Overflow用户
提问于 2020-12-01 10:04:43
回答 1查看 810关注 0票数 2

我想在我的应用程序中加快搜索结果,但是,无论我使用什么方法,我都会得到相同的结果。因为它是Django应用程序,所以我将提供ORM命令和生成的SQL代码(使用PostgreSQL)。

首先,我在数据库上启用了GIN索引和trigram操作:

其次,我创建了包含两个varchar列的表: first_name和last_name (加上一个id字段作为主键)。

代码语言:javascript
运行
复制
from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

我还在数据库中填充了952条示例记录,这样就不会出现这样的情况: Postgres由于数据集太小而避免使用索引。

接下来,我对非索引数据运行以下查询。

简单的类似查询:

代码语言:javascript
运行
复制
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms

Trigram类似:

代码语言:javascript
运行
复制
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms

还有一个带有排序结果的更复杂的查询:

代码语言:javascript
运行
复制
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms

下一步是创建一个索引:

代码语言:javascript
运行
复制
class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

    class Meta:
        indexes = [GinIndex(fields=['last_name'])]

这导致了以下SQL迁移:

代码语言:javascript
运行
复制
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;

现在我运行同样的命令。

比如:

代码语言:javascript
运行
复制
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms

Trigram类似:

代码语言:javascript
运行
复制
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms

还有更复杂的查询:

代码语言:javascript
运行
复制
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms

执行时间的变化似乎微不足道。在最后一次查询的情况下,扫描需要0.643个单元,而在前一个情况下则是0.657个。时间也相差0.02毫秒(第二个查询运行速度甚至慢一点)。是否有一些选项,我错过了,应该启用,以帮助性能?数据集太简单了吗?

我用的医生:

编辑--我添加了一些houndred记录(现在有将近259,000条记录)--并再次运行测试。首先,没有索引:

代码语言:javascript
运行
复制
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms

In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms

In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms

有了它:

代码语言:javascript
运行
复制
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms

In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms

In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms

现在仍然是非常相似的时期,而且它似乎在避免使用gin索引。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-12-01 15:02:24

代码语言:javascript
运行
复制
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");

这并没有创建一个trigram索引。它在整个字符串上创建了一个GIN索引,使用了来自btree_gin的操作符(您似乎并不是为了任何好的目的使用这些操作符)。要建立一个trigram索引,需要如下所示:

代码语言:javascript
运行
复制
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);

但我不知道如何让django这么做,我不是Django用户。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65088405

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档