PostgreSQL搜索插件有什么优点

212次阅读

共计 16134 个字符，预计需要花费 41 分钟才能阅读完成。

本篇内容主要讲解“PostgreSQL 搜索插件有什么优点”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让丸趣 TV 小编来带大家学习“PostgreSQL 搜索插件有什么优点”吧!

git clone https://github.com/postgrespro/rum 
cd rum 
. /var/lib/pgsql/.bash_profile 
USE_PGXS=1 make 
USE_PGXS=1 make install 
create extension rum;

1、生成随机浮点数组的 UDF 接口

create or replace function gen_rand_float4(int,int) returns float4[] as $$ 
 select array(select (random()*$1)::float4 from generate_series(1,$2)); 
$$ language sql strict;

2、建表，索引

create unlogged table t_rum(id int primary key, arr float4[]); 
create index idx_t_rum_1 on t_rum using rum(arr);

4、写入随机浮点数数组

vi test.sql 
\set id random(1,2000000000) 
insert into t_rum values (:id, gen_rand_float4(10,16)) on conflict(id) do nothing;

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -t 10000000

postgres=# select * from t_rum limit 2; 
 id | arr 
-----------+----------------------------------------------------------------------------------------------------------------------------------------- 
 182025544 | {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} 
 51515704 | {0.123099,9.26626,0.00549683,9.01483,0.911669,3.44338,4.55135,4.65002,0.820029,9.66546,1.93433,3.00254,1.28121,8.99883,1.85269,6.39579} 
(2 rows) 
 
 
postgres=# select count(*) from t_rum; 
 count 
--------- 
 3244994 
(1 row)

5、使用 rum 提供的数组相似搜索（元素重叠率计算）

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_rum order by arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444}  limit 1; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 Limit (cost=22435.67..22435.68 rows=1 width=97) (actual time=12527.447..12527.450 rows=1 loops=1) 
 Output: id, arr, ((arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[])) 
 Buffers: shared hit=50450 
 -  Sort (cost=22435.67..29469.15 rows=3244994 width=97) (actual time=12527.445..12527.446 rows=1 loops=1) 
 Output: id, arr, ((arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[])) 
 Sort Key: ((t_rum.arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[])) 
 Sort Method: top-N heapsort Memory: 25kB 
 Buffers: shared hit=50450 
 -  Seq Scan on public.t_rum (cost=0.00..8368.72 rows=3244994 width=97) (actual time=0.054..11788.483 rows=3244994 loops=1) 
 Output: id, arr, (arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[]) 
 Buffers: shared hit=50447 
 Planning Time: 0.115 ms 
 Execution Time: 12527.498 ms 
(13 rows)

你会发现，走了索引，但是并不快。扫描了大量 (50447) 的索引 PAGE。

原因是我们没有管它的阈值，导致扫描了大量的 index BLOCK。默认的阈值为 0.5，太低了。

postgres=# show rum.array_similarity_threshold 
postgres-# ; 
 rum.array_similarity_threshold 
-------------------------------- 
 0.5 
(1 row)

调成 0.9，只输出 90% 以上相似 (重叠度) 的数组。性能瞬间暴增，扫描的数据块也变少了。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_rum where arr %  {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444}  order by arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444}  limit 1; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 Limit (cost=1.54..1.56 rows=1 width=97) (actual time=0.664..0.664 rows=0 loops=1) 
 Output: id, arr, ((arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[])) 
 Buffers: shared hit=128 read=40 
 -  Index Scan using idx_t_rum_1 on public.t_rum (cost=1.54..87.65 rows=3245 width=97) (actual time=0.662..0.662 rows=0 loops=1) 
 Output: id, arr, (arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[]) 
 Index Cond: (t_rum.arr %  {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[]) 
 Order By: (t_rum.arr  =   {5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444} ::real[]) 
 Buffers: shared hit=128 read=40 
 Planning Time: 0.184 ms 
 Execution Time: 0.691 ms 
(10 rows)

元素重叠度相似搜索优化

1、调整阈值，阶梯化解题。

《PostgreSQL 相似搜索设计与性能 – 地址、QA、POI 等文本毫秒级相似搜索实践》

实际上图像特征值近似搜索，也有优化的空间，接下来进入正题。

部署 imgsmlr (on PG 11)

1、假设 yum 安装的 PG 11

2、克隆源码

yum install -y git 
git clone https://github.com/postgrespro/imgsmlr 
cd imgsmlr

3、修改头文件

vi imgsmlr.h 
 
//  追加  
#ifndef FALSE 
#define FALSE (0) 
#endif 
 
 
#ifndef TRUE 
#define TRUE (!FALSE) 
#endif

4、安装依赖的图像转换包

yum install -y gd-devel

5、编译安装 IMGSMLR 插件

. /var/lib/pgsql/.bash_profile 
USE_PGXS=1 make 
USE_PGXS=1 make install

单节点单表图像搜索 (4 亿图像)

1、创建生成随机图像特征值 signature 的 UDF。

create or replace function gen_rand_img_sig(int) returns signature as $$ 
 select (( ||rtrim(ltrim(array(select (random()*$1)::float4 from generate_series(1,16))::text, {),} )|| ) )::signature; 
$$ language sql strict;

postgres=# select * from gen_rand_img_sig(10); 
 gen_rand_img_sig 
------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 (6.744310, 5.105020, 0.087113, 3.808010, 8.129480, 2.834540, 2.495250, 0.940481, 0.033208, 6.583490, 2.840330, 1.422440, 6.683830, 0.080847, 8.327730, 2.471430) 
(1 row) 
 
postgres=# select * from gen_rand_img_sig(10); 
 gen_rand_img_sig 
------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 (3.013650, 6.170690, 0.601905, 2.692030, 1.268540, 7.803740, 9.757770, 5.537750, 0.391753, 4.440790, 1.201580, 5.501380, 6.166980, 0.240686, 9.768680, 2.911290) 
(1 row)

2、建表，建图像特征值索引

create table t_img_sig (id int primary key, sig signature); 
create index idx_t_img_sig_1 on t_img_sig using gist(sig);

3、写入约 4 亿随机图像特征值

vi testsig.sql 
\set id random(1,2000000000) 
insert into t_img_sig values (:id, gen_rand_img_sig(10)) on conflict(id) do nothing;

pgbench -M prepared -n -r -P 1 -f ./testsig.sql -c 32 -j 32 -t 20000000

postgres=# select * from t_img limit 10; 
 id | sig 
-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 47902935 | (5.861920, 1.062770, 8.318020, 2.205840, 0.202951, 6.956610, 1.413190, 2.898480, 8.961630, 6.377800, 1.110450, 6.684520, 2.286290, 7.850760, 1.832650, 0.074348) 
 174656795 | (2.165030, 0.183753, 9.913950, 9.208260, 5.165660, 6.603510, 2.008380, 8.117910, 2.358590, 5.466330, 9.139280, 8.893700, 4.664190, 9.361670, 9.016990, 2.271000) 
 96186891 | (9.605980, 4.395920, 4.336720, 3.174360, 8.706960, 0.155107, 9.408940, 4.531100, 2.783530, 5.681780, 9.792380, 6.428320, 2.983760, 9.733290, 7.635160, 7.035780) 
 55061667 | (7.567960, 5.874530, 5.222040, 5.638520, 3.488960, 8.770750, 7.054610, 7.239630, 9.202280, 9.465020, 4.079080, 5.729770, 0.475227, 8.434800, 6.873730, 5.140080) 
 64659434 | (4.860650, 3.984440, 3.009900, 5.116680, 6.489150, 4.224800, 0.609752, 8.731120, 6.577390, 8.542540, 9.096120, 8.976700, 8.936000, 2.836270, 7.186250, 6.264300) 
 87143098 | (4.801570, 7.870150, 0.939599, 3.666670, 1.102340, 5.819580, 6.511330, 6.430760, 0.584531, 3.024190, 6.255460, 8.823820, 5.076960, 0.181344, 8.137380, 1.230360) 
 109245945 | (7.541850, 7.201460, 6.858400, 2.605210, 1.283090, 7.525200, 4.213240, 8.413760, 9.707390, 1.916970, 1.719320, 1.255280, 9.006780, 4.851420, 2.168250, 5.997360) 
 4979218 | (8.463000, 4.051410, 9.057320, 1.367980, 3.344340, 7.032640, 8.583770, 1.873090, 5.524810, 0.187254, 5.783270, 6.141040, 2.479410, 6.406450, 9.371700, 0.050690) 
 72846137 | (7.018560, 4.039150, 9.114800, 2.911170, 5.531180, 8.557330, 6.739050, 0.103649, 3.691390, 7.584640, 8.184180, 0.599390, 9.037130, 4.090610, 4.369770, 6.480000) 
 36813995 | (4.643480, 8.704640, 1.073880, 2.665530, 3.298300, 9.244280, 5.768050, 0.887555, 5.990350, 2.991390, 6.186550, 6.464940, 6.187140, 0.150242, 2.123070, 2.932270) 
(10 rows) 
 
Time: 58.101 ms

写入约 4.39 亿图像特征值。

postgres=# select count(*) from t_img_sig; 
 count 
----------- 
 438924137 
(1 row)

4、输入一个图像特征值，搜索与之最相似的图像。

explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig  -   (5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444)  limit 1;

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig where signature_distance(sig, (5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444) )   0.9 order by sig  -   (5.07998,6.80827,5.42024,2.53619,4.10843,0.532198,4.33886,9.60262,6.68369,8.01305,9.60298,8.087,1.25819,6.54424,6.04902,5.3444)  limit 1; 
 QUERY PLAN 
 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Limit (cost=0.48..0.51 rows=1 width=72) (actual time=4094.810..4094.812 rows=1 loops=1) 
 Output: id, sig, ((sig  -   (5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980, 8.087000, 1.258190, 6.544240, 6.049020, 5.344400) ::signature)) 
 Buffers: shared hit=205999 
 -  Index Scan using idx_t_img_sig_1 on public.t_img_sig (cost=0.48..5361351.06 rows=146395778 width=72) (actual time=4094.808..4094.808 rows=1 loops=1) 
 Output: id, sig, (sig  -   (5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980, 8.087000, 1.258190, 6.544240, 6.049020, 5.344400) ::signature) 
 Order By: (t_img_sig.sig  -   (5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980, 8.087000, 1.258190, 6.544240, 6.049020, 5.344400) ::signature) 
 Filter: (signature_distance(t_img_sig.sig,  (5.079980, 6.808270, 5.420240, 2.536190, 4.108430, 0.532198, 4.338860, 9.602620, 6.683690, 8.013050, 9.602980, 8.087000, 1.258190, 6.544240, 6.049020, 5.344400) ::signature)    0.9 ::double precision) 
 Buffers: shared hit=205999 
 Planning Time: 0.073 ms 
 Execution Time: 4194.485 ms 
(10 rows)

性能与瓶颈

性能：4.39 亿图像特征值，以图搜图约 4.2 秒。

瓶颈:

1、扫描了大量的索引页（205999）。

优化思路

1、压缩精度，比如使用 3 位小数。据用户说有 10 倍性能提升。

精度优化如下，使用新的生成图像特征值的函数，使用 3 位小数。

create or replace function gen_rand_img_sig3(int) returns signature as $$ 
 select (( ||rtrim(ltrim(array(select trunc((random()*$1)::numeric,3) from generate_series(1,16))::text, {),} )|| ) )::signature; 
$$ language sql strict;

例子如下

postgres=# select gen_rand_img_sig3(10); 
 gen_rand_img_sig3 
------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 (2.984000, 3.323000, 4.083000, 6.292000, 5.008000, 9.029000, 6.208000, 1.141000, 1.796000, 9.257000, 1.397000, 1.235000, 7.157000, 3.745000, 0.112000, 7.723000) 
(1 row)

2、使用分区表 +dblink 异步接口并行调用。（内核层面直接支持 imgsmlr gist index scan 并行更好）

下一篇介绍

3、使用 citus sharding。多机，提高整体计算能力。（因为扫描大量索引页，即使 CPU 没有瓶颈，将来内存带宽也会成为瓶颈。多机可以解决这个问题。）

下一篇介绍

4、内核层面，支持维度更低的 signature，现在是 16 片，比如支持降低到 4 片，性能也可以提升。

精度现象

1、当有记录可以完全匹配时，扫描少量 INDEX PAGE。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179000) ::signature limit 1;
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.48..0.49 rows=1 width=72) (actual time=1.596..1.598 rows=1 loops=1)
 Output: id, sig, ((sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179000) ::signature))
 Buffers: shared hit=125
 -  Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.48..7318159.22 rows=785457848 width=72) (actual time=1.594..1.595 rows=1 loops=1)
 Output: id, sig, (sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179000) ::signature)
 Order By: (t_img_sig.sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179000) ::signature)
 Buffers: shared hit=125
 Planning Time: 0.072 ms
 Execution Time: 1.621 ms
(9 rows)

2、当修改少量内容，少量值完全匹配，其他值不完全匹配时，扫描的 INDEX PAGE 增加。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179001) ::signature limit 1;
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.48..0.49 rows=1 width=72) (actual time=7.051..7.052 rows=1 loops=1)
 Output: id, sig, ((sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179001) ::signature))
 Buffers: shared hit=454
 -  Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.48..7324626.56 rows=786152016 width=72) (actual time=7.049..7.049 rows=1 loops=1)
 Output: id, sig, (sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179001) ::signature)
 Order By: (t_img_sig.sig  -   (3.727000, 2.594000, 0.185000, 3.996000, 6.450000, 7.126000, 5.499000, 1.540000, 8.239000, 6.262000, 2.053000, 2.566000, 4.522000, 6.929000, 1.582000, 2.179001) ::signature)
 Buffers: shared hit=454
 Planning Time: 0.074 ms
 Execution Time: 7.076 ms
(9 rows)

3、当大量修改值，不能完全匹配时，需要扫描大量 INDEX PAGE。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_img_sig order by sig  -   (7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000, 2.178000) ::signature limit 1;
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.47..0.48 rows=1 width=72) (actual time=2528.890..2528.891 rows=1 loops=1)
 Output: id, sig, ((sig  -   (7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000, 2.178000) ::signature))
 Buffers: shared hit=121510
 -  Index Scan using t_img_sig1_sig_idx on public.t_img_sig (cost=0.47..1361409.21 rows=146121007 width=72) (actual time=2528.887..2528.888 rows=1 loops=1)
 Output: id, sig, (sig  -   (7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000, 2.178000) ::signature)
 Order By: (t_img_sig.sig  -   (7.727000, 3.594000, 1.185000, 4.996000, 6.950000, 7.129000, 5.429000, 1.520000, 8.219000, 6.222000, 2.013000, 2.536000, 4.532000, 6.939000, 1.538000, 2.178000) ::signature)
 Buffers: shared hit=121510
 Planning Time: 0.092 ms
 Execution Time: 2582.558 ms
(9 rows)

到此，相信大家对“PostgreSQL 搜索插件有什么优点”有了更深的了解，不妨来实际操作一番吧！这里是丸趣 TV 网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

正文完