在分布式数据库中怎么计算count

123次阅读

共计 6858 个字符，预计需要花费 18 分钟才能阅读完成。

这篇文章主要介绍“在分布式数据库中怎么计算 count”，在日常操作中，相信很多人在在分布式数据库中怎么计算 count 问题上存在疑惑，丸趣 TV 小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”在分布式数据库中怎么计算 count”的疑惑有所帮助！接下来，请跟着丸趣 TV 小编一起来学习吧！

背景

在分布式数据库中，计算 count(distinct xxx)，需要对 distinct 的字段，

1、去重，

2、重分布去重后的数据，（这一步，如果 distinct 值特别多，那么就会比较耗时）

3、然后再去重，

4、最后 count (xxx)，

5、求所有节点的 count SUM。

例如，以下是 Greenplum 的执行计划例子

postgres=# explain analyze select count(distinct c_acctbal) from customer; 
 QUERY PLAN 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Aggregate (cost=182242.41..182242.42 rows=1 width=8) 
 Rows out: 1 rows with 0.006 ms to first row, 69 ms to end, start offset by 23 ms. 
 -  Gather Motion 16:1 (slice2; segments: 16) (cost=53392.85..173982.82 rows=660767 width=8) 
 Rows out: 818834 rows at destination with 3.416 ms to first row, 447 ms to end, start offset by 23 ms. 
 -  HashAggregate (cost=53392.85..61652.43 rows=41298 width=8) 
 Group By: customer.c_acctbal 
 Rows out: Avg 51177.1 rows x 16 workers. Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms. 
 -  Redistribute Motion 16:16 (slice1; segments: 16) (cost=30266.00..43481.34 rows=41298 width=8) 
 Hash Key: customer.c_acctbal 
 Rows out: Avg 89865.6 rows x 16 workers at destination. Max 90305 rows (seg3) with 18 ms to first row, 120 ms to end, start offset by 25 ms. 
 -  HashAggregate (cost=30266.00..30266.00 rows=41298 width=8) 
 Group By: customer.c_acctbal 
 Rows out: Avg 89865.6 rows x 16 workers. Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26 ms. 
 -  Append-only Columnar Scan on customer (cost=0.00..22766.00 rows=93750 width=8) 
 Rows out: Avg 93750.0 rows x 16 workers. Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26 ms. 
 Slice statistics: 
 (slice0) Executor memory: 387K bytes. 
 (slice1) Executor memory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0). 
 (slice2) Executor memory: 371K bytes avg x 16 workers, 371K bytes max (seg0). 
 Statement statistics: 
 Memory used: 1280000K bytes 
 Optimizer status: legacy query optimizer 
 Total runtime: 723.143 ms 
(23 rows)

以下是 citus 的例子

postgres=# explain analyze select count(distinct bid) from pgbench_accounts ; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 Aggregate (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1 loops=1) 
 -  Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1) 
 Task Count: 128 
 Tasks Shown: One of 128 
 -  Task 
 Node: host=172.24.211.224 port=1921 dbname=postgres 
 -  HashAggregate (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1) 
 Group Key: bid 
 -  Seq Scan on pgbench_accounts_106812 pgbench_accounts (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1) 
 Planning time: 0.445 ms 
 Execution time: 3.781 ms 
 Planning time: 1.399 ms 
 Execution time: 32.159 ms 
(13 rows)

对于可估值计算的场景，即不需要精确 distinct 值的场景，PostgreSQL 提供了一个名为 hll 的插件，可以用来估算 distinct 元素个数。

citus 结合 hll，可以实现超高速的 count(distinct xxx)，即使 distinct 值非常非常多，也不慢。

SET citus.count_distinct_error_rate to 0.005; 
 
0.005 表示失真度

hll 加速 citus count(distinct xxx) 使用举例部署

1、所有节点（coordinator 与 worker 节点），安装 hll 软件

yum install -y gcc-c++ 
 
cd ~/ 
 
git clone https://github.com/citusdata/postgresql-hll 
 
cd postgresql-hll 
 
. /var/lib/pgsql/.bash_profile 
 
USE_PGXS=1 make 
USE_PGXS=1 make install

2、所有节点（coordinator 与 worker 节点），在需要用到 HLL 的 DB 中增加插件

su - postgres -c  psql -d postgres -c  create extension hll;  
 
su - postgres -c  psql -d newdb -c  create extension hll;

使用举例 1、创建测试表，128 shard

create table test (id int primary key, a int, b int, c int); 
 
set citus.shard_count =128; 
 
select create_distributed_table(test ,  id

2、写入 10 亿测试数据，a 字段 10 唯一值，b 字段 100 唯一值，c 字段 100 万唯一值

insert into test select id, random()*9, random()*99, random()*999999 from generate_series(1,1000000000) t(id);

3、（coordinator 节点）设置全局或当前会话级参数，指定失真度，越小失真度越小

SET citus.count_distinct_error_rate to 0.005; 
 
newdb=# explain select count(distinct bid) from pgbench_accounts group by bid; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 
 HashAggregate (cost=0.00..0.00 rows=0 width=0) 
 Group Key: remote_scan.worker_column_2 
 -  Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) 
 Task Count: 128 
 Tasks Shown: One of 128 
 -  Task 
 Node: host=172.24.211.224 port=8001 dbname=newdb 
 -  GroupAggregate (cost=97272.79..105102.29 rows=1000 width=36) 
 Group Key: bid 
 -  Sort (cost=97272.79..99227.04 rows=781700 width=4) 
 Sort Key: bid 
 -  Seq Scan on pgbench_accounts_102008 pgbench_accounts (cost=0.00..20759.00 rows=781700 width=4) 
(12 rows)

4、对比是否使用 HLL 加速 (少量唯一值，HLL 没有性能提升，因为本身就不存在瓶颈)4.1、未使用 hll

newdb=# set citus.count_distinct_error_rate to 0; 
newdb=# select count(distinct bid) from pgbench_accounts; 
 count 
------- 
 1000 
(1 row) 
 
Time: 423.364 ms 
 
postgres=# set citus.count_distinct_error_rate to 0; 
postgres=# select count(distinct a) from test; 
 count 
------- 
 10 
(1 row) 
 
Time: 2392.709 ms (00:02.393)

4.2、使用 hll

newdb=# set citus.count_distinct_error_rate to 0.005; 
newdb=# select count(distinct bid) from pgbench_accounts; 
 count 
------- 
 1000 
(1 row) 
 
Time: 444.287 ms 
 
postgres=# set citus.count_distinct_error_rate to 0.005; 
postgres=# select count(distinct a) from test; 
 count 
------- 
 10 
(1 row) 
 
Time: 2375.473 ms (00:02.375)

5、对比是否使用 HLL 加速 (大量唯一值，HLL 性能提升显著)5.1、未使用 hll

postgres=# set citus.count_distinct_error_rate to 0; 
 
 count 
----------
 10000000
(1 row)
Time: 5826241.205 ms (01:37:06.241)

128 个节点，每个节点最多发送 10 亿 /128 条数据给 coordinator，慢是可以理解的。另一方面，coordinator 可以边接收边去重 (postgresql 11 增加了 parallel gather, merge sort 等能力，citus coordinator 可以借鉴)，没必要等所有数据都收完再去重。

5.2、使用 hll

postgres=# set citus.count_distinct_error_rate to 0.005; 
postgres=# select count(distinct (a,c)) from test; 
 count 
--------- 
 9999995 
(1 row) 
 
Time: 4468.749 ms (00:04.469)

6、设置不同的精度参数，性能对比

newdb=# set citus.count_distinct_error_rate to 0.1; 
newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; 
 count 
---------- 
 94778491 
(1 row) 
Time: 545.301 ms 
 
newdb=# set citus.count_distinct_error_rate to 0.01; 
newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; 
 count 
----------- 
 100293937 
(1 row) 
Time: 554.333 ms 
 
--  推荐设置 0.005 
 
newdb=# set citus.count_distinct_error_rate to 0.005; 
newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; 
 count 
----------- 
 100136086 
(1 row) 
Time: 1053.070 ms (00:01.053) 
 
newdb=# set citus.count_distinct_error_rate to 0.001; 
newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; 
 count 
----------- 
 100422107 
(1 row) 
Time: 9287.934 ms (00:09.288)

到此，关于“在分布式数据库中怎么计算 count”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注丸趣 TV 网站，丸趣 TV 小编会继续努力为大家带来更多实用的文章！

正文完