1、 总述
主机 | cpu | 内存 | 磁盘 |
---|---|---|---|
p1(master) | 16 core | 64GB | 100GB |
P2 | 8 core | 32GB | 100GB |
P2 | 8 core | 32GB | 100GB |
之前我基于TPC-DS 做了很多轮测试,数据量也分布了几个维度,从结果行看性能提升并不明显,只是单表查询上略有提升,多表基本无变化。结果懒得贴上去了,具体可以看我提的issue。
TPC-DS 典型的单表查询
- group by 、count
1 | select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk |
- group by、 order by、count
1 | select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk order by cnt desc,ss_sold_Date_sk limit 10 |
- group by 、order by、count 、avg
1 | select ss_sold_Date_sk,ss_wholesale_cost,avg(ss_item_sk) as cnt,count(distinct(ss_sales_price)) as avg1 from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk,ss_wholesale_cost order by cnt desc,ss_sold_Date_sk limit 10 |
TPC-DS 典型的多表查询
1 | select channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM ( |
针对tpc-ds的sql,单表查询聚合函数多个联合使用,多表关联太多且又使用复杂的聚合函数。测试下来总是囫囵吞枣,很难侦查到有无alluxio时性能的变化,我觉得需要更细粒度的单表测试。
2、单表单聚合函数测试
表名:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```sql
CREATE TABLE `store_sales`(
`ss_sold_time_sk` bigint,
`ss_item_sk` bigint,
`ss_customer_sk` bigint,
`ss_cdemo_sk` bigint,
`ss_hdemo_sk` bigint,
`ss_addr_sk` bigint,
`ss_store_sk` bigint,
`ss_promo_sk` bigint,
`ss_ticket_number` bigint,
`ss_quantity` int,
`ss_wholesale_cost` decimal(7,2),
`ss_list_price` decimal(7,2),
`ss_sales_price` decimal(7,2),
`ss_ext_discount_amt` decimal(7,2),
`ss_ext_sales_price` decimal(7,2),
`ss_ext_wholesale_cost` decimal(7,2),
`ss_ext_list_price` decimal(7,2),
`ss_ext_tax` decimal(7,2),
`ss_coupon_amt` decimal(7,2),
`ss_net_paid` decimal(7,2),
`ss_net_paid_inc_tax` decimal(7,2),
`ss_net_profit` decimal(7,2))
PARTITIONED BY (
`ss_sold_date_sk` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'alluxio://p1:19998/user/hive/warehouse/tpcds_bin_partitioned_orc_50.db/store_sales'
TBLPROPERTIES (
'transient_lastDdlTime'='1574907136')
1 |
|
聚合查询
每个查询三次,最后一列为平均时间,单位毫秒。
- presto+hdfs (sum)
1 | 20381 19798 19916 20031 |
- presto+hdfs+alluxio(sum)
1 | 6479 6711 6816 6668 |
- presto+hdfs (count)
1 | 19920 20434 19285 19879 |
- presto+hdfs+alluxio(count)
1 | 6194 5739 5638 5857 |
- presto+hdfs (count(xxx))
1 | 21166 18758 20217 20047 |
- presto+hdfs+alluxio (count(xxx))
1 | 6203 6268 5857 6109 |
- presto+hdfs (order by)
1 | 22183 20706 21496 21461 |
- presto+hdfs+alluxio (order by )
1 | 14981 15045 14967 14997 |
- presto+hdfs (avg)
1 | 21916 20552 19876 22114 |
- presto+hdfs+alluxio (avg)
1 | 6555 5928 6317 6266 |
3、总结
单表常规聚合, 有alluxio性能大约2-3倍的提升,presto对于join的查询不擅长,join做分析性能没什么提升,具体也可以看我提交的issue。另外对于多个聚合函数的测试可能也会干扰测试结果,建议大家单独的测试,alluxio+presto 做adhoc对sql是有要求的,不是有缓存了,性能就提升了,需要仔细甄别。