1、 总述
| 主机 | cpu | 内存 | 磁盘 | 
|---|---|---|---|
| p1(master) | 16 core | 64GB | 100GB | 
| P2 | 8 core | 32GB | 100GB | 
| P2 | 8 core | 32GB | 100GB | 
之前我基于TPC-DS 做了很多轮测试,数据量也分布了几个维度,从结果行看性能提升并不明显,只是单表查询上略有提升,多表基本无变化。结果懒得贴上去了,具体可以看我提的issue。
TPC-DS 典型的单表查询
- group by 、count
 
1  | select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk  | 
- group by、 order by、count
 
1  | select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk order by cnt desc,ss_sold_Date_sk limit 10  | 
- group by 、order by、count 、avg
 
1  | select ss_sold_Date_sk,ss_wholesale_cost,avg(ss_item_sk) as cnt,count(distinct(ss_sales_price)) as avg1 from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk,ss_wholesale_cost order by cnt desc,ss_sold_Date_sk limit 10  | 
TPC-DS 典型的多表查询
1  | select channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM (  | 
针对tpc-ds的sql,单表查询聚合函数多个联合使用,多表关联太多且又使用复杂的聚合函数。测试下来总是囫囵吞枣,很难侦查到有无alluxio时性能的变化,我觉得需要更细粒度的单表测试。
2、单表单聚合函数测试
表名:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```sql
CREATE TABLE `store_sales`(
  `ss_sold_time_sk` bigint,
  `ss_item_sk` bigint,
  `ss_customer_sk` bigint,
  `ss_cdemo_sk` bigint,
  `ss_hdemo_sk` bigint,
  `ss_addr_sk` bigint,
  `ss_store_sk` bigint,
  `ss_promo_sk` bigint,
  `ss_ticket_number` bigint,
  `ss_quantity` int,
  `ss_wholesale_cost` decimal(7,2),
  `ss_list_price` decimal(7,2),
  `ss_sales_price` decimal(7,2),
  `ss_ext_discount_amt` decimal(7,2),
  `ss_ext_sales_price` decimal(7,2),
  `ss_ext_wholesale_cost` decimal(7,2),
  `ss_ext_list_price` decimal(7,2),
  `ss_ext_tax` decimal(7,2),
  `ss_coupon_amt` decimal(7,2),
  `ss_net_paid` decimal(7,2),
  `ss_net_paid_inc_tax` decimal(7,2),
  `ss_net_profit` decimal(7,2))
PARTITIONED BY (
  `ss_sold_date_sk` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'alluxio://p1:19998/user/hive/warehouse/tpcds_bin_partitioned_orc_50.db/store_sales'
TBLPROPERTIES (
  'transient_lastDdlTime'='1574907136')
1  | 
  | 
聚合查询
每个查询三次,最后一列为平均时间,单位毫秒。
- presto+hdfs (sum)
 
1  | 20381 19798 19916 20031  | 
- presto+hdfs+alluxio(sum)
 
1  | 6479 6711 6816 6668  | 
- presto+hdfs (count)
 
1  | 19920 20434 19285 19879  | 
- presto+hdfs+alluxio(count)
 
1  | 6194 5739 5638 5857  | 
- presto+hdfs (count(xxx))
 
1  | 21166 18758 20217 20047  | 
- presto+hdfs+alluxio (count(xxx))
 
1  | 6203 6268 5857 6109  | 
- presto+hdfs (order by)
 
1  | 22183 20706 21496 21461  | 
- presto+hdfs+alluxio (order by )
 
1  | 14981 15045 14967 14997  | 
- presto+hdfs (avg)
 
1  | 21916 20552 19876 22114  | 
- presto+hdfs+alluxio (avg)
 
1  | 6555 5928 6317 6266  | 
3、总结
单表常规聚合, 有alluxio性能大约2-3倍的提升,presto对于join的查询不擅长,join做分析性能没什么提升,具体也可以看我提交的issue。另外对于多个聚合函数的测试可能也会干扰测试结果,建议大家单独的测试,alluxio+presto 做adhoc对sql是有要求的,不是有缓存了,性能就提升了,需要仔细甄别。