原创 parquet、orc、Avro存储空间和查询耗时对比

在 HDFS 中,不同的存储格式,对存储空间和查询耗时差异较大,以下是几种常见的存储格式parquet、orc、Avro的存储空间和查询耗时对比。

从结果中来看 parquet、orc 在存储空间和查询耗时方面明显优于 Avro。

压缩算法存储空间Hive SQL 查询耗时特征
parquet180M14.8s列式存储
orc96.7M5.6s列式存储
Avro2.7G82.8s序列化

测试数据:2kw 行公开测试数据

1. 创建表测试

原始表的文件格式为 Avro

创建 parquet 格式的表

CREATE TABLE `queryip_parquet`(
  `ip` string COMMENT '',
  `isp` string COMMENT '',
  `country` string COMMENT '',
  `province` string COMMENT '',
  `city` string COMMENT '',
  `deprecate` string COMMENT '',
  `useragent` string COMMENT '',
  `referer` string COMMENT '',
  `path` string COMMENT '',
  `method` string COMMENT '',
  `protocol` string COMMENT '',
  `timestamp` string COMMENT '')
PARTITIONED BY (
  `dt` string)
  STORED AS PARQUET;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert overwrite table queryip_parquet partition (dt)
SELECT * FROM queryip;
1
2

创建 parquet 格式的表

CREATE TABLE `queryip_orc`(
  `ip` string COMMENT '',
  `isp` string COMMENT '',
  `country` string COMMENT '',
  `province` string COMMENT '',
  `city` string COMMENT '',
  `deprecate` string COMMENT '',
  `useragent` string COMMENT '',
  `referer` string COMMENT '',
  `path` string COMMENT '',
  `method` string COMMENT '',
  `protocol` string COMMENT '',
  `timestamp` string COMMENT '')
PARTITIONED BY (
  `dt` string)
  STORED AS ORC;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert overwrite table queryip_orc partition (dt)
SELECT * FROM queryip;
1
2

2. 查看元数据

使用 hive analyze 和 describe 命令

hive> analyze table queryip_parquet partition(dt) compute statistics ;
hive> analyze table queryip_orc partition(dt) compute statistics ;

hive> describe formatted queryip_parquet;
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                43
        numPartitions           43
        numRows                 21105927
        rawDataSize             274377051
        totalSize               188784031
        transient_lastDdlTime   1623294389

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            0
Bucket Columns:         []
Sort Columns:           []
Time taken: 0.125 seconds, Fetched: 46 row(s)

hive> describe formatted queryip_orc;
OK
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                43
        numPartitions           43
        numRows                 21184763
        rawDataSize             0
        totalSize               101381351
        transient_lastDdlTime   1623294395

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            0
Bucket Columns:         []
Sort Columns:           []
Time taken: 0.135 seconds, Fetched: 46 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

3. Hive SQL 查询耗时

parquet , 14.779 seconds

 hive> SELECT country,
    >        province,
    >        city,
    >        count(*) AS cnt
    > FROM `default`.`queryip_parquet`
    > GROUP BY country,
    >          province,
    >          city
    > ORDER BY cnt DESC
    > LIMIT 10;
Query ID = root_20210610112412_3a1eaff7-5153-4dd0-985e-bbe23e67aaa2
Query Hive on Spark job[4] stages: [5, 6, 4]
Spark job[4] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-4 ........         0      FINISHED      2          2        0        0       0
Stage-5 ........         0      FINISHED      9          9        0        0       0
Stage-6 ........         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 03/03    [==========================>>] 100%  ELAPSED TIME: 14.03 s
--------------------------------------------------------------------------------------
Spark job[4] finished successfully in 14.03 second(s)
Spark Job[4] Metrics: TaskDurationTime: 18289, ExecutorCpuTime: 15458, JvmGCTime: 93, BytesRead / RecordsRead: 31766596 / 21105927, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 40566 / 1423, ShuffleBytesWritten / ShuffleRecordsWritten: 40566 / 1423
OK
Time taken: 14.779 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

orc,  5.645 seconds

hive> SELECT country,
    >        province,
    >        city,
    >        count(*) AS cnt
    > FROM `default`.`queryip_orc`
    > GROUP BY country,
    >          province,
    >          city
    > ORDER BY cnt DESC
    > LIMIT 10;
    Query Hive on Spark job[5] stages: [9, 7, 8]
Spark job[5] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-7 ........         0      FINISHED      2          2        0        0       0
Stage-8 ........         0      FINISHED      4          4        0        0       0
Stage-9 ........         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 03/03    [==========================>>] 100%  ELAPSED TIME: 5.01 s
--------------------------------------------------------------------------------------
Spark job[5] finished successfully in 5.01 second(s)
Spark Job[5] Metrics: TaskDurationTime: 6666, ExecutorCpuTime: 4968, JvmGCTime: 38, BytesRead / RecordsRead: 26663580 / 20715, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 36165 / 1384, ShuffleBytesWritten / ShuffleRecordsWritten: 36165 / 1384
OK

Time taken: 5.645 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Avro, 82.76 seconds

hive> SELECT country,
    >        province,
    >        city,
    >        count(*) AS cnt
    > FROM `default`.`queryip`
    > GROUP BY country,
    >          province,
    >          city
    > ORDER BY cnt DESC
    > LIMIT 10
    > ;
Query ID = root_20210610112449_359c8c81-0588-45fd-ba42-2e5be5dff421
Query Hive on Spark job[6] stages: [12, 10, 11]
Spark job[6] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-10 .......         0      FINISHED     12         12        0        0       0
Stage-11 .......         0      FINISHED     86         86        0        0       0
Stage-12 .......         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 03/03    [==========================>>] 100%  ELAPSED TIME: 82.15 s
--------------------------------------------------------------------------------------
Spark job[6] finished successfully in 82.16 second(s)
Spark Job[6] Metrics: TaskDurationTime: 494625, ExecutorCpuTime: 369951, JvmGCTime: 10414, BytesRead / RecordsRead: 2885901995 / 21184763, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 263896 / 6562, ShuffleBytesWritten / ShuffleRecordsWritten: 263896 / 6562
OK
Time taken: 82.76 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

4. 存储空间对比

# hdfs dfs -du -s -h /user/hive/warehouse/queryip_orc
96.7 M  193.4 M  /user/hive/warehouse/queryip_orc
# hdfs dfs -du -s -h /user/hive/warehouse/queryip
2.7 G  7.0 G  /user/hive/warehouse/queryip
# hdfs dfs -du -s -h /user/hive/warehouse/queryip_parquet
180.0 M  360.1 M  /user/hive/warehouse/queryip_parquet
1
2
3
4
5
6