原创 HDFS 合并小文件

考虑到时效性 HDFS 落存储时可能是小文件,查询时需要访问大量小文件,耗 IO,查询速度慢,有必要定期做小文件合并,加快访问速度。

大体思路是重写原来的表,达到合并的效果。

1. 合并小文件

合并前可以记录未合并前的查询速度。

下面是一个测试环境的查询效果,读取速度为 2.78MB/s

presto:default> SELECT isp, count(*) as cnt FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02' group by isp limit 10;
     isp      |  cnt
--------------+--------
 A         | 115230
 B         |  57819
 C         |  78038
 D         |   3525
(4 rows)

Query 20210520_080949_00066_8dpqe, FINISHED, 1 node
Splits: 1,430 total, 1,430 done (100.00%)
0:11 [255K rows, 30MB] [23.6K rows/s, 2.78MB/s]
1
2
3
4
5
6
7
8
9
10
11
12

开启合并前设置好参数:

hive> SET hive.merge.mapfiles = true;
hive> SET hive.merge.mapredfiles = true;
hive> SET hive.merge.size.per.task = 256000000;
hive> SET hive.merge.smallfiles.avgsize = 134217728;
hive> SET hive.exec.dynamic.partition.mode = nonstrict;
hive> SET hive.exec.dynamic.partition = true;
1
2
3
4
5
6

开始合并,会创建 MR 任务。

hive> INSERT OVERWRITE TABLE default.demo
    > PARTITION (dt)
    > SELECT * FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02';
Query ID = root_20210520161711_2d1a65ec-2918-4839-9dc0-70d300dfebac
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1621253192216_0622, Tracking URL = http://hadoop-30.com:18088/proxy/application_1621253192216_0622/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1621253192216_0622
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-05-20 16:17:21,390 Stage-1 map = 0%,  reduce = 0%
2021-05-20 16:17:37,769 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 15.62 sec
2021-05-20 16:17:43,891 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 19.81 sec
MapReduce Total cumulative CPU time: 19 seconds 810 msec
Ended Job = job_1621253192216_0622
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoop-10.com:8020/user/hive/warehouse/demo/.hive-staging_hive_2021-05-20_16-17-12_005_3392610118785009765-1/-ext-10000
Loading data to table default.demo partition (dt=null)
         Time taken to load dynamic partitions: 2.018 seconds
         Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 19.81 sec   HDFS Read: 43010457 HDFS Write: 29956911 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 19 seconds 810 msec
OK
Time taken: 35.549 seconds
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

2. 验证效果

可以看到读取速度为 23.1MB/s,可以看到读取速度提升了 7 倍,查询的分片(splits)大幅下降。

presto:default> SELECT isp, count(*) as cnt FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02' group by isp limit 10;
     isp      |  cnt
--------------+--------
 A         | 115230
 B         |  57819
 C         |  78038
 D         |   3525
(4 rows)

Query 20210520_081822_00067_8dpqe, FINISHED, 1 node
Splits: 51 total, 51 done (100.00%)
0:01 [255K rows, 28.6MB] [206K rows/s, 23.1MB/s]
1
2
3
4
5
6
7
8
9
10
11
12

reference