原创 HDFS 合并小文件
考虑到时效性 HDFS 落存储时可能是小文件,查询时需要访问大量小文件,耗 IO,查询速度慢,有必要定期做小文件合并,加快访问速度。
大体思路是重写原来的表,达到合并的效果。
1. 合并小文件
合并前可以记录未合并前的查询速度。
下面是一个测试环境的查询效果,读取速度为 2.78MB/s
presto:default> SELECT isp, count(*) as cnt FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02' group by isp limit 10;
isp | cnt
--------------+--------
A | 115230
B | 57819
C | 78038
D | 3525
(4 rows)
Query 20210520_080949_00066_8dpqe, FINISHED, 1 node
Splits: 1,430 total, 1,430 done (100.00%)
0:11 [255K rows, 30MB] [23.6K rows/s, 2.78MB/s]
1
2
3
4
5
6
7
8
9
10
11
12
2
3
4
5
6
7
8
9
10
11
12
开启合并前设置好参数:
hive> SET hive.merge.mapfiles = true;
hive> SET hive.merge.mapredfiles = true;
hive> SET hive.merge.size.per.task = 256000000;
hive> SET hive.merge.smallfiles.avgsize = 134217728;
hive> SET hive.exec.dynamic.partition.mode = nonstrict;
hive> SET hive.exec.dynamic.partition = true;
1
2
3
4
5
6
2
3
4
5
6
开始合并,会创建 MR 任务。
hive> INSERT OVERWRITE TABLE default.demo
> PARTITION (dt)
> SELECT * FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02';
Query ID = root_20210520161711_2d1a65ec-2918-4839-9dc0-70d300dfebac
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1621253192216_0622, Tracking URL = http://hadoop-30.com:18088/proxy/application_1621253192216_0622/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1621253192216_0622
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-05-20 16:17:21,390 Stage-1 map = 0%, reduce = 0%
2021-05-20 16:17:37,769 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 15.62 sec
2021-05-20 16:17:43,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 19.81 sec
MapReduce Total cumulative CPU time: 19 seconds 810 msec
Ended Job = job_1621253192216_0622
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoop-10.com:8020/user/hive/warehouse/demo/.hive-staging_hive_2021-05-20_16-17-12_005_3392610118785009765-1/-ext-10000
Loading data to table default.demo partition (dt=null)
Time taken to load dynamic partitions: 2.018 seconds
Time taken for adding to write entity : 0.0 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 19.81 sec HDFS Read: 43010457 HDFS Write: 29956911 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 19 seconds 810 msec
OK
Time taken: 35.549 seconds
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2. 验证效果
可以看到读取速度为 23.1MB/s,可以看到读取速度提升了 7 倍,查询的分片(splits)大幅下降。
presto:default> SELECT isp, count(*) as cnt FROM default.demo WHERE dt BETWEEN '2021-05-01' AND '2021-05-02' group by isp limit 10;
isp | cnt
--------------+--------
A | 115230
B | 57819
C | 78038
D | 3525
(4 rows)
Query 20210520_081822_00067_8dpqe, FINISHED, 1 node
Splits: 51 total, 51 done (100.00%)
0:01 [255K rows, 28.6MB] [206K rows/s, 23.1MB/s]
1
2
3
4
5
6
7
8
9
10
11
12
2
3
4
5
6
7
8
9
10
11
12
reference
- [1] Fayson. 0455-如何在Hadoop中处理小文件-续open in new window