This website requires JavaScript.

[译]在Hive中启用压缩

文件压缩主要带来两个好处,一方面是减少存储所占用的空间,另外一个就是提高网络或者磁盘的读写速度提升Hive的查询性能。本文主要介绍Hive中压缩的使用方法。

Find Available Compression Codecs in Hive

hive> set io.compression.codecs;
io.compression.codecs=
      org.apache.hadoop.io.compress.GzipCodec,
      org.apache.hadoop.io.compress.DefaultCodec,
      org.apache.hadoop.io.compress.BZip2Codec,
      org.apache.hadoop.io.compress.SnappyCodec
hive>

Enable Compression on Intermediate Data

一个复杂的查询会转换一系列的MapReduce Stage,这些中介过程产生的文件会被下一个MapReduce的Job作为输入源,我们可以通过设置hive.exec.compress.intermediate开启。 在Shell中使用SET命令或者在hive-site.xml文件中设置。

  <property>
    <name>hive.exec.compress.intermediate</name>
    <value>true</value>
    <description>
      This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>
  <property>
    <name>hive.intermediate.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    <description/>
  </property>
  <property>
    <name>hive.intermediate.compression.type</name>
    <value>BLOCK</value>
    <description/>
  </property>

命令行

hive> set hive.exec.compress.intermediate=true;
hive> set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> set hive.intermediate.compression.type=BLOCK;

Enable Compression on Final Output

我们也可以设置输出文件的压缩

  <property>
    <name>hive.exec.compress.output</name>
    <value>true</value>
    <description>
      This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>

hive> set hive.exec.compress.output=true;
hive> set mapreduce.output.fileoutputformat.compress=true;
hive> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;

Example Table Creation with Compression Enabled

以下SQL用来创建一个新的压缩表compressed_emp,数据来源testemp.

1.源表内容

hive> select * from testemp;
OK
123	Ram	Team Lead
345	Siva	Member
678	Krishna	Member
Time taken: 0.096 seconds, Fetched: 3 row(s)

2.在Hive Shell中设置压缩属性

hive> set hive.exec.compress.output=true;
hive> set mapreduce.output.fileoutputformat.compress=true;
hive> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
hive> set hive.exec.compress.intermediate=true;

3.目标表compressed_emp创建


hive> CREATE TABLE compressed_emp ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    > AS SELECT * FROM testemp;
Query ID = hadoop1_20150502214545_8eb6915e-8d0d-4109-b743-cb6505dfa26b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1430579477500_0002, Tracking URL = http://localhost:8088/proxy/application_1430579477500_0002/
Kill Command = /usr/lib/hadoop/hadoop-2.3.0/bin/hadoop job  -kill job_1430579477500_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-02 21:45:28,983 Stage-1 map = 0%,  reduce = 0%
2015-05-02 21:45:34,437 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.19 sec
MapReduce Total cumulative CPU time: 1 seconds 190 msec
Ended Job = job_1430579477500_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/mydir/hadoop1/af416484-b7cb-4036-bfbd-5942c09fcfd9/hive_2015-05-02_21-45-19_676_3039038823977237962-1/-ext-10001
Moving data to: hdfs://localhost:9000/user/hive/warehouse/compressed_emp
Table default.compressed_emp stats: [numFiles=1, numRows=3, totalSize=66, rawDataSize=50]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.19 sec   HDFS Read: 268 HDFS Write: 144 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 190 msec
OK
Time taken: 16.422 seconds
hive> dfs -ls -R /user/hive/warehouse/compressed_emp;
-rw-r--r--   1 hadoop1 supergroup         66 2015-05-02 21:45 /user/hive/warehouse/compressed_emp/000000_0.gz
hive> dfs -cat /user/hive/warehouse/compressed_emp/000000_0.gz;
此处应该是乱码。。。因为影响站点搜索和RSS订阅的功能所以去掉啦。。。
hive> dfs -text /user/hive/warehouse/compressed_emp/000000_0.gz;
123	Ram	Team Lead
345	Siva	Member
678	Krishna	Member

Reference

Enable Compression in Hive

0条评论
avatar