在Hadoop中分析数据(三) Avro-Tools Parquet-tools NIFI
Hadoop
2020-01-15
459
0
Avro-Tools
下载我们之前建立的zips的avro文件
fs -copyToLocal /user/hive/warehouse/zips/506c2c87-39cb-437e-bd2b-b898fca06e19.avro mv 506c2c87-39cb-437e-bd2b-b898fca06e19.avro zips.avro几个常用的参数
avro-tools tojson zips.avro avro-tools getmeta zips.avro avro-tools getschema zips.avro通过schema 生成java文件
avro-tools compile schema evelope.avsc ./
Parquet
下载Parquet数据文件
hadoop fs -ls /user/hive/warehouse/ratings/year=2015/month=02 hadoop fs -copyToLocal /user/hive/warehouse/ratings/year=2015/month=02/43de224b-547e-4521-b36f-e0a2b49c8b61.parquetparquet-tools schema file.parquet parquet-tools head file.parquet parquet-tools cat file.parquet parquet-tools meta file.parquet
将数据输出为Json
未来版本的Parquet有一个-j参数,可以直接把数据输出为Json ,我们可以用kite变通处理
1.把元数据输出
parquet-tools meta file.parquet > meta.text2.把schema 复制到另外的文件中
3.生成数据集
命令中dataset:file也可以改为dataset:hive 取决于你要放在哪里.
kite-dataset create dataset:file:datasets/parquet_ratings --schema rating-from-qarquet.avsc --format parquet4\. 生成json格式
mv file.parquet datasets/parquet_ratings/ kite-dataset show dataset:file:datasets/parquet_ratings/
一些ETL
1. 从Hadoop中导出数据,分年月进行压缩
#!/bin/bash impala-shell -B -i localhost -q 'select min(r.year), min(r.month), r.movieId, m.title, m.genres from ratings as r, movies as m where r.movieId = m.movieId group by r.movieId, m.title, m.genres order by min(r.year), min(r.month)' > /tmp/first-rated.tsv for year in `seq 2013 2015`; do for month in `seq 1 12`; do mkdir $year-$month cat /tmp/first-rated.tsv | grep "^$year $month" > $year-$month/movies.tsv 2> /dev/null impala-shell -B -i localhost -q "select userId, movieId, rating, \`timestamp\` from ratings where year = $year and month = $month" | grep -v '^$' > $year-$month/ratings.tsv if [ -s $year-$month/ratings.tsv ]; then tar czf $year-$month.tar.gz $year-$month/ fi rm -rf $year-$month/ done done
2. 统计某列重复值的数量
tail -n 5000 ratings.csv | awk -F',' '{print $2}' | sort | uniq -c
3. 使用 APACHE NIFI 处理数据流
下载NIFI
wget http://apache.fayea.com/nifi/0.5.1/nifi-0.5.1-bin.tar.gz官方文档 [http://nifi.apache.org/docs.html](http://nifi.apache.org/docs.html "http://nifi.apache.org/docs.html")
0条评论