This website requires JavaScript.

在Hadoop中分析数据(三) Avro-Tools Parquet-tools NIFI

Avro-Tools

下载我们之前建立的zips的avro文件

fs -copyToLocal /user/hive/warehouse/zips/506c2c87-39cb-437e-bd2b-b898fca06e19.avro mv 506c2c87-39cb-437e-bd2b-b898fca06e19.avro zips.avro
几个常用的参数
avro-tools tojson zips.avro avro-tools getmeta zips.avro avro-tools getschema zips.avro
通过schema 生成java文件

avro-tools compile schema evelope.avsc ./

Parquet

下载Parquet数据文件

hadoop fs -ls /user/hive/warehouse/ratings/year=2015/month=02 hadoop fs -copyToLocal /user/hive/warehouse/ratings/year=2015/month=02/43de224b-547e-4521-b36f-e0a2b49c8b61.parquet

parquet-tools schema file.parquet parquet-tools head file.parquet parquet-tools cat file.parquet parquet-tools meta file.parquet

将数据输出为Json

未来版本的Parquet有一个-j参数,可以直接把数据输出为Json ,我们可以用kite变通处理

1.把元数据输出

parquet-tools meta file.parquet > meta.text
2.把schema 复制到另外的文件中

image

3.生成数据集

命令中dataset:file也可以改为dataset:hive 取决于你要放在哪里.

kite-dataset create dataset:file:datasets/parquet_ratings --schema rating-from-qarquet.avsc --format parquet
4\. 生成json格式
mv file.parquet datasets/parquet_ratings/
kite-dataset show dataset:file:datasets/parquet_ratings/

一些ETL

1. 从Hadoop中导出数据,分年月进行压缩

#!/bin/bash

impala-shell -B -i localhost -q 'select min(r.year), min(r.month), r.movieId, m.title, m.genres from ratings as r, movies as m where r.movieId = m.movieId group by r.movieId, m.title, m.genres order by min(r.year), min(r.month)' > /tmp/first-rated.tsv

for year in `seq 2013 2015`;
do
   for month in `seq 1 12`;
   do
     mkdir $year-$month
     cat /tmp/first-rated.tsv | grep "^$year     $month" > $year-$month/movies.tsv 2> /dev/null
     impala-shell -B -i localhost -q "select userId, movieId, rating, \`timestamp\` from ratings where year = $year and month = $month" | grep -v '^$' > $year-$month/ratings.tsv
     if [ -s $year-$month/ratings.tsv ]; then
       tar czf $year-$month.tar.gz $year-$month/
     fi
     rm -rf $year-$month/
   done
done

2. 统计某列重复值的数量

tail -n 5000 ratings.csv | awk -F',' '{print $2}' | sort | uniq -c

3. 使用 APACHE NIFI 处理数据流

下载NIFI

wget http://apache.fayea.com/nifi/0.5.1/nifi-0.5.1-bin.tar.gz
官方文档 [http://nifi.apache.org/docs.html](http://nifi.apache.org/docs.html "http://nifi.apache.org/docs.html")
0条评论
avatar