This website requires JavaScript.

Hadoop 日常问题记录

1. HDFS 权限问题

例如 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

The /user/ directory is owned by "hdfs" with 755 permissions. As a result only hdfs can write to that directory. Unlike unix/linux, hdfs is the superuser and not root. So you would need to do this:

sudo -u hdfs hadoop fs -mkdir /user/,,myfile,, sudo -u hdfs hadoop fs -put myfile.txt /user/,,/,,
If you want to create a home directory for root so you can store files in his directory, do:
sudo -u hdfs hadoop fs -mkdir /user/root sudo -u hdfs hadoop fs -chown root /user/root
Then as root you can do hadoop fs -put file /user/root/

2. Spark 运行Bug

spark-shell执行命令出现以下错误提示:

java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "/home/108857"): error=2。

这个算是Cloudera 的bug , 把datanode上的 /etc/hadoop/conf.cloudera.yarn/topology* 复制到执行spark-shell的机器上即可。

3.Hive大量导入动态分区导致内存益处错误

也是个Bug级的存在,有两个方案处理

方案一:设置 hive.optimize.sort.dynamic.partition减少reduce的内存耗用

set hive.optimize.sort.dynamic.partition = true set hive.exec.max.dynamic.partitions.pernode=100000; set hive.exec.max.dynamic.partitions=100000;
Shuffle Hive data before storing in Parquet

方案二:增加内存配置

 set mapred.map.tasks=100;
 set mapred.reduce.tasks=100;
 set mapreduce.map.java.opts=-Xmx4096m;
 set mapreduce.reduce.java.opts=-Xmx4096m;
 set hive.exec.max.dynamic.partitions.pernode=100000;
 set hive.exec.max.dynamic.partitions=100000;

Hive - Out of Memory Exception - Java Heap Space Unable to insert into a dynamic partition parquet table hive dynamic partitions insert java.lang.OutOfMemoryError: Java heap space

4. Hive Parquet时区问题

如果表格是Hive存储的Parquet文件,Impala读的时候时区会有问题。

解决方案1:

Impala select的时候进行时区转换 from_utc_timestamp(recordtime,"HKT")或者其他时间函数如hours_add

解决方案2:

设置属性使Impala自动转换 ,在后台Impala服务中的Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)  添加--convert_legacy_hive_parquet_utc_timestamps=true

Linux系统问题

http://mirror.centos.org/centos/6/SCL/x86_64/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"

yum remove centos-release-SCL
yum install centos-release-scl

参考:

List of time zone abbreviations TIMESTAMP Data Type Timestamp stored in Parquet file format in Impala Showing GMT Value

0条评论
avatar