This website requires JavaScript.

Hadoop:Sqoop 增量导入

注: 本文参考https://weidongzhou.wordpress.com的两篇文章然后做了下合并

一、从Hive导出数据到RDBMS

sqoop export \
–connect jdbc:oracle:thin:@enkx3-scan:1521:dbm1 \
–username wzhou \
–password wzhou \
–direct \
–export-dir ‘/user/hive/warehouse/test_oracle.db/my_all_objects_sqoop’ \
–table WZHOU.TEST_IMPORT_FROM_SCOOP \
–fields-terminated-by ‘\001’
注意: –export-dir  是hive的目录不特指某个文件.

二、增量把数据导入Hive

1.建立源表

create table student
(
student_id int not null,
student_name varchar(20) not null,
 major varchar(20),
 CONSTRAINT student_pk PRIMARY KEY (student_id)
 );
 insert into student values ( 1, 'student1', 'math' );
 insert into student values ( 2, 'student2', 'computer' );
 insert into student values ( 3, 'student3', 'math' );
 insert into student values ( 4, 'student4', 'accounting' );

2.创建导入命令

sqoop import \
--driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" \
--connect="jdbc:sqlserver://xxx.xxx.xxx.xxxx;database=xxx;"\
--username  xxx\
 --password xxx\
 --table STUDENT \
 --incremental append \
 --check-column student_id \
 -m 4 \
 --split-by major
**check-column **指定那列在导入的时候进行检测 ,列的类型不能为*CHAR,比如VARCHAR2或者CHAR. **incremental **参数有两种模式 append 和 lastmodified. Lastmodified 参数通常使用timestamp **Last-value  **指定最后一个值,大于这个值的记录会被导入.当然也可以用一下方法获取 –last-value $($HIVE_HOME/bin/hive -e “select max(idcolumn) from tablename”)

注意: 如果看到以下错误,在—table 参数后面写上大写的表名 ERROR tool.ImportTool: Imported Failed: There is no column found in the target table all_objects_inc_test. Please ensure that your table name is correct.

3.建立外部表

USE default;
CREATE EXTERNAL TABLE student_ext (
student_id string,
student_name string,
major string
 )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 LOCATION '/user/hdfs/STUDENT';

4.源表上插入新数据

insert into student values ( 5,'student3','computer');
insert into student values ( 6,'student4','math');
insert into student values ( 7,'student5','computer');

5. 再执行一次Sqoop命令

注意增加 --last-value 这个参数

sqoop import 
--driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
--connect="jdbc:sqlserver://xxx.xxx.xxx.xxxx;database=xxx;"
--username xxx
--password xxx
--table STUDENT
--incremental append
--check-column student_id
--last-value 4
-m 4
--split-by major
Impala 查看,执行以下命令更新元数据 invalidate metadata

三、建立Sqoop Job

我们可以通过Sqoop建立Job这样操作比较方便

sqoop job. sqoop job –list sqoop job –show student_job sqoop job –exec student_job sqoop job –delete student_job

1. 建立一个 Sqoop job

sqoop job \
--create student_job \
-- import \
--driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" \
--connect="jdbc:sqlserver://xxx.xxx.xxx.xxx;database=xxx;"\
--username  xxx\
 --password xxx\
 --table STUDENT \
 --incremental append \
 --check-column student_id \
 --last-value 4 \
 -m 4 \
 --split-by major

2.操作Job

sqoop job --list
sqoop job --show student_job
sqoop job --exec student_job
sqoop job --delete student_job
 

参考

Use incremental import in sqoop to load data from Oracle (Part I) Use incremental import in sqoop to load data from Oracle (Part II)

0条评论
avatar