網(wǎng)站開發(fā)語言是什么app開發(fā)多少錢
文章目錄
Spark Load 導(dǎo)入Hive數(shù)據(jù)
一、Spark Load導(dǎo)入Hive非分區(qū)表數(shù)據(jù)
1、在node3hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
2、啟動Hive,在Hive客戶端創(chuàng)建Hive表并加載數(shù)據(jù)
3、在Doris中創(chuàng)建Hive外部表
4、創(chuàng)建Doris表
5、創(chuàng)建Spark Load導(dǎo)入任務(wù)
6、Spark Load任務(wù)查看
7、查看Doris結(jié)果
二、Spark Load 導(dǎo)入Hive分區(qū)表數(shù)據(jù)
1、在node3 hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
2、創(chuàng)建Hive分區(qū)表并,加載數(shù)據(jù)
3、創(chuàng)建Doris分區(qū)表
4、創(chuàng)建Spark Load導(dǎo)入任務(wù)
5、Spark Load任務(wù)查看
6、查看Doris結(jié)果
Spark Load 導(dǎo)入Hive數(shù)據(jù)
一、Spark Load導(dǎo)入Hive非分區(qū)表數(shù)據(jù)
1、在node3hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
hive_data1.txt:
1,zs,18,100
2,ls,19,101
3,ww,20,102
4,ml,21,103
5,tq,22,104
2、啟動Hive,在Hive客戶端創(chuàng)建Hive表并加載數(shù)據(jù)
#配置Hive 服務(wù)端$HIVE_HOME/conf/hive-site.xml
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
注意:此配置項為關(guān)閉metastore版本驗證,避免在doris中讀取hive外表時報錯。#在node1節(jié)點啟動hive metastore
[root@node1 ~]# hive --service metastore &#在node3節(jié)點進入hive客戶端建表并加載數(shù)據(jù)
create table hive_tbl (id int,name string,age int,score int) row format delimited fields terminated by ',';load data local inpath '/root/hive_data1.txt' into table hive_tbl;#查看hive表中的數(shù)據(jù)
hive> select * from hive_tbl;
1 zs 18 100
2 ls 19 101
3 ww 20 102
4 ml 21 103
5 tq 22 104
3、在Doris中創(chuàng)建Hive外部表
使用Spark Load 將Hive非分區(qū)表中的數(shù)據(jù)導(dǎo)入到Doris中時,需要先在Doris中創(chuàng)建hive 外部表,然后通過Spark Load 加載這張外部表數(shù)據(jù)到Doris某張表中。
#Doris中創(chuàng)建Hive 外表
CREATE EXTERNAL TABLE example_db.hive_doris_tbl
(
id INT,
name varchar(255),
age INT,
score INT
)
ENGINE=hive
properties
(
"dfs.nameservices"="mycluster",
"dfs.ha.namenodes.mycluster"="node1,node2",
"dfs.namenode.rpc-address.mycluster.node1"="node1:8020",
"dfs.namenode.rpc-address.mycluster.node2"="node2:8020",
"dfs.client.failover.proxy.provider.mycluster" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"database" = "default",
"table" = "hive_tbl",
"hive.metastore.uris" = "thrift://node1:9083"
);
注意:
- 在Doris中創(chuàng)建Hive外表不會將數(shù)據(jù)存儲到Doris中,查詢hive外表數(shù)據(jù)時會讀取HDFS中對應(yīng)hive路徑中的數(shù)據(jù)來展示,向hive表中插入數(shù)據(jù)時,doris中查詢hive外表也能看到新增數(shù)據(jù)。
- 如果Hive表中是分區(qū)表,doris創(chuàng)建hive表將分區(qū)列看成普通列即可。
以上hive外表結(jié)果如下:
mysql> select * from hive_doris_tbl;
+------+------+------+-------+
| id | name | age | score |
+------+------+------+-------+
| 1 | zs | 18 | 100 |
| 2 | ls | 19 | 101 |
| 3 | ww | 20 | 102 |
| 4 | ml | 21 | 103 |
| 5 | tq | 22 | 104 |
+------+------+------+-------+
4、創(chuàng)建Doris表
#創(chuàng)建Doris表
create table spark_load_t2(
id int,
name varchar(255),
age int,
score double
)
ENGINE = olap
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(`id`) BUCKETS 8;
5、創(chuàng)建Spark Load導(dǎo)入任務(wù)
創(chuàng)建Spark Load任務(wù)后,底層Spark Load轉(zhuǎn)換成Spark任務(wù)進行數(shù)據(jù)導(dǎo)入處理時,需要連接Hive,所以需要保證在Spark node1-node3節(jié)點客戶端中SPARK_HOME/conf/目錄下有hive-site.xml配置文件,以便找到Hive ,另外,連接Hive時還需要MySQL 連接依賴包,所以需要在Yarn NodeManager各個節(jié)點保證$HADOOP_HOME/share/hadoop/yarn/lib路徑下有mysql-connector-java-5.1.47.jar依賴包。
#把hive客戶端hive-site.xml 分發(fā)到Spark 客戶端(node1-node3)節(jié)點$SPARK_HOME/conf目錄下
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node1:/software/spark-2.3.1/conf/
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node2:/software/spark-2.3.1/conf/
[root@node3 ~]# cp /software/hive-3.1.3/conf/hive-site.xml /software/spark-2.3.1/conf/#將mysql-connector-java-5.1.47.jar依賴分發(fā)到NodeManager 各個節(jié)點$HADOOP_HOME/share/hadoop/yarn/lib路徑中
[root@node3 ~]# cp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar /software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node4:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node5:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
編寫Spark Load任務(wù),如下:
LOAD LABEL example_db.label2
(
DATA FROM TABLE hive_doris_tbl
INTO TABLE spark_load_t2
)
WITH RESOURCE 'spark1'
(
"spark.executor.memory" = "1g",
"spark.shuffle.compress" = "true"
)
PROPERTIES
(
"timeout" = "3600"
);
6、Spark Load任務(wù)查看
登錄Yarn Web UI查看對應(yīng)任務(wù)執(zhí)行情況:
執(zhí)行命令查看Spark Load 任務(wù)執(zhí)行情況:
mysql> show load order by createtime desc limit 1\G;
*************************** 1. row ***************************JobId: 37128Label: label2State: FINISHEDProgress: ETL:100%; LOAD:100%Type: SPARKEtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=0TaskInfo: cluster:spark1; timeout(s):3600; max_filter_ratio:0.0ErrorMsg: NULLCreateTime: 2023-03-10 18:13:19EtlStartTime: 2023-03-10 18:13:34EtlFinishTime: 2023-03-10 18:15:27LoadStartTime: 2023-03-10 18:15:27
LoadFinishTime: 2023-03-10 18:15:30URL: http://node1:8088/proxy/application_1678424784452_0007/JobDetails: {"Unfinished backends":{"0-0":[]},"ScannedRows":0,"TaskNumber":1,"LoadBytes":0,"All backends":{"0-0":[-1]},"FileNumber":0,"FileSi
ze":0} TransactionId: 24081ErrorTablets: {}
1 row in set (0.00 sec)
?
7、查看Doris結(jié)果
mysql> select * from spark_load_t2;
+------+------+------+-------+
| id | name | age | score |
+------+------+------+-------+
| 5 | tq | 22 | 104 |
| 4 | ml | 21 | 103 |
| 1 | zs | 18 | 100 |
| 3 | ww | 20 | 102 |
| 2 | ls | 19 | 101 |
+------+------+------+-------+
二、Spark Load 導(dǎo)入Hive分區(qū)表數(shù)據(jù)
導(dǎo)入Hive分區(qū)表數(shù)據(jù)到對應(yīng)的doris分區(qū)表就不能在doris中創(chuàng)建hive外表這種方式導(dǎo)入,因為hive分區(qū)列在hive外表中就是普通列,所以這里我們使用Spark Load 直接讀取Hive分區(qū)表在HDFS中的路徑,將數(shù)據(jù)加載到Doris分區(qū)表中。
1、在node3 hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
hive_data2.txt:
1,zs,18,100,2023-03-01
2,ls,19,200,2023-03-01
3,ww,20,300,2023-03-02
4,ml,21,400,2023-03-02
5,tq,22,500,2023-03-02
2、創(chuàng)建Hive分區(qū)表并,加載數(shù)據(jù)
#在node3節(jié)點進入hive客戶端建表并加載數(shù)據(jù)
create table hive_tbl2 (id int, name string,age int,score int) partitioned by (dt string) row format delimited fields terminated by ','load data local inpath '/root/hive_data2.txt' into table hive_tbl2;#查看hive表中的數(shù)據(jù)
hive> select * from hive_tbl2;
OK
1 zs 18 100 2023-03-01
2 ls 19 200 2023-03-01
3 ww 20 300 2023-03-02
4 ml 21 400 2023-03-02
5 tq 22 500 2023-03-02hive> show partitions hive_tbl2;
OK
dt=2023-03-01
dt=2023-03-02
當(dāng)hive_tbl2表創(chuàng)建完成后,我們可以在HDFS中看到其存儲路徑格式如下:
?
3、創(chuàng)建Doris分區(qū)表
create table spark_load_t3(
dt date,
id int,
name varchar(255),
age int,
score double
)
ENGINE = olap
DUPLICATE KEY(dt,id)
PARTITION BY RANGE(`dt`)
(
PARTITION `p1` VALUES [("2023-03-01"),("2023-03-02")),
PARTITION `p2` VALUES [("2023-03-02"),("2023-03-03"))
)
DISTRIBUTED BY HASH(`id`) BUCKETS 8;
4、創(chuàng)建Spark Load導(dǎo)入任務(wù)
創(chuàng)建Spark Load任務(wù)后,底層Spark Load轉(zhuǎn)換成Spark任務(wù)進行數(shù)據(jù)導(dǎo)入處理時,需要連接Hive,所以需要保證在Spark node1-node3節(jié)點客戶端中SPARK_HOME/conf/目錄下有hive-site.xml配置文件,以便找到Hive ,另外,連接Hive時還需要MySQL 連接依賴包,所以需要在Yarn NodeManager各個節(jié)點保證HADOOP_HOME/share/hadoop/yarn/lib路徑下有mysql-connector-java-5.1.47.jar依賴包。
#把hive客戶端hive-site.xml 分發(fā)到Spark 客戶端(node1-node3)節(jié)點$SPARK_HOME/conf目錄下
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node1:/software/spark-2.3.1/conf/
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node2:/software/spark-2.3.1/conf/
[root@node3 ~]# cp /software/hive-3.1.3/conf/hive-site.xml /software/spark-2.3.1/conf/#將mysql-connector-java-5.1.47.jar依賴分發(fā)到NodeManager 各個節(jié)點$HADOOP_HOME/share/hadoop/yarn/lib路徑中
[root@node3 ~]# cp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar /software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node4:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node5:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
編寫Spark Load任務(wù),如下:
LOAD LABEL example_db.label3
(
DATA INFILE("hdfs://node1:8020/user/hive/warehouse/hive_tbl2/dt=2023-03-02/*")
INTO TABLE spark_load_t3
COLUMNS TERMINATED BY ","
FORMAT AS "csv"
(id,name,age,score)
COLUMNS FROM PATH AS (dt)
SET
(
dt=dt,
id=id,
name=name,
age=age
)
)
WITH RESOURCE 'spark1'
(
"spark.executor.memory" = "1g",
"spark.shuffle.compress" = "true"
)
PROPERTIES
(
"timeout" = "3600"
);
注意:
- 以上HDFS路徑不支持HA模式,需要手動指定Active NameNode節(jié)點
- 讀取HDFS文件路徑中的分區(qū)路徑需要寫出來,不能使用*代表,這與Broker Load不同。
- 目前版本測試存在問題:當(dāng)Data INFILE中指定多個路徑時有時會出現(xiàn)只導(dǎo)入第一個路徑數(shù)據(jù)。
5、Spark Load任務(wù)查看
執(zhí)行命令查看Spark Load 任務(wù)執(zhí)行情況:
mysql> show load order by createtime desc limit 1\G;
*************************** 1. row ***************************JobId: 39432Label: label3State: FINISHEDProgress: ETL:100%; LOAD:100%Type: SPARKEtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=3TaskInfo: cluster:spark1; timeout(s):3600; max_filter_ratio:0.0ErrorMsg: NULLCreateTime: 2023-03-10 20:11:19EtlStartTime: 2023-03-10 20:11:36EtlFinishTime: 2023-03-10 20:12:21LoadStartTime: 2023-03-10 20:12:21
LoadFinishTime: 2023-03-10 20:12:22URL: http://node1:8088/proxy/application_1678443952851_0026/JobDetails: {"Unfinished backends":{"0-0":[]},"ScannedRows":3,"TaskNumber":1,"LoadBytes":0,"All backends":{"0-0":[-1]},"FileNumber":2,"FileSi
ze":60} TransactionId: 25529ErrorTablets: {}
1 row in set (0.02 sec)
6、查看Doris結(jié)果
mysql> select * from spark_load_t3;
+------------+------+------+------+-------+
| dt | id | name | age | score |
+------------+------+------+------+-------+
| 2023-03-02 | 3 | ww | 20 | 300 |
| 2023-03-02 | 4 | ml | 21 | 400 |
| 2023-03-02 | 5 | tq | 22 | 500 |
+------------+------+------+------+-------+
- 📢博客主頁:https://lansonli.blog.csdn.net
- 📢歡迎點贊 👍 收藏 ?留言 📝 如有錯誤敬請指正!
- 📢本文由 Lansonli 原創(chuàng),首發(fā)于 CSDN博客🙉
- 📢停下休息的時候不要忘了別人還在奔跑,希望大家抓緊時間學(xué)習(xí),全力奔赴更美好的生活?