您现在的位置是:首页 > 智能机电
使用 DataX 将 Hive 与 MySQL 中的表互导
智慧创新站
2025-04-04【智能机电】50人已围观
简介一、DataX简介(摘自/alibaba/Dat…详细介绍QuickStartSupportDataChannels二、DataXDataX是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase...
一、DataX简介(摘自/alibaba/Dat…详细介绍QuickStartSupportDataChannels二、
DataXDataX是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS等各种异构数据源之间高效的数据同步功能。
FeaturesDataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统,每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
DataX详细介绍请参考:DataX-Introduction
QuickStartDownloadDataX下载地址
SupportDataChannelsDataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,
2.1将Hive表导入MySQL1.将DataX下载后上传到服务器2.解压后使用自检脚本:python{YOUR_DATAX_HOME}/bin/{YOUR_DATAX_HOME}/job/运行没有问题后根据模板和自己的需求创建your_配置文件,这里以hive2mysql为例:(需要注意的是读hive表其实就是读hive存放在hdfs上的文件){"job":{"setting":{"speed":{"channel":3}},"content":[{"reader":{"name":"hdfsreader","parameter":{"path":"/apps/hive/warehouse/test_table/*","defaultFS":"hdfs://192.168.10.101:8020","column":[{"index":0,"type":"string"},{"index":1,"type":"long"}],"fileType":"text","encoding":"UTF-8","fieldDelimiter":"\u0001"}},"writer":{"name":"mysqlwriter","parameter":{"writeMode":"insert","username":"username","password":"password","column":["word","cnt"],"session":["setsessionsql_mode='ANSI'"],"preSql":["deletefromtest_table"],"connection":[{"jdbcUrl":"jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=truecharacterEncoding=gbk","table":["test_table"]}]}}}]}}4.运行脚本
[root@Masterdatax]operatingSystemclass=:19:46.691[main]INFOEngine-themachineinfo=osInfo::_64cpunum:1totalPhysicalMemory:-0.00GfreePhysicalMemory:-0.00GmaxFileDescriptorCount:-1currentOpenFileDescriptorCount:-1GCNames[Copy,MarkSweepCompact]MEMORY_NAME|allocation_size|init_sizeEdenSpace|273.06MB|273.06MBCodeCache|240.00MB|2.44MBSurvivorSpace|34.13MB|34.13MBCompressedClassSpace|1,024.00MB|0.00MBMetaspace|-0.00MB|0.00MBTenuredGen|682.69MB|682.69MB2018-08-2923:19:46.732[main]INFOEngine-2018-08-2923:20:00.489[job-0]INFOJobContainer-[totalcpuinfo]=averageCpu|maxDeltaCpu|%|-1.00%|-1.00%[totalgcinfo]=NAME|totalGCCount|maxDeltaGCCount|minDeltaGCCount|totalGCTime|maxDeltaGCTime|minDeltaGCTimeCopy|0|0|0|0.000s|0.000s|0.000sMarkSweepCompact|1|1|1|0.071s|0.071s|0.071s2018-08-2923:20:00.490[job-0]INFOJobContainer-PerfTracenotenable!2018-08-2923:20:00.499[job-0]INFOStandAloneJobContainerCommunicator-Total939records,31267bytes|/s,93records/s|Error0records,0bytes|||%2018-08-2923:20:00.500[job-0]INFOJobContainer-任务启动时刻:2018-08-2923:19:46任务结束时刻:2018-08-2923:20:00任务总计耗时:13s任务平均流量:3.05KB/s记录写入速度:93rec/s读出记录总数:939读写失败总数:0
**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter":"\u0001"!!!**2.2将MySQL表导入Hive
1.在Hive中建表(储存为文本文件类型)
hivecreatetablemysql_table(wordstring,cntint)rowformatdelimitedfieldsterminatedby','STOREDASTEXTFILE;OKTimetaken:0.194secondshiveselect*frommysql_tablelimit10;OKTimetaken:0.162seconds
{"job":{"setting":{"speed":{"channel":3},"errorLimit":{"record":0,"percentage":0.02}},"content":[{"reader":{"name":"mysqlreader","parameter":{"writeMode":"insert","username":"root","password":"123456","column":["word","cnt"],"splitPk":"cnt","connection":[{"table":["test_table"],"jdbcUrl":["jdbc:mysql://192.168.10.116:3306/test_datax"]}]}},"writer":{"name":"hdfswriter","parameter":{"defaultFS":"hdfs://192.168.10.101:8020","fileType":"text","path":"/apps/hive/warehouse/mysql_table","fileName":"mysql_table","column":[{"name":"word","type":"string"},{"name":"cnt","type":"int"}],"writeMode":"app","fieldDelimiter":",","compress":"gzip"}}}]}}3.运行脚本
[root@Masterdatax]operatingSystemclass=:03:36.415[main]INFOEngine-themachineinfo=osInfo::_64cpunum:1totalPhysicalMemory:-0.00GfreePhysicalMemory:-0.00GmaxFileDescriptorCount:-1currentOpenFileDescriptorCount:-1GCNames[Copy,MarkSweepCompact]MEMORY_NAME|allocation_size|init_sizeEdenSpace|273.06MB|273.06MBCodeCache|240.00MB|2.44MBSurvivorSpace|34.13MB|34.13MBCompressedClassSpace|1,024.00MB|0.00MBMetaspace|-0.00MB|0.00MBTenuredGen|682.69MB|682.69MB2018-08-3001:03:36.469[main]INFOEngine-2018-08-3001:03:52.094[job-0]INFOJobContainer-[totalcpuinfo]=averageCpu|maxDeltaCpu|%|-1.00%|-1.00%[totalgcinfo]=NAME|totalGCCount|maxDeltaGCCount|minDeltaGCCount|totalGCTime|maxDeltaGCTime|minDeltaGCTimeCopy|1|1|1|0.346s|0.346s|0.346sMarkSweepCompact|1|1|1|0.206s|0.206s|0.206s2018-08-3001:03:52.094[job-0]INFOJobContainer-PerfTracenotenable!2018-08-3001:03:52.096[job-0]INFOStandAloneJobContainerCommunicator-Total939records,19462bytes|/s,93records/s|Error0records,0bytes|||%2018-08-3001:03:52.101[job-0]INFOJobContainer-任务启动时刻:2018-08-3001:03:36任务结束时刻:2018-08-3001:03:52任务总计耗时:15s任务平均流量:1.90KB/s记录写入速度:93rec/s读出记录总数:939读写失败总数:0
4.查看hive表中是否有数据
hiveselect*frommysql_tablelimit10;_1:0.148seconds,Fetched:10row(s)hiveselect*frommysql_tableorderbycntdesclimit10;QueryID=hdfs_200_68021b2b-9bd8-44a0-92df-dbe0c717053aTotaljobs=1:Running(ExecutingonYARNclusterwithAppidapplication_30_0005)--------------------------------------------------------------------------------VERTICESSTATUSTOTALCOMPLETEDRUNNINGPENDINGFAILEDKILLED---------------------------------------------------------------------------------------------------------------------------------------VERTICES:02/02[==========================]100%ELAPSEDTIME:17.63s--------------------------------------------------------------------------------OKInstalling918warning:2***2tcp__1:33.713seconds,Fetched:10row(s)1.可能会遇到的错误
2018-08-2922:37:54.327[job-0]:Code:[DBUtilErrorCode-01],Description:[获取表字段相关信息失败.].-获取表:test_table的字段的元信息时失败.请联系DBA核查该库、表信息.-:Unknowncolumn'id'in'fieldlist'
原因:“column”:[]中定义的列名与mysql不一致
经DataX智能分析,该任务最可能的错误原因是::Code:[UnstructuredStorageReader-03],Description:[您填写的参数值不合法.].-仅仅支持单字符切分,您配置的切分为:['01']原因:分隔符问题,详见上面的特别注意
原因:分隔符问题,详见上面的特别注意
2.MysqlWriter针对Mysql类型转换列表3.MysqlReader针对Mysql类型转换列表请注意:
除上述罗列字段类型外,其他类型均不支持。
tinyint(1)DataX视作为整形。
yearDataX视作为字符串类型
bitDataX属于未定义行为。
很赞哦!(80)