hive和datax數據採集數量對不上
對數據的時候發現有些對不上,在hive中 staff_id = 'DF67B3FC-02DD-4142-807A-DF4A75A4A22E’的數據只有1033
而在mysql中發現staff_id = 'DF67B3FC-02DD-4142-807A-DF4A75A4A22E’的數據有4783條記錄(昨天的記錄是4781)
這個數據即使是由於離線採集也不會相差這麼大,肯定是哪裡出現了問題
原因:
在datax中修改cn_attendance_day_print的job文件,只要 staff_id = 'DF67B3FC-02DD-4142-807A-DF4A75A4A22E’的記錄
發現採集過來的確實是4781條數據
那麼datax到hdfs的鏈路是正確的
所以需要去查看一下剛剛採集過來的數據,由於之前的記錄都刪除了,所以懶得再去復現了,說一下最終的處理結果
後面發現原因是由於大小寫的原因,hive是區分數據大小寫的,但是在mysql中這邊設置了全局大小寫不區分
解決方案
在datax中將string類型的數據全部轉為大寫或者小寫
示例如下:
{"job": {"content": [{"transformer": [{"parameter": {"code": "for(int i=0;i<record.getColumnNumber();i++){if(record.getColumn(i).getByteSize()!=0){Column column = record.getColumn(i); def str = column.asString(); def newStr=null; newStr=str.replaceAll(\"[\\r\\n]\",\"\"); record.setColumn(i, new StringColumn(newStr)); };};return record;","extraPackage": []},"name": "dx_groovy"}],"writer": {"parameter": {"writeMode": "append","fieldDelimiter": "\u0001","column": [{"type": "string","name": "id"}, {"type": "string","name": "username"}, {"type": "string","name": "user_id"}, {"type": "string","name": "superior_id"}, {"type": "string","name": "finger_print_number"}],"path": "${targetdir}","fileType": "text","defaultFS": "hdfs://mycluster:8020","compress": "gzip","fileName": "cn_staff"},"name": "hdfswriter"},"reader": {"parameter": {"username": "dw_readonly","column": ["id", "username", "user_id", "superior_id", "finger_print_number"],"connection": [{"table": ["cn_staff"],"jdbcUrl": ["jdbc:mysql://*******"]}],"password": "******","splitPk": ""},"name": "mysqlreader"}}],"setting": {"speed": {"channel": 3},"errorLimit": {"record": 0,"percentage": 0.02}}}
}