5.7 与 8.0 对相同文件的 LOAD DATA 语句结果不同

news/2024/11/14 17:56:34/文章来源:https://www.cnblogs.com/greatsql/p/18546504

5.7 与 8.0 对相同文件的 LOAD DATA 语句结果不同

问题描述

某客户现场支持,由MySQL 5.7.21升级MySQL 8.0.25后,通过LOAD DATA导入文件,当同一会话连续导入不同的编码(UTF8/GB18030)文件时会出现乱码。数据库版本未升级之前,相同的导入操作在MySQL 5.7.21未出现乱码。

问题分析

1)查看简化后的 LOAD DATA语句

greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0

2)查看表数据

+----------+------------------------------------------------------+
| AUTO_INC | D_NAME                                               |
+----------+------------------------------------------------------+
|        1 | xxx社会保险xxx                                        |
|        2 | xxx市路桥区xxx                                        |
|        4 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績             |
|        5 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績             |
+----------+------------------------------------------------------+
4 rows in set (0.00 sec)

3)检查业务表的字符集与校验集,发现字符集为 utf8mb4 、校验集为 utf8mb4_bin

4)检查数据库的字符集与校验集

greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name                        | Value                          |
+--------------------------------------+--------------------------------+
| character_set_client                 | utf8mb4                        |
| character_set_connection             | utf8mb4                        |
| character_set_database               | utf8mb4                        |
| character_set_filesystem             | binary                         |
| character_set_results                | utf8mb4                        |
| character_set_server                 | utf8mb4                        |
| character_set_system                 | utf8mb3                        |
| character_sets_dir                   | /opt/mysql3301/share/charsets/ |
| validate_password_special_char_count | 1                              |
+--------------------------------------+--------------------------------+
9 rows in set (0.01 sec)greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| collation_connection          | utf8mb4_bin        |
| collation_database            | utf8mb4_bin        |
| collation_server              | utf8mb4_bin        |
| default_collation_for_utf8mb4 | utf8mb4_general_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)

程序在MySQL 5.7.21跑了很长时间,一直没有问题,把数据库升级MySQL 8.0.25后,新导入的数据出现部分乱码, 由此怀疑,MySQL 8.0定长数据导入LOAD DATA @row 出现BUG。

BUG场景:同一个会话 LOAD DATA多种字符集文件,使用@临时变量切割字段。将导致导入数据乱码,向MySQL官方提BUG,已证实为BUG(编号115824)

问题复现

MySQL: 8.0.25

greatsql> SELECT VERSION();
+-----------+
| version() |
+-----------+
| 8.0.25    |
+-----------+
1 row in set (0.00 sec)table ddl:
CREATE TABLE `assp_sis_payres_imp_bak` (`AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列',`D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,PRIMARY KEY (`AUTO_INC`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name                        | Value                          |
+--------------------------------------+--------------------------------+
| character_set_client                 | utf8mb4                        |
| character_set_connection             | utf8mb4                        |
| character_set_database               | utf8mb4                        |
| character_set_filesystem             | binary                         |
| character_set_results                | utf8mb4                        |
| character_set_server                 | utf8mb4                        |
| character_set_system                 | utf8mb3                        |
| character_sets_dir                   | /opt/mysql3301/share/charsets/ |
| validate_password_special_char_count | 1                              |
+--------------------------------------+--------------------------------+
9 rows in set (0.01 sec)greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| collation_connection          | utf8mb4_bin        |
| collation_database            | utf8mb4_bin        |
| collation_server              | utf8mb4_bin        |
| default_collation_for_utf8mb4 | utf8mb4_general_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)greatsql> TRUNCATE TABLE assp_sis_payres_imp_bak;
Query OK, 0 rows affected (0.03 sec)greatsql> SELECT charset(@row), @row;
+---------------+------------+
| charset(@row) | @row       |
+---------------+------------+
| binary        | NULL       |
+---------------+------------+
1 row in set (0.00 sec)greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SELECT charset(@row), @row;  
+---------------+------------------------+
| charset(@row) | @row                   |
+---------------+------------------------+
| utf8mb4       | XXX路桥区社会保XXX       |
+---------------+------------------------+greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SELECT charset(@row), @row;  
+---------------+-----------------------------------------+
| charset(@row) | @row                                    |
+---------------+-----------------------------------------+
| gb18030       | XXX路桥区社会保XXX       |
+---------------+-----------------------------------------+greatsql>  SELECT * FROM  ASSP_SIS_PAYRES_IMP_BAK;
+----------+---------------------------------------------------------+
| AUTO_INC | D_NAME                                                  |
+----------+---------------------------------------------------------+
|        1 | XXX路桥区社会保XXX                                        |
|        2 | XXX路桥区社会保XXX                                        |
|        4 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績                 |
|        5 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績                 |
+----------+---------------------------------------------------------+
4 rows in set (0.00 sec)

MySQL 5.7.21

greatsql> SELECT VERSION();
+------------+
| version()  |
+------------+
| 5.7.21-log |
+------------+
1 row in set (0.01 sec)table ddl:
CREATE TABLE `assp_sis_payres_imp_bak` (`AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列',`D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,PRIMARY KEY (`AUTO_INC`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;greatsql>  SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name                        | Value                          |
+--------------------------------------+--------------------------------+
| character_set_client                 | utf8mb4                        |
| character_set_connection             | utf8mb4                        |
| character_set_database               | utf8mb4                        |
| character_set_filesystem             | binary                         |
| character_set_results                | utf8mb4                        |
| character_set_server                 | utf8mb4                        |
| character_set_system                 | utf8                           |
| character_sets_dir                   | /opt/mysql3305/share/charsets/ |
| validate_password_special_char_count | 1                              |
+--------------------------------------+--------------------------------+
9 rows in set (0.00 sec)greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database   | utf8mb4_general_ci |
| collation_server     | utf8mb4_general_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)greatsql> SELECT charset(@row), @row;
+---------------+------------+
| charset(@row) | @row       |
+---------------+------------+
| binary        | NULL       |
+---------------+------------+
1 row in set (0.00 sec)greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SELECT charset(@row), @row;  
+---------------+-----------------------+
| charset(@row) | @row                  |
+---------------+-----------------------+
| utf8mb4       | XXX路桥区社会保XXX      |
+---------------+-----------------------+greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SELECT charset(@row), @row;  
+---------------+-----------------------+
| charset(@row) | @row                  |
+---------------+-----------------------+
| gb18030       | XXX路桥区社会保XXX      |
+---------------+-----------------------+greatsql> SELECT * FROM  ASSP_SIS_PAYRES_IMP_BAK;                                                                                                                                                 
+---------------+-----------------------------+
| AUTO_INC      | D_NAME                      |
+---------------+-----------------------------+
|        1 | XXX路桥区社会保XXX                 |
|        2 | XXX路桥区社会保XXX                 |
|        4 | XXX路桥区社会保XXX                 |
|        5 | XXX路桥区社会保XXX                 |
+---------------+-----------------------------+
4 rows in set (0.00 sec)

BUG规避方案

通过SELECT``charset(@row), @row; 可以看到@row在执行LOAD DATA后在5.7.21和8.0.25是一样的,但最终的影响不一样。虽然MySQL官方确认此问题为BUG,但没有提供规避方案或者解决方案。通过万里工程师研究后,发现一种可行的规避方案。每次执行LOAD DATA命令前执行 [set @row=_binary'';] 进行规避。

greatsql> SELECT VERSION();
+-----------+
| version() |
+-----------+
| 8.0.25    |
+-----------+
1 row in set (0.00 sec)greatsql> SET @row=_binary'';
Query OK, 0 rows affected (0.00 sec)greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SET @row=_binary'';
Query OK, 0 rows affected (0.00 sec)greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0greatsql> SELECT * FROM assp_sis_payres_imp_bak;
+----------+--------------------------------------------------+
| AUTO_INC | D_NAME                                           |
+----------+--------------------------------------------------+
|        1 | XXX路桥区社会保XXX                 |
|        2 | XXX路桥区社会保XXX                 |
|        4 | XXX路桥区社会保XXX                 |
|        5 | XXX路桥区社会保XXX                 |
+----------+--------------------------------------------------+
4 rows in set (0.00 sec)

问题总结

1.BUG原因

MySQL8.0重构定长数据导入LOAD DATA @row 出现BUG.同一个数据库会话,多次执行LOAD DATA @row命令,则第n次执行LOAD DATA @row 的字符集使用的是n-1次的字符集,当文件的字符集存在不同,例如先后处理GB18030、UTF8字符集的文件就会数据乱码。此问题MySQL官方已证实为BUG(编号115824)

2.BUG触发条件

触发条件:需同时满足以下三个条件才会触发此bug。

1)LOAD DATA命令使用类似 @row临时变量 进行数据处理,例如对定长记录按字节切割出多个字段:

LINES (@row) SET COLUMN_NAME = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row),1,20)) USING GB18030))。

2)在同一个连接中,多次执行LOAD DATA命令,且先后处理的文件字符集存在不同(例如GB18030和UTF8)。

3)使用MySQL 8.0。

3.BUG规避办法

由万里工程师提出,与MySQL官方社区沟通证实,涉及到满足上述BUG触发条件的场景,通过在每次执行LOAD DATA命令前执行 [set @row=_binary'';] 进行规避。

参考:https://bugs.mysql.com/bug.php?id=115824


Enjoy GreatSQL 😃

关于 GreatSQL

GreatSQL是适用于金融级应用的国内自主开源数据库,具备高性能、高可靠、高易用性、高安全等多个核心特性,可以作为MySQL或Percona Server的可选替换,用于线上生产环境,且完全免费并兼容MySQL或Percona Server。

相关链接: GreatSQL社区 Gitee GitHub Bilibili

GreatSQL社区:

社区博客有奖征稿详情:https://greatsql.cn/thread-100-1-1.html

image-20230105161905827

技术交流群:

微信:扫码添加GreatSQL社区助手微信好友,发送验证信息加群

image-20221030163217640

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/833520.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

用命令行启动 docker 报错:Redirecting to /bin/systemctl start docker.service 解决方法

docker安装成功后,用 sudo service docker start 启动docker报这个错误,看提示应该是需要用systemctl的命令。 使用systemctl start docker命令启动成功了,做下记录。 以下是启动doker常用的几个命令: # 启动 docker:systemctl start docker # 停止 docker:systemctl sto…

CCF - 网易雷火基金项目成果:基于大小模型协同的低资源标注技术|CNCC 2024 演讲实录

在科技蓬勃发展的时代浪潮中,人工智能领域的每一次突破都离不开持续的科研投入和对前沿技术的不懈探索。2023 年,网易伏羲与中国计算机学会(CCF)共同发起了 “CCF - 网易雷火联合基金”,致力于发挥和利用多方资源优势,加强与海内外青年学者的科研合作,促进中国人工智能等…

VisualVM 使用说明

VisualVM 简介:一个轻量级的Java进程监控软件 VisualVM 安装介绍(Mac 使用 brew 安装) ➜ ~ brew uninstall visualvm==> Uninstalling Cask visualvm ==> Backing App VisualVM.app up to /opt/homebrew/Caskroom/visualvm/2.1.10/VisualVM.app ==> Removing App /…

Java方法(四)

设计方法原则:本意为功能块,是实现某个功能语句块的结合,设计方法时保持原子性(一个方法完成一个功能)public class operator {public static void main(String[] args) {int sum = add(1,3);System.out.println(sum);}//加法public static int add(int a,int b){return a…

11.14,python之自动化

python+selenium selenium是一个第三方库,python有很多库; 1、什么是ui自动化? 通过模拟手工操作用户ui页面的方式,用代码去实现自动化操作和验证的行为。 2、ui自动化的优点? (1)解决重复性的功能测试和验证 (2)减少测试人员在回归测试时用例漏测和验证点的漏测 (3)…

快来验 踩CTH !!!

题目 别样的,验个数据验成这使样还怎么玩? 谁跟谁的都不一样

python自动化之selenium

python+selenium selenium是一个第三方库,python有很多库; 1、什么是ui自动化? 通过模拟手工操作用户ui页面的方式,用代码去实现自动化操作和验证的行为。 2、ui自动化的优点? (1)解决重复性的功能测试和验证 (2)减少测试人员在回归测试时用例漏测和验证点的漏测 (3)…

高级语言程序设计第七次个人作业

班级链接:https://edu.cnblogs.com/campus/fzu 作业要求链接:https://edu.cnblogs.com/campus/fzu/2024C/homework/13304 学号:102400130 姓名:杨子旭

GDPC-CSACTF Round2 WP Web篇

先从简单的开始 ezupload题目都把解题方法拍脸上了,随便上网找一个php一句话木马上传后拿webshell软件(我用的是蚁剑antsword)脸上就可以翻服务器了,最后在usr找到flag,比较搞笑的是我第一次出了点问题还以为要提权,经典把题目做难ezcmd 同样是几乎送分题,跟一轮一样直接…

不推荐别的了,IDEA 自带的数据库工具就很牛逼!

https://blog.51cto.com/u_13626762/5225591 导出数据库表模型 https://github.com/godmaybelieve

cmu15545笔记-排序和聚合算法(SortingAggregation Algorithms)

目录概述排序堆排序外部归并排序使用索引聚合操作排序聚合哈希聚合 概述本节和下一节讨论具体的操作算子,包括排序,聚合,Join等。 排序 为什么需要排序操作: 关系型数据库是无序的,但是使用时往往需要顺序数据(Ordered By,G roup By,Distinct)。 主要矛盾:磁盘很大:…

Postman接口测试从入门到精通(二)

十一、Postman批量运行测试用例十二、Postman数据驱动之cSV文件和JSON文件的处理十三、测试必须带请求头的接口常见的请求头:Host 请求的主机地址connection 连接方式Accept 客户端接收到的数据格式 -Requestea-Wih 异步请求 User-Agent 客户端的用户类型 Reterer 来源 …