基于LLM的数据质量错误检测-编程知识

原文地址：automated-detection-of-data-quality-issues

2024 年 3 月 23 日

本文是有关使用大型语言模型 (LLM) 清理数据的系列文章中的第二篇文章，重点是识别表格数据集中的错误。

该图概述了我们将在本文中探讨的方法，该方法侧重于在最少的人工参与下评估表格数据集的数据脏度分数。

数据脏度分数

我们鼓励读者首先阅读有关数据脏度分数的介绍性文章，其中解释了关键假设并演示了如何计算该分数。

快速回顾一下，数据脏度分数估计数据集中包含错误的单元格的预期比例。以下是该指标背后的关键假设：

数据错误与违反约束有关。
如果没有预期，就不会对分数产生影响。
数据问题可以精确定位到特定的cell。
每个数据错误都会分配一个置信度分数。
每个单元格对总分都有相同的影响。

此过程的第一步涉及识别和分类数据集中存在的数据不准确之处。

自动检测数据质量问题的重要性

检测数据问题在此过程中至关重要，但由于以下几个因素而具有挑战性：

人工标记成本高：识别数据错误通常需要数据专业人员（如科学家、工程师和分析师）或主题专家 (SME) 的大量输入。这需要大量时间并且成本高昂。
数据从业者对这项繁重的工作缺乏热情：众所周知，许多业内人士认为数据清理是他们工作中不太有吸引力的方面。数据清理被视为建模、构建现代数据堆栈或回答业务查询等更具吸引力的活动的先驱，其优先级列表中的数据清理通常排在较低的位置，从而导致拖延，或者在某些情况下完全被忽视，直到出现关键问题。
中小企业的局限性：中小企业拥有宝贵的知识，但可能缺乏 SQL 或编程等技术技能。虽然无代码和低代码工具在一定程度上有所帮助，但它们尚未被完全采用，并且可能无法涵盖所有数据管理方面，例如版本控制。
专业知识差距：有效的数据清理超越了基本技能，需要专门的专业知识。缺乏培训以及对数据准备普遍不感兴趣意味着许多从业者可能只能识别出表面的错误，而忽略了需要更深入地了解数据清理的更复杂的问题。

尽管存在固有的挑战，大型语言模型 (LLM) 领域的进步为自动识别简单数据问题和发现更复杂的数据质量问题提供了有前景的解决方案。

由LLM提供支持的数据错误检测

大型语言模型正在成为自动检测数据质量问题的宝贵工具，成为高效的人机交互迭代过程的有效起点。模型，例如《Jellyfish：用于数据预处理的大型语言模型》等论文中讨论的模型，语言模型可以自动处理数据吗？和大型语言模型作为数据预处理器，展示了它们自动约束生成和数据错误检测的潜力。这种自动化并没有取代人工干预，而是增强了人工干预，允许通过直接解决问题或修改置信度分数以反映数据错误检测中固有的不确定性来审查和调整自动化约束。

LLM特别适合检测数据质量问题，因为他们接受过各种互联网内容的广泛培训，包括大量的领域知识和与数据质量问题相关的大量代码审查示例。这种培训使LLM能够根据文本内容识别数据错误，而不需要明确定义的规则。通过将表格数据集转换为纯文本（称为序列化），LLM可以像经验丰富的团队一样仔细检查数据，利用他们的“压缩”互联网知识来查明错误。这种广泛的培训使他们能够以模仿人类专业知识的直觉水平识别人类可读数据集（例如 CSV 文件）中的潜在错误。此外，特定领域知识中的任何差距都可以通过检索增强生成 (RAG) 等技术或根据数据集的特定性质定制模型的提示来弥补。

在数据错误检测中采用LLM的另一个关键优势是它们能够处理与数据质量问题相关的固有不确定性。并非所有错误都是直截了当的，甚至专家有时也会对数据问题的构成存在分歧。LLM可以为他们的发现分配置信度分数，就像人类基于直觉和经验的混合所做的那样，反映错误的估计可能性。

跨不同数据集和潜在问题进行泛化错误检测的挑战是巨大的。传统方法通常采用一组广泛的决策规则或专门的机器学习模型的组合来解决各种场景，例如检查地址和电话号码的有效性或异常检测。这就是LLM的闪光点，它提供了适应性更强、劳动力密集程度更低的替代方案。它们无需大量基于规则的系统或特定领域模型即可理解和识别各种数据质量问题的能力使它们成为宝贵的工具。机器学习方法相对于传统业务规则或统计方法的优势的类比非常有趣。机器学习的采用是由于其相对易用性和跨不同用例的适应性，需要较少的特定领域知识和实施时间。

接下来，我们将通过一个实际例子来演示这种方法。

案例研究

在上一篇文章中，我们使用《清理数据以实现有效数据科学》一书中的数据集示例探讨了数据脏度分数的概念。有问题的数据集如下：

Student#,Last Name,First Name,Favorite Color,Age
1,Johnson,Mia,periwinkle,12
2,Lopez,Liam,blue,green,13
3,Lee,Isabella,,11
4,Fisher,Mason,gray,-1
5,Gupta,Olivia,9,102
6,,Robinson,,Sophia,,blue,,12

数据错误已经指出。现在，我们想探索如何使用大型语言模型，特别GPT-4是自动查找这些错误。这种新方法提供了一种发现数据集中问题的现代方法，但在使用外部 API 时可能会带来隐私问题等风险。然而，这可以适用于任何LLM，而不仅仅是GPT-4，尽管有效性可能会根据模型的功能而有所不同。

初步步骤：检索表注释

为了帮助模型识别数据不一致，提供有关数据框架的附加上下文是有益的。这正是数据目录的作用，尽管它是一个广泛的主题，但我们将对其进行简化，以仅关注LLM在检查批量数据集行时检测数据错误所需的基本上下文信息。

所需的关键元数据包括：

该表的概述，包括其描述和用途。
清楚地了解每一列的含义和类型。

鉴于组织中经常缺乏数据目录或可靠的文档，我们将探索如何使用LLM来加速这一过程。此过程称为表注释，它涉及识别有关表元素的语义信息，包括列、它们的关系以及单元格内的实体。有关更多详细信息，请参阅Column Type Annotation using ChatGPT、Annotating Columns with Pre-trained Language Models或SOTAB：The WDC Schema.org Table Annotation Benchmark 等来源。

这是我使用的提示：

Analyse the table below and provide schema annotations based on Schema.org standards.Student#,Last Name,First Name,Favorite Color,Age
1,Johnson,Mia,periwinkle,12
2,Lopez,Liam,blue,green,13
3,Lee,Isabella,,11
4,Fisher,Mason,gray,-1
5,Gupta,Olivia,9,102
6,,Robinson,,Sophia,,blue,,12Follow these steps:
1. Identify the overall semantic type of the table.
2. Provide a short description of each column.
3. Annotate each column with its semantic type from Schema.org.
4. Determine the most suitable data type for each column (after data cleaning).Summarise the table schema as follows:
- Table Semantic Type: <type>
- Column: <name>, Description: <description>, Semantic Type: <Schema.org type>, Pandas Type: <Pandas data type>

在提示说明中，我指示模型分析提供的表（或表的概述）并建议遵循Schema.org标准的注释。具体来说，输出应包括：

表的语义类型
每列的简要说明
来自 Schema.org 的列注释类型（如果适用）
每列的理想或最适合的数据类型，无论提供的文本序列化中的数据问题如何

然后，响应被格式化以提供清晰且结构化的摘要，可以在后续提示中重复使用作为上下文：

Summarise the table schema as follows:
- Table Semantic Type: <type>
- Column: <name>, Description: <description>, Semantic Type: <Schema.org type>, Pandas Type: <Pandas data type>

该表是使用{table}提示中的占位符合并的。典型的方法涉及通过序列化将表格数据转换为文本，如表格数据的大型语言模型 (LLM)：预测、生成和理解 - 调查中所讨论的。

以下是来自的响应示例GPT-4：

- Table Semantic Type: Person
- Column: Student#, Description: Unique identifier for each student, Semantic Type: identifier, Pandas Type: int
- Column: Last Name, Description: Family name of the student, Semantic Type: familyName, Pandas Type: string
- Column: First Name, Description: Given name of the student, Semantic Type: givenName, Pandas Type: string
- Column: Favorite Color, Description: Preferred color of the student, Semantic Type: color (custom), Pandas Type: string (or list if cleaned for multiple values)
- Column: Age, Description: Age of the student, Semantic Type: age, Pandas Type: int (after cleaning invalid entries)

对于这样一个简单的例子，响应可能略有不同，但通常是一致的。这里的目的是加速初始过程而不是完全自动化。因此，这可以被视为初步草案，然后可以根据我们的知识和主题专家 (SME) 的外部背景的见解进行完善。

现在，了解有关该表的一些背景信息，让我们探讨如何自动识别数据质量问题。

使用 LLM 嗅探数据错误

首先，我建议使用一个提示来帮助识别给定表中的数据质量问题。

Task: Analyse the provided table to identify and document data quality issues.Below are common data quality issues to guide your analysis. However, you may also identify other relevant issues:
- Ingestion errors
- Typecasting issues
- Duplicates
- Date parsing issues
- Character encoding problems
- Missing values
- Typos/spelling mistakes
- Anomalies/outliers
- Conversion errors and inconsistent units
- Privacy concerns (e.g., exposed PII)
- Domain-specific errors (e.g., invalid formats for addresses, phone numbers, emails)Instructions:
1. Examine silently the table and its metadata.
2. Line by line, identify potential data quality issues without coding.
3. Document each issue, including:- Nature and description of the issue- Expected correct state- Violated constraint- Confidence level in your assessment using ordinal categories: `low`, `medium`, `high` and `certain`.- Specific location of the issue in the table (use 'None' for table-wide issues): Index and Column names.Provided Data:Table: 
,Student#,Last Name,First Name,Favorite Color,Age
0,1,Johnson,Mia,periwinkle,12
1,2,Lopez,Liam,blue,green,13
2,3,Lee,Isabella,,11
3,4,Fisher,Mason,gray,-1
4,5,Gupta,Olivia,9,102
5,6,,Robinson,,Sophia,,blue,,12Metadata:
- Table Semantic Type: Person
- Column: Student#, Description: Unique identifier for each student, Semantic Type: identifier, Pandas Type: int or string
- Column: Last Name, Description: Family name of the student, Semantic Type: familyName, Pandas Type: string
- Column: First Name, Description: Given name of the student, Semantic Type: givenName, Pandas Type: string
- Column: Favorite Color, Description: Preferred color of the student, Semantic Type: color (custom), Pandas Type: string (or list if cleaned for multiple values)
- Column: Age, Description: Age of the student, Semantic Type: age, Pandas Type: int (after cleaning invalid entries)Detected Data Issues:

提示的最初部分设定了任务的目标，并列出了常见数据问题的示例，例如摄取错误、重复和隐私问题等。此列表并不详尽，我们鼓励您根据表的上下文添加更多相关类型以指导分析。

接下来，提示详细说明了遵循思想链方法的分步说明，确保模型在逐行识别数据问题之前系统地分析表及其元数据，反映人工分析。此过程无需编码即可进行，以保持简单性和广泛的适用性。这一点至关重要，因为尽管GPT-4 具有分析功能的模型可以执行有用的迭代编码会话，但仅依赖文本分析会促进泛化。

检测到潜在的数据问题后，提示将指示记录以下详细信息：

问题的性质和描述
预期的正确状态
违反的约束
使用序数类别进行评估的置信度low：、medium和high。certain
问题在表中的具体位置，对于全表范围的问题使用“None”，并附有 Index 和 Column 名称以供参考。

该表及其元数据在提示中提供，并在每行中添加一个索引，以帮助模型查明错误的确切位置。

对于大型表，可以批量应用此提示以覆盖整个数据集，并汇总结果以识别所有数据质量问题。

以下是此提示可以生成的输出示例，其格式为详细说明已识别数据问题的报告，每个问题都有描述、预期状态、违反的约束、置信度和位置。

1. Issue: Incorrect value format in "Favorite Color" column- Nature and Description: The entry "blue,green" in the "Favorite Color" column for the second row appears to contain two colors instead of one, suggesting a possible ingestion or data entry error.- Expected Correct State: Each cell under the "Favorite Color" column should contain only one color.- Violated Constraint: Single value constraint.- Confidence Level: High- Specific Location: Index 1, Column "Favorite Color"2. Issue: Missing value in "Favorite Color" and "Age" columns- Nature and Description: The third row has a missing value for "Favorite Color", and the sixth row has missing values for both "Last Name" and "Favorite Color".- Expected Correct State: No missing values in any of the columns.- Violated Constraint: Non-null constraint.- Confidence Level: Certain- Specific Location: Index 2, Column "Favorite Color"; Index 5, Columns "Last Name", "Favorite Color"3. Issue: Negative value in "Age" column- Nature and Description: The fourth row has an age of "-1", which is not possible and indicates a data entry error.- Expected Correct State: Age values should be positive integers.- Violated Constraint: Age value range (greater than 0).- Confidence Level: Certain- Specific Location: Index 3, Column "Age"4. Issue: Misplaced values in "Favorite Color" and "Age" columns- Nature and Description: In the fifth row, what appears to be an age ("9") is placed in the "Favorite Color" column, and an unlikely age ("102") is in the "Age" column, suggesting a possible shifting or misplacement of values.- Expected Correct State: Age should be a reasonable positive integer, and favorite color should be a string denoting a color.- Violated Constraint: Data type and value range constraint.- Confidence Level: High- Specific Location: Index 4, Columns "Favorite Color", "Age"5. Issue: Inconsistent formatting in row entries- Nature and Description: The sixth row contains additional commas, which might indicate incorrect splitting or entry of the data, leading to misalignment of the values with their respective columns.- Expected Correct State: Each row should consistently follow the format defined by the column headers without extra delimiters.- Violated Constraint: Data format and consistency constraint.- Confidence Level: High- Specific Location: Index 5, Entire row

下一步是将这些已识别的问题转换为 Python 对象，以便更轻松地计算数据脏度分数。

将识别的数据问题转换为正确的格式

本节重点介绍将之前确定的数据问题从纯文本描述转换为 Python 对象。这些对象应遵循类定义的结构DataIssue：

@dataclass
class DataIssue:type_of_issue: strexpectation: strconstraint_violated: strconfidence_score: str # `low`, `medium`, `high` or `certain`location: List[Tuple]  # Cell positions as (Index, Column). Use None for row/column-wide issues.

这是我使用的提示：

Task: Document data table issues using the provided `DataIssue` class.For each identified issue and its description, document it using the provided `DataIssue` class structure. ```python
from dataclasses import dataclass
from typing import List, Tuple@dataclass
class DataIssue:type_of_issue: strexpectation: strconstraint_violated: strconfidence_score: str # `low`, `medium`, `high` or `certain`location: List[Tuple]  # Cell positions as (Index, Column). Use None for row/column-wide issues.Instructions:
1. Review all identified issues provided and their descriptions silently.
2. For each issue, instantiate it using the provided `DataIssue` class structure. 
3. Return only the code.
4. Once the code has been validated, stop generation.Identified issues:
{issues_found}Your code here:

目标是生成将每个数据问题表示为DataIssue对象的 Python 代码，而无需冗余文本。使用占位符将已识别的问题注入到提示中{issues_found}。

输出示例：

issue1 = DataIssue(type_of_issue="Incorrect value format",expectation="Each cell under 'Favorite Color' should contain only one color",constraint_violated="Single value constraint",confidence_score="high",location=[(1, "Favorite Color")]
)issue2 = DataIssue(type_of_issue="Missing value",expectation="No missing values in any columns",constraint_violated="Non-null constraint",confidence_score="certain",location=[(2, "Favorite Color"), (5, "Last Name"), (5, "Favorite Color")]
)issue3 = DataIssue(type_of_issue="Negative value",expectation="Age values should be positive integers",constraint_violated="Age value range (greater than 0)",confidence_score="certain",location=[(3, "Age")]
)issue4 = DataIssue(type_of_issue="Misplaced values",expectation="Age should be a reasonable positive integer, and favorite color should be a string denoting a color",constraint_violated="Data type and value range constraint",confidence_score="high",location=[(4, "Favorite Color"), (4, "Age")]
)issue5 = DataIssue(type_of_issue="Inconsistent formatting",expectation="Each row should consistently follow the format defined by the column headers without extra delimiters",constraint_violated="Data format and consistency constraint",confidence_score="high",location=[(5, None)]  # None indicates entire row issue
)

最后一步涉及将location属性从元组列表转换为numpy数组，附录中详细介绍了这一点。

所有元素就位后，我们现在可以计算数据脏度分数。

数据脏度分数的计算以及与真实值的比较

让我们回顾一下上一篇文章中的函数，compute_data_dirtiness_score它使用前面提到的对象列表DataIssue。

compute_data_dirtiness_score(data_issues)

数据脏度分数：28.33%

使用该GPT-4模型，我们估计该样本的得分约为 28%。这与 31.87% 的“真实情况”分数相当接近。

为了了解这些分数之间的差异，让我们深入研究有关数据问题检测的更详细的指标。除了总体得分之外，我们还有基本事实和模型估计的细胞问题概率矩阵。

下面是真实矩阵，为了清晰起见添加了列和索引：

   Student#  Last Name  First Name  Favorite Color   Age
0      0.00        0.0        0.00            0.00  0.00
1      0.00        0.0        0.00            0.75  0.00
2      0.00        0.0        0.00            1.00  0.00
3      0.00        0.0        0.00            0.00  1.00
4      0.00        0.0        0.00            0.75  0.75
5      0.75        1.0        0.75            1.00  0.75

这是模型估计的概率矩阵：

   Student#  Last Name  First Name  Favorite Color   Age
0       0.0        0.0        0.00          0.0000  0.00
1       0.0        0.0        0.00          0.7500  0.00
2       0.0        0.0        0.00          1.0000  0.00
3       0.0        0.0        0.00          0.0000  1.00
4       0.0        0.0        0.25          0.8125  0.75
5       1.0        1.0        1.00          1.0000  1.00

尽管这些矩阵乍一看很相似，但我们可以应用基于阈值的指标，例如accuracy、recall、precision和F1-score来获得更清晰的图像。如果模型的可能性超过 0，这些指标可以通过考虑单元格有问题来提供对模型性能的直接评估。以下是获得的指标：

该模型正确识别了 91% 的问题细胞 ( recall)，并且所有错误预测都是准确的 ( precision)。

该模型错过了一个特定问题：“考虑Favorite Color到可以同时是名称和颜色，因此和字段可能会First Name交换。”这在置信度分数下被认为是不可能的，表明更有可能是而不是。因此，即使这种可能性这个问题被忽视了，它的最小置信度分数减少了它对整体数据脏度分数的影响。这解释了为什么尽管有这个遗漏，两个分数仍然相对接近。OlivialowOliviaFirst NameFavorite Color

总之，这种基于大型语言模型（LLM）的方法提供了一种检测数据框架中的数据质量问题的方法。虽然此方法可能尚未完全自动化，并且可能需要手动调整，但希望它能够加快数据错误的检测和表格数据集的数据脏度分数的计算。

下一步和挑战

我使用两步过程将问题生成为代码。这样做是因为我发现这比一体化解决方案（即扫描数据集和元数据并直接以正确的代码格式输出数据问题）增加了更多稳定性。这并不意味着这是不可能的，但我暂时选择将此步骤分为两个阶段以提高稳健性。

我们面临的一个问题涉及管理大型数据集，包括行数和列数。尽管最近取得了进展，LLM仍然面临输入上下文窗口和生成内容长度的限制。这些约束限制了可以序列化为分析提示的表的大小以及模型生成的数据问题报告的长度。如何根据数据框的大小和模型的功能来划分数据框是一个问题。

在某些情况下，缺乏一般上下文可能会出现问题，例如在识别数据库中的重复行或在没有广泛了解列值的情况下检测拼写错误时。例如，在重复并不简单的情况下，常见的方法是实体匹配。该技术在数据清理过程中特别有用，并且通过使用大型语言模型取得了进步。该领域的相关研究包括使用大型语言模型进行实体匹配和基础模型能否处理您的数据等研究？，以及作为数据预处理器的大型语言模型和Jellyfish：用于数据预处理的大型语言模型。

机器学习中的集成方法涉及组合多个模型，可以提高性能和稳定性。这种方法可以通过同时运行多个LLM来识别数据集中的问题来应用。改变每个LLM的提示和设置是有益的，以确保获得多样化的见解。此外，将特定的错误类型（例如拼写错误）分配给各个模型可以使流程更加高效。虽然这种方法可以通过将任务划分为更小的部分来获得更可靠的结果，但它也增加了软件的成本和复杂性。通过收集所有已识别的数据问题，我们可以提高发现错误的机会（增加召回率），但也可能识别更多错误（降低精度）。然而，检查这些已识别的错误通常比一开始就发现它们更省时。

LLM直接与数据库交互的能力，类似于中的代码分析能力ChatGPT-4，为检测数据错误开辟了更广泛的可能性。这里的一个挑战是使这个过程自动化，因为如果没有足够的指导，模型可能会偏离其预期路径。

尽管面临所有挑战，但我们通过这种简单的方法能够实现的目标已经非常有希望。通过更多的工程工作，我希望我们很快就能提供更强大的解决方案来覆盖更大的数据集并完全自动化检测过程。

下一篇文章将讨论自动数据修复，或者至少提出修复待验证的解决方案。

References

Data Dirtiness Score
Jellyfish: A Large Language Model for Data Preprocessing
Can language models automate data wrangling?
Large Language Models as Data Preprocessors
Column Type Annotation using ChatGPT
Annotating Columns with Pre-trained Language Models
SOTAB: The WDC Schema.org Table Annotation Benchmark
Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding — A Survey
Entity Matching using Large Language Models

附录

本节介绍如何将来自 LLM 的对象location的属性转换为不同的格式。DataIssue此转换将表示单元格位置的元组列表更改为数组numpy。该数组充当这些单元格位置的掩码。
这是使用该数据集的基本示例Students：

create_mask_from_list_of_cell_positions(shape=dataset_shape,list_of_cell_positions=[(4, 'Favorite Color'), (4, 'Age')],columns=columns
)

array([[0, 0, 0, 0, 0],[0, 0, 0, 0, 0],[0, 0, 0, 0, 0],[0, 0, 0, 0, 0],[0, 0, 0, 1, 1],[0, 0, 0, 0, 0]], dtype=int8)

以下是函数定义：

def validate_cell_position(cell_position: Union[Tuple[int, int], Tuple[None, int], Tuple[int, None], Tuple[None, None]],columns: List[str] = None,
) -> Tuple[int, int]:"""Validate the cell position and convert column names to indices if necessary."""if not isinstance(cell_position, tuple):raise ValueError("Cell position must be a tuple")# Convert column name to index if columns are providedif isinstance(cell_position[1], str):if columns is None:raise ValueError("Column names must be provided to create a mask based on column names")column_index = columns.index(cell_position[1])return (cell_position[0], column_index)return cell_positiondef set_mask_values(mask: np.ndarray, cell_position: Tuple[int, int]):"""Set values in the mask based on the cell position."""row_index, col_index = cell_positionif row_index is None:mask[:, col_index] = 1elif col_index is None:mask[row_index, :] = 1else:mask[row_index, col_index] = 1def create_mask_from_list_of_cell_positions(shape: Tuple[int, int],list_of_cell_positions: List[Tuple],columns: List[str] = None,
) -> np.ndarray:"""Create a mask array based on a list of cell positions."""mask = np.zeros(shape=shape, dtype=np.int8)for cell_position in list_of_cell_positions:validated_position = validate_cell_position(cell_position, columns)set_mask_values(mask, validated_position)return mask