LLM 构建Data Multi-Agents 赋能数据分析平台的实践之②:数据治理之二(自动处理)

前述

在前文的multi Agents for Data Analysis的设计说起,本文将继续探索和测试借助llm实现基于私有知识库的数据治理全自动化及智能化。整体设计如下:
在这里插入图片描述
整个体系设计了3个Agent以及一个Planer&Execute Agent,第一个Agent用于从企业数据标准私有知识库中检索生成与用户问题相关联的知识块,第二个Agent用于结合企业或者用户的数据做数据质量分析,第三个Agent用于根据用户的问题或者其他Agent的结果总结生成报告;Planer&Excute Agent用于根据用户的问题规划任务及调度Agents执行。**本文实践的例子流程是这样的:**根据用户的私有知识库检索数据治理的流程或者数据的标准范围,根据该标准或者流程,分析用户上传的数据的异常值及数据质量,撰写数据质量报告。

一、Agent①:知识检索

使用create_retriever_tool及 ZeroShotAgent构建第一个Agent用于从私有知识库检索生成相关的知识块。

model_id = "iic/nlp_corom_sentence-embedding_english-base"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
loader = Docx2txtLoader('./standard documents (1).docx')
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=150)
split_docs = text_splitter.split_documents(docs)
vectordb = Chroma.from_documents(documents=split_docs,embedding=embeddings)
retriever_tool = create_retriever_tool(vectordb.as_retriever(), "Search_for_data_retriever", "which you can retriever relate data block."
)tools_kg = [retriever_tool]
memory = ConversationBufferMemory(memory_key="chat_history",human_prefix="Input",ai_prefix="Response",input_key="question",return_messages=False,
)
template_kg = 'You are a knowledge retrieval and generation tool that searches for relevant text based on user questions and generates answers accordingly.'
agent_kg = ZeroShotAgent.from_llm_and_tools(llm=llm,tools=tools_kg,prefix=template_kg,)

二、Agent②:数据分析

1、提示词设计:包含三个部分,工具的简单介绍,数据基础信息的提示(dhead),可参考的例子。

TEMPLATE_dt = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`<df>
{dhead}
</df>You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.You possess  essential tools:  `python_repl_dt`: With this tool, you can analyze and process the data retrieved from df using Python code.When facing a question, assess whether you need to employ these tools iteratively.Example:
<question>It is known that the price interval for apples in the market is between 5 and 20 yuan per kilogram; your task is to detect any anomalous values within the set of apple market price quotes. </question>
<logic>First, To begin with, you should scrutinize the user's query, where in the user has furnished an essential detail: that the benchmark data range for apple prices is from 5 to 20 yuan per kilogram.Then, leverage `python_repl_dt` to detect any anomalous values based on the retrieved data.the code write by python_repl_dt:'''import pandas as pddf[df['apple'].lt(10) | df['apple'].gt(20)].to_csv('abnormal_data.csv')the output should print the anomalous,and output a file named apple market price anomalous.csv.
</logic>
If you have not found the precise answer, please iteratively use the tools.
"""

2、工具的构建及Agent的组装

#使用PythonAstREPLTool工具
repl = PythonAstREPLTool(locals={"df": df},name="python_repl_dt",description="The tool is used to generate Python code analysis based on the 'df' named pig_market_data ,runs the code and outputs both the code and the results of the computation.",#args_schema=PythonInputs,
)
tools_re = [repl]
#Agent_dt
template_dt = TEMPLATE_dt.format(dhead=df.head().to_markdown())
agent_dt = ZeroShotAgent.from_llm_and_tools(llm=llm,tools=tools_re,prefix=template_dt,)
agent_dt = AgentExecutor(agent=agent_dt, tools=tools_re, max_iterations=50, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

三、Agent③:总结报告

1、提示词设计:设计一个总结报告的基础框架,包含目的、背景、过程、结论、建议等。

template_bg = """As a Data Analys, please compose a comprehensive data report addressing the user's inquiry:{question}, adhering to the following guidelines:1.Introduction: Begin with a clear and concise overview of the purpose of the report, briefly outlining the user's question or problem statement. This section should set the context and provide a roadmap for the content that follows.2.Data Scope and Source: Specify the scope of the data analyzed, including relevant timeframes, geographical coverage, and any specific datasets utilized. Mention the sources from which the data was obtained, emphasizing their credibility and relevance to the analysis.3.Methodology: Describe the analytical methods employed to examine the data, including any statistical techniques, models, or tools used. Explain why these methods were chosen and how they contribute to answering the user's question. Outline any assumptions, limitations, or caveats associated with the methodology.4.Key Findings: Present the main insights derived from the analysis in a structured and visually appealing manner, using tables, charts, graphs, or other appropriate visualizations. Accompany each finding with a clear explanation and, where applicable, quantitative measures such as percentages, averages, or trends. Ensure findings are directly responsive to the user's inquiry and are contextualized within the broader data landscape.5.Interpretation and Implications: Interpret the key findings, drawing meaningful conclusions and highlighting their significance. Relate these conclusions back to the user's question, explaining how they address the initial concerns or objectives. Discuss any potential implications for business decisions, strategy, or further research, and offer actionable recommendations where appropriate.6.Quality Assurance and Limitations: Discuss the steps taken to ensure data quality throughout the analysis process, such as data cleaning, validation, and outlier detection. Acknowledge any limitations or challenges encountered during the analysis, including data gaps, inconsistencies, or inherent biases, and discuss how these may have influenced the results and conclusions.7.Conclusion and Next Steps: Summarize the key takeaways from the report, reinforcing the most important findings and their implications. Suggest potential avenues for future analysis or data collection that could further enhance understanding or address remaining questions. Encourage user engagement by inviting feedback or follow-up inquiries.8.Appendix and Supporting Materials: Include any additional information, detailed calculations, or raw data that support the analysis but would disrupt the flow of the main report. This might include detailed statistical outputs, full dataset summaries, or detailed descriptions of complex methodologies.By adhering to these guidelines, your data report will effectively communicate the results of your analysis, address the user's question thoroughly, and provide a robust foundation for informed decision-making.
"""

2、Agent的构建

prompt = ChatPromptTemplate.from_template(template_bg)
chain_bg = ({"question": RunnablePassthrough()}| prompt| llm| StrOutputParser()
)

四、Planer&Execute Agent设计及构建

1、方式一:使用ZeroShotAgent+提示词的方式构建一个任务规划及调度执行的Agent
(1)提示词的设计:包含三部分,介绍该Agent的基础信息(角色定位、工具库)、任务执行的流程、可参考的例子

template= """As an AI data analyst, it is crucial to optimize workflow by utilizing tools within the tool library, especially "Search_for_data_standard" 、 "Analysis_for_pig_market_data" and "Tool_for_Data_analysis_report". Below are streamlined methods for efficiently handling user inquiries:1. **Task Understanding & Strategy Formulation**:- Firstly, comprehensively understand the user's requirements and constraints, aligning with industry expertise and best practices.- Develop personalized data analysis strategies, setting operational guidelines for each data item based on established norms.- Based on the user's questions, output a matching task list.2. **Utilizing "Search_for_data_standard" to Establish Data Analysis Framework and Standards**:- Within the strategy, define standards and guidelines for each data item to ensure alignment with business context and analysis objectives.3. **Performing Data Analysis Using "Analysis_for_pig_market_data" Tool**:- Employ the "Analysis_for_pig_market_data" tool to analyze data uploaded by the user, following the framework and standards established by "Search_for_data_standard".— That is, the input of tool "Analysis_for_pig_market_data" should contain the framework and standards established by "Search_for_data_standard".
4. ** Utilize the "Tool_for_Data_analysis_report" to craft a report.— Compose a data analysis report based on industry knowledge and standards (query by "Search_for_data_standard" ), data analysis results("Analysis_for_pig_market_data" output), and user inquiries.Example:
<question>Please calculate the cost and profit of the company's batch products for this year, and identify the batches of products that exceed the average cost. And generate a financial report</question>
<logic>First, Based on the user's questions, plan a task list and execution process:1. Use tool "Search_for_data_standard" to search and generate the company's financial management system and cost accounting process;2. According to the system or process, use tool "Analysis_for_pig_market_data" to analyze financial data, including batch cost, profit, and batch product ranking;3. Write financial analysis reports using "Tool_for_Data_analysis_report" tools based on management systems, cost accounting processes, data analysis, etc.
<task1>Use tool "Search_for_data_standard" to search and generate......
task1 output: The cost of product batches should be between 10-20 yuan/kg
<task2>use tool "Analysis_for_pig_market_data" to analyze financial data......
task2 input: the batch cost of the product should be between 10-20 yuan/kg. Analyze products with unreasonable batch costs
task2 output:'''import pandas as pddf[df['apple'].lt(10) | df['apple'].gt(20)].to_csv("the batch cost of the product  anomalous.csv", index=False)'''
<task3>Write a report using tool "Tool_for_Data_analysis_report"......
task3 input: the company's financial management system and cost accounting process,Financial data analysis results.....
task3 output: Our company's batch cost analysis overview is as follows: the average cost of a batch is 15 yuan/batch, and the batches of products that exceed the abnormal cost warning value include A, B, C, etc......
</logic>"""

(2)工具库及P-Agent的构建

tools = [Tool(name = "Search_for_data_standard",func=agent_KG.invoke,description="By retrieving relevant text, obtain industry standards, data analysis strategies related to user questions, and summarize and generate answers."),Tool(name = "Analysis_for_pig_market_data",func=agent_dt.invoke,description="A tool used to excute code analyze the situation in the data named pig_market_data, where the data is the slaughter weight and price data of the pig market."),Tool(name = "Tool_for_Data_analysis_report",func=chain_bg.invoke,description="As a Data Analyst, please compose a comprehensive data report addressing the user's inquiry."),]agent = ZeroShotAgent.from_llm_and_tools(llm=llm,tools=tools,prefix=template,)
agent_executor = AgentExecutor(agent=agent, tools=tools, max_iterations=150, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

2、方式二、使用langchain 的plan & excute模块

from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_plannerplanner = load_chat_planner(llm,template)
executor = load_agent_executor(llm, tools, verbose=True)agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)

五、实践结果讨论

任务一:
从标准私有知识库中检索数据标准(生猪体重标准范围),根据此标准分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'''Obtain the weight standards for pigs in the market; 
According to this standard, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,
And generate a csv containing the identified outliers,above all, 
Write a data quality analysis report.'''})

①任务规划:

steps=[Step(value='Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market; ')
Step(value='According to the weight standards, use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country, identifying abnormal weight values that exceed the slaughter standard weight range; ')
Step(value='Generate a CSV file containing the identified outliers;')
Step(value='Write a data quality analysis report based on the weight standards, data analysis results, and identified outliers.\n</logic>\n\n<task1>Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market.\ntask1 output: The weight standards for pigs in the market should be between 100-150 kg.\n\n<task2>Use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country.\ntask2 input: The weight standards for pigs in the market should be between 100-150 kg. Analyze data to identify abnormal weight values that exceed the slaughter standard weight range.\ntask2 output: \n```python\nimport pandas as pd\ndf[df[\'weight\'].lt(100) | df[\'weight\'].gt(150)].to_csv("abnormal_weight_values.csv", index=False)\n```\n\n<task3>Write a report using tool "Tool_for_Data_analysis_report" based on weight standards, data analysis results, and identified outliers.\ntask3 input: Weight standards for pigs in the market, data analysis results, identified outliers in the data.\ntask3 output: The data quality analysis report includes information on the weight standards, analysis of abnormal weight values exceeding the slaughter standard weight range, and recommendations for data quality improvement.\n')]

②执行第一个Agent:知识检索生成,从私有知识库检索生成生猪出栏体重标准为70~200kg。
在这里插入图片描述

③执行第二个Agent:数据质量分析,根据生猪出栏体重标准范围【70,200】,查询异常值并生成异常数据文件。
在这里插入图片描述

④执行第三个Agent:数据质量报告撰写,根据提供的数据分析报告模板撰写报告。
在这里插入图片描述
任务二:从私有知识库中检索生成数据质量分析流程根据此流程分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'Obtain the data quality analysis process for pigs in the market; According to this process, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,And generate a csv containing the identified outliers,above all, Write a data quality analysis report.'})

①任务规划

I need to first search for the data quality analysis process for pigs in the market to understand the standards and guidelines. Then, I should analyze the pig_market_data to identify abnormal weight values that exceed the slaughter standard weight range. Finally, I need to generate a csv file with the identified outliers and write a data quality analysis report.

在这里插入图片描述

②执行第一个Agent:检索生成数据分析流程

Observation: {'input': 'Data quality analysis process for pigs in the market', 'output': 'The data quality analysis process for pigs in the market involves querying abnormal data, using methods like Z-score and GNN to discover anomalies, and generating a data report to evaluate data quality in terms of integrity, consistency, timeliness, and accuracy.'}

在这里插入图片描述
③执行第二个Agent:根据数据质量分析流程对数据质量分析,使用Z-score算法
在这里插入图片描述
生成的异常数据文件
在这里插入图片描述

④执行第三个Agent:数据质量分析报告
在这里插入图片描述

六、讨论

通过本次测试,初步验证了将LLM构建的Data Multi-Agents嵌入数据平台用作数据治理的可行性,并实现了多个Agent的协同工作、以及根据私有知识库自动分析数据质量。在测试过程中有如下问题尚需考虑:
1、大模型选择问题:本次实践针对每个Agent的核心—大模型均做了测试,理论上每个Agent可以使用不同的大模型作为核心。经过本次测试,有如下考虑或者建议,针对每个Agent的用途可以选择适合的大模型,比如数据分析(python code)可以选择代码生成能力强的大模型,Plan&Excute Agent可以选择任务规划能力强的大模型,甚至针对特定领域需要微调一个适合、匹配的大模型。
2、知识检索Agent可能还需借助最新的RAG技术,以获取在海量数据标准中能检索匹配生成到更准确的知识块。
3、数据质量分析Agent应对多表、多库以及各表之间的关联等更复杂的场景,需要建立一个数据资源基础信息库,如元数据、数据血缘、数据库表关联关系等信息,大模型处理起来会更全面和精准。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/597934.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

视听杂志知网收录期刊投稿发表论文

《视听》是由国家新闻出版总署批准的正规的&#xff0c;兼有新闻传播、新媒体理论探索的当代学术性省级综合期刊。坚持正确的理论导向&#xff0c;全面展示广播影视事业发展中出现的新情况、新事物&#xff0c;探索现代传媒基础理论&#xff0c;研究网络传播、通讯传播等新媒体…

leetcode.707. 设计链表

题目 题意&#xff1a; 在链表类中实现这些功能&#xff1a; get(index)&#xff1a;获取链表中第 index 个节点的值。如果索引无效&#xff0c;则返回-1。 addAtHead(val)&#xff1a;在链表的第一个元素之前添加一个值为 val 的节点。插入后&#xff0c;新节点将成为链表的…

文心一言指令词宝典之自媒体篇

作者&#xff1a;哈哥撩编程&#xff08;视频号、抖音、公众号同名&#xff09; 新星计划全栈领域优秀创作者博客专家全国博客之星第四名超级个体COC上海社区主理人特约讲师谷歌亚马逊演讲嘉宾科技博主极星会首批签约作者 &#x1f3c6; 推荐专栏&#xff1a; &#x1f3c5;…

【算法】两数之和(暴力求解+哈希表)

本题来源---《两数之和》。 题目描述 给定一个整数数组 nums 和一个整数目标值 target&#xff0c;请你在该数组中找出 和为目标值 target 的那 两个 整数&#xff0c;并返回它们的数组下标。 你可以假设每种输入只会对应一个答案。但是&#xff0c;数组中同一个元素在答案里…

恶意代码的多种形式

恶意代码是一种程序代码有害的计算机代码或 web 脚本&#xff0c;其设计目的是创建系统漏洞&#xff0c;并借以造成后门、安全隐患、信息和数据盗窃、以及其他对文件和计算机系统的潜在破坏。为了防范恶意代码的攻击&#xff0c;我们可以采取一系列措施。德迅云安全-代码审计可…

深度学习理论基础(七)Transformer编码器和解码器

目录 前述&#xff1a; Transformer总体结构框图一、编码器encoder1. 编码器作用2. 编码器部分&#xff08;1&#xff09;单个编码器层代码&#xff08;2&#xff09;编码器总体代码 二、解码器decoder1. 解码器作用2. 解码器部分&#xff08;1&#xff09;单个解码器层代码&am…

纯CSS实现未读消息显示99+

在大佬那看到这个小技巧&#xff0c;我觉得这个功能点还挺常用&#xff0c;所以给大家分享下具体的实现。当未读消息数小于100的时候显示准确数值&#xff0c;大于99的时候显示99。 1. 实现效果 2. 组件封装 <template><span class"col"><sup :styl…

JavaEE 初阶篇-深入了解线程池(线程池创建、线程池如何处理任务)

&#x1f525;博客主页&#xff1a; 【小扳_-CSDN博客】 ❤感谢大家点赞&#x1f44d;收藏⭐评论✍ 文章目录 1.0 线程池概述 1.1 线程池的优点 1.2 不使用线程池的问题 1.3 线程池的工作原理图 1.4 如何创建线程池&#xff1f; 2.0 通过 ThreadPoolExecutor 类自定义创建线程…

基于springboot+vue的地方美食分享网站(全套资料)

一、系统架构 前端&#xff1a;vue | element-ui 后端&#xff1a;springboot | mybatis-plus 环境&#xff1a;jdk8 | mysql | maven | node 二、代码及数据库 三、功能介绍 01. 用户端-登录 02. 用户端-首页 03. 用户端-外国美食 04. 用户端-中式美食 05. 用户端-热…

Centos7使用docker安装Jenkins(含pipeline脚本语句)

一、下载Jenkins docker pull jenkins/jenkins:lts 二、启动Jenkins docker run \-u root \--rm \-d \-p 8081:8080 \-p 50000:50000 \-v /root/docker/jenkins/var/jenkins_home:/var/jenkins_home \-v /var/run/docker.sock:/var/run/docker.sock \-v /usr/bin/docker:/usr…

2024.4.4-day09-CSS 布局模型(标准流模型、浮动模型)

个人主页&#xff1a;学习前端的小z 个人专栏&#xff1a;HTML5和CSS3悦读 本专栏旨在分享记录每日学习的前端知识和学习笔记的归纳总结&#xff0c;欢迎大家在评论区交流讨论&#xff01; 文章目录 作业 2024.4.4-学习笔记1 CSS 布局模型1.1 标准流1.2 CSS 浮动1.3 去除塌陷 2…