数据采集与融合技术综合设计

news/2024/12/16 3:13:28/文章来源:https://www.cnblogs.com/Con1427/p/18607800

职途启航

数据采集项目实践	https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology
组名、项目简介	组别：数据矿工，项目需求：爬取招聘网站的求助信息、编辑信息匹配系统等，项目简介：根据求职者的个人信息为其推荐最合适的工作、根据全国各省的各行业信息为求职者提供合适的参考城市、项目开展技术路线：数据库操作：使用 pymysql 库与 MySQL 数据库进行交互，执行 SQL 查询和获取数据、Flask Web 框架：使用了 Flask 作为 Web 应用框架，用于创建 Web 服务和 API 端点、WSGI 服务器：pywsgi 作为 WSGI 服务器来运行 Flask 应用
团队成员学号	102202132郑冰智、102202131林鑫、102202143梁锦盛、102202111刘哲睿、102202122张诚坤、102202136洪金举
这个项目的目标	根据求职者的个人信息为其推荐最合适的工作
其他参考文献	https://github.com/balloonwj/CppGuide

GitHub连接：https://gitee.com/zheng-bingzhi/2022-level-data-collection/tree/master/职业规划与就业分析平台

一.项目介绍：

1.1项目背景：

随着互联网技术的飞速发展，网络招聘平台如雨后春笋般涌现，为求职者和招聘者提供了一个广阔的交流平台。然而，面对海量的招聘信息，许多求职者感到迷茫，难以判断自己的技能和经验与哪些岗位相匹配。此外，不同城市的发展重点和行业特色导致相同岗位在不同地区的薪资水平和发展前景存在显著差异。因此，一个能够提供精准职位匹配和城市行业分析的工具对于求职者来说显得尤为重要。

1.2项目目标：

本项目旨在开发一个综合性的互联网招聘服务平台，通过智能算法为求职者提供个性化的职位推荐，同时分析各省份的行业发展状况，帮助求职者做出更明智的职业选择。

1.3项目功能：

1、职位智能匹配：
2、利用机器学习算法分析求职者的简历和技能，与数据库中的招聘信息进行匹配，推荐最适合的工作岗位。
3、城市行业分析：
4、收集并分析各省份的主要行业数据，包括行业规模、增长趋势、平均薪资等，为求职者提供行业发展前景的参考。
5、职业路径规划：根据求职者的长期职业目标和个人偏好，提供职业发展路径规划建议，包括必要的技能提升和转型建议。
6、用户界面：设计直观易用的用户界面，使求职者能够轻松浏览职位、查看行业分析报告和获取职业建议。

1.4系统总体结构

数据采集层:

通过爬取BOSS直聘和58同城等招聘网站收集薪资待遇，公司地点，公司福利等招聘信息
通过爬取中国数据局各省份的不同行业随时间的占比变化
将获取的招聘信息存储在云数据库中以便后续的匹配

前端：

使用HTML、JavaScript进行界面设计，实现用户与系统的交互。用户可以输入自己的个人信息，查询匹配的职业以及获取AI给出的建议。

后端：

使用Python语言和Flask框架实现，调用AI接口和爬虫数据存储处理信息并匹配，以及数据可视化。

二.个人分工

2.1爬取招聘网页

2.1.1主要代码块

    def parse(self, response):jobs = response.xpath("//ul[@class='job-list-box']/li")for job in jobs:b = 0t = ''item = JobItem()item['Occupation_Name'] = job.xpath(".//span[@class='job-name']//text()").extract_first()item['Location'] = job.xpath(".//span[@class='job-area']//text()").extract_first()item['Salary'] = job.xpath(".//span[@class='salary']//text()").extract_first()item['Work_Experience'] = job.xpath(".//ul[@class='tag-list']/li[1]//text()").extract_first()item['Education'] = job.xpath(".//ul[@class='tag-list']/li[2]//text()").extract_first()tags = job.xpath(".//div[@class='job-card-footer clearfix']//ul[@class='tag-list']//li")for tag in tags:if b==0:t = tag.xpath(".//text()").extract_first()b = b + 1else:t = t + '139842'+tag.xpath(".//text()").extract_first()item['Job_Keywords'] = tpart_url = job.xpath(".//a[@class='job-card-left']/@href").extract_first()detail_url =response.urljoin(part_url)yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item' : item})def parse_detail(self,response):a = 0w = ''t = ''item =response.meta['item']texts = response.xpath("//div[@class='job-detail-section']//div[@class='job-sec-text']//text()")for text in texts:t = t + text.extract()item['Details'] = twelfares = response.xpath(".//div[@class='job-tags']//span")for welfare in welfares:if a == 0:w = welfare.xpath(".//text()").extract_first()a = a + 1else:w = w + '139842' + welfare.xpath(".//text()").extract_first()item['Company_welfare'] = witem['Company_Name'] = response.xpath("//li[@class='company-name']//text()").extract()[1]part_url = response.xpath("//a[@class='look-all'][@ka='job-cominfo']/@href").extract_first()image_url = response.urljoin(part_url)yield scrapy.Request(url = image_url,callback=self.parse_image,meta={'item' : item})def parse_image(self,response):a = 0item = response.meta['item']img_url = ''t = ''images = -1if len(response.xpath("//ul[@class='swiper-wrapper swiper-wrapper-row']//li")) > 0:images = response.xpath("//ul[@class='swiper-wrapper swiper-wrapper-row']//li")if len(response.xpath("//ul[@class='swiper-wrapper swiper-wrapper-col']//li")) > 0:images = response.xpath("//ul[@class='swiper-wrapper swiper-wrapper-col']//li")texts = response.xpath("//div[@class='job-sec']//div[@class='text fold-text']//text()")for text in texts:t = t + text.extract()item['Company_Profile'] = tif images != -1:for image in images:if a ==0:img_url = image.xpath(".//img/@src").extract_first()a = a + 1else:img_url = img_url +'139842'+image.xpath(".//img/@src").extract_first()item['Company_Photo'] = img_urlyield item

item代码：

class JobItem(scrapy.Item):Occupation_Name = scrapy.Field()Location= scrapy.Field()Salary= scrapy.Field()Work_Experience= scrapy.Field()Education= scrapy.Field()Job_Keywords= scrapy.Field()Details= scrapy.Field()Company_Name= scrapy.Field()Company_Profile= scrapy.Field()Company_welfare= scrapy.Field()Company_Photo= scrapy.Field()

2.2获取全部职业接口

2.2.1导入并初始化FLask框架并设置CORS跨域（之后的接口有重复的就不写了）

  from flask_cors import CORSfrom flask import Flask, jsonifyimport randomimport pymysqlapp = Flask(__name__)CORS(app)

2.2.2数据库连接配置

  DB_CONFIG = {'host': '81.70.22.101','user': 'root','password': 'xxx','database': 'job','charset': 'utf8mb4'
}

2.2.3定义获取招聘数据的函数

def fetch_recruitment_data():"""从数据库中获取招聘数据"""connection = pymysql.connect(**DB_CONFIG)try:with connection.cursor(pymysql.cursors.DictCursor) as cursor:sql = "SELECT id, Occupation_Name, Location, Salary, Company_Name, big_kind FROM jobInfo"cursor.execute(sql)result = cursor.fetchall()finally:connection.close()return result

2.2.4定义省份招聘信息的接口

@app.route('/province-recruitment', methods=['GET'])
def get_province_recruitment():global initial_data# 判断是首次加载还是刷新if "loaded_once" not in get_province_recruitment.__dict__:get_province_recruitment.loaded_once = Trueresponse_data = initial_data  # 顺序固定else:response_data = random.sample(initial_data, len(initial_data))  # 顺序随机print("响应数据:", response_data)return jsonify({"code": 200,"message": "Success","data": response_data})

2.3获取职业详情的接口

2.3.1配置数据库连接

  DB_CONFIG = {"host": "81.70.22.101","user": "root","password": "xxx","database": "job","charset": "utf8mb4"
}
#### 2.3.2定义数据分割函数和从数据库获取招聘详情的函数
```pythondef split_data(data, delimiter="139842"):return data.split(delimiter) if data else []def get_recruitment_from_db(recruitment_id):connection = pymysql.connect(**DB_CONFIG)try:with connection.cursor() as cursor:sql = """SELECT Occupation_Name,Location,Salary,Work_Experience,Education,Job_Keywords,Details,Company_Name,Company_Profile,Company_Photo,Company_WelfareFROM jobInfoWHERE id = %s """cursor.execute(sql, (recruitment_id,))result = cursor.fetchone()if result:return {"Occupation_Name": result[0],"Location": result[1],"Salary": result[2],"Work_Experience": result[3],"Education": result[4],"Job_Keywords": split_data(result[5]),"Details": result[6],"Company_Name": result[7],"Company_Profile": result[8],"Company_Photo": split_data(result[9]),"Company_Welfare": split_data(result[10])}return Nonefinally:connection.close()

2.3.3定义接口：招聘详情查询

  @app.route('/recruitment-detail', methods=['GET'])
def get_recruitment_detail():recruitment_id = request.args.get('id')if not recruitment_id:return jsonify({"code": 400, "message": "recruitment_id is required"}), 400recruitment_detail = get_recruitment_from_db(recruitment_id)if recruitment_detail:return jsonify({"code": 200, "message": "Success", "data": recruitment_detail})else:return jsonify({"code": 404, "message": "Recruitment detail not found"}), 404

2.4 AI智能根据用户填写的简历给出建议的接口

2.4.1获取 Access Token

  def get_access_token():try:url = f"{AI_AUTH_URL}?grant_type=client_credentials&client_id={API_KEY}&client_secret={SECRET_KEY}"response = requests.post(url)response.raise_for_status()return response.json().get("access_token")except requests.RequestException as err:print(f"获取 Access Token 失败: {err}")return None

2.4.2调用 AI 模型生成结果

  def get_ai_response(payload, access_token):url = f"{AI_API_URL}?access_token={access_token}"headers = {"Content-Type": "application/json"}try:response = requests.post(url, json=payload, headers=headers)response.raise_for_status()return response.json()except requests.RequestException as err:print(f"AI API 请求失败: {err}")return {"error": "Failed to get response from AI API", "status_code": response.status_code}

2.4.3解析 AI 返回内容

  def parse_ai_response(response_text):sections = {"Current_Situation": "","Interview_Advice": "","Career_Direction": "","Communication_Skills": ""}pattern = r'====\s*(Current_Situation|Interview_Advice|Career_Direction|Communication_Skills)\s*====\n(.+?)(?=\n====|$)'matches = re.findall(pattern, response_text, re.DOTALL)for section_name, content in matches:sections[section_name.strip()] = content.strip()return sections

2.4.4 定义职业建议生成接口

  @app.route('/recommend-career-advice', methods=['POST'])
def recommend_career_advice():data = request.jsonrequired_params = ["desired_position", "expected_salary", "resume", "work_experience"]for param in required_params:if param not in data or not data[param]:return jsonify({"code": 400, "message": f"{param} is required"}), 400desired_position = data["desired_position"]expected_salary = data["expected_salary"]resume = data["resume"]work_experience = data["work_experience"]requirements = data.get("requirements", "")access_token = get_access_token()if not access_token:return jsonify({"code": 500, "message": "Failed to retrieve access token"}), 500sections = {"Current_Situation": "请详细描述应聘者的期望岗位的发展前景和就业情况。","Interview_Advice": "提供针对面试的详细建议，如如何准备面试，如何展示自己的优势等。","Career_Direction": "根据应聘者的期望职位和技能背景，提供详细的长期职业规划建议。","Communication_Skills": "提供应聘者在工作中的详细沟通技巧建议。"}results = {}for section, description in sections.items():ai_payload = {"messages": [{"role": "user","content": f"""
根据以下求职信息，生成详细的个性化职业规划建议。请确保按照以下部分返回回答，并在开头使用“==== {section} ====”分隔标识符，确保内容超过300字：
==== {section} ====
{description}期望职位: {desired_position}
期望薪资: {expected_salary}
工作经验: {work_experience}
简历: {resume}
求职要求: {requirements}确保返回的内容详细且字数不少于 300 字。
"""}],"temperature": 0.7,"max_tokens": 1024}response = get_ai_response(ai_payload, access_token)if "result" not in response:return jsonify({"code": 500, "message": f"Failed to generate {section}"}), 500results[section] = response["result"]parsed_results = parse_ai_response("\n".join(results.values()))if not parsed_results:return jsonify({"code": 500, "message": "Failed to parse AI response"}), 500return jsonify({"code": 200,"message": "Success","data": parsed_results})