数据采集作业3

news/2025/1/11 6:59:06/文章来源:https://www.cnblogs.com/xiaowang254/p/18525935
课程链接 https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology
作业链接 https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology/homework/13287
实验三仓库链接 https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3

一、实验准备

1.查看Scrapy库安装

2.查看数据库安装

3.创建作业文件夹

二 、实验编写

作业1

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider
要求:指定一个网站,爬取该网站中的所有图片,例如中国气象网(http://www.weather.com.cn)。使用 scrapy 框架分别实现单线程和多线程的方式爬取。
输出信息:将下载的 Url 信息在控制台输出,并把下载的图片存储在 images 子文件中
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 weather_spider/spiders 目录下创建一个爬虫文件:weather_spider.py
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/spiders/weather_spider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItemclass ImagePipeline(ImagesPipeline):def get_media_requests(self, item, info):for image_url in item['image_urls']:logging.info(f"Downloading image: {image_url}")yield scrapy.Request(image_url)def file_path(self, request, response=None, info=None, *, item=None):image_guid = request.url.split('/')[-1]return f'images/{image_guid}'def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem("Item contains no images")return item

(3)编辑 settings.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/settings.py

点击查看代码
BOT_NAME = "weather_spider"
SPIDER_MODULES = ["weather_spider.spiders"]
NEWSPIDER_MODULE = "weather_spider.spiders"
# 启用图片下载管道
ITEM_PIPELINES = {'weather_spider.pipelines.ImagePipeline': 1,
}
# 配置图片存储路径
IMAGES_STORE = 'images'
# 配置并发请求数量
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.5  # 每个请求之间的延迟时间
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

(4)编辑 pipelines.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/pipelines.py

点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItemclass ImagePipeline(ImagesPipeline):def get_media_requests(self, item, info):for image_url in item['image_urls']:logging.info(f"Downloading image: {image_url}")yield scrapy.Request(image_url)def file_path(self, request, response=None, info=None, *, item=None):image_guid = request.url.split('/')[-1]return f'images/{image_guid}'def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem("Item contains no images")return item

(5)编辑 items.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/items.py
3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl weather

(2)显示结果截图
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider/weather_spider/images/images

  • 心得体会
    虽然还是爬取天气网,但这次是第一次用scrapy框架,能感受到scrapy更加便捷,具有良好的可扩展性和灵活性。

作业2

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider
要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;Scrapy+Xpath
MySQL数据库存储技术路线爬取股票相关信息。MySQL数据库存储
候选网站:
东方财富网:https://www.eastmoney.com/
新浪财经:http://finance.sina.com.cn/stock/
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 stock_spider/spiders 目录下创建一个爬虫文件:stock_spider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/spiders/stock_spider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from stock_spider.items import StockItem
class StockSpider(scrapy.Spider):name = 'stock'allowed_domains = ['finance.sina.com.cn']start_urls = ['http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/lsjy/index.phtml']def parse(self, response):rows = response.xpath('//table[@class="list_table"]//tr')[1:]  # 跳过表头for row in rows:item = StockItem()item['id'] = row.xpath('td[1]/text()').get()item['bStockNo'] = row.xpath('td[2]/a/text()').get()item['stockName'] = row.xpath('td[3]/a/text()').get()item['latestPrice'] = row.xpath('td[4]/text()').get()item['changePercent'] = row.xpath('td[5]/text()').get()item['changeAmount'] = row.xpath('td[6]/text()').get()item['volume'] = row.xpath('td[7]/text()').get()item['turnover'] = row.xpath('td[8]/text()').get()item['amplitude'] = row.xpath('td[9]/text()').get()item['highest'] = row.xpath('td[10]/text()').get()item['lowest'] = row.xpath('td[11]/text()').get()item['openPrice'] = row.xpath('td[12]/text()').get()item['closePrice'] = row.xpath('td[13]/text()').get()yield item

(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/settings.py

点击查看代码
BOT_NAME = "stock_spider"
SPIDER_MODULES = ["stock_spider.spiders"]
NEWSPIDER_MODULE = "stock_spider.spiders"
ITEM_PIPELINES = {'stock_spider.pipelines.MySQLPipeline': 300,
}
# settings.py
DB_SETTINGS = {'db': 'stock_spider','user': 'root','passwd': '123456','host': 'localhost','port': 3306,
}
ROBOTSTXT_OBEY = False
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/pipelines.py

点击查看代码
import mysql.connector
from mysql.connector import errorcode
class MySQLPipeline:def __init__(self):self.connection = Noneself.cursor = Nonedef open_spider(self, spider):try:self.connection = mysql.connector.connect(host='localhost',user='root',password='123456',database='stock_spider')self.cursor = self.connection.cursor()self.create_table()except mysql.connector.Error as err:if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:print("Something is wrong with your user name or password")elif err.errno == errorcode.ER_BAD_DB_ERROR:print("Database does not exist")else:print(err)self.cursor = Noneself.connection = Nonedef create_table(self):if self.cursor:create_table_query = """CREATE TABLE IF NOT EXISTS stocks (id INT AUTO_INCREMENT PRIMARY KEY,bStockNo VARCHAR(255),stockName VARCHAR(255),latestPrice VARCHAR(255),changePercent VARCHAR(255),changeAmount VARCHAR(255),volume VARCHAR(255),turnover VARCHAR(255),amplitude VARCHAR(255),highest VARCHAR(255),lowest VARCHAR(255),openPrice VARCHAR(255),closePrice VARCHAR(255))"""self.cursor.execute(create_table_query)self.connection.commit()def process_item(self, item, spider):if not self.cursor:return item  # 如果 cursor 为 None,直接返回 iteminsert_query = """INSERT INTO stocks (bStockNo, stockName, latestPrice, changePercent, changeAmount, volume, turnover, amplitude, highest, lowest, openPrice, closePrice)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""values = (item.get('bStockNo', ''),item.get('stockName', ''),item.get('latestPrice', ''),item.get('changePercent', ''),item.get('changeAmount', ''),item.get('volume', ''),item.get('turnover', ''),item.get('amplitude', ''),item.get('highest', ''),item.get('lowest', ''),item.get('openPrice', ''),item.get('closePrice', ''))try:self.cursor.execute(insert_query, values)self.connection.commit()except mysql.connector.Error as err:print(f"Error inserting data: {err}")return itemdef close_spider(self, spider):if self.cursor:self.cursor.close()if self.connection:self.connection.close()

(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/items.py

点击查看代码
import scrapy
class StockItem(scrapy.Item):id = scrapy.Field()bStockNo = scrapy.Field()  # 股票代码stockName = scrapy.Field()  # 股票名称latestPrice = scrapy.Field()  # 最新报价changePercent = scrapy.Field()  # 涨跌幅changeAmount = scrapy.Field()  # 涨跌额volume = scrapy.Field()  # 成交量turnover = scrapy.Field()  # 成交额amplitude = scrapy.Field()  # 振幅highest = scrapy.Field()  # 最高lowest = scrapy.Field()  # 最低openPrice = scrapy.Field()  # 今开closePrice = scrapy.Field()  # 昨收

3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl stock

(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider/data

  • 心得体会
    在作业2中系统的学习了基本的使用scrapy框架爬取数据方法以及xpath爬取信息的方法,我在这一实验中一一体会,受益颇多。

作业3

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project
要求: 熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法; 使用 scrapy 框架 + Xpath + MySQL 数据库存储技术路线爬取外汇网站数据。使用 scrapy 框架 + MySQL 数据库存储技术爬取外汇网站数据。
候选网站: 招商银行网: https://www.boc.cn/sourcedb/whpj
输出信息: (MySQL 数据库存储和输出格式)
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 currency_project/spiders 目录下创建一个爬虫文件:CurrencySpider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/spiders/CurrencySpider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from scrapy import signals
from scrapy.utils.log import configure_logging
from ..items import CurrencyItem
class CurrencySpider(scrapy.Spider):name = "currency"allowed_domains = ["boc.cn"]start_urls = ["https://www.boc.cn/sourcedb/whpj/"]def __init__(self, *args, **kwargs):# 配置日志输出configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})super().__init__(*args, **kwargs)def start_requests(self):for url in self.start_urls:yield scrapy.Request(url, callback=self.parse, errback=self.errback)def parse(self, response):# 使用XPath选择所有<tr>元素rows = response.xpath("//tr[position()>1]")  # 忽略第一个<tr>元素# 遍历每个<tr>元素for row in rows:# 使用XPath选择当前<tr>下的所有<td>元素,并提取文本值currencyname = row.xpath("./td[1]//text()").get()hui_in = row.xpath("./td[2]//text()").get()chao_in = row.xpath("./td[3]//text()").get()hui_out = row.xpath("./td[4]//text()").get()chao_out = row.xpath("./td[5]//text()").get()zhonghang = row.xpath("./td[6]//text()").get()date = row.xpath("./td[7]//text()").get()time = row.xpath("./td[8]//text()").get()currency = CurrencyItem()currency['currencyname'] = str(currencyname)currency['hui_in'] = str(hui_in)currency['chao_in'] = str(chao_in)currency['hui_out'] = str(hui_out)currency['chao_out'] = str(chao_out)currency['zhonghang'] = str(zhonghang)currency['date'] = str(date)currency['time'] = str(time)yield currencydef errback(self, failure):self.logger.error(repr(failure))

(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/settings.py

点击查看代码
BOT_NAME = "currency_project"
SPIDER_MODULES = ["currency_project.spiders"]
NEWSPIDER_MODULE = "currency_project.spiders"
ITEM_PIPELINES = {'currency_project.pipelines.CurrencyPipeline': 300,
}
LOG_LEVEL = 'INFO'
MYSQL_HOST = 'localhost'
MYSQL_PORT = 3307
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DB = 'data acquisition'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'waihui (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False

(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/pipelines.py

点击查看代码
import pymysql
class CurrencyPipeline:def __init__(self):# 初始化数据库连接参数self.host = "localhost"self.port = 3306self.user = "root"self.password = "123456"self.db = "data acquisition"self.charset = "utf8"self.table_name = "currency"def open_spider(self, spider):# 在爬虫启动时建立数据库连接self.client = pymysql.connect(host=self.host,port=self.port,user=self.user,password=self.password,db=self.db,charset=self.charset)self.cursor = self.client.cursor()# 创建表(如果不存在)create_table_query = """CREATE TABLE IF NOT EXISTS {} (id INT AUTO_INCREMENT PRIMARY KEY,currencyname VARCHAR(255),hui_in FLOAT,chao_in FLOAT,hui_out FLOAT,chao_out FLOAT,zhonghang FLOAT,date VARCHAR(255),time VARCHAR(255))""".format(self.table_name)self.cursor.execute(create_table_query)def process_item(self, item, spider):# 处理每个抓取的item,将其插入数据库# 转换hui_in字段,处理None值hui_in_value = item.get("hui_in")if hui_in_value == 'None':hui_in_value = Noneelse:try:hui_in_value = float(hui_in_value)except ValueError:hui_in_value = None# 转换hui_out字段,处理None值hui_out_value = item.get("hui_out")if hui_out_value == 'None':hui_out_value = Noneelse:try:hui_out_value = float(hui_out_value)except ValueError:hui_out_value = None# 转换chao_in字段,处理None值chao_in_value = item.get("chao_in")if chao_in_value == 'None':chao_in_value = Noneelse:try:chao_in_value = float(chao_in_value)except ValueError:chao_in_value = None# 转换chao_out字段,处理None值chao_out_value = item.get("chao_out")if chao_out_value == 'None':chao_out_value = Noneelse:try:chao_out_value = float(chao_out_value)except ValueError:chao_out_value = None# 转换zhonghang字段,处理None值zhonghang_value = item.get("zhonghang")if zhonghang_value == 'None':zhonghang_value = Noneelse:try:zhonghang_value = float(zhonghang_value)except ValueError:zhonghang_value = None# 准备要插入的数据args = [item.get("currencyname"),hui_in_value,chao_in_value,hui_out_value,chao_out_value,zhonghang_value,item.get("date"),item.get("time"),]print("Inserting data:", args)# 插入数据到数据库sql = "INSERT INTO {} (currencyname, hui_in, chao_in, hui_out, chao_out, zhonghang, date, time) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)".format(self.table_name)self.cursor.execute(sql, args)self.client.commit()# 返回处理后的itemreturn itemdef close_spider(self, spider):# 在爬虫关闭时关闭数据库连接self.cursor.close()self.client.close()

(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/items.py

点击查看代码
import scrapy
class CurrencyItem(scrapy.Item):currencyname = scrapy.Field()hui_in = scrapy.Field()chao_in = scrapy.Field()hui_out = scrapy.Field()chao_out = scrapy.Field()zhonghang = scrapy.Field()date = scrapy.Field()time = scrapy.Field()

3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl currency

(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project/data@0020acquisition

  • 心得体会
    通过本次实验,我熟练掌握了使用 Scrapy 框架结合 Xpath 技术爬取外汇网站数据,并将其存储到 MySQL 数据库的完整流程。定义了 CurrencyItem 类来存储数据,编写了 Spider 类提取外汇牌价信息,并通过 Pipeline 将数据序列化存储到 MySQL 中。实验过程加深了我对 Scrapy 框架的理解,提升了数据爬取和处理的能力。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/826720.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

10分钟快速开发一个监控ASIN的系统-计算引擎

定义数据模型 发布生成表单 Python端直接操作亚马逊amazon# <span data-hook="rating-out-of-text" aria-hidden="true" class="a-size-medium a-color-base">4.5 out of 5</span>if com_amz_x.check_driver_is_exists_by_x_path(…

数据库 PostgreSQL 和 MySQL 开源协议的区别

说到开源数据库,很多人想到了MySQL和PostgreSQL数据库,这是两种比较有名和使用量较多的数据库技术,两者虽然都是开源的,但开源协议是不一样的。 PostgreSQL遵循灵活的开源协议BSD,MySQL使用的是GPL(GNU General Public License)协议,二者有以下区别: 1. 商业使用限制方面…

MySQL8.0安装配置教程【超级详细图解】

万分感谢.参考文章内容:https://blog.csdn.net/m0_73442728/article/details/131359479 万分感谢.参考文章内容:https://blog.csdn.net/qq_40187702/article/details/130618805 目录** 一、MySQL下载与安装 二、MySQL安装 三、MySQL连接测试 四、配置环境变量一、MySQL下载与安…

使用TLE计算轨道并使用Cesium绘制

使用TLE计算轨道并使用Cesium绘制 依赖库satellite.jsnpm install satellite.js实现思路利用satellite.js获取一个轨道周期内的全部轨迹点坐标 使用Cesium的primitive绘制轨迹线完整代码 import { twoline2satrec, propagate, gstime, eciToEcf } from "satellite.js"…

实例化对象有哪几种方式

1. **new 关键字** User user = new User(); 2. **clone() 方法** User clonedUser = originalUser.clone(); 3. **通过反射机制创建** Class<?> cls = Class.forName("com.dao.User"); User u = (User) cls.newInstance(); 4. **序列化反序列化** …

操作字符串都有哪些类以及它们之间有什么区别

1. **String**: - 是不可变对象。每次对 String 类型进行修改时都会生成一个新的对象。 - 适用于不频繁修改字符串的情况。 2. **StringBuilder**: - 线程不安全,效率高,多用于单线程环境。 - 适用于需要频繁修改字符串的操作。 3. **StringBuffer**: - 线程安全…

微信公众号服务器配置一直提示token验证失败?

本地使用postman,请求了要设置在微信公众号服务器回调的URL,可以正常返回echostr,点击提交, 一直报错token验证失败,请问这个是什么原因呢? 解决办法: 解决了,我去,好坑啊,遇到该问题的朋友,请做如下检查:1.检查,request 是不是 UTF-8,避免获取的数据是乱码 2…

「闲话」NOIP 集训

10.31 因为明天是 11.1,所以从今天开始写 上午 T1 没看让输出啥所以一眼会了求所有 j 看了输出之后,额 诶,其实也对啊,直接根据每个 j 求出的 i 区间查分一下就好了,调和级数的复杂度 20 min 打完了,本来以为有些 conercase 要调一会,但直接过了所有样例,爽!! 后记:…

2024-11-04 Request textDocument/documentSymbol failed. ==》重启vscode

今个儿早上一来,跑一下项目,我勒个去,这个vscode就给我不断弹出这种弹窗 Request textDocument/documentSymbol failed. 请求文本Document/documentSymbol失败。来源:Vue-官方。 原因:不知道,可能是终端报错了,但是把终端关掉了(没有杀死),报错仍旧存在。 解决方案:…

django抛出异常然后在axios里处理

1.views.py代码@csrf_exempt def get_selectdata_by_sql(request):current = request.GET.get("current", 1)size = request.GET.get("size", 0)limit_size=50try:res_list = []request_dict = json.loads(request.body)# print(request_dict)runsql = req…

Tmux基本用法

Tmux 是一个终端复用器(terminal multiplexer),非常有用,属于常用的开发工具。 它的核心功能是解绑当前运行会话进程和当前终端窗口。即使出现意外情况,当前的终端窗口被kill掉了(断电、断网、手贱点错了等等),你之前使用Tmux执行的会话进程仍在继续运行,并可以绑定到…

配置数据同步环境v1

配置数据同步环境v1.0 1 配置Canal+MQ数据同步环境 1.1 配置Mysql主从同步 根据Canal的工作原理,首先需要开启MySQL主从同步。 1.在MySQL中需要创建一个用户,并授权 进入mysql容器: docker exec -it mysql /bin/bash -- 使用命令登录: mysql -u root -p -- 创建用户 用户名…