数据采集与融合技术实践第三次作业

news/2025/1/20 14:52:28/文章来源:https://www.cnblogs.com/chsiyu/p/18513935

作业1

要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。
输出信息:

代码:

weather_images.py
import scrapy
import re
import os
from scrapy.http import Request
from weather_image_spider.items import WeatherImageSpiderItemclass WeatherImagesSpider(scrapy.Spider):name = "weather_images"allowed_domains = ["weather.com.cn"]start_urls = ["http://www.weather.com.cn/"]# 设置图片数量和页面限制total_images = 125total_pages = 25  # 按学号尾数设置总页数images_downloaded = 0pages_visited = 0def parse(self, response):# 获取页面中的链接url_list = re.findall(r'<a href="(.*?)"', response.text, re.S)for url in url_list:if self.pages_visited >= self.total_pages:breakself.pages_visited += 1yield Request(url, callback=self.parse_images)def parse_images(self, response):img_list = re.findall(r'<img.*?src="(.*?)"', response.text, re.S)for img_url in img_list:if self.images_downloaded >= self.total_images:breakself.images_downloaded += 1print(f"正在保存第{self.images_downloaded}张图片 路径: {img_url}")yield WeatherImageSpiderItem(image_urls=[img_url])
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import osclass WeatherImagePipeline(ImagesPipeline):def file_path(self, request, response=None, info=None):image_name = request.url.split('/')[-1]return f'images/{image_name}'
settings.py
# Scrapy settings for weather_image_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = "weather_image_spider"SPIDER_MODULES = ["weather_image_spider.spiders"]
NEWSPIDER_MODULE = "weather_image_spider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "weather_image_spider (+http://www.yourdomain.com)"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "weather_image_spider.middlewares.WeatherImageSpiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "weather_image_spider.middlewares.WeatherImageSpiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "weather_image_spider.pipelines.WeatherImageSpiderPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"# 限制单线程
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 0.25  # 可以根据需要调整# 激活图片管道
ITEM_PIPELINES = {'weather_image_spider.pipelines.WeatherImagePipeline': 1,
}
IMAGES_STORE = './images'

心得体会:对于单线程爬取和多线程爬取的区别还不是很懂,我认为似乎就是在pipelines.py中CONCURRENT_REQUESTS是否被赋值

作业2

要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站:东方财富网:https://www.eastmoney.com/
代码:

eastmoney.py
import scrapy
import json
from ..items import StockItemclass StockSpider(scrapy.Spider):name = "stock"allowed_domains = ["eastmoney.com"]def start_requests(self):base_url1 = 'http://95.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112404577990037157569_1696660645140'base_url2 = '&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1696660645141'total_pages = 4  #爬取前4页for page_number in range(1, total_pages + 1):page_url = f"{base_url1}&pn={page_number}{base_url2}"yield scrapy.Request(page_url, callback=self.parse)def parse(self, response):data = response.text json_data = json.loads(data[data.find('{'):data.rfind('}') + 1])stock_list = json_data['data']['diff']for stock in stock_list:item = StockItem()item['code'] = stock['f12']item['name'] = stock['f14']item['latest_price'] = stock['f2']item['change_percent'] = stock['f3']item['change_amount'] = stock['f4']item['volume'] = stock['f5']item['turnover'] = stock['f6']item['amplitude'] = stock['f7']item['highest'] = stock['f15']item['lowest'] = stock['f16']item['open_price'] = stock['f17']item['close_price'] = stock['f18']yield item
items.py
import scrapyclass StockItem(scrapy.Item):code = scrapy.Field()name = scrapy.Field()latest_price = scrapy.Field()change_percent = scrapy.Field()change_amount = scrapy.Field()volume = scrapy.Field()turnover = scrapy.Field()amplitude = scrapy.Field()highest = scrapy.Field()lowest = scrapy.Field()open_price = scrapy.Field()close_price = scrapy.Field()
pipelines.py
import sqlite3
import MySQLdb  # 引入 MySQLdbclass StockPipeline:def __init__(self):self.create_database()def create_database(self):self.conn = sqlite3.connect('stock.db')self.cursor = self.conn.cursor()self.cursor.execute('''CREATE TABLE IF NOT EXISTS stocks (id INTEGER PRIMARY KEY AUTOINCREMENT,code TEXT,name TEXT,latest_price REAL,change_percent REAL,change_amount REAL,volume INTEGER,turnover REAL,amplitude REAL,highest REAL,lowest REAL,open_price REAL,close_price REAL)''')self.conn.commit()def process_item(self, item, spider):self.save_stock_data_to_database(item)return itemdef save_stock_data_to_database(self, item):self.cursor.execute('''INSERT INTO stocks (code, name, latest_price, change_percent, change_amount,volume, turnover, amplitude, highest, lowest, open_price, close_price) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', (item['code'], item['name'], item['latest_price'], item['change_percent'],item['change_amount'], item['volume'], item['turnover'], item['amplitude'],item['highest'], item['lowest'], item['open_price'], item['close_price']))self.conn.commit()def close_spider(self, spider):self.conn.close()class TerminalOutputPipeline(object):def process_item(self, item, spider):# 在终端输出数据print(f"股票代码: {item['code']}")print(f"股票名称: {item['name']}")print(f"最新报价: {item['latest_price']}")print(f"涨跌幅: {item['change_percent']}")print(f"跌涨额: {item['change_amount']}")print(f"成交量: {item['volume']}")print(f"成交额: {item['turnover']}")print(f"振幅: {item['amplitude']}")print(f"最高: {item['highest']}")print(f"最低: {item['lowest']}")print(f"今开: {item['open_price']}")print(f"昨收: {item['close_price']}")return itemclass MySQLPipeline:def open_spider(self, spider):# 初始化数据库连接self.connection = MySQLdb.connect(host='localhost',        # 数据库主机名user='root',    # 数据库用户名password='123456', # 数据库密码db='stocks_db',          # 数据库名称charset='utf8mb4',use_unicode=True)self.cursor = self.connection.cursor()def close_spider(self, spider):# 关闭数据库连接self.connection.commit()self.cursor.close()self.connection.close()def process_item(self, item, spider):# 处理 item,并将其存入数据库self.save_stock_data_to_database(item)return itemdef save_stock_data_to_database(self, item):sql = '''INSERT INTO stocks (code, name, latest_price, change_percent, change_amount,volume, turnover, amplitude, highest, lowest, open_price, close_price)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'''self.cursor.execute(sql, (item['code'],item['name'],item['latest_price'],item['change_percent'],item['change_amount'],item['volume'],item['turnover'],item['amplitude'],item['highest'],item['lowest'],item['open_price'],item['close_price']))
settings.py
BOT_NAME = "stock_scraper"SPIDER_MODULES = ["stock_scraper.spiders"]
NEWSPIDER_MODULE = "stock_scraper.spiders"
ITEM_PIPELINES = {'stock_scraper.pipelines.MySQLPipeline': 2,'stock_scraper.pipelines.StockPipeline': 1,'stock_scraper.pipelines.TerminalOutputPipeline': 3,
}USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'ROBOTSTXT_OBEY = False

输出信息:

心得体会:这一题用时比较长,因为股票网页是动态的,开始的时候想的比较简单。后来用了F12抓包获取对应的XPATH后才取得了成功。至于对于上课时候讲的通过Selenium框架如何结合着操作,还未实践,听的也不是很懂。

作业3

要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/
代码:

bank.py
import scrapy
from bank.items import BankItemclass BankSpider(scrapy.Spider):name = 'bank'start_urls = ['https://www.boc.cn/sourcedb/whpj/index.html']def start_requests(self):num_pages = int(getattr(self, 'pages', 4))  for page in range(1, num_pages + 1):if page == 1:start_url = f'https://www.boc.cn/sourcedb/whpj/index.html'else:start_url = f'https://www.boc.cn/sourcedb/whpj/index_{page-1}.html'yield scrapy.Request(start_url, callback=self.parse)def parse(self, response):   bank_list = response.xpath('//tr[position()>1]')for bank in bank_list:item = BankItem()item['Currency'] = bank.xpath('.//td[1]/text()').get()item['TBP'] = bank.xpath('.//td[2]/text()').get()item['CBP'] = bank.xpath('.//td[3]/text()').get()item['TSP'] = bank.xpath('.//td[4]/text()').get()item['CSP'] = bank.xpath('.//td[5]/text()').get()item['Time'] = bank.xpath('.//td[8]/text()').get()yield item
items.py
import scrapy
class BankItem(scrapy.Item):Currency = scrapy.Field()TBP = scrapy.Field()CBP = scrapy.Field()TSP = scrapy.Field()CSP = scrapy.Field()Time = scrapy.Field()
pipelines.py
from itemadapter import ItemAdapter
import sqlite3class BankPipeline:def open_spider(self, spider):self.conn = sqlite3.connect('bank.db')self.cursor = self.conn.cursor()self.create_database()def create_database(self):self.cursor.execute('''CREATE TABLE IF NOT EXISTS banks (Currency TEXT ,TBP REAL,CBP REAL,TSP REAL,CSP REAL,Time TEXT)''')self.conn.commit()def process_item(self, item, spider):self.cursor.execute('''INSERT INTO banks (Currency,TBP,CBP,TSP,CSP,Time) VALUES (?, ?, ?, ?, ?, ?)''', (item['Currency'],item['TBP'],item['CBP'],item['TSP'],item['CSP'],item['Time'] ))self.conn.commit()return itemdef close_spider(self, spider):self.conn.close()
settings.py
BOT_NAME = "bank"SPIDER_MODULES = ["bank.spiders"]
NEWSPIDER_MODULE = "bank.spiders"CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300,
}

心得体会:做完第二题再来做第三题就会轻松许多了,主要加强了我对Scrapy框架和MySQL数据库(Sqlite3)操作的综合运用能力,最后我是在Navicat Prenium中进行了数据库中表的可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/823870.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

实验3_C语言函数应用编程

任务一:#include <stdio.h> char score_to_grade(int score); int main() { int score; char grade; while(scanf("%d", &score) != EOF) { grade = score_to_grade(score);printf("分数: %d, 等级: %c\n\n", score, grade); } return 0;…

强化学习的数学原理-07时序差分方法

目录引入TD learing of state valuesTD learing of action values SarsaTD learing of action values Expected SarsaTD learing of action values n-step SarsaTD learing of optimal action values:Q-learninga unified point of view 引入这三个例子是层层递进的,都可以用…

HarmonyOS NEXT 组件市场在DevEco Studio,安装出现Fail to load plugin descriptor from file cases-master.zip

HarmonyOS NEXT开源组件市场 https://gitee.com/harmonyos-cases/cases根据gitee的下载连接,下载了cases-master.zip。如果在dev studio -settings-plugins-设置按钮-install from disk ,会报错,说明这个不是真正的插件包。 解压这个zip,在plugin文件夹下有个case_plugin-…

网络攻防实验 -- 渗透测试作业三

目录漏洞复现:1.利用宏病毒感染word文档获取shell复现2.实现CVE-2020-0796永恒之黑漏洞利用3.实现Microsoft Windows远程溢出漏洞CVE-2012-0002利用4.实现MS11-003(CVE-2001-0036)漏洞利用5.实现IE浏览器的极光漏洞利用6.实现Adobe Reader 9漏洞利用7.渗透攻击Metasploitabl…

2024网鼎杯初赛-青龙组-WEB gxngxngxn

WEB01 开局随便随便输入都可以登录,登上去以后生成了一个token和一个session,一个是jwt一个是flask框架的 这边先伪造jwt,是国外的原题 CTFtime.org / DownUnderCTF 2021 (线上) / JWT / Writeup 先生成两个token,然后利用rsa_sign2n工具来生成公钥python3 jwt_forgery.py…

网络攻防实验 -- 渗透测试作业一

一 nmap命令 SERVICE/VERSION DETECTION:-sV: Probe open ports to determine service/version infoHOST DISCOVERY:-sn: Ping Scan - disable port scan # 检测主机是否在线,不显示任何端口信息。1.使用nmap搜寻网络内活跃的主机2.使用nmap扫描目标主机端口信息和服务版本号3.…

手机app开发用的是什么语言有哪些优势

手机APP开发是一项涉及多种编程语言的任务。开发者可以根据需求、平台以及个人偏好选择合适的语言。手机app开发用的语言有:1、Java;2、Kotlin;3、Swift;4、JavaScript/TypeScript;5、Dart。作为Android平台的主要开发语言,Java拥有庞大的开发者社区和丰富的开源库。它的…

代码随想录一刷-day3

T209 长度最小子数组 核心:滑动窗口思想,如何用一个for循环达到两个循环的效果 for(int j=0;j<num.size();j++){ sum+=nums[j];//外层for循环内负责将窗口结束的坐标++; while(sum>=target){window_length=j-i+1;result=min(result,window_length);sum-=nums[ i++ ]; …

IDEA如何在线安装一个插件,超简单

前言 我们在使用IDEA开发Java应用时,经常是需要安装插件的,这些各种各样的插件帮助我们快速的开发应用,今天,就来介绍下如何在IDEA中安装插件。 那么,我们该如何安装插件呢? 如何安装插件 首先,我们打开设置面板。然后,我们点击【Plugins】,我们再在右侧点击【Marketp…

【Azure Bot Service】部署Python ChatBot代码到App Service中

问题描述 使用Python编写了ChatBot,在部署到App Service,却无法启动。 通过高级工具(Kudu站点:https://<your site name>.scm.chinacloudsites.cn/newui)查看日志显示:Failed to find attribute app in app.2024-10-25T02:43:29.242073529Z _____ …

系统类型中的标准、VHD和VHDX是什

系统类型中的标准、VHD和VHDX是指计算机系统中的不同类型和格式。标准系统类型是指常见的操作系统,如Windows、Linux和macOS等,它们具有广泛的应用和用户基础。VHD(Virtual Hard Disk)和VHDX是虚拟硬盘的文件格式,用于在虚拟化环境中模拟硬盘存储。标准系统类型是操作系统…

Vue基础–v-model表单

v-model的基本使用基本使用<!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Document</titl…