Python爬虫——scrapy-3

目录

免责声明

任务

文件简介

爬取当当网内容单管道

pipelines.py

items.py

setting

dang.py

当当网多管道下载图片

pipelines.py

settings

当当网多页下载

dang.py

pielines.py

settings

items.py

总结


免责声明

该文章用于学习,无任何商业用途

文章部分图片来自尚硅谷

任务

爬取当当网汽车用品_汽车用品【价格 品牌 推荐 正品折扣】-当当网页面的全部商品数据

文件简介

在Scrapy框架中,pipelines和items都是用于处理和存储爬取到的数据的工具。

  • Items:Items是用于存储爬取到的数据的容器。它类似于一个Python字典,可以存储各种字段和对应的值。在Scrapy中,你可以定义一个自己的Item类,然后在爬虫中创建Item对象,并将爬取到的数据填充到Item对象中。Items可以在爬取过程中传递给pipelines进行进一步处理和存储。

  • Pipelines:Pipelines是用于处理和存储Item对象的组件。当爬虫爬取到数据后,它会将数据填充到Item对象中,并通过Pipeline进行处理和存储。Pipeline可以在爬取过程中执行各种操作,比如数据的清洗、去重、验证、存储等。你可以定义多个Pipeline,并按优先级顺序执行它们。

在我们的这个项目中就需要用到了


爬取当当网内容单管道

下面的图片来自尚硅谷

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 如果想使用管道的话,就要在setting中开启
class ScrapyDangdang060Pipeline:"""在爬虫文件开始之前执行"""def open_spider(self, spider):print("++++++++++=========")self.fp = open('book.json', 'w', encoding='utf-8')# item 就是yield后面的book对象# book = ScrapyDangdang060Pipeline(src=src, name=name, price=price)def process_item(self, item, spider):# TODO 一下这种方法并不推荐,因为每传递过来一个对象就打开一个文件# TODO 对文件的操作过于频繁# (1) write方法必须是字符串,而不能是其他的对象# w 会每一个对象都打开一次文件,然后后一个文件会将前一个文件覆盖# with open('book.json', 'a', encoding='utf-8') as fp:#     fp.write(str(item))# todo 这样就解决了文件的打开过于频繁self.fp.write(str(item))return item"""在爬虫文件执行完成之后执行"""def close_spider(self, spider):print("------------------==========")self.fp.close()

在setting中解除注释,开启pipelines

ITEM_PIPELINES = {# 管道可以有很多个,管道也存在优先级,范围1~1000,值越小,优先级越高"scrapy_dangdang_060.pipelines.ScrapyDangdang060Pipeline": 300,
}

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyDangdang060Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 通俗的说就是我们要下载的数据都有什么# 图片src = scrapy.Field()# 名字name = scrapy.Field()# 价格price = scrapy.Field()

setting

# Scrapy settings for scrapy_dangdang_060 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = "scrapy_dangdang_060"SPIDER_MODULES = ["scrapy_dangdang_060.spiders"]
NEWSPIDER_MODULE = "scrapy_dangdang_060.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_dangdang_060 (+http://www.yourdomain.com)"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_dangdang_060.middlewares.ScrapyDangdang060SpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_dangdang_060.middlewares.ScrapyDangdang060DownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道可以有很多个,管道也存在优先级,范围1~1000,值越小,优先级越高
ITEM_PIPELINES = {"scrapy_dangdang_060.pipelines.ScrapyDangdang060Pipeline": 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

dang.py

import scrapy
# 这里报错是编译器的问题,但是并不影响下面的代码
from scrapy_dangdang_060.items import ScrapyDangdang060Itemclass DangSpider(scrapy.Spider):name = "dang"allowed_domains = ["category.dangdang.com"]start_urls = ["https://category.dangdang.com/cid4002429.html"]def parse(self, response):print("===============成功================")# pipelines 管道用于下载数据# items     定义数据结构的# src = //ul[@id="component_47"]/li//img/@src# alt = //ul[@id="component_47"]/li//img/@alt# price = //ul[@id="component_47"]/li//p/span/text()# 所有的seletor的对象都可以再次调用xpathli_list = response.xpath('//ul[@id="component_47"]/li')for li in li_list:# 这里页面使用了懒加载,所以不能使用src了src = li.xpath('.//a//img/@data-original').extract_first()# 前几张图片的和其他图片你的标签属性并不一样# 第一章图片的src是可以使用的,其他的图片的地址是data-originalif src:src = srcelse:src = li.xpath('.//a//img/@src').extract_first()name = li.xpath('.//img/@alt').extract_first()# /span/text()price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()# print(src, name, price)book = ScrapyDangdang060Item(src=src, name=name, price=price)# 获取一个book就将book交给pipelinesyield book

这样之后就可以拿下book.json也就是当当网这一页的全部json数据了。


当当网多管道下载图片

# (1)定义管道类
# (2)在settings中开启管道
#  "scrapy_dangdang_060.pipelines.DangDangDownloadPipeline": 301,

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 如果想使用管道的话,就要在setting中开启
class ScrapyDangdang060Pipeline:"""在爬虫文件开始之前执行"""def open_spider(self, spider):print("++++++++++=========")self.fp = open('book.json', 'w', encoding='utf-8')# item 就是yield后面的book对象# book = ScrapyDangdang060Pipeline(src=src, name=name, price=price)def process_item(self, item, spider):# TODO 一下这种方法并不推荐,因为每传递过来一个对象就打开一个文件# TODO 对文件的操作过于频繁# (1) write方法必须是字符串,而不能是其他的对象# w 会每一个对象都打开一次文件,然后后一个文件会将前一个文件覆盖# with open('book.json', 'a', encoding='utf-8') as fp:#     fp.write(str(item))# todo 这样就解决了文件的打开过于频繁self.fp.write(str(item))return item"""在爬虫文件执行完成之后执行"""def close_spider(self, spider):print("------------------==========")self.fp.close()import urllib.request# 多条管道开启
# (1)定义管道类
# (2)在settings中开启管道
#  "scrapy_dangdang_060.pipelines.DangDangDownloadPipeline": 301,class DangDangDownloadPipeline:def process_item(self, item, spider):url = 'https:' + item.get('src')filename = './books/' + item.get('name') + '.jpg'urllib.request.urlretrieve(url=url, filename=filename)return item

settings

ITEM_PIPELINES = {"scrapy_dangdang_060.pipelines.ScrapyDangdang060Pipeline": 300,# DangDangDownloadPipeline"scrapy_dangdang_060.pipelines.DangDangDownloadPipeline": 301,
}

就修改这两处文件,其他的无需变化


当当网多页下载

dang.py

这里寻找了一下不同page页之间的区别,然后使用parse方法来爬取数据

import scrapy
# 这里报错是编译器的问题,但是并不影响下面的代码
from scrapy_dangdang_060.items import ScrapyDangdang060Itemclass DangSpider(scrapy.Spider):name = "dang"# 如果要多页爬取,那么需要调整allowed_domains的范围一般情况下只写域名allowed_domains = ["category.dangdang.com"]start_urls = ["https://category.dangdang.com/cid4002429.html"]base_url = 'https://category.dangdang.com/pg'page = 1def parse(self, response):print("===============成功================")# pipelines 管道用于下载数据# items     定义数据结构的# src = //ul[@id="component_47"]/li//img/@src# alt = //ul[@id="component_47"]/li//img/@alt# price = //ul[@id="component_47"]/li//p/span/text()# 所有的seletor的对象都可以再次调用xpathli_list = response.xpath('//ul[@id="component_47"]/li')for li in li_list:# 这里页面使用了懒加载,所以不能使用src了src = li.xpath('.//a//img/@data-original').extract_first()# 前几张图片的和其他图片你的标签属性并不一样# 第一章图片的src是可以使用的,其他的图片的地址是data-originalif src:src = srcelse:src = li.xpath('.//a//img/@src').extract_first()name = li.xpath('.//img/@alt').extract_first()# /span/text()price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()# print(src, name, price)book = ScrapyDangdang060Item(src=src, name=name, price=price)# 获取一个book就将book交给pipelinesyield book# 每一页的爬取逻辑都是一样的,所以我们只需要将执行的那个页的请求再次调用parse方法即可
# 第一页:https://category.dangdang.com/cid4002429.html
# 第二页:https://category.dangdang.com/pg2-cid4002429.html
# 第三页:https://category.dangdang.com/pg3-cid4002429.htmlif self.page < 100:self.page = self.page + 1url = self.base_url + str(self.page) + '-cid4002429.html'# 调用parse方法# 下面的代码就是scrapy的get请求# 这里的parse千万不要加括号()yield scrapy.Request(url=url, callback=self.parse)

pielines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 如果想使用管道的话,就要在setting中开启
class ScrapyDangdang060Pipeline:"""在爬虫文件开始之前执行"""def open_spider(self, spider):print("++++++++++=========")self.fp = open('book.json', 'w', encoding='utf-8')# item 就是yield后面的book对象# book = ScrapyDangdang060Pipeline(src=src, name=name, price=price)def process_item(self, item, spider):# TODO 一下这种方法并不推荐,因为每传递过来一个对象就打开一个文件# TODO 对文件的操作过于频繁# (1) write方法必须是字符串,而不能是其他的对象# w 会每一个对象都打开一次文件,然后后一个文件会将前一个文件覆盖# with open('book.json', 'a', encoding='utf-8') as fp:#     fp.write(str(item))# todo 这样就解决了文件的打开过于频繁self.fp.write(str(item))return item"""在爬虫文件执行完成之后执行"""def close_spider(self, spider):print("------------------==========")self.fp.close()import urllib.request# 多条管道开启
# (1)定义管道类
# (2)在settings中开启管道
#  "scrapy_dangdang_060.pipelines.DangDangDownloadPipeline": 301,class DangDangDownloadPipeline:def process_item(self, item, spider):url = 'https:' + item.get('src')filename = './books/' + item.get('name') + '.jpg'urllib.request.urlretrieve(url=url, filename=filename)return item

settings

# Scrapy settings for scrapy_dangdang_060 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = "scrapy_dangdang_060"SPIDER_MODULES = ["scrapy_dangdang_060.spiders"]
NEWSPIDER_MODULE = "scrapy_dangdang_060.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_dangdang_060 (+http://www.yourdomain.com)"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_dangdang_060.middlewares.ScrapyDangdang060SpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_dangdang_060.middlewares.ScrapyDangdang060DownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道可以有很多个,管道也存在优先级,范围1~1000,值越小,优先级越高
ITEM_PIPELINES = {"scrapy_dangdang_060.pipelines.ScrapyDangdang060Pipeline": 300,# DangDangDownloadPipeline"scrapy_dangdang_060.pipelines.DangDangDownloadPipeline": 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyDangdang060Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 通俗的说就是我们要下载的数据都有什么# 图片src = scrapy.Field()# 名字name = scrapy.Field()# 价格price = scrapy.Field()

总结

虽然难,但是男人不能说这难┭┮﹏┭┮

ヾ( ̄▽ ̄)Bye~Bye~

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/520432.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Java工作需求后端代码--实现树形结构

加油&#xff0c;新时代打工人&#xff01; 前端页面 带树形结构的表格 最近在新项目上加班加点&#xff0c;下面是个实现树形结构的数据表格。 需求&#xff1a; 在前端页面表格中展示成树形结构的数据。 技术&#xff1a; 后端&#xff1a;Java、Mybatis-Plus、HuTool树形的…

最佳牛围栏(二分 + 前缀和)

最佳牛围栏 原题链接&#xff1a;https://www.acwing.com/problem/content/104/ 题目 思路 我们发现若是枚举答案的话&#xff0c;那么我们判断是否存在一个平均值大于等于mid&#xff0c;如果最优解是x&#xff0c;那么mid < x的时候&#xff0c;必然可以找到一段&#x…

[HackMyVM]靶场 Run

kali:192.168.56.104 主机发现 arp-scan -l # arp-scan -l Interface: eth0, type: EN10MB, MAC: 00:0c:29:d2:e0:49, IPv4: 192.168.56.104 Starting arp-scan 1.10.0 with 256 hosts (https://github.com/royhills/arp-scan) 192.168.56.1 0a:00:27:00:00:05 (Un…

云计算项目八:Harbor

部署企业私有镜像仓库Harbor 私有镜像仓库有许多优点&#xff1a; 节省网络带宽&#xff0c;针对于每个镜像不用每个人都去中央仓库上面去下载&#xff0c;只需要从私有仓库中下载即可提供镜像资源利用&#xff0c;针对于公司内部使用的镜像&#xff0c;推送到本地私有仓库中…

C++ 11 新特性 override和final

一.override和final介绍 在C11中&#xff0c;override和final是两个用于支持继承和多态的重要关键字。它们的具体作用如下&#xff1a; override&#xff1a;这个关键字用于派生类中&#xff0c;以确保虚函数的正确重写。当一个派生类的函数被声明为override时&#xff0c;编译…

物联网智慧大屏

随着物联网技术的飞速发展&#xff0c;物联网智慧大屏已经成为企业数字化转型的关键组件。那么&#xff0c;什么是物联网智慧大屏&#xff1f;它为企业带来了哪些价值&#xff1f;让我们一起来探索。 一、什么是物联网智慧大屏&#xff1f; 物联网智慧大屏&#xff0c;作为物联…

【Linux C | 网络编程】多播的概念、多播地址、UDP实现多播的C语言例子

&#x1f601;博客主页&#x1f601;&#xff1a;&#x1f680;https://blog.csdn.net/wkd_007&#x1f680; &#x1f911;博客内容&#x1f911;&#xff1a;&#x1f36d;嵌入式开发、Linux、C语言、C、数据结构、音视频&#x1f36d; &#x1f923;本文内容&#x1f923;&a…

python界面开发 - Menu (popupmenu) 右键菜单

文章目录 1. python图形界面开发1.1. Python图形界面开发——Tkinter1.2. Python图形界面开发——PyQt1.3. Python图形界面开发——wxPython1.4. Python图形界面开发—— PyGTK&#xff1a;基于GTK1.5. Python图形界面开发—— Kivy1.6. Python图形界面开发——可视化工具1.7. …

『 Linux 』Process Control进程控制(万字)

文章目录 &#x1f996; 前言&#x1f996; fork()函数调用失败原因&#x1f996; 进程终止&#x1f4a5; 进程退出码&#x1f4a5; 进程正常退出 &#x1f996; 进程等待&#x1f4a5; 僵尸进程&#x1f4a5; 如何解决僵尸进程的内存泄漏问题&#x1f4a5; wait( )/waitpid( )…

小火星露谷管理器建议的模组安装文件结构

建议的模组安装文件结构 小火星露谷管理器希望用户将所有模组直接解压到Mods这一层目录&#xff0c;而不是嵌套存放。 比如你安装了两个模组&#xff0c;Content Patcher和Custom Companions&#xff0c;你应该直接解压到Mods文件夹中&#xff0c;并保证解压的内容全部在一个…

【动态规划】完全背包

欢迎来到Cefler的博客&#x1f601; &#x1f54c;博客主页&#xff1a;折纸花满衣 &#x1f3e0;个人专栏&#xff1a;题目解析 &#x1f30e;推荐文章&#xff1a;【LeetCode】winter vacation training 目录 &#x1f449;&#x1f3fb;完全背包 &#x1f449;&#x1f3fb;…

素数求解的N种境界

本期介绍&#x1f356; 主要介绍&#xff1a;如何快速筛查出素数的各种方法&#x1f440;。 文章目录 1. 前言2. 试除法2.1 试除2~(n-1)2.2 排除2倍数的合数2.3 试除2~sprt(n) 3. 筛选法3.1 如何实现筛除法求素数3.2 优化代码 4. 总结 1. 前言 素数又称质数(prime number)&…