【实验3.1]使用Scrapy采集豆瓣读书评分在9分以上的图书数据。 要求采集每本图书的数据,包括图书名、评分、作者、出版社和出版年份。

目录

1.项目代码如图所示:

2.代码详情

__init__.py

dbbook.py

__init__.py

items.py

pipelines.py

setting.py

main.py

scrapy.cfg

4.实验前准备:


1.项目代码如图所示:

2.代码详情

__init__.py
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
dbbook.py
# -*- coding: utf-8 -*-
import scrapy
import re
from doubanbook.items import DoubanbookItemclass DbbookSpider(scrapy.Spider):name = "dbbook"#allowed_domains = ["https://www.douban.com/doulist/1264675/"]start_urls = ('https://www.douban.com/doulist/1264675/',)URL = 'https://www.douban.com/doulist/1264675/?start=PAGE&sort=seq&sub_type='def parse(self, response):#print response.bodyitem = DoubanbookItem()selector = scrapy.Selector(response)books = selector.xpath('//div[@class="bd doulist-subject"]')for each in books:title = each.xpath('.//div[@class="title"]/a/text()').get().strip()rate = each.xpath('.//div[@class="rating"]/span[@class="rating_nums"]/text()').get().strip()abstract = each.xpath('.//div[@class="abstract"]').extract_first().strip()# 使用 XPath 提取作者、出版社和出版年信息author = each.xpath('.//div[@class="abstract"]/text()[contains(., "作者")]').extract_first()item['author'] = author.split(':')[1].strip()publisher = each.xpath('.//div[@class="abstract"]/text()[contains(., "出版社")]').extract_first()item['publisher'] = publisher.split(':')[1].strip()year = each.xpath('.//div[@class="abstract"]/text()[contains(., "出版年")]').extract_first()item['year'] = year.split(':')[1].strip()title = title.replace(' ','').replace('\n','')author = author.replace(' ','').replace('\n','')year = year.replace(' ', '').replace('\n', '')publisher = publisher.replace(' ', '').replace('\n', '')#雷同# author = each.xpath('.//div[@class="abstract"]/text()').get().strip().split('作者:')[-1].split('<br>')[#      0].strip()# publisher = each.xpath('.//div[@class="abstract"]/text()').get().strip().split('出版社:')[-1].split('<br>')[#     0].strip()# year = each.xpath('.//div[@class="abstract"]/text()').get().strip().split('出版年:')[-1].split('<br>')[#     0].strip()# publisher = each.xpath('.//div[@class="abstract"]/text()').get().strip().split('出版社:')[-1].split('<br>')[#      0].strip()# year = each.xpath('.//div[@class="abstract"]/text()').get().strip().split('出版年:')[-1].split('<br>')[#      0].strip()yield DoubanbookItem(title=title,rate=rate,author=author,publisher=publisher,year=year,)# for each in books:#     title = each.xpath('div[@class="title"]/a/text()').extract()[0]#     rate = each.xpath('div[@class="rating"]/span[@class="rating_nums"]/text()').extract()[0]#     author = re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1)#     abstract = each.xpath('.//div[@class="abstract"]/text()').get().strip()#     publisher = re.search(r'出版社:(.*?)(?=<br|\Z)', abstract)#     year = re.search(r'出版年:(\d{4}-\d+)(?=<br|\Z)', abstract)#     item['title'] = title#     item['rate'] = rate#     item['author'] = author#     item['year'] = year#     item['publisher'] = publisher#     # print 'title:' + title#     # print 'rate:' + rate#     # print author#     # print ''#     yield itemnextPage = selector.xpath('//span[@class="next"]/link/@href').extract()if nextPage:next = nextPage[0]printnextyield scrapy.http.Request(next,callback=self.parse)
__init__.py

items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DoubanbookItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()rate = scrapy.Field()author = scrapy.Field()publisher = scrapy.Field()year = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass DoubanbookPipeline(object):def process_item(self, item, spider):return item
setting.py
# -*- coding: utf-8 -*-# Scrapy settings for doubanbook project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'doubanbook'SPIDER_MODULES = ['doubanbook.spiders']
NEWSPIDER_MODULE = 'doubanbook.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'FEEDS = {'file:///F:/2023-2024学年大三学期(2023.9.4至)/0.2023-202学年大三下学期(2023.2.22至)/douban.csv': {'format': 'csv','encoding': 'utf-8',},
}
# FEED_URI = u'file:///F:/2023-2024学年大三学期(2023.9.4至)/0.2023-202学年大三下学期(2023.2.22至)/douban.csv'
# FEED_FORMAT = 'CSV'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16# Disable cookies (enabled by default)
#COOKIES_ENABLED=False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'doubanbook.middlewares.MyCustomSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'doubanbook.middlewares.MyCustomDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'doubanbook.pipelines.SomePipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
#AUTOTHROTTLE_ENABLED=True
# The initial download delay
#AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG=False# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
main.py
from scrapy import cmdline
cmdline.execute("scrapy crawl dbbook".split())
scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html[settings]
default = doubanbook.settings[deploy]
#url = http://localhost:6800/
project = doubanbook

3.运行结果图

4.实验前准备:

安装scrapy
一条命令解决所有问题,在终端输入:pip install scrapy

如果想了解代码和网站信息,建议打开网站,观察网页源代码,用f12,或者是点击更多工具,点击开发者工具,进行学习,这一个很重要。

例如:

源代码请在我上传的资源中免费下载,涉及到其他的相关知识,我将后续进行完善

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/678977.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

remmina无法连接远程桌面,Remmina无法连接远程桌面的原因与解决办法

在解决Remmina无法连接远程桌面的问题时&#xff0c;我们需要考虑多种可能的原因&#xff0c;并采取相应的解决办法。以下是一些常见的原因及其对应的解决方案&#xff1a; 1、网络问题 原因&#xff1a;不稳定的网络连接或中断可能导致无法建立远程桌面连接。 解决办法&#x…

stack的使用

1.栈的定义 我们可以看到模板参数里面有一个容器适配器 &#xff0c;什么是适配器&#xff1f;比如充电器就叫做电源适配器&#xff0c;用在做转换&#xff0c;对电压进行相关的转换适配我们的设备。栈&#xff0c;队列不是自己直接管理数据&#xff0c;是让其他容器管理数据&a…

D3CTF2024

文章目录 前言notewrite_flag_where【复现】D3BabyEscapePwnShell 前言 本次比赛笔者就做出两道简单题&#xff0c;但队里师傅太快了&#xff0c;所以也没我啥事。然后 WebPwn 那题命令行通了&#xff0c;但是浏览器不会调试&#xff0c;然后就简单记录一下。 note 只开了 N…

Python专题:四、字符串(1)

str是类 字符串可以是任意内容 &#xff08;单引号&#xff09; ""&#xff08;双引号&#xff09; 三引号可跨行输入 print&#xff08;&#xff09;打印字符串 字符 下面这些也都是字符 字符——>数字&#xff0c;ASCII编码&#xff0c;美国信息交换标准代码&a…

鸿蒙开发全攻略:华为应用系统如何携手嵌入式技术开启新篇章~

鸿蒙操作系统是华为自主创新的成果&#xff0c;打破了传统操作系统的局限。通过结合嵌入式技术&#xff0c;鸿蒙实现了跨平台、跨设备的高度融合&#xff0c;提供了流畅、智能的体验。华为应用系统与嵌入式技术的结合&#xff0c;提升了性能&#xff0c;丰富了用户体验。鸿蒙与…

基于springboot+vue+Mysql的口腔管理平台

开发语言&#xff1a;Java框架&#xff1a;springbootJDK版本&#xff1a;JDK1.8服务器&#xff1a;tomcat7数据库&#xff1a;mysql 5.7&#xff08;一定要5.7版本&#xff09;数据库工具&#xff1a;Navicat11开发软件&#xff1a;eclipse/myeclipse/ideaMaven包&#xff1a;…

深入理解Python的类,实例和type函数

问题起源&#xff1a; class t():pass s1 t() s2 type("Student2",(),{}) isinstance(s1, type), isinstance(s2, type)为什么第一个是false&#xff0c;第二个是true呢 根因定位&#xff1a; 在Python中&#xff0c;一切皆对象&#xff0c;类是对象&#xff0c…

MindSponge分子动力学模拟——定义一个分子系统

技术背景 在前面两篇文章中&#xff0c;我们分别介绍了分子动力学模拟软件MindSponge的软件架构和安装与使用。这里我们进入到实用化阶段&#xff0c;假定大家都已经在本地部署好了基于MindSpore的MindSponge的编程环境&#xff0c;开始用MindSponge去做一些真正的分子模拟的工…

Nacos单机模式集成MySQL

系列文章目录 文章目录 系列文章目录前言 前言 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到网站&#xff0c;这篇文章男女通用&#xff0c;看懂了就去分享给你的码吧。 Nacos支持三种部署…

《编译原理》阅读笔记:p1-p3

《编译原理》学习第 1 天&#xff0c;p1-p3总结&#xff0c;总计 3 页。 一、技术总结 1.compiler(编译器) p1, But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do thi…

【linux软件基础知识】-死锁问题

死锁问题 当两个或多个线程由于每个线程都在等待另一个线程持有的资源而无法继续时,就会发生死锁 如下图所示, 在线程 1 中,代码持有了 L1 上的锁,然后尝试获取 L2 上的锁。 在线程 2 中,代码持有了 L2 上的锁,然后尝试获取 L1 上的锁。 在这种情况下,线程 1 已获取 L…

Java特性之设计模式【代理模式】

一、代理模式 概述 在代理模式&#xff08;Proxy Pattern&#xff09;中&#xff0c;一个类代表另一个类的功能。这种类型的设计模式属于结构型模式 在代理模式中&#xff0c;我们创建具有现有对象的对象&#xff0c;以便向外界提供功能接口 主要解决&#xff1a; 在直接访问…