Python 爬虫之简单的爬虫(四)

爬取动态网页(下)


文章目录

  • 爬取动态网页(下)
  • 前言
  • 一、大致内容
  • 二、基本思路
  • 三、代码编写
    • 1.引入库
    • 2.加载网页数据
    • 3.获取并保存
    • 4.保存文档
  • 总结


前言

上篇主要讲了如何去爬取数据,这篇来讲一下如何在获取的同时将数据整理保存到excel文档中。

上一篇《Python 爬虫之简单的爬虫(三)》链接:https://blog.csdn.net/weixin_57061292/article/details/135073002


一、大致内容

以上一篇文章为基础。在原来的代码上进行增添和修改。
增添的内容是:Python操作文档的一些库等相关代码。
修改的内容是:对上一篇的《3.获取指定数据》进行修改,遍历获取的数据的同时把它们添加到新创建的excel文档里。

运行效果图:
在这里插入图片描述


二、基本思路

接着上一篇的基本思路继续写:

  • 第五步:导入一下需要的新的软件库
  • 第六步:主要是将上一篇《3.获取指定数据》里面print()替换成将数据保存到文档中的操作。
  • 第七步:删除文档中默认的Sheet工作表,并保存文档。

三、代码编写

1.引入库

代码如下:

# 以上是原来的
from selenium import webdriver
from selenium.webdriver.common.by import By
import time# 以下是新添加的
from openpyxl.styles import Font, Alignment, Border, Side
import openpyxl
import re

2.加载网页数据

代码如下:

# 这些是原来的
driver = webdriver.Firefox()
driver.get("https://movie.douban.com/annual/2022/?fullscreen=1&source=movie_navigation")
time.sleep(5)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')# 这些是新添加的
# 创建实例对象
wb = openpyxl.Workbook()

这里新添加一个对象实例,用来生成excel文档用的。


3.获取并保存

代码如下:

# 获取四大影视类型标题
comment_Titles = driver.find_elements(by=By.CSS_SELECTOR, value='.module-top10-grid-chart-title')
# 创建以四大影视类型标题的四个工作表
i = 0
for comment in comment_Titles:# 创建工作表ws = wb.create_sheet(index=i, title=comment.text)# 冻结首行ws.freeze_panes = 'A2'# 首行居中、加粗、加框线# 将电影中的元素作为标题添加到每个工作表的第一行中cell_titles = ['片名', '演员', '评分', '产地']index = 1for title in cell_titles:wc = ws.cell(row=1, column=index, value=title)# 加粗wc.font = Font(bold=True)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))# 水平垂直居中wc.alignment = Alignment(horizontal='center', vertical='center')index += 1i += 1# 获取每个影视类型里的第一名片名
which_mo_list = driver.find_elements(by=By.CSS_SELECTOR, value='.subject-top-title')
# 将第一名的片名写入到每个工作表中
a = 0
for each_mo in which_mo_list:movie_title = each_mo.get_attribute('title')if a == 0:ws = wb['评分最高华语电影']wc = ws.cell(column=1, row=2, value=f'《{movie_title}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif a == 1:ws = wb['评分最高外语电影']wc = ws.cell(column=1, row=2, value=f'《{movie_title}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif a == 2:ws = wb['年度冷门佳片']wc = ws.cell(column=1, row=2, value=f'《{movie_title}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif a == 3:ws = wb['华语剧集']wc = ws.cell(column=1, row=2, value=f'《{movie_title}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))a += 1# 获取每个影视类型里的第一名评分
movies_top_scores_list = driver.find_elements(by=By.CSS_SELECTOR, value='.rating-card-value')
# 将第一名的评分写入到每个工作表中
c = 0
for movie_top_score in movies_top_scores_list:score = movie_top_score.textif c == 0:ws = wb['评分最高华语电影']wc = ws.cell(column=3, row=2, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif c == 1:ws = wb['评分最高外语电影']wc = ws.cell(column=3, row=2, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif c == 2:ws = wb['年度冷门佳片']wc = ws.cell(column=3, row=2, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif c == 3:ws = wb['华语剧集']wc = ws.cell(column=3, row=2, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))c += 1# 获取所有影片的人物信息
persons_list = driver.find_elements(by=By.CSS_SELECTOR, value='.subject-credit')
# 将演员信息添加到各自的工作表中
b = 0
for person in persons_list:person_title = person.find_elements(by=By.TAG_NAME, value='p')for title in person_title:# 演员信息actor = title.textif 0 < b <= 10:ws = wb['评分最高华语电影']wc = ws.cell(column=2, row=b+1, value=actor)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 11 < b <= 21:ws = wb['评分最高外语电影']wc = ws.cell(column=2, row=b-10, value=actor)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 22 < b <= 32:ws = wb['年度冷门佳片']wc = ws.cell(column=2, row=b-21, value=actor)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 33 < b <= 43:ws = wb['华语剧集']wc = ws.cell(column=2, row=b-32, value=actor)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))b += 1# 获取所有影片的片名(每个影视类型里的第一名除外)
movies_title_list = driver.find_elements(by=By.CSS_SELECTOR, value='.subjects-rank-title')
# 将片名写入到每个工作表中
d = 0
for movie_title in movies_title_list:# 使用正则表达式提取中文文本# 使用正则表达式 [\u4e00-\u9fff]+# 匹配一个或多个连续的中文字符,并使用 re.search().group(1) 获取第一个括号内的匹配内容,即中文文本。chinese_text = re.search(r'([\u4e00-\u9fff]+)', movie_title.text).group(1)if 0 <= d <= 8:ws = wb['评分最高华语电影']wc = ws.cell(column=1, row=d+3, value=f'《{chinese_text}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 9 <= d <= 17:ws = wb['评分最高外语电影']wc = ws.cell(column=1, row=d-6, value=f'《{chinese_text}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 18 <= d <= 26:ws = wb['年度冷门佳片']wc = ws.cell(column=1, row=d-15, value=f'《{chinese_text}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 27 <= d <= 35:ws = wb['华语剧集']wc = ws.cell(column=1, row=d-24, value=f'《{chinese_text}》')# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))d += 1# 获取影片的产地(每个影视类型里的第一名除外)
addresses_list = driver.find_elements(by=By.CSS_SELECTOR, value='.subjects-rank-credits > div:nth-child(2)')
# 将产地名称添加到每个工作表中
e = 0
for addresses in addresses_list:address_text = addresses.textif 0 <= e <= 8:ws = wb['评分最高华语电影']wc = ws.cell(column=4, row=e + 3, value=address_text)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 9 <= e <= 17:ws = wb['评分最高外语电影']wc = ws.cell(column=4, row=e - 6, value=address_text)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 18 <= e <= 26:ws = wb['年度冷门佳片']wc = ws.cell(column=4, row=e - 15, value=address_text)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 27 <= e <= 35:ws = wb['华语剧集']wc = ws.cell(column=4, row=e - 24, value=address_text)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))e += 1# 获取影片评分(每个影视类型里的第一名除外)
movies_scores_list = driver.find_elements(by=By.CSS_SELECTOR, value='.subjects-rank-rating')
# 将评分输入到每个工作表中
f = 0
for movie_score in movies_scores_list:score = movie_score.textif 0 <= f <= 8:ws = wb['评分最高华语电影']wc = ws.cell(column=3, row=f + 3, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 9 <= f <= 17:ws = wb['评分最高外语电影']wc = ws.cell(column=3, row=f - 6, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 18 <= f <= 26:ws = wb['年度冷门佳片']wc = ws.cell(column=3, row=f - 15, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))elif 27 <= f <= 35:ws = wb['华语剧集']wc = ws.cell(column=3, row=f - 24, value=score)# 单元格左右上下加框线wc.border = Border(left=Side(border_style='thin'), right=Side(border_style='thin'),top=Side(border_style='thin'), bottom=Side(border_style='thin'))f += 1

代码很多哈。但都是有规律的。上一篇是获取到数据把它变成一个列表,然后遍历打印出来它。

这里变了。不是遍历打印了,改成遍历保存了。因为上面获取的每个列表里面的元素顺序是有规律的(需要大家自己动手去体会啦),结合一定的逻辑判断,分别把它们填写到四个类型的工作表中去(再添加一些对表格美化的操作的代码)。


4.保存文档

代码如下:

del wb['Sheet']
wb.save(f'example{int(time.time())}.xlsx')

删除文档默认的Sheet工作表(没卵用),保存文档(默认保存到当前文件夹下)。


总结

其它的还好,主要是数据的遍历保存的逻辑判断部分的代码,这个需要大家手动去搞一遍才能明白。这篇用的是Python 3.11.6 版本的环境,基本环境因素要注意哦,要不然就算一样的代码运行起来也可能会有问题。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/286757.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

原子学习笔记2——输入设备应用编程

一、输入类设备介绍 1、输入设备 常见的输入设备有鼠标、键盘、触摸屏、遥控器、电脑画图板等&#xff0c;用户通过输入设备与系统进行交互。 2、input子系统 常见的输入设备有鼠标、键盘、触摸屏、遥控器、电脑画图板等&#xff0c;用户通过输入设备与系统进行交互。 基于…

Java中线程状态的描述

多线程-基础方法的认识 截止目前线程的复习 Thread 类 创建Thread类的方法 继承Thread类,重写run方法实现Runnable接口,重写run方法使用匿名内部类继承Thread类,重写run方法使用匿名内部类实现Runnable接口,重写run方法使用Lambda表达式 run方法中的所有的代码是当前线程对…

两种方案实现等待线程池结束后执行后面的业务代码

使用场景 批量任务处理&#xff1a;当需要并发执行多个任务&#xff0c;然后等待所有任务执行完毕后进行下一步操作时&#xff0c;可以使用这两种方法来等待所有任务执行完毕。 线程池管理&#xff1a;在使用线程池执行任务时&#xff0c;有时需要等待所有任务执行完毕后再关闭…

零基础也能制作家装预约咨询小程序

近年来&#xff0c;随着互联网的快速发展&#xff0c;越来越多的消费者倾向于使用手机进行购物和咨询。然而&#xff0c;许多家装实体店却发现自己的客流量越来越少&#xff0c;急需一种新的方式来吸引顾客。而开发家装预约咨询小程序则成为了一种利用互联网技术来解决这一问题…

标准IO与文件IO

标准IO通过缓冲机制减少系统调用&#xff0c;实现更高的效率 全缓冲&#xff1a;当流的缓冲区无数据或无空间时才执行实际IO操作 行缓冲&#xff1a;当在输入和输出中遇到换行符&#xff08;\n&#xff09;时&#xff0c;进行IO操作 当流和一个终端关联时&#xff0c;典型的行缓…

python学习,2.简单的数据类型

1.了解数及运算 整数&#xff1a;1&#xff0c;2&#xff0c;3。 运算符&#xff1a;加减乘除&#xff0c;**(乘方) 浮点数&#xff1a;python将所有带小数点的数称为浮点数。 这一块和别的语言有些不一样&#xff0c; 像C&#xff0c;分为float&#xff0c;double&#x…

基于grpc从零开始搭建一个准生产分布式应用(7) - 01 - 附:GRPC拦截器源码

开始前必读&#xff1a;​​基于grpc从零开始搭建一个准生产分布式应用(0) - quickStart​​ 一、源码目录结构 二、GRPC拦截器源码 2.1、com.zd.baseframework.core.core.common.interceptor package com.zd.baseframework.core.core.common.interceptor;import com.zd.ba…

清华提出ViLa,揭秘 GPT-4V 在机器人视觉规划中的潜力

人类在面对简洁的语言指令时&#xff0c;可以根据上下文进行一连串的操作。对于“拿一罐可乐”的指令&#xff0c;若可乐近在眼前&#xff0c;下意识的反应会是迅速去拿&#xff1b;而当没看到可乐时&#xff0c;人们会主动去冰箱或储物柜中寻找。这种自适应的能力源于对场景的…

算法(2)——滑动窗口

前言&#xff1a; 步骤及算法模板&#xff1a; 确定两个指针变量&#xff0c;left0,right0; 进窗口&#xff1a; 判断&#xff1a; 出窗口 更新结果 接下来我们的所用滑动窗口解决问题都需要以上几个步骤。 一、长度最小的子数组 209. 长度最小的子数组 - 力扣&#xff08;L…

【重点】【前缀树|字典树】208.实现Trie(前缀树)

题目 前缀树介绍&#xff1a;https://blog.csdn.net/DeveloperFire/article/details/128861092 什么是前缀树 在计算机科学中&#xff0c;trie&#xff0c;又称前缀树或字典树&#xff0c;是一种有序树&#xff0c;用于保存关联数组&#xff0c;其中的键通常是字符串。与二叉查…

安卓开发学习---kotlin版---笔记(三)

网络 安卓主页的网络框架&#xff1a;OkHttp 在OkHttp的基础上进行封装的&#xff1a;Retrofit框架&#xff0c;更常使用 OkHttp学习 在使用网络请求的时候&#xff0c;先添加网络访问权限&#xff1a; <uses-permission android:name"android.permission.INTERNET&…

JavaScript 内存管理的秘密武器:垃圾回收(上)

&#x1f90d; 前端开发工程师&#xff08;主业&#xff09;、技术博主&#xff08;副业&#xff09;、已过CET6 &#x1f368; 阿珊和她的猫_CSDN个人主页 &#x1f560; 牛客高级专题作者、在牛客打造高质量专栏《前端面试必备》 &#x1f35a; 蓝桥云课签约作者、已在蓝桥云…