爬虫入门到精通_基础篇4(BeautifulSoup库_解析库,基本使用,标签选择器,标准选择器,CSS选择器)

1 Beautiful说明

BeautifulSoup库是灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实线网页信息的提取。

安装

pip3 install beautifulsoup4

解析库

解析器使用方法优势劣势
Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库、执行速度适中 、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器BeautifulSoup(markup, “lxml”)速度快、文档容错能力强需要安装C语言库
lxml XML 解析器BeautifulSoup(markup, “xml”)速度快、唯一支持XML的解析器需要安装C语言库
html5libBeautifulSoup(markup, “html5lib”)最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

2 基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.prettify())  # 格式化代码,自动补全
print(soup.title.string)  # 得到title标签里的内容

在这里插入图片描述

报错:
在这里插入图片描述

3 标签选择器

选择元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.title)  # 选择了title标签
print(type(soup.title))  # 查看类型
print(soup.head)

在这里插入图片描述

获取名称

获得标签的名称:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.title.name)

在这里插入图片描述

获取属性

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.p.attrs['name'])#获取p标签中,name这个属性的值
print(soup.p['name'])#另一种写法,比较直接

在这里插入图片描述

获取内容

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.p.string)

在这里插入图片描述

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.head.title.string)

在这里插入图片描述

子节点和子孙节点

contents方式

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.p.contents)  # 获取指定标签的子节点,类型是list

输出结果:

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']Process finished with exit code 0

child方式

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
for i,children in enumerate(soup.p.children):#i接受索引,children接受内容print(i,children)

2为空,是因为标签与标签之间空一行
在这里插入图片描述

子孙节点

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
for i,child in enumerate(soup.p.descendants):#i接受索引,child接受内容print(i,child)

在这里插入图片描述

父节点和祖先节点

parent

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(soup.a.parent)#获取指定标签的父节点

打印出了a节点的父节点:p标签
在这里插入图片描述

parents

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点

输出结果:


[(0, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>), (1, <body>
<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>)]Process finished with exit code 0

兄弟节点

html = """
<html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')  # 传入解析器:lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点

输出结果:

[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]Process finished with exit code 0

4 标准选择器

find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档。

name

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))  # 查找所有ul标签下的内容
print(type(soup.find_all('ul')[0]))  # 查看其类型

在这里插入图片描述
嵌套地查找标签下的子标签:

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):print(ul.find_all('li'))

在这里插入图片描述

attrs

通过属性进行元素的查找:

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1" name="elements"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))  # 传入的是一个字典类型,也就是想要查找的属性
print(soup.find_all(attrs={'name': 'elements'}))

在这里插入图片描述
特殊类型的参数查找:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是个特殊的属性,可以直接使用
print(soup.find_all(class_='element')) #class是关键字所以要用class_

text

根据文本内容来进行选择:

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本为Foo的内容,但是返回的不是标签

在这里插入图片描述
text在做内容匹配的时候比较方便,但是在做内容查找的时候并不是太方便。

其他方式

find
find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出。find_parents(), find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。find_next_siblings() ,find_next_sibling()
1返回后面的所有兄弟节点,2返回后面的第一个兄弟节点find_previous_siblings(),find_previous_sibling()
1返回前面所有兄弟节点…find_all_next(),find_next()
1返回节点后所有符合条件的节点,2返回后面第一个符合条件的节点find_all_previous()和find_previous()
同理。

5 CSS选择器

通过select()直接传入CSS选择器即可完成选择

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))  # .代表class,中间需要空格来分隔
print(soup.select('ul li'))  # 选择ul标签下面的li标签
print(soup.select('#list-2 .element'))  # '#'代表id。这句的意思是查找id为"list-2"的标签下的,class=element的元素
print(type(soup.select('ul')[0]))  # 打印节点类型

在这里插入图片描述
层层嵌套的选择:

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):print(ul.select('li'))

在这里插入图片描述

获取属性

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):print(ul['id'])  # 用[ ]即可获取属性print(ul.attrs['id'])  # 另一种写法

在这里插入图片描述

获取内容

html = '''
<div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):print(li.get_text())

在这里插入图片描述

6 总结

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值的方法

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/449728.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

服了,一个ThreadLocal被问出了花

分享是最有效的学习方式。 博客&#xff1a;https://blog.ktdaddy.com/ 故事 地铁上&#xff0c;小帅无力地倚靠着杆子&#xff0c;脑子里尽是刚才面试官的夺命连环问&#xff0c;“用过TheadLocal么&#xff1f;ThreadLocal是如何解决共享变量访问的安全性的呢&#xff1f;你…

跨平台开发:浅析uni-app及其他主流APP开发方式

随着智能手机的普及&#xff0c;移动应用程序&#xff08;APP&#xff09;的需求不断增长。开发一款优秀的APP&#xff0c;不仅需要考虑功能和用户体验&#xff0c;还需要选择一种适合的开发方式。随着技术的发展&#xff0c;目前有多种主流的APP开发方式可供选择&#xff0c;其…

postgres:锁申请

什么是弱锁&#xff0c;强锁&#xff1f; 为了提高并发控制&#xff0c;PG通过将锁信息在本地缓存&#xff08;**LOCALLOCK**&#xff09;和快速处理常见锁&#xff08;fastpath&#xff09;&#xff0c;减少了对共享内存的访问&#xff0c;提高性能。从而出现了弱锁和强锁的概…

羊大师:冬季出行,心血管病患者应做好哪些准备?

羊大师&#xff1a;冬季出行&#xff0c;心血管病患者应做好哪些准备&#xff1f; 冬季将至&#xff0c;气温骤降&#xff0c;寒冷的天气不仅让人感到不适&#xff0c;对于患有心血管病的人来说&#xff0c;更是需要格外注意。在这个寒冷的季节里&#xff0c;心血管病患者需要…

MH-ET LIVE Boards(ATTiny88)实验一---点亮板载灯

MH-ET LIVE Boards(ATTiny88&#xff09;实验一点亮板载灯 在Arduino IDE中添加开发板资源包加入开发板json添加开发板 安装开发板驱动方法一&#xff1a;github下载2.0a4.rar方法二&#xff1a;开发板的package包中自带的2.0a4.rar安装驱动确认安装成功 blink.ino程序测试![在…

C++:输入流/输出流

C流类库简介 C为了克服C语言中的scanf和printf存在的缺点。&#xff0c;使用cin/cout控制输入/输出。 cin&#xff1a;表示标准输入的istream类对象&#xff0c;cin从终端读入数据。cout&#xff1a;表示标准输出的ostream类对象&#xff0c;cout向终端写数据。cerr&#xff…

Jmeter学习系列之四:测试计划元素介绍

测试计划元素 JMeter包含各种相互关联但为不同目的而设计的元素。在开始使用JMeter之前&#xff0c;最好先了解一下JMeter的一些主要元素。 注意:测试计划包含至少一个线程组。 以下是JMeter的一些主要组件: 测试计划&#xff08;Plan&#xff09;线程组(Thread Group)控制器…

数据结构—基础知识:哈夫曼编码

文章目录 数据结构—基础知识&#xff1a;哈夫曼编码哈夫曼编码的主要思想有关编码的概念哈夫曼编码满足两个性质&#xff1a; 数据结构—基础知识&#xff1a;哈夫曼编码 哈夫曼编码的主要思想 在进行数据压缩时&#xff0c;为了使压缩后的数据文件尽可能短&#xff0c;可采…

车企销售官网搭建流程

引言 近期在整理车企销售官网相关的一些材料时,由于之前也没接触过相关的业务,所以也是一边学习一遍整理,将自己理解官的网搭建流程给梳理出来,供与大家交流讨论。 官网搭建流程 官网搭建流程我分为两大步:网站Demo设计和网站搭建部署,具体流程如下图所示。 流程的具体…

03.PostgreSQL排序和分页

当我们使用SELECT语句查询表中数据的时候,PostgreSQL不确保按照一定的顺序返回结果。如果想要将查询的结果按照某些规则排序显示,需要使用ORDER BY子句。 1. 排序规则 使用ORDER BY 子句排序 ASC: 升序 DESC:降序 ORDER BY 子句在SELECT语句的结尾 1.1 单列排序 是指按…

C++类和对象入门(二)

顾得泉&#xff1a;个人主页 个人专栏&#xff1a;《Linux操作系统》 《C从入门到精通》 《LeedCode刷题》 键盘敲烂&#xff0c;年薪百万&#xff01; 一、类的作用域 类定义了一个新的作用域&#xff0c;类的所有成员都在类的作用域中。在类体外定义成员时&#xff0c;需要…

Red Panda Dev C++ 病毒项目模板

前言 在看这篇文章之前&#xff0c;请查看此文章&#xff01;否则你可能看不懂。 还记得上一讲吗&#xff1f; 没错&#xff0c;我的小脑瓜动了动&#xff0c;就。。。。。 好吧&#xff0c;模板&#xff0c;你又一次成功引起了我的注意 一、创建项目 首先创建一个项目&#x…