[网络爬虫] Jsoup : HTML 解析工具

news/2025/3/10 5:19:08/文章来源:https://www.cnblogs.com/johnnyzen/p/18449179

1 概述

简介

Jsoup是一款基于Java的HTML解析器，它提供了一种简单、灵活且易于使用的API，用于从URL、文件或字符串中解析HTML文档。它可以帮助开发人员从HTML文档中提取数据、操作DOM元素、处理表单提交等。

主要特点

Jsoup的主要特点包括：

简单易用：Jsoup提供了一系列简单的API，使得解析HTML变得非常容易。开发人员可以使用类似于jQuery的选择器语法来选择DOM元素，从而方便地提取所需的数据。
强大的HTML处理能力：Jsoup支持HTML5标准，并且能够处理不完整或损坏的HTML文档。它可以自动修复HTML中的错误，并且在解析过程中保留原始的HTML结构。
安全可靠：Jsoup内置了防止XSS攻击的机制，可以自动过滤恶意的HTML标签和属性，保证解析过程的安全性。
支持CSS选择器：Jsoup支持使用CSS选择器来选择DOM元素，这使得开发人员可以更加灵活地定位和操作HTML文档中的元素。
与Java集成：Jsoup是基于Java开发的，可以与Java程序无缝集成。开发人员可以使用Java的各种特性和库来处理解析后的数据。

应用场景

Jsoup 在大数据、云计算领域的应用场景包括但不限于：

网页数据抓取： Jsoup可以帮助开发人员从网页中提取所需的数据，例如爬取新闻、商品信息等。通过解析HTML文档，可以快速准确地获取所需的数据。
数据清洗与处理：在云计算中，大量的数据需要进行清洗和处理。Jsoup可以帮助开发人员解析HTML文档，提取出需要的数据，并进行进一步的处理和分析。
网页内容分析： Jsoup可以帮助开发人员对网页内容进行分析，例如提取关键词、统计标签出现次数等。这对于搜索引擎优化、网页分析等领域非常有用。

竞品

爬虫解析HTML文档的工具有：

[java] Jsoup

https://github.com/jhy/jsoup

https://jsoup.org/

https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2

[python] Beautiful Jsoup

https://www.crummy.com/software/BeautifulSoup/

https://github.com/DeronW/beautifulsoup/tree/v4.4.0

https://beautifulsoup.readthedocs.io/

https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/

2 使用指南

本章节，基于 1.14.3 版本

依赖引入

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> 
<dependency>     <groupId>org.jsoup</groupId>     <artifactId>jsoup</artifactId><!-- 1.12.2 / 1.14.3 / 1.17.2 -->	<version>1.14.3</version> 
</dependency>

核心 API

org.jsoup.Jsoup

package org.jsoup;import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import javax.annotation.Nullable;
import org.jsoup.helper.DataUtil;
import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;
import org.jsoup.safety.Whitelist;public class Jsoup {private Jsoup() {}public static Document parse(String html, String baseUri) {return Parser.parse(html, baseUri);}public static Document parse(String html, String baseUri, Parser parser) {return parser.parseInput(html, baseUri);}public static Document parse(String html, Parser parser) {return parser.parseInput(html, "");}public static Document parse(String html) {return Parser.parse(html, "");}public static Connection connect(String url) {return HttpConnection.connect(url);}public static Connection newSession() {return new HttpConnection();}public static Document parse(File file, @Nullable String charsetName, String baseUri) throws IOException {return DataUtil.load(file, charsetName, baseUri);}public static Document parse(File file, @Nullable String charsetName) throws IOException {return DataUtil.load(file, charsetName, file.getAbsolutePath());}public static Document parse(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {return DataUtil.load(file, charsetName, baseUri, parser);}public static Document parse(InputStream in, @Nullable String charsetName, String baseUri) throws IOException {return DataUtil.load(in, charsetName, baseUri);}public static Document parse(InputStream in, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {return DataUtil.load(in, charsetName, baseUri, parser);}public static Document parseBodyFragment(String bodyHtml, String baseUri) {return Parser.parseBodyFragment(bodyHtml, baseUri);}public static Document parseBodyFragment(String bodyHtml) {return Parser.parseBodyFragment(bodyHtml, "");}public static Document parse(URL url, int timeoutMillis) throws IOException {Connection con = HttpConnection.connect(url);con.timeout(timeoutMillis);return con.get();}public static String clean(String bodyHtml, String baseUri, Safelist safelist) {Document dirty = parseBodyFragment(bodyHtml, baseUri);Cleaner cleaner = new Cleaner(safelist);Document clean = cleaner.clean(dirty);return clean.body().html();}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, String baseUri, Whitelist safelist) {return clean(bodyHtml, baseUri, (Safelist)safelist);}public static String clean(String bodyHtml, Safelist safelist) {return clean(bodyHtml, "", safelist);}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, Whitelist safelist) {return clean(bodyHtml, (Safelist)safelist);}public static String clean(String bodyHtml, String baseUri, Safelist safelist, Document.OutputSettings outputSettings) {Document dirty = parseBodyFragment(bodyHtml, baseUri);Cleaner cleaner = new Cleaner(safelist);Document clean = cleaner.clean(dirty);clean.outputSettings(outputSettings);return clean.body().html();}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, String baseUri, Whitelist safelist, Document.OutputSettings outputSettings) {return clean(bodyHtml, baseUri, (Safelist)safelist, outputSettings);}public static boolean isValid(String bodyHtml, Safelist safelist) {return (new Cleaner(safelist)).isValidBodyHtml(bodyHtml);}/** @deprecated */@Deprecatedpublic static boolean isValid(String bodyHtml, Whitelist safelist) {return isValid(bodyHtml, (Safelist)safelist);}
}

Node

关键 API

Jsoup遍历DOM树的方法

根据id查找元素: getElementById(String id)

根据标签查找元素: getElementsByTag(String tag)

根据class查找元素: getElementsByClass(String className)

根据属性查找元素: getElementsByAttribute(String key)

兄弟遍历方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()

层级之间遍历: parent(), children(), child(int index)

这些方法会返回Element或者Elements节点对象，这些对象可以使用下面的方法获取一些属性：

attr(String key): 获取某个属性值
attributes(): 获取节点的所有属性
id(): 获取节点的id
className(): 获取当前节点的class名称
classNames(): 获取当前节点的所有class名称
text(): 获取当前节点的textNode内容
html(): 获取当前节点的 inner HTML
outerHtml(): 获取当前节点的 outer HTML
data(): 获取当前节点的内容，用于script或者style标签等
tag(): 获取标签
tagName(): 获取当前节点的标签名称

有了这些API，就像 JQuery 一样很便利的操作DOM。

Jsoup也支持修改DOM树结构：

text(String value): 设置内容

html(String value): 直接替换HTML结构

append(String html): 元素后面添加节点

prepend(String html): 元素前面添加节点

appendText(String text), prependText(String text)

appendElement(String tagName), prependElement(String tagName)

源码

package org.jsoup.nodes;import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import javax.annotation.Nullable;
import org.jsoup.SerializationException;
import org.jsoup.helper.Validate;
import org.jsoup.internal.StringUtil;
import org.jsoup.select.NodeFilter;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;public abstract class Node implements Cloneable {static final List<Node> EmptyNodes = Collections.emptyList();static final String EmptyString = "";@NullableNode parentNode;int siblingIndex;protected Node() {}public abstract String nodeName();protected abstract boolean hasAttributes();public boolean hasParent() {return this.parentNode != null;}public String attr(String attributeKey) {...}public abstract Attributes attributes();public int attributesSize() {return this.hasAttributes() ? this.attributes().size() : 0;}public Node attr(String attributeKey, String attributeValue) {attributeKey = NodeUtils.parser(this).settings().normalizeAttribute(attributeKey);this.attributes().putIgnoreCase(attributeKey, attributeValue);return this;}public boolean hasAttr(String attributeKey) {Validate.notNull(attributeKey);if (!this.hasAttributes()) {return false;} else {if (attributeKey.startsWith("abs:")) {String key = attributeKey.substring("abs:".length());if (this.attributes().hasKeyIgnoreCase(key) && !this.absUrl(key).isEmpty()) {return true;}}return this.attributes().hasKeyIgnoreCase(attributeKey);}}public Node removeAttr(String attributeKey) {Validate.notNull(attributeKey);if (this.hasAttributes()) {this.attributes().removeIgnoreCase(attributeKey);}return this;}public Node clearAttributes() {if (this.hasAttributes()) {Iterator<Attribute> it = this.attributes().iterator();while(it.hasNext()) {it.next();it.remove();}}return this;}public abstract String baseUri();protected abstract void doSetBaseUri(String var1);public void setBaseUri(String baseUri) {Validate.notNull(baseUri);this.doSetBaseUri(baseUri);}public String absUrl(String attributeKey) {Validate.notEmpty(attributeKey);return this.hasAttributes() && this.attributes().hasKeyIgnoreCase(attributeKey) ? StringUtil.resolve(this.baseUri(), this.attributes().getIgnoreCase(attributeKey)) : "";}protected abstract List<Node> ensureChildNodes();public Node childNode(int index) {return (Node)this.ensureChildNodes().get(index);}public List<Node> childNodes() {if (this.childNodeSize() == 0) {return EmptyNodes;} else {List<Node> children = this.ensureChildNodes();List<Node> rewrap = new ArrayList(children.size());rewrap.addAll(children);return Collections.unmodifiableList(rewrap);}}public List<Node> childNodesCopy() {List<Node> nodes = this.ensureChildNodes();ArrayList<Node> children = new ArrayList(nodes.size());Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();children.add(node.clone());}return children;}public abstract int childNodeSize();protected Node[] childNodesAsArray() {return (Node[])this.ensureChildNodes().toArray(new Node[0]);}public abstract Node empty();@Nullablepublic Node parent() {return this.parentNode;}@Nullablepublic final Node parentNode() {return this.parentNode;}public Node root() {Node node;for(node = this; node.parentNode != null; node = node.parentNode) {}return node;}@Nullablepublic Document ownerDocument() {Node root = this.root();return root instanceof Document ? (Document)root : null;}public void remove() {Validate.notNull(this.parentNode);this.parentNode.removeChild(this);}public Node before(String html) {this.addSiblingHtml(this.siblingIndex, html);return this;}public Node before(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex, node);return this;}public Node after(String html) {this.addSiblingHtml(this.siblingIndex + 1, html);return this;}public Node after(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex + 1, node);return this;}private void addSiblingHtml(int index, String html) {Validate.notNull(html);Validate.notNull(this.parentNode);Element context = this.parent() instanceof Element ? (Element)this.parent() : null;List<Node> nodes = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());this.parentNode.addChildren(index, (Node[])nodes.toArray(new Node[0]));}public Node wrap(String html) {Validate.notEmpty(html);Element context = this.parentNode != null && this.parentNode instanceof Element ? (Element)this.parentNode : (this instanceof Element ? (Element)this : null);List<Node> wrapChildren = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());Node wrapNode = (Node)wrapChildren.get(0);if (!(wrapNode instanceof Element)) {return this;} else {Element wrap = (Element)wrapNode;Element deepest = this.getDeepChild(wrap);if (this.parentNode != null) {this.parentNode.replaceChild(this, wrap);}deepest.addChildren(new Node[]{this});if (wrapChildren.size() > 0) {for(int i = 0; i < wrapChildren.size(); ++i) {Node remainder = (Node)wrapChildren.get(i);if (wrap != remainder) {if (remainder.parentNode != null) {remainder.parentNode.removeChild(remainder);}wrap.after(remainder);}}}return this;}}@Nullablepublic Node unwrap() {Validate.notNull(this.parentNode);List<Node> childNodes = this.ensureChildNodes();Node firstChild = childNodes.size() > 0 ? (Node)childNodes.get(0) : null;this.parentNode.addChildren(this.siblingIndex, this.childNodesAsArray());this.remove();return firstChild;}private Element getDeepChild(Element el) {List<Element> children = el.children();return children.size() > 0 ? this.getDeepChild((Element)children.get(0)) : el;}void nodelistChanged() {}public void replaceWith(Node in) {Validate.notNull(in);Validate.notNull(this.parentNode);this.parentNode.replaceChild(this, in);}protected void setParentNode(Node parentNode) {Validate.notNull(parentNode);if (this.parentNode != null) {this.parentNode.removeChild(this);}this.parentNode = parentNode;}protected void replaceChild(Node out, Node in) {Validate.isTrue(out.parentNode == this);Validate.notNull(in);if (in.parentNode != null) {in.parentNode.removeChild(in);}int index = out.siblingIndex;this.ensureChildNodes().set(index, in);in.parentNode = this;in.setSiblingIndex(index);out.parentNode = null;}protected void removeChild(Node out) {Validate.isTrue(out.parentNode == this);int index = out.siblingIndex;this.ensureChildNodes().remove(index);this.reindexChildren(index);out.parentNode = null;}protected void addChildren(Node... children) {List<Node> nodes = this.ensureChildNodes();Node[] var3 = children;int var4 = children.length;for(int var5 = 0; var5 < var4; ++var5) {Node child = var3[var5];this.reparentChild(child);nodes.add(child);child.setSiblingIndex(nodes.size() - 1);}}protected void addChildren(int index, Node... children) {...}protected void reparentChild(Node child) {child.setParentNode(this);}private void reindexChildren(int start) {if (this.childNodeSize() != 0) {List<Node> childNodes = this.ensureChildNodes();for(int i = start; i < childNodes.size(); ++i) {((Node)childNodes.get(i)).setSiblingIndex(i);}}}public List<Node> siblingNodes() {if (this.parentNode == null) {return Collections.emptyList();} else {List<Node> nodes = this.parentNode.ensureChildNodes();List<Node> siblings = new ArrayList(nodes.size() - 1);Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();if (node != this) {siblings.add(node);}}return siblings;}}@Nullablepublic Node nextSibling() {if (this.parentNode == null) {return null;} else {List<Node> siblings = this.parentNode.ensureChildNodes();int index = this.siblingIndex + 1;return siblings.size() > index ? (Node)siblings.get(index) : null;}}@Nullablepublic Node previousSibling() {if (this.parentNode == null) {return null;} else {return this.siblingIndex > 0 ? (Node)this.parentNode.ensureChildNodes().get(this.siblingIndex - 1) : null;}}public int siblingIndex() {return this.siblingIndex;}protected void setSiblingIndex(int siblingIndex) {this.siblingIndex = siblingIndex;}public Node traverse(NodeVisitor nodeVisitor) {Validate.notNull(nodeVisitor);NodeTraversor.traverse(nodeVisitor, this);return this;}public Node filter(NodeFilter nodeFilter) {Validate.notNull(nodeFilter);NodeTraversor.filter(nodeFilter, this);return this;}public String outerHtml() {StringBuilder accum = StringUtil.borrowBuilder();this.outerHtml(accum);return StringUtil.releaseBuilder(accum);}protected void outerHtml(Appendable accum) {NodeTraversor.traverse(new OuterHtmlVisitor(accum, NodeUtils.outputSettings(this)), this);}abstract void outerHtmlHead(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;abstract void outerHtmlTail(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;public <T extends Appendable> T html(T appendable) {this.outerHtml(appendable);return appendable;}public String toString() {return this.outerHtml();}protected void indent(Appendable accum, int depth, Document.OutputSettings out) throws IOException {accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));}public boolean equals(@Nullable Object o) {return this == o;}public int hashCode() {return super.hashCode();}public boolean hasSameValue(@Nullable Object o) {if (this == o) {return true;} else {return o != null && this.getClass() == o.getClass() ? this.outerHtml().equals(((Node)o).outerHtml()) : false;}}public Node clone() {...}...

org.jsoup.nodes.Element extends Node

org.jsoup.nodes.Document extends Element

应用场景

CASE : 解析 HTML文档 => 获得 Document 对象

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

CASE : 解析 HTML 片段 => 获得 Document 对象

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

CASE : 解析 URL => 获得 Document 对象

org.jsoup.Connection connection = Jsoup.connect("http://example.com/");
Document doc = connection.get();//HTTP Method = GET
String title = doc.title();

还可以携带cookie等参数：(和Python的爬虫类似)

Document doc = Jsoup.connect("http://example.com")   
.data("query", "Java")   
.userAgent("Mozilla")   
.cookie("auth", "token")   
.timeout(3000)   
.post(); //HTTP Method = POST

CASE : 解析 HTML 本地文件 => 获得 Document 对象

File input = new File("/tmp/input.html"); 
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");/*** 提取文件里面的文本信息*/
public static String openFile(String szFileName) {try {BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream(new File(szFileName)), ENCODE));String szContent = "";String szTemp;while ((szTemp = bis.readLine()) != null) {szContent += szTemp + "\n";}bis.close();return szContent;} catch (Exception e) {return "";}
}

X 参考文献

jsoup

https://jsoup.org/

https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2

使用JAVA解析html (Jsoup) - 腾讯云
Java爬虫系列三：使用Jsoup解析HTML(以博客园为例)「建议收藏」 - 腾讯云
解析html Java工具 - 51CTO

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.hqwc.cn/news/809092.html

如若内容造成侵权/违法违规/事实不符，请联系编程知识网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

F5负载均衡系列教程八【负载均衡算法详解】

#参考文档：3https://clouddocs.f5.com/training/community/f5cert/html/class1/modules/module1.html #系统上可以配置的负载均衡算法如下所示F5默认的负载均衡算法是轮询（Round Robin）负载均衡算法描述使用场景轮询（Round Robin）这是默认的负载均衡方法。轮询方法将每个新…

搜索算法合集 - By DijkstraPhoenix

搜索算法合集 By DijkstraPhoenix 深度优先搜索 (DFS) 引入如果现在有一个迷宫，如何走路径最短？方法走迷宫最简单粗暴的方法式什么呢？当然是把所有路都走一遍啦！如果是手动计算的话，可能会把你手指累得抽筋，但电脑不会，电脑具有强大的算力，这种暴力的事情当然是交给电…

个人知识面/技能池

虽然走得慢，但是一直在前行知识面/技能池电路模拟电路微弱信号处理信号链设计1Msps采样电路设计滤波器设计无源滤波器有源滤波器光电探测电路设计电力电子逆变电路设计磁耦合谐振式无线电能传输开关电源LED恒流驱动AC/DC 设计基本电路知识电路基础知识复习跟习题册联系现代…

【CodeForces训练记录】Codeforces Round 977 (Div. 2, based on COMPFEST 16 - Final Round)

赛后反思做红温了，太菜了，每题都需要WA几次才能过，B题看到 MEX 选择性害怕，时间复杂度又算错了 A题每次选择一对 \(a_i,a_j\) 把均值插入数组最后面，要想结果最大，对于两个数求均值，最后的结果一定是小于等于其中的较大值，我们可以考虑如何最大化最后一次操作，想到将…

傻逼模拟赛搬的时候能不能看看题面改之后还是不是让人能看懂还有不发 checker 是有什么心事吗

如题。傻逼模拟赛搬的时候能不能看看题面改之后还是不是让人能看懂还有不发 checker 是有什么心事吗还在最后一道题放集训队互测什么意思什么叫有 \(b_{k}\) 种 \(k\) 类型的货币，同一种流通的货币不会超过二十种什么叫接下来 \(n\) 个数表示 \(a_{1} \sim a_{n-1}\)upd：

Java - 10 二维数据一维数组的每个元素又是一个一维数组静态初始化 int[][] arr = {{0,0,0,0},{1,1,1,1},{2,2,2,2},{3,3,3,3}};public class TwoDimensionArray {public static void main(String[] args) {int[][] arr = {{0,0,0,0},{1,1,1,1},{2,2,2,2},{3,3,3,3}};// 遍历…

Java - 11 类与对象

Java - 11 类与对象类类[属性, 行为] ->对象[属性, 行为] public class Test{public static void main(String[] args){Cat cat1 = new Cat(); // 创建对象cat1.name = "大宝";cat1.age = "3";cat1.color = "orange";System.out.println(ca…

20222413 2024-2025-1 《网络与系统攻防技术》实验一实验报告

1.实验内容在本周的学习过程中，我了解到了许多缓冲区溢出攻击的实际案例、缓冲区溢出攻击的原理和相关基础知识，包括GDB调试器的使用方法、反汇编、基础的汇编语言与指令等，重新温习了函数调用过程和进程管理方面的知识内容。并且通过实验一，我能够了解并熟练完成Linux系统…

函数的上下文

函数的上下文概述在函数体的语句中，会出现this这个词，this就是函数的上下文函数中this是谁，就说明函数的上下文是谁函数中的this是谁，要看是如何调用的，因为this不是一成不变的比如我们看下面的例子 var obj = {a: 100,fun: function() {console.log(this.a);} };我们…