1 概述
简介
Jsoup
是一款基于Java的HTML解析器,它提供了一种简单、灵活且易于使用的API,用于从URL、文件或字符串中解析HTML文档。它可以帮助开发人员从HTML文档中提取数据、操作DOM元素、处理表单提交等。
主要特点
Jsoup的主要特点包括:
- 简单易用:Jsoup提供了一系列简单的API,使得解析HTML变得非常容易。开发人员可以使用类似于jQuery的选择器语法来选择DOM元素,从而方便地提取所需的数据。
- 强大的HTML处理能力:Jsoup支持HTML5标准,并且能够处理不完整或损坏的HTML文档。它可以自动修复HTML中的错误,并且在解析过程中保留原始的HTML结构。
- 安全可靠:Jsoup内置了防止XSS攻击的机制,可以自动过滤恶意的HTML标签和属性,保证解析过程的安全性。
- 支持CSS选择器:Jsoup支持使用CSS选择器来选择DOM元素,这使得开发人员可以更加灵活地定位和操作HTML文档中的元素。
- 与Java集成:Jsoup是基于Java开发的,可以与Java程序无缝集成。开发人员可以使用Java的各种特性和库来处理解析后的数据。
应用场景
Jsoup 在大数据、云计算领域的应用场景包括但不限于:
- 网页数据抓取: Jsoup可以帮助开发人员从网页中提取所需的数据,例如爬取新闻、商品信息等。通过解析HTML文档,可以快速准确地获取所需的数据。
- 数据清洗与处理: 在云计算中,大量的数据需要进行清洗和处理。Jsoup可以帮助开发人员解析HTML文档,提取出需要的数据,并进行进一步的处理和分析。
- 网页内容分析: Jsoup可以帮助开发人员对网页内容进行分析,例如提取关键词、统计标签出现次数等。这对于搜索引擎优化、网页分析等领域非常有用。
竞品
爬虫解析HTML文档的工具有:
- [java] Jsoup
- https://github.com/jhy/jsoup
- https://jsoup.org/
- https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2
- [python] Beautiful Jsoup
- https://www.crummy.com/software/BeautifulSoup/
- https://github.com/DeronW/beautifulsoup/tree/v4.4.0
- https://beautifulsoup.readthedocs.io/
- https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/
2 使用指南
- 本章节,基于 1.14.3 版本
依赖引入
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId><!-- 1.12.2 / 1.14.3 / 1.17.2 --> <version>1.14.3</version>
</dependency>
核心 API
org.jsoup.Jsoup
package org.jsoup;import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import javax.annotation.Nullable;
import org.jsoup.helper.DataUtil;
import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;
import org.jsoup.safety.Whitelist;public class Jsoup {private Jsoup() {}public static Document parse(String html, String baseUri) {return Parser.parse(html, baseUri);}public static Document parse(String html, String baseUri, Parser parser) {return parser.parseInput(html, baseUri);}public static Document parse(String html, Parser parser) {return parser.parseInput(html, "");}public static Document parse(String html) {return Parser.parse(html, "");}public static Connection connect(String url) {return HttpConnection.connect(url);}public static Connection newSession() {return new HttpConnection();}public static Document parse(File file, @Nullable String charsetName, String baseUri) throws IOException {return DataUtil.load(file, charsetName, baseUri);}public static Document parse(File file, @Nullable String charsetName) throws IOException {return DataUtil.load(file, charsetName, file.getAbsolutePath());}public static Document parse(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {return DataUtil.load(file, charsetName, baseUri, parser);}public static Document parse(InputStream in, @Nullable String charsetName, String baseUri) throws IOException {return DataUtil.load(in, charsetName, baseUri);}public static Document parse(InputStream in, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {return DataUtil.load(in, charsetName, baseUri, parser);}public static Document parseBodyFragment(String bodyHtml, String baseUri) {return Parser.parseBodyFragment(bodyHtml, baseUri);}public static Document parseBodyFragment(String bodyHtml) {return Parser.parseBodyFragment(bodyHtml, "");}public static Document parse(URL url, int timeoutMillis) throws IOException {Connection con = HttpConnection.connect(url);con.timeout(timeoutMillis);return con.get();}public static String clean(String bodyHtml, String baseUri, Safelist safelist) {Document dirty = parseBodyFragment(bodyHtml, baseUri);Cleaner cleaner = new Cleaner(safelist);Document clean = cleaner.clean(dirty);return clean.body().html();}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, String baseUri, Whitelist safelist) {return clean(bodyHtml, baseUri, (Safelist)safelist);}public static String clean(String bodyHtml, Safelist safelist) {return clean(bodyHtml, "", safelist);}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, Whitelist safelist) {return clean(bodyHtml, (Safelist)safelist);}public static String clean(String bodyHtml, String baseUri, Safelist safelist, Document.OutputSettings outputSettings) {Document dirty = parseBodyFragment(bodyHtml, baseUri);Cleaner cleaner = new Cleaner(safelist);Document clean = cleaner.clean(dirty);clean.outputSettings(outputSettings);return clean.body().html();}/** @deprecated */@Deprecatedpublic static String clean(String bodyHtml, String baseUri, Whitelist safelist, Document.OutputSettings outputSettings) {return clean(bodyHtml, baseUri, (Safelist)safelist, outputSettings);}public static boolean isValid(String bodyHtml, Safelist safelist) {return (new Cleaner(safelist)).isValidBodyHtml(bodyHtml);}/** @deprecated */@Deprecatedpublic static boolean isValid(String bodyHtml, Whitelist safelist) {return isValid(bodyHtml, (Safelist)safelist);}
}
Node
关键 API
- Jsoup遍历DOM树的方法
- 根据id查找元素: getElementById(String id)
- 根据标签查找元素: getElementsByTag(String tag)
- 根据class查找元素: getElementsByClass(String className)
- 根据属性查找元素: getElementsByAttribute(String key)
- 兄弟遍历方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
- 层级之间遍历: parent(), children(), child(int index)
这些方法会返回Element或者Elements节点对象,这些对象可以使用下面的方法获取一些属性:
- attr(String key): 获取某个属性值
- attributes(): 获取节点的所有属性
- id(): 获取节点的id
- className(): 获取当前节点的class名称
- classNames(): 获取当前节点的所有class名称
- text(): 获取当前节点的textNode内容
- html(): 获取当前节点的 inner HTML
- outerHtml(): 获取当前节点的 outer HTML
- data(): 获取当前节点的内容,用于script或者style标签等
- tag(): 获取标签
- tagName(): 获取当前节点的标签名称
有了这些API,就像 JQuery 一样很便利的操作DOM。
- Jsoup也支持修改DOM树结构:
- text(String value): 设置内容
- html(String value): 直接替换HTML结构
- append(String html): 元素后面添加节点
- prepend(String html): 元素前面添加节点
- appendText(String text), prependText(String text)
- appendElement(String tagName), prependElement(String tagName)
源码
package org.jsoup.nodes;import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import javax.annotation.Nullable;
import org.jsoup.SerializationException;
import org.jsoup.helper.Validate;
import org.jsoup.internal.StringUtil;
import org.jsoup.select.NodeFilter;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;public abstract class Node implements Cloneable {static final List<Node> EmptyNodes = Collections.emptyList();static final String EmptyString = "";@NullableNode parentNode;int siblingIndex;protected Node() {}public abstract String nodeName();protected abstract boolean hasAttributes();public boolean hasParent() {return this.parentNode != null;}public String attr(String attributeKey) {...}public abstract Attributes attributes();public int attributesSize() {return this.hasAttributes() ? this.attributes().size() : 0;}public Node attr(String attributeKey, String attributeValue) {attributeKey = NodeUtils.parser(this).settings().normalizeAttribute(attributeKey);this.attributes().putIgnoreCase(attributeKey, attributeValue);return this;}public boolean hasAttr(String attributeKey) {Validate.notNull(attributeKey);if (!this.hasAttributes()) {return false;} else {if (attributeKey.startsWith("abs:")) {String key = attributeKey.substring("abs:".length());if (this.attributes().hasKeyIgnoreCase(key) && !this.absUrl(key).isEmpty()) {return true;}}return this.attributes().hasKeyIgnoreCase(attributeKey);}}public Node removeAttr(String attributeKey) {Validate.notNull(attributeKey);if (this.hasAttributes()) {this.attributes().removeIgnoreCase(attributeKey);}return this;}public Node clearAttributes() {if (this.hasAttributes()) {Iterator<Attribute> it = this.attributes().iterator();while(it.hasNext()) {it.next();it.remove();}}return this;}public abstract String baseUri();protected abstract void doSetBaseUri(String var1);public void setBaseUri(String baseUri) {Validate.notNull(baseUri);this.doSetBaseUri(baseUri);}public String absUrl(String attributeKey) {Validate.notEmpty(attributeKey);return this.hasAttributes() && this.attributes().hasKeyIgnoreCase(attributeKey) ? StringUtil.resolve(this.baseUri(), this.attributes().getIgnoreCase(attributeKey)) : "";}protected abstract List<Node> ensureChildNodes();public Node childNode(int index) {return (Node)this.ensureChildNodes().get(index);}public List<Node> childNodes() {if (this.childNodeSize() == 0) {return EmptyNodes;} else {List<Node> children = this.ensureChildNodes();List<Node> rewrap = new ArrayList(children.size());rewrap.addAll(children);return Collections.unmodifiableList(rewrap);}}public List<Node> childNodesCopy() {List<Node> nodes = this.ensureChildNodes();ArrayList<Node> children = new ArrayList(nodes.size());Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();children.add(node.clone());}return children;}public abstract int childNodeSize();protected Node[] childNodesAsArray() {return (Node[])this.ensureChildNodes().toArray(new Node[0]);}public abstract Node empty();@Nullablepublic Node parent() {return this.parentNode;}@Nullablepublic final Node parentNode() {return this.parentNode;}public Node root() {Node node;for(node = this; node.parentNode != null; node = node.parentNode) {}return node;}@Nullablepublic Document ownerDocument() {Node root = this.root();return root instanceof Document ? (Document)root : null;}public void remove() {Validate.notNull(this.parentNode);this.parentNode.removeChild(this);}public Node before(String html) {this.addSiblingHtml(this.siblingIndex, html);return this;}public Node before(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex, node);return this;}public Node after(String html) {this.addSiblingHtml(this.siblingIndex + 1, html);return this;}public Node after(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex + 1, node);return this;}private void addSiblingHtml(int index, String html) {Validate.notNull(html);Validate.notNull(this.parentNode);Element context = this.parent() instanceof Element ? (Element)this.parent() : null;List<Node> nodes = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());this.parentNode.addChildren(index, (Node[])nodes.toArray(new Node[0]));}public Node wrap(String html) {Validate.notEmpty(html);Element context = this.parentNode != null && this.parentNode instanceof Element ? (Element)this.parentNode : (this instanceof Element ? (Element)this : null);List<Node> wrapChildren = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());Node wrapNode = (Node)wrapChildren.get(0);if (!(wrapNode instanceof Element)) {return this;} else {Element wrap = (Element)wrapNode;Element deepest = this.getDeepChild(wrap);if (this.parentNode != null) {this.parentNode.replaceChild(this, wrap);}deepest.addChildren(new Node[]{this});if (wrapChildren.size() > 0) {for(int i = 0; i < wrapChildren.size(); ++i) {Node remainder = (Node)wrapChildren.get(i);if (wrap != remainder) {if (remainder.parentNode != null) {remainder.parentNode.removeChild(remainder);}wrap.after(remainder);}}}return this;}}@Nullablepublic Node unwrap() {Validate.notNull(this.parentNode);List<Node> childNodes = this.ensureChildNodes();Node firstChild = childNodes.size() > 0 ? (Node)childNodes.get(0) : null;this.parentNode.addChildren(this.siblingIndex, this.childNodesAsArray());this.remove();return firstChild;}private Element getDeepChild(Element el) {List<Element> children = el.children();return children.size() > 0 ? this.getDeepChild((Element)children.get(0)) : el;}void nodelistChanged() {}public void replaceWith(Node in) {Validate.notNull(in);Validate.notNull(this.parentNode);this.parentNode.replaceChild(this, in);}protected void setParentNode(Node parentNode) {Validate.notNull(parentNode);if (this.parentNode != null) {this.parentNode.removeChild(this);}this.parentNode = parentNode;}protected void replaceChild(Node out, Node in) {Validate.isTrue(out.parentNode == this);Validate.notNull(in);if (in.parentNode != null) {in.parentNode.removeChild(in);}int index = out.siblingIndex;this.ensureChildNodes().set(index, in);in.parentNode = this;in.setSiblingIndex(index);out.parentNode = null;}protected void removeChild(Node out) {Validate.isTrue(out.parentNode == this);int index = out.siblingIndex;this.ensureChildNodes().remove(index);this.reindexChildren(index);out.parentNode = null;}protected void addChildren(Node... children) {List<Node> nodes = this.ensureChildNodes();Node[] var3 = children;int var4 = children.length;for(int var5 = 0; var5 < var4; ++var5) {Node child = var3[var5];this.reparentChild(child);nodes.add(child);child.setSiblingIndex(nodes.size() - 1);}}protected void addChildren(int index, Node... children) {...}protected void reparentChild(Node child) {child.setParentNode(this);}private void reindexChildren(int start) {if (this.childNodeSize() != 0) {List<Node> childNodes = this.ensureChildNodes();for(int i = start; i < childNodes.size(); ++i) {((Node)childNodes.get(i)).setSiblingIndex(i);}}}public List<Node> siblingNodes() {if (this.parentNode == null) {return Collections.emptyList();} else {List<Node> nodes = this.parentNode.ensureChildNodes();List<Node> siblings = new ArrayList(nodes.size() - 1);Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();if (node != this) {siblings.add(node);}}return siblings;}}@Nullablepublic Node nextSibling() {if (this.parentNode == null) {return null;} else {List<Node> siblings = this.parentNode.ensureChildNodes();int index = this.siblingIndex + 1;return siblings.size() > index ? (Node)siblings.get(index) : null;}}@Nullablepublic Node previousSibling() {if (this.parentNode == null) {return null;} else {return this.siblingIndex > 0 ? (Node)this.parentNode.ensureChildNodes().get(this.siblingIndex - 1) : null;}}public int siblingIndex() {return this.siblingIndex;}protected void setSiblingIndex(int siblingIndex) {this.siblingIndex = siblingIndex;}public Node traverse(NodeVisitor nodeVisitor) {Validate.notNull(nodeVisitor);NodeTraversor.traverse(nodeVisitor, this);return this;}public Node filter(NodeFilter nodeFilter) {Validate.notNull(nodeFilter);NodeTraversor.filter(nodeFilter, this);return this;}public String outerHtml() {StringBuilder accum = StringUtil.borrowBuilder();this.outerHtml(accum);return StringUtil.releaseBuilder(accum);}protected void outerHtml(Appendable accum) {NodeTraversor.traverse(new OuterHtmlVisitor(accum, NodeUtils.outputSettings(this)), this);}abstract void outerHtmlHead(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;abstract void outerHtmlTail(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;public <T extends Appendable> T html(T appendable) {this.outerHtml(appendable);return appendable;}public String toString() {return this.outerHtml();}protected void indent(Appendable accum, int depth, Document.OutputSettings out) throws IOException {accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));}public boolean equals(@Nullable Object o) {return this == o;}public int hashCode() {return super.hashCode();}public boolean hasSameValue(@Nullable Object o) {if (this == o) {return true;} else {return o != null && this.getClass() == o.getClass() ? this.outerHtml().equals(((Node)o).outerHtml()) : false;}}public Node clone() {...}...
org.jsoup.nodes.Element extends Node
org.jsoup.nodes.Document extends Element
应用场景
CASE : 解析 HTML文档 => 获得 Document 对象
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
CASE : 解析 HTML 片段 => 获得 Document 对象
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
CASE : 解析 URL => 获得 Document 对象
org.jsoup.Connection connection = Jsoup.connect("http://example.com/");
Document doc = connection.get();//HTTP Method = GET
String title = doc.title();
还可以携带cookie等参数:(和Python的爬虫类似)
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post(); //HTTP Method = POST
CASE : 解析 HTML 本地文件 => 获得 Document 对象
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");/*** 提取文件里面的文本信息*/
public static String openFile(String szFileName) {try {BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream(new File(szFileName)), ENCODE));String szContent = "";String szTemp;while ((szTemp = bis.readLine()) != null) {szContent += szTemp + "\n";}bis.close();return szContent;} catch (Exception e) {return "";}
}
X 参考文献
- jsoup
- https://jsoup.org/
- https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2
- 使用JAVA解析html (Jsoup) - 腾讯云
- Java爬虫系列三:使用Jsoup解析HTML(以博客园为例)「建议收藏」 - 腾讯云
- 解析html Java工具 - 51CTO