1 概述 : Markdown
Markdown 的诞生
- 什么是 Markdown? Markdown 的诞生初衷
Markdown 是一种用于编写结构化文档的纯文本格式,基于在电子邮件和 usenet 帖子中指示格式的约定。
它由 John Gruber 开发(在 Aaron Swartz 的帮助下),并于 2004 年以 语法描述 和用于将 Markdown 转换为 HTML 的 Perl 脚本 ( Markdown.pl) 的形式发布。
在接下来的十年中,许多语言开发了数十种实现。有些扩展了原始 Markdown 语法,增加了脚注、表格和其他文档元素的约定。
有些允许以 HTML 以外的格式呈现 Markdown 文档。Reddit、StackOverflow 和 GitHub 等网站有数百万人使用 Markdown。Markdown 开始在网络之外用于创作书籍、文章、幻灯片、信件和讲义。
Markdown 与许多其他轻量级标记语法的区别在于它的可读性,而这些语法通常更容易编写。
正如 Gruber 所写:Markdown 格式语法的首要设计目标是使其尽可能易于阅读。
其理念是,Markdown 格式的文档应可按原样以纯文本形式发布,而不像是用标签或格式说明标记的。(http://daringfireball.net/projects/markdown/)
《Markdown 规范》
- 《Markdown 规范》
- https://spec.commonmark.org/0.28/
- 为什么需要 Markdown 规范?
John Gruber对 Markdown 语法的规范描述 并未明确说明语法。以下是它未回答的一些问题示例:
子列表需要缩进多少?规范规定,后续段落需要缩进四个空格,但对子列表没有完全明确规定。人们自然会认为它们也必须缩进四个空格,但这Markdown.pl并不是必需的。这几乎不是一个“极端情况”,在实际文档中,不同实现在这个问题上的分歧常常会给用户带来意外。(请参阅John Gruber 的此评论。)
块引用或标题前是否需要空行?大多数实现不需要空行。但是,这可能会导致文本硬换行的意外结果,并且还会导致解析中的歧义(请注意,某些实现将标题放在块引用内,而其他实现则不这样做)。(John Gruber 也表示支持要求空行。)
缩进的代码块之前是否需要空行?(Markdown.pl需要它,但是文档中没有提到它,并且有些实现不需要它。)
确定列表项何时被包裹在标签中的确切规则是什么
<p>
?列表可以部分“松散”而部分“紧密”吗?我们应该如何处理这样的列表?
...
- 遵循本规范的开源组件
- commonmark-java
- flexmark-java
- ...
2 Markdown 转 HTML
依赖组件
- [Java] commonmark-java 【推荐】
- 推荐原因: 社区持续活跃度高、组件相对更为成熟
- 口号
Java library for parsing and rendering Markdown text according to the CommonMark specification (and some extensions).
这是一个Java库,用于根据 CommonMark 规范(以及一些扩展)解析和渲染Markdown文本。
- URL :
- 依赖坐标
<!-- commonmark | https://github.com/vsch/flexmark-java -->
<dependency><groupId>com.atlassian.commonmark</groupId><artifactId>commonmark</artifactId><!-- 0.9.0 --><version>${commonmark.version}</version>
</dependency>
- [java] flexmark 【不推荐,源码演示章节中未实际使用】
- https://github.com/vsch/flexmark-java
- https://spec.commonmark.org/0.28/
- 口号
CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
基于Java的 CommonMark / Markdown 解析器,提供源代码级别抽象语法树(AST)。支持 CommonMark 0.28 版本,并实现了对pegdown、kramdown、markdown.pl和MultiMarkdown的仿真。还包含HTML到Markdown、Markdown到PDF以及Markdown到DOCX的转换模块。
Flexmark-java 是使用块优先、内联之后的 Markdown 解析架构实现的 CommonMark(规范 0.28)解析器的 Java 实现。
- 依赖坐标
内部依赖了 jsoup 等组件
<!-- Flexmark | https://github.com/vsch/flexmark-java -->
<dependency><groupId>com.vladsch.flexmark</groupId><artifactId>flexmark-all</artifactId><!-- 0.62.2 --><version>${flexmark.version}</version>
</dependency>flexmark-all 组件还包括但不限于含有如下依赖 :
<dependency><groupId>com.vladsch.flexmark</groupId><artifactId>flexmark-html2md-converter</artifactId><version>${flexmark.version}</version>
</dependency>
源码示范
- 核心思路
- 基于 commonmark-java 或 flexmark-java
定制化HTML样式 : HtmlNodeCustomStyleAttributeProvider
package xx.xx;import org.commonmark.node.Image;
import org.commonmark.node.Node;
import org.commonmark.renderer.html.AttributeProvider;import java.util.Map;/*** 定制 HTML节点的、标签属性的创建器* @uaage*/
public class HtmlNodeCustomStyleAttributeProvider implements AttributeProvider {private Node targetNode;private String targetTagName;private Map<String, String> targetAttributes;public HtmlNodeCustomStyleAttributeProvider(Node targetNode, String targetTagName, Map<String, String> targetAttributes) {this.targetNode = targetNode;this.targetTagName = targetTagName;this.targetAttributes = targetAttributes;}/*** 设置属性* @param node* the node to set attributes for* eg: org.commonmark.node.Image* @param tagName* the HTML tag name that these attributes are for (e.g. {@code h1}, {@code pre}, {@code code}).* @param attributes* the attributes, with any default attributes already set in the map* eg : attributes.put("class", "border");* @usage* Node document = parser.parse("![text](/url.png)");* renderer.render(document);* // "<p><img src=\"/url.png\" alt=\"text\" class=\"border\" /></p>\n"*/@Overridepublic void setAttributes(Node node, String tagName, Map<String, String> attributes) {if( targetNode.getClass().isInstance(node) ){//约等效于: node instanceof Image//attributes.put("class", "border");attributes.putAll( this.targetAttributes );}}
}
MD转HTML工具 : MarkdownConverter
package xx.mdtohtml;import org.commonmark.parser.Parser;
import org.commonmark.node.Node;
//import com.vladsch.flexmark.util.ast.Node;
//import com.vladsch.flexmark.html.HtmlRenderer;
//import com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter;
import org.commonmark.renderer.html.AttributeProvider;
import org.commonmark.renderer.html.AttributeProviderContext;
import org.commonmark.renderer.html.AttributeProviderFactory;
import org.commonmark.renderer.html.HtmlRenderer;import java.util.ArrayList;
import java.util.List;public class MarkdownConverter {private Parser parser = Parser.builder().build();private HtmlNodeCustomStyleAttributeProvider attributeProvider;private HtmlRenderer htmlRenderer;/*** @note 支持对输出 HTML 进行属性定制 (关键类: AttributeProvider)* @param @Nullable attributeProvider*/public MarkdownConverter(HtmlNodeCustomStyleAttributeProvider attributeProvider) {init();}public MarkdownConverter() {this(null);}private void init(){//初始化 htmlRendererif(this.attributeProvider == null){htmlRenderer = HtmlRenderer.builder().build();} else {htmlRenderer = HtmlRenderer.builder().attributeProviderFactory(new AttributeProviderFactory() {public AttributeProvider create(AttributeProviderContext context) {//定制化 HTML 渲染器return attributeProvider; //new ImageAttributeProvider();}}).build();}}/*** Markdown 转 HTML [基于 commonmark-java]* @param markdownContent* eg: "This is *Sparta*"* @return* eg: "<p>This is <em>Sparta</em></p>\n"* @reference* [1] commonmark-java Java 的 Markdown 解析器 - oschina.net - https://www.oschina.net/p/commonmark-java*/public String convertMarkdownToHtml(String markdownContent) {Node document = this.parser.parse(markdownContent);//"This is *Sparta*"HtmlRenderer renderer = this.htmlRenderer; //HtmlRenderer.builder().build();return renderer.render(document);//// "<p>This is <em>Sparta</em></p>\n"}
Demo
MarkdownConverter converter = new MarkdownConverter();String fileContent = "This is *Sparta*"; //
//String fileContent = reader.readHtml(markdownFilePath);
String renderedContent = converter.convertMarkdownToHtml( fileContent );System.out.println(renderedContent);
out
<p>This is <em>Sparta</em></p>
HTML 转 Markdown
源码示范
-
核心思路 : 基于 Jsoup(解析 HTML文档结构) + 借鉴源码
-
jsoup : HTML 解析工具
参见 : Jsoup : HTML 解析工具 - 博客园/千千寰宇
MarkdownConverter
package xx.htmltomd;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Tag;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Whitelist;import java.util.ArrayList;
import java.util.List;import xx.MarkdownLine.*;public class MarkdownConverter {private static int indentation = -1;private static boolean orderedList = false;public MarkdownConverter() {}private void init(){}/*** HTML 转 Markdown [基于 jsoup + 借鉴 jHTML2Md ]* @reference-doc* [1] https://github.com/nico2sh/jHTML2Md/blob/master/src/main/java/com/pnikosis/html2markdown/HTML2Md.java* [2] java html 转换为 markdown - 51CTO - https://blog.51cto.com/u_16213398/11878678 [不推荐]* @param htmlContent* @return*/public String convertHtmlToMarkdown(String htmlContent){Document document = Jsoup.parse(htmlContent);// 调用自定义方法进行转换// return traverseNodes(document.body());return convertHtmlToMarkdown( document );}/*** 遍历节点* @param node* @return*/
/**private String traverseNodes(org.jsoup.nodes.Node node) {StringBuilder markdown = new StringBuilder();// 遍历每个节点for (org.jsoup.nodes.Node childNode : node.childNodes()) {if (childNode instanceof TextNode) {// 处理文本节点markdown.append(childNode.outerHtml()).append("\n");} else if (childNode instanceof Element) {markdown.append( handleElement((Element) childNode) );}int childNodeSize = childNode.childNodeSize();if(childNodeSize > 0){String childNodeMarkdownContent = traverseNodes( childNode );markdown.append( childNodeMarkdownContent );}}return markdown.toString();}
**//**private String handleElement(Element element) {StringBuilder result = new StringBuilder();switch (element.tagName().toLowerCase()) {case "h1":result.append("# ").append(element.ownText()).append("\n");break;case "h2":result.append("## ").append(element.ownText()).append("\n");case "h3":result.append("### ").append(element.ownText()).append("\n");case "h4":result.append("#### ").append(element.ownText()).append("\n");case "h5":result.append("##### ").append(element.ownText()).append("\n");case "h6":result.append("###### ").append(element.ownText()).append("\n");break;case "div"://todoresult.append(element.ownText()).append("\n");break;// 可以继续添加其他标签的处理,如 h3, p, ul 等case "p"://todoresult.append(element.ownText()).append("\n");break;default:result.append(element.outerHtml());}return result.toString();}
**/private String convertHtmlToMarkdown(Document htmlDirtyDoc) {indentation = -1;String title = htmlDirtyDoc.title();Whitelist whitelist = Whitelist.relaxed();Cleaner cleaner = new Cleaner(whitelist);Document doc = cleaner.clean(htmlDirtyDoc);doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);if (!title.trim().equals("")) {return "# " + title + "\n\n" + getTextContent(doc);} else {return getTextContent(doc);}}private static String getTextContent(Element element) {ArrayList<MarkdownLine> lines = new ArrayList<MarkdownLine>();List<org.jsoup.nodes.Node> children = element.childNodes();for (org.jsoup.nodes.Node child : children) {if (child instanceof TextNode) {TextNode textNode = (TextNode) child;MarkdownLine line = getLastLine(lines);if (line.getContent().equals("")) {if (!textNode.isBlank()) {line.append(textNode.text().replaceAll("#", "/#").replaceAll("\\*", "/\\*"));}} else {line.append(textNode.text().replaceAll("#", "/#").replaceAll("\\*", "/\\*"));}} else if (child instanceof Element) {Element childElement = (Element) child;processElement(childElement, lines);} else {System.out.println();}}int blankLines = 0;StringBuilder result = new StringBuilder();for (int i = 0; i < lines.size(); i++) {String line = lines.get(i).toString().trim();if (line.equals("")) {blankLines++;} else {blankLines = 0;}if (blankLines < 2) {result.append(line);if (i < lines.size() - 1) {result.append("\n");}}}return result.toString();}private static void processElement(Element element, ArrayList<MarkdownLine> lines) {Tag tag = element.tag();String tagName = tag.getName();if (tagName.equals("div")) {div(element, lines);} else if (tagName.equals("p")) {p(element, lines);} else if (tagName.equals("br")) {br(lines);} else if (tagName.matches("^h[0-9]+$")) {h(element, lines);} else if (tagName.equals("strong") || tagName.equals("b")) {strong(element, lines);} else if (tagName.equals("em")) {em(element, lines);} else if (tagName.equals("hr")) {hr(lines);} else if (tagName.equals("a")) {a(element, lines);} else if (tagName.equals("img")) {img(element, lines);} else if (tagName.equals("code")) {code(element, lines);} else if (tagName.equals("ul")) {ul(element, lines);} else if (tagName.equals("ol")) {ol(element, lines);} else if (tagName.equals("li")) {li(element, lines);} else {MarkdownLine line = getLastLine(lines);line.append(getTextContent(element));}}private static MarkdownLine getLastLine(ArrayList<MarkdownLine> lines) {MarkdownLine line;if (lines.size() > 0) {line = lines.get(lines.size() - 1);} else {line = new MarkdownLine(MarkdownLine.MDLineType.None, 0, "");lines.add(line);}return line;}private static void div(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);String content = getTextContent(element);if (!content.equals("")) {if (!line.getContent().trim().equals("")) {lines.add(new MarkdownLine(MDLineType.None, 0, ""));lines.add(new MarkdownLine(MDLineType.None, 0, content));lines.add(new MarkdownLine(MDLineType.None, 0, ""));} else {if (!content.trim().equals(""))line.append(content);}}}private static void p(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);if (!line.getContent().trim().equals(""))lines.add(new MarkdownLine(MDLineType.None, 0, ""));lines.add(new MarkdownLine(MDLineType.None, 0, ""));lines.add(new MarkdownLine(MDLineType.None, 0, getTextContent(element)));lines.add(new MarkdownLine(MDLineType.None, 0, ""));if (!line.getContent().trim().equals(""))lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void br(ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);if (!line.getContent().trim().equals(""))lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void h(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);if (!line.getContent().trim().equals(""))lines.add(new MarkdownLine(MDLineType.None, 0, ""));int level = Integer.valueOf(element.tagName().substring(1));switch (level) {case 1:lines.add(new MarkdownLine(MDLineType.Head1, 0, getTextContent(element)));break;case 2:lines.add(new MarkdownLine(MDLineType.Head2, 0, getTextContent(element)));break;case 3:lines.add(new MarkdownLine(MDLineType.Head3, 0, getTextContent(element)));break;case 4:lines.add(new MarkdownLine(MDLineType.Head4, 0, getTextContent(element)));break;case 5:lines.add(new MarkdownLine(MDLineType.Head5, 0, getTextContent(element)));break;case 6:lines.add(new MarkdownLine(MDLineType.Head6, 0, getTextContent(element)));break;default:throw new RuntimeException("Not Support the tag: "+ element.tagName());}lines.add(new MarkdownLine(MDLineType.None, 0, ""));lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void strong(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);line.append("**");line.append(getTextContent(element));line.append("**");}private static void em(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);line.append("*");line.append(getTextContent(element));line.append("*");}private static void hr(ArrayList<MarkdownLine> lines) {lines.add(new MarkdownLine(MDLineType.None, 0, ""));lines.add(new MarkdownLine(MDLineType.HR, 0, ""));lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void a(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);line.append("[");line.append(getTextContent(element));line.append("]");line.append("(");String url = element.attr("href");line.append(url);String title = element.attr("title");if (!title.equals("")) {line.append(" \"");line.append(title);line.append("\"");}line.append(")");}private static void img(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line = getLastLine(lines);line.append("![");String alt = element.attr("alt");line.append(alt);line.append("]");line.append("(");String url = element.attr("src");line.append(url);String title = element.attr("title");if (!title.equals("")) {line.append(" \"");line.append(title);line.append("\"");}line.append(")");}private static void code(Element element, ArrayList<MarkdownLine> lines) {//判断是否是单行行内代码片段Boolean isOneLineCodeSnippet = !( element.ownText().contains("\n") || element.ownText().contains("\r") );StringBuilder codeContent = new StringBuilder();//lines.add(new MarkdownLine(MDLineType.None, 0, "```code"));if(isOneLineCodeSnippet){codeContent.append(" `");} else {codeContent.append( " ```code\n" );}MarkdownLine line = new MarkdownLine(MDLineType.None, 0, " ");//line.append(getTextContent(element).replace("\n", " "));//line.append( element.ownText() );//lines.add(line);codeContent.append( element.ownText() );if(isOneLineCodeSnippet){codeContent.append("` ");} else {//lines.add(new MarkdownLine(MDLineType.None, 0, "```"));codeContent.append("\n ``` ");}lines.add( new MarkdownLine( MDLineType.None , 0 , codeContent.toString() ) );}private static void ul(Element element, ArrayList<MarkdownLine> lines) {lines.add(new MarkdownLine(MDLineType.None, 0, ""));indentation++;orderedList = false;MarkdownLine line = new MarkdownLine(MDLineType.None, 0, "");line.append(getTextContent(element));lines.add(line);indentation--;lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void ol(Element element, ArrayList<MarkdownLine> lines) {lines.add(new MarkdownLine(MDLineType.None, 0, ""));indentation++;orderedList = true;MarkdownLine line = new MarkdownLine(MDLineType.None, 0, "");line.append(getTextContent(element));lines.add(line);indentation--;lines.add(new MarkdownLine(MDLineType.None, 0, ""));}private static void li(Element element, ArrayList<MarkdownLine> lines) {MarkdownLine line;if (orderedList) {line = new MarkdownLine(MDLineType.Ordered, indentation,getTextContent(element));} else {line = new MarkdownLine(MDLineType.Unordered, indentation,getTextContent(element));}lines.add(line);}
}
X 参考文献
- CommonMark Spec
- https://spec.commonmark.org/0.28/
- commonmark-java
- https://github.com/commonmark/commonmark-java/
- flexmark-java
含 flexmark-html2md-converter 、... 等子组件
- https://github.com/vsch/flexmark-java
- JohannesKaufmann | html-to-markdown
基于 go 语言
- https://github.com/JohannesKaufmann/html-to-markdown
- jHTML2Md
不推荐本项目,但 HTML2Md.java 的源码思路,值得借鉴
- https://github.com/nico2sh/jHTML2Md/blob/master/src/main/java/com/pnikosis/html2markdown/HTML2Md.java