A Rite of Passage: Compiling Markdown

A drawn picture of a mountain lake with a lakeside house during sunset

Intro

Odds are, if you're a web developer like me, you've tried to build a developer blog. And if you've done that before, you've probably experienced the delight that is Markdown parsing and compilation. But it's never just Markdown parsing, it's about everything else surrounding it. Extracting front matter, highlighting code blocks, dealing with tables, generating a table of contents, dealing with embeds, and more. An entire ecosystem of packages have sprung up to deal with it, each with their own requirements and limitations. Parsing all that is not easy, even for experienced devs.

Tanner Linsley decrying the state of Markdown packages

Shouldn't you be using Portable Text?

Shh. That's even harder to setup outside of the Sanity ecosystem.

Performance Considerations

As you can imagine, all that parsing is not particularily fast. And because it's not that fast, we've got to consider where and when you'd want to do it. I've seen a variety of solutions. Discord Remix User Kilman processes markdown once and stores it to a file. Amos processes it and stores it in an sqlite database. Kent C. Dodds processes it and then has the browser cache the result by setting cache control headers. All of these approaches suggest we're concerned about the time this takes, so let's put some numbers on that concern.

The Contenders

I spent some time scouring the JS ecosystem for the most common packages used for the task. I believe I've found the most common ones, but if I haven't please ping me on Twitter.

unified, remark, and rehype-highlight ecosystem
- Arguably one of the most popular and common ecosystems for parsing markdown into in HTML in JS, it features a huge number of packages and plugins designed to handle every possible option. It's an ecosystem more than a single package, to get where we need to go we'll need a lot of different packages.
marked and highlight.js
- marked claims it is "Built for Speed", and while this doesn't have as many features as the above, it can do everything we need with a lot fewer packages. It's also pure JS, so it is very portable.

The Test

For this test, I've chosen to compile a blog post from the prolific blog fasterthanlime! If you haven't heard of amos, he writes very, very long form content about Rust, Go, static typing, dev ops, and so much more. Definitely worth a read if you have a spare hour or two. He's graciously provided one of his posts, The Curse of Strong Typing, in markdown format, and agreed to make the file available to everyone. Coming in at a whopping 20,395 words and 7,645 lines, with 378 code blocks, you'd be hard pressed to find a better challenge for a markdown compiler outside of a book.

Test Environment

Because I've been using Remix a lot lately, and because I just rewrote my blog with it, I'm going to test this using Remix. I'll time the time it takes before it returns html with syntax highlighted code blocks. I'll try to keep the output equivalent for each option, so no html sanitization, table of contents generation, frontmatter parsing, or anything else. This is running locally on my M1 13" Macbook Pro, which should give the best case performance. But we're interested in the differences anyway. I'll run each test independently ten times, and then average the results.

For those unfamiliar with Remix and the loader paradigm, loaders are rendered on the server when the page loads. They then return json or other data in an HTML response that is then displayed by the client.

Below is the loader from the page that renders Markdown content on my post page.

TypeScript React code

import { marked } from "marked";
import hljs from 'highlight.js';
import rust from 'highlight.js/lib/languages/rust';
import codeStyles from 'highlight.js/styles/github.css';
import { markdownToHtml } from "~/models/markdown.server";
...

export const loader: LoaderFunction = async ({
    params, request
  }) => {
    invariant(params.slug, `params.slug is required`);
    const post = await getPost(params.slug);
    invariant(post, `Post not found: ${params.slug}`);

    // Set options
    hljs.registerLanguage('rust', rust);

    marked.setOptions({
        renderer: new marked.Renderer(),
        highlight: function(code, lang) {
          const language = hljs.getLanguage(lang) ? lang : 'plaintext';
          return hljs.highlight(code, { language }).value;
        },
        langPrefix: 'hljs language-', // highlight.js css expects a top-level 'hljs' class.
        pedantic: false,
        gfm: true,
      });

    const markedStart = performance.now();
    marked.parse(post.markdown);
    const markedEnd = performance.now();
    console.log(`Marked Time: ${markedEnd - markedStart}ms`);

    const remarkStart = performance.now();
    await markdownToHtml(post.markdown);
    const remarkEnd = performance.now();
    console.log(`Remark Time: ${remarkEnd - remarkStart}ms`);

    return json<LoaderData>({ admin, post, html: html });
};

Wait a minute, where's the Remark code?

Well, as it turns out Remark's ecosystem is a lot more involved, so I extracted it to its own file. Here it is, shamelessly cribbed from the venerable Kent C. Dodds.

TypeScript React code

async function markdownToHtml(markdownString: string) {
    const { unified } = await import('unified')
    const { default: markdown } = await import('remark-parse')
    const { default: remark2rehype } = await import('remark-rehype')
    const { default: rehypeStringify } = await import('rehype-stringify')
    const { default: rehypeHighlight } = await import('rehype-highlight')

    const result = await unified()
        .use(markdown)
        .use(remark2rehype)
        .use(rehypeStringify)
        .use(rehypeHighlight, { ignoreMissing: true, aliases: { 'none': 'text' } })
        .process(markdownString)

    return result.value.toString()
}

export {
    markdownToHtml,
    markdownToHtmlUnwrapped,
}

If everybody is caching the output or storing it in a DB, why do we care about any of this?

That's not very helpful, but maybe you'd like to render a split page post editor with a live preview. Or you care about reducing the environmental impact of your blog posts. Or maybe you just want bragging rights. Anyway, on to the results!

Results

Performance graph showing marked and remark, with marked handily winning

Clearly, the winner here is marked, handily beating the remark ecosystem in every test. I guess it's truly built for speed after all.

Aren't you missing something? Don't you know a way this could be faster?

No, not really. These are the current best options in JS. What could you mean?

Maybe you're feeling a little bit... crabby? 🦀

Alright fine, let's see if I can do this faster in Rust!

Rust Contenders

As it turns out, Rust has several good crates for markdown to HTML compilation, the most popular being pulldown-cmark and comrak. It also has a popular syntax highlighter called syntect that was developed for Sublime Text. But I did not find any npm packages that use these, so there's not really a way to compare them...

Giving up so easily?

Ok fine. I'll just make my own. Let's grab pulldown-cmark and syntect, compile it using napi-rs and wasm-bindgen into an npm package, and see how they shake out. I shall call the package femark.

Performance graph of femark-napi, femark, remark, and marked options

In this graph, the femark package is compiled to WASM, and femark-napi is compiled as a native Rust module. And would you look at that, both options handily beat remark. That's progress! But I'm disappointed that the native Rust version is roughly on par with marked, and femark loses to marked every time. Isn't Rust supposed to be faster than JS?

You've fallen victim to one of the classic blunders!

A land war in Asia?

No, silly. What do you think Node.js uses to do Regular Expressions?

It's written in C isn't it.

Yep

highlight.js uses regexes, just like syntect, but those regex calls are just a thin wrapper around Node's v8 engine. Since Rust's native performance is roughly equivalent to C, the two options are roughly equivalent in speed. And WASM experiences various performance penalties and overhead copying data in and out, so it won't be faster either. Well played, Node/Browser devs.

The End?

At this point, feeling a bit miffed,I reached out to amos and asked him how he parses and highlights his markdown. And he mentioned that he much preferred using tree-sitter over syntect, because it "uses actual parsers, not regex soup". tree-sitter has worked wonders in my neovim environment, but I hadn't heard about anyone using it on the web before. I also couldn't find any performance comparisons between a regex and parser implementation. I suspect that it might be faster, so let's check it out.

Run time comparison between the packages. femark-ts handily beats all the options

There we go, femark-ts is the treesitter version compiled with napi-rs, and it handily trounces the syntect version and marked by about 3x and the remark one by 20x!

Conclusion

This whole experience is a perfect lesson that just because you rewrite an npm package in Rust, it does not automatically make it faster. One needs to analyze what the JS version does, and whether it just calls out to C. A good design, written in a fast language, will be faster than a good design in JS, unless that JS has help.

If you're interested in quickly compiling your markdown to HTML and syntax highlighting it, I've published the fastest verion of the package, previously referred to as femark-ts, to npm as femark. Not only is it blazingly fast, it also uses classes instead of style tags for brevity and customization. Check it out, PRs and comments welcome!

Thanks

A big thanks to Amos for providing the markdown of his post, and guidance. This post wouldn't be possible without the hard work of developers in the Rust lang, tree-sitter, pulldown-cmark, syntect, napi-rs, and wasm-bindgen projects. And others, many many others.

Average Run Times

marked	femark-napi	femark-ts	remark	femark
55.3464833855629	48.9104747891426	14.9385375857353	329.070762491226	144.905074894428

Raw Data

marked	femark-napi	femark-ts	remark	femark
52.5864169597626	47.04791700840000	13.743250012397800	192.7639158964160	202.89374995231600
66.54324996471410	52.11045789718630	22.594791889190700	408.9021250009540	155.1495840549470
63.25125002861020	47.306707978248600	14.036875009536700	286.07187509536700	132.7103749513630
51.76112496852880	47.231040954589800	13.419041991233800	346.07716703414900	133.82504105567900
51.40050005912780	48.94004201889040	14.00616705417630	363.82941591739700	139.92316699028000
40.90349996089940	46.65362501144410	13.58062493801120	301.24004209041600	135.77149999141700
46.893749952316300	49.40858292579650	14.68291699886320	319.53983294963800	134.85695803165400
61.449041962623600	53.4950829744339	14.871457934379600	364.0340839624410	137.96595799922900
50.269083976745600	48.28879106044770	14.132166981697100	341.6937919855120	138.78633296489700
68.40691602230070	48.62250006198880	14.318083047866800	366.55537497997300	137.1680829524990