Writing plugins for remark and gatsby-transformer-remark (part 2)

Welcome to part two of my four-part tutorial on writing plugins for remark and gatsby-transformer-remark. In part one, we created the well-tested functionality to fetch content from GitHub. In this part, we’ll look at Abstract Syntax Trees (ASTs) and explore some of the fun things we can do with them.

Markdown AST

Practically speaking, an AST represents a piece of source code written in a particular language as a tree structure in which each node represents a language feature (variable definition, function call etc). This tree structure is produced by a parser based on that language’s grammar. A language can have multiple different AST formats and, conversely, multiple languages can share the same AST format. Here are a few examples:

  • The transpiler Babel uses the babylon parser to parse JavaScript into a babylon AST.
  • The linter ESLint uses the espree parser to parse JS into an ESTree AST.
  • The minifier UglifyJS, interestingly enough, has its own parser and AST format. Is this the reason it is the fastest JS minifier out there?
  • Flow and TypeScript, which are supersets of JavaScript that add static type annotations, can be parsed into either an ESTree AST (by flow-parser and typescript-eslint-parser) or Babylon AST by the babylon parser. In addition, TypeScript has its own AST format.
  • postcss (Babel for CSS) parses CSS into its own AST format.
  • styled-components uses stylis to parse CSS with interpolated JS.
  • Markdown has lots of different parsers. One that is popular in the JavaScript ecosystem is remark, which we will work with in this tutorial. remark’s AST format is MDAST.

If you think the list above is weighed heavily towards front-end development languages, the reason is that, in my impression, software developers in other languages don’t manipulate ASTs as often as we front-end developers do. In the vast majority of cases, browsers can only consume programs written in plain-text sources1 instead of compiled byte code or machine code. As such, most, if not all, optimizations in front-end development must be done by manipulating plain-text source code into more optimized plain-text source code2 (which in turn requires manipulating ASTs more often) instead of making compilers produce more efficient byte code or machine code from an AST.

Exploring a simple AST

To get a hang of ASTs, I’ll visually show you a simple AST with a fabulous tool called AST Explorer. Paste some code (JavaScript, TypeScript, Markdown etc) into the left panel and it’ll parse that code and show you the corresponding AST in the right panel. You can click on any text in the left panel and the corresponding AST node will be highlighted in the right panel (and vice versa). Coincidentally, because AST Explorer uses remark to parse Markdown, its output will help us keep a mental model while manipulating Markdown ASTs later.

Let’s look at the AST for a simple Markdown snippet

simple Markdown

In the corresponding AST, we can see that the free-form text has been parsed into a tree structure that captures our intuitive understanding of how various Markdown formatting features translate to visual output. Here are a few examples:

We expect *italicized text* to be rendered as the italicized phrase “italicized text.” Indeed, the corresponding AST node’s type is emphasis and its content is a single text node with the value of italicized text:

{
  "type": "emphasis",
  "children": [
    {"type": "text", "value": "italicized text"}
  ],
},

You’ll see that most formatted texts have text nodes as their terminal children. text nodes are leaves in the tree, meaning they have no children.

We expect # Hello to be rendered as a h1 heading. Indeed, its corresponding AST node has a type of heading and depth (or level) of 1.

{
  "type": "text", "value": "Hello",
  "depth": 1,
}

We expect the code snippet to be rendered as a JavaScript code block and, indeed, its AST node has the type of code and lang (language) JavaScript.

{
  "type": "code", "lang": "javascript",
  "value": "console.log('!');",
}

Feel free to play around more with this tool. If you want to dig deeper, the MDAST specification contains information about all the possible types of and relationship between nodes that remark can understand natively.

Remark plugin

Let’s take a short detour and talk about the structure of a remark plugin. remark is really just the scaffolding on which plugins do their jobs. remark’s core handles conversion between plain text Markdown sources and ASTs while all AST manipulations are performed by plugins.

The top-level export of a remark plugin must be a function, called an attacher, that can accept configuration options for that plugin from the user. The attacher can perform some initialization based on these options and then return another function, called the transformer, which will perform all the heavy lifting. During program execution, the transformer will receive the Markdown AST and mutate it (e.g. add/remove nodes, change node types etc) to achieve the desired output. Although we’ll only examine how plugins can transform ASTs, they can also add new syntactic constructs to Markdown or new types of output (e.g. HTML).

For example, here is a bare bone attacher:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/dcffd535/src/index.ts
 */
import {transform} from './transform';
const attacher = () => {
  return transform;
};

export default attacher;

and a no-op transformer

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/dcffd535/src/transform.ts
 */
export const transform = () => {

};

First AST manipulation

To warm up, let’s perform a simple transformation: replacing occurrences of the paragraph GITHUB-EMBED in the following input basic input with the following short code snippet: basic code snippet in order to get this output: basic output

Because ASTs can be a bit difficult to think about, a useful trick I usually employ when working with them is just to copy-paste the Markdown input and desired Markdown output into AST Explorer and compare them to determine a reasonably simple way to change the former into the latter. By doing this, I can see that I need to transform this AST node:

{
  "type": "paragraph",
  "children": [
    {
      "type": "text",
      "value": "GITHUB-EMBED",
    }
  ],
}

into this AST node:

{
  "type": "code",
  "lang": "js",
  "value": "const a = 1;",
}

Thus, our plan is to visit every node in the AST and whenver we encounter a paragraph node whose content is the marker GITHUB-EMBED, we change the type of that node into code, unset the children key and add two new keys: lang with the value js and value with the value const a = 1;. The following transformer accomplishes that goal:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/c187c72dded0b57179648776e9b887c5fbcbc5da/src/transform.ts
 */
import visit from 'unist-util-visit';

export interface IOptions {
  marker: string;
}
export const transform = ({marker}: IOptions) => (tree: any) => {
  const visitor = (node: any) => {
    const {children} = node;
    if (children.length >= 1 && children[0].value === marker) {
      node.type = 'code';
      node.children = undefined;
      node.lang = 'js';
      node.value = `const a = 1;`;
    }
  };

  visit(tree, 'paragraph', visitor);
};

Note that instead of hard coding the embedding marker to be GITHUB-EMBED, we’ve made it a configurable option marker. Instead of traversing the AST manually, we use the utility package unist-util-visit (provided by remark). Its main export (the visit function) takes three arguments: an AST to traverse, a condition (such as the node type paragraph in this case) and a callback to invoke if a node matches that condition. This is the visitor design pattern in action. All AST-parsing libraries I’ve seen so far (remark, eslint, babel etc) provide utility packages to traverse the ASTs they produce e.g. babel-traverse by Babel.

Side note on testing

One common way to test code transformers, such as the one we’re writing, is to store each pair of input and expected output as separate files within a directory (called a test fixture) and then programmatically generate tests from each directory. For example, to test the transformer we have created so far, we create the following simple-example directory inside __fixtures__:

src
├── __fixtures__
    ├── simple-example
        ├── input.md
        ├── expected.md
        ├── options.js

where input.md and expected.md are the Markdown input and expected Markdown output, respectively, taken from above. options.js contains the configuration for the transformer (in this case, setting marker to GITHUB-EMBED) and “simple example” is the name of this test fixture. The plumbing to convert these fixtures into tests is in the file src/__tests__/transform.js if you’re interested.

After this step, the repo should look like this. When you git checkout that commit and run npm run test, the tests should pass, indicating that the actual output of the plugin matches the expected output.

Feel free to play around with the input, expected output, options and transformer code. For example, can you try a new marker phrase or make the plugin transform the marker into a different code snippet while still keeping the tests pass? If you set marker to be GITHUB_EMBED, what happens? What constraint does that put on possible values for marker?

Recognizing embedding markers

Now that we’ve gotten a hang of transforming ASTs and testing those transformations, let’s try to apply those skills to our current use case. We want to target our transformation at URLs sandwiched between embedding markers of this form

sample input

and replace them with the toy JavaScript snippet above (const a = 1;) while avoiding false positives. For example, in this sample input:

recognize markers test input

only the first paragraph containing GITHUB-EMBED should be replaced by the code snippet while the latter two should be left alone because one of them doesn’t contain a URL and the other contains only one marker:

recognize markers test output

We will again use AST Explorer for guidance. After pasting the sample input into AST Explorer, we can see the difference between our target:

{
  "type": "paragraph",
  "children": [
    {"type": "text", "value": "GITHUB-EMBED "},
    {
      "type": "link", "title": null, "url": "https://github.com/huy-nguyen/squarify/blob/d7074c2/.babelrc",
      "children": [
        {"type": "text", "value": "https://github.com/huy-nguyen/squarify/blob/d7074c2/.babelrc"}
      ],
    },
    {"type": "text", "value": " GITHUB-EMBED"}
  ],
},

and the two potential false positives:

{
  "type": "paragraph",
  "children": [
    {"type": "text", "value": "GITHUB-EMBED GITHUB-EMBED"}
  ],
}
{
  "type": "paragraph",
  "children": [
    {"type": "text", "value": "GITHUB-EMBED"}
  ],
},

From this exercise in compare-and-contrast, we can reasonably conclude that we need to transform paragraph nodes that have three children, of which:

  • The first is a text node whose value contains the embedding marker (GITHUB-EMBED).
  • The second is a link to the desired GitHub file.
  • The last is another text node whose value also contains the embedding marker (GITHUB-EMBED).

Based on the above three conditions, we can write a function checkNode to check whether a paragraph node is a candidate for transformation:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/0784899e/src/transform.ts
 */
// ...
type CheckResult = {
  isCandidate: true;
  link: string;
} | {
  isCandidate: false;
};

const checkNode = (embedMarker: string, node: any): CheckResult => {
  const {children} = node;
  const numChildren = children.length;
  if (numChildren < 3) {
    return {
      isCandidate: false,
    };
  } else {
    const firstChild = children[0];
    const firstChildContent = firstChild.value.trim();

    const lastChild = children[numChildren - 1];
    const lastChildContent = lastChild.value.trim();

    const [linkChild ] = children.slice(1, numChildren - 1);

    if (firstChild.type === 'text' &&
        firstChildContent === embedMarker &&
        lastChild.type === 'text' &&
        lastChildContent.includes(embedMarker) &&
        linkChild.type === 'link') {

      return {
        isCandidate: true,
        link: linkChild.url,
      };
    } else {
      return {
        isCandidate: false,
      };
    }

  }

};
// ...

and use this checker to guard against false positives in our transformer:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/0784899e/src/transform.ts
 */
// ...
export const transform = ({marker}: IOptions) => (tree: any) => {
  const visitor = (node: any) => {
    const checkResult = checkNode(marker, node);
    if (checkResult.isCandidate === true) {
    // ...
    }
  };

  visit(tree, 'paragraph', visitor);
};

After this step, the repo should look like this. The tests should pass, indicating that our detection works as expected.

Allow specifying language and line range

I think we all want syntax highlighting for our new embedded code blocks. Additionally, it would also be nice to be able to embed only a subset of lines from a GitHub file. After some consideration, I decided to make it as simple as possible to specify the language for an embedded code block The language name should come after the URL (but still stays within the two embedding markers) and is separated from the URL by whitespace like this:

input with syntax highlighting

The user can additionally specify that only a subset of lines from the GitHub file should be embedded. For example, the following embedding will only insert line 1 and lines 3 through 5 into the output code block.

input with syntax highlighting and line range

I chose this numeric range notation because it’s used to specify which pages should be printed from the print dialog of many operating systems and software, thus making it immediately familiar to a large number of users. Additionally, there’s already an NPM package to parse this notation for us: parse-numeric-range.

Like the language name, I again decided to let the line range just follow the language name, separated by whitespace but still stay within the two embedding markers. This does raise a potential conflict: if only one whitespace-delimited “word” appear between the URL and the closing embedding marker, should that “word” be interpreted as a language name or a line range? After some more consideration, I decided that because a user is more likely to specify a language name than to specify a line range, that ambiguous “word” should be interpreted as a language name.

We can now incorporate these new requirements into our test input:

comprehensive test input

and output:

comprehensive test output

Note that we include the expected line range inside the expected output code blocks (e.g. const range = '1,3-5') to visually demonstrate that if the tests pass, we have correctly extracted the line range from within the embedding markers.

Having the test in place, we can update the checkNode function to be able to detect these extra use cases by detecting the number of whitespace-delimited entities between two embedding markers:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/061fddea/src/transform.ts
 */
// ...
type CheckResult = {
  isCandidate: true;
  link: string;
  language: string | undefined;
  range: string | undefined;
} | {
  isCandidate: false;
};
// ...
const checkNode = (embedMarker: string, node: any): CheckResult => {
// ...
    if (firstChild.type === 'text' &&
        firstChildContent.includes(embedMarker) &&
        lastChild.type === 'text' &&
        lastChildContent.includes(embedMarker) &&
        linkChild.type === 'link') {

      // Ref https://stackoverflow.com/a/14912552/7075699
      const matched = lastChildContent.match(/\S+/g);
      let range: string | undefined, language: string | undefined;
      if (matched.length === 3) {
        // If there are 2 settings, the first is the language and the second the
        // range:
        language = matched[0];
        range = matched[1];
      } else if (matched.length === 2) {
        // If there's only one option provided, it's the language:
        language = matched[0];
        range = undefined;
      } else {
        range = undefined;
        language = undefined;
      }

      return {
        isCandidate: true,
        link: linkChild.url,
        range,
        language,
      };
    } else {
    // ...
};
// ...

Once a node satisfies the checkNode function, we need to set the lang property on the code block and insert the line range into the code block:

/**
 * https://github.com/huy-nguyen/remark-github-plugin/blob/061fddea/src/transform.ts
 */
// ...
export const transform = ({marker}: IOptions) => (tree: any) => {
// ...
    if (checkResult.isCandidate === true) {
      const {language, link, range} = checkResult;
      node.type = 'code';
      node.children = undefined;
      node.lang = (language === undefined) ? null : language;
      node.value = `const link = '${link}';\nconst range = '${range}';`;
    }
    // ...
};

After this step, the repo should look like this. Running npm run test should show all tests passing.

So far our tool is pretty rudimentary but has correctly performed the tasks we asked of it. In my experience with writing code transformers, it’s best to start simple and avoid overengineering, then slowly add more complex test cases later.

This is the end of part two of my tutorial. Click here for part three.


  1. The notable (and probably only) exception is WebAssembly byte code.

  2. For example, the transform-react-constant-elements Babel plugin “factors out” constant React elements to avoid calling React.createElement more than once for those elements. A more extreme example is the Prepack “compiler” by Facebook, which actually executes JavaScript source code to eliminate all computations that can be done at compile-time. For example, it can turn const a = 1; const b = 2; const c = a + b; into const c = 3.