#formula

#expression

#interpreter

#parser

#chevrotain

9 min read

How to Create Your Own Expression Interpreter

Marian Zagoruiko

Ever wondered how to write a simple interpreter? Or maybe wanted to create a simple macro language for your app?

In this article, I’ll show you how you can easily build a simple interpreter using JavaScript (or TypeScript) and the Chevrotain library.

I strongly believe that the best way to learn is by example, so we’ll build something real, more specifically, a simple mathematical expression evaluation engine, that’s going to be able to evaluate expressions like this: 10 + (20 - 30) * 40. All the source code is available on GitHub and I’d recommend checking it out and playing around with it: https://github.com/LeanyLabs/formula-engine-blog.

Anatomy of a Typical Interpreter

A typical interpreter usually consists of 3 parts:

Lexer
Parser
Visitor

Lexer is responsible for transforming raw text into a stream of tokens. A parser is transforming the stream of tokens into Concrete Syntax Tree (CST). CST is basically a source code represented as a tree data structure. A visitor is doing the actual evaluation/execution by recursively “visiting” each node in CST. We’ll take a closer look at each of them a bit later.

The following picture describes how expressions like 10 + (20 - 30) * 40 would be processed by the pipeline of lexer-parser-visitor.

Chevrotain

Writing custom lexers, parsers and visitors is somewhat tedious and time-consuming.

So, naturally, there’re many tools that either generate them from some kind of notation (usually Backus-Naur form, BNF). But working with generated code is not always convenient, so we’ll take another path.

There’s a library called Chevrotain, which is quite performant and very easy to use.

Chevrotain is just a DSL (Domain-specific language), which means that Chevrotain grammars are defined in the source code without any additional levels of abstraction (as opposed to parser generators). Because of that, you can debug your parser like any other JavaScript code.

Lexer

As I said previously, lexer (also known as tokenizer) is responsible for transforming raw text into a stream of tokens. A token is a single element in a programming language. Keywords, operators, literals (like strings and numbers), operators (like + and -) are all tokens. In Chevrotain, you can define tokens using regular expressions, which is very convenient. Also, the order in which tokens are defined matters because the lexer will try to match these regexes in order until it finds a match.

This is how our lexer is going to look like:

import { createToken, Lexer, TokenType } from "chevrotain";

enum TokenName {
  AdditionOperator = "AdditionOperator",
  Plus = "Plus",
  Minus = "Minus",
  MultiplicationOperator = "MultiplicationOperator",
  Mul = "Mul",
  Div = "Div",

  LParen = "LParen",
  RParen = "RParen",

  WhiteSpace = "WhiteSpace",
  Comma = "Comma",

  NumberLiteral = "NumberLiteral",
}

const AdditionOperator = createToken({
  name: TokenName.AdditionOperator,
  pattern: Lexer.NA,
});
const Plus = createToken({
  name: TokenName.Plus,
  pattern: /\+/,
  categories: AdditionOperator,
});
const Minus = createToken({
  name: TokenName.Minus,
  pattern: /-/,
  categories: AdditionOperator,
});

const MultiplicationOperator = createToken({
  name: TokenName.MultiplicationOperator,
  pattern: Lexer.NA,
});
const Mul = createToken({
  name: TokenName.Mul,
  pattern: /\*/,
  categories: MultiplicationOperator,
});
const Div = createToken({
  name: TokenName.Div,
  pattern: /\//,
  categories: MultiplicationOperator,
});

const LParen = createToken({
  name: TokenName.LParen,
  pattern: /\(/,
});
const RParen = createToken({
  name: TokenName.RParen,
  pattern: /\)/,
});

const WhiteSpace = createToken({
  name: "WhiteSpace",
  pattern: /\s+/,
  group: Lexer.SKIPPED,
});

const NumberLiteral = createToken({
  name: TokenName.NumberLiteral,
  pattern: /[0-9]+[.]?[0-9]*([eE][+\-][0-9]+)?/,
});

const tokensByPriority = [
  WhiteSpace,
  Plus,
  Minus,
  Mul,
  Div,
  LParen,
  RParen,
  NumberLiteral,
  AdditionOperator,
  MultiplicationOperator,
];

export const FormulaLexer = new Lexer(tokensByPriority, {
  ensureOptimizations: true,
});

export type TokenTypeDict = { [key in TokenName]: TokenType };
export const tokens: TokenTypeDict = tokensByPriority.reduce(
  (acc, tokenType) => {
    acc[tokenType.name] = tokenType;
    return acc;
  },
  {} as TokenTypeDict
);

It consists of the following parts:

token enum with all the tokens that we’ll need
actual tokens defined with regular expressions using createToken function
An array of tokens ordered by priority (the order is important here)
actual lexer instance that is exported as FormulaLexer
dictionary of tokens created mainly for convenience (export const tokens)

Now you’ll be able to tokenize input using the FormulaLexer.tokenize method.

Parser

Parsing is a little bit more complicated. Its main function is to build a syntax tree out of a stream of tokens. There are two types of syntax trees:

Abstract Syntax Tree (AST), also known as Syntax Tree
Concrete Syntax Tree (CST), also known as Parse Tree

Most of you probably heard about ASTs, but not CSTs. So, let me explain the difference if that’s the case.

CST (or Parse Tree) is a direct representation of the grammar in the form of a tree. It is usually much bigger than AST since it contains all grammatical details that aren’t usually relevant for semantics (like a semicolon, parens, etc).

AST on the other hand is a much shorter representation that only holds semantically relevant information and is more convenient to work with.

However, in our case, we will work directly with CST, as the language that we are going to build isn’t that complicated.

It is also helpful to familiarize yourself with BNF notation since parser essentially is a grammar that’s defined with chevrotain’s DSL.

Let’s take a look at an example of an expression parser:

import { CstParser } from "chevrotain";
import { tokens } from "./lexer";

export class FormulaParser extends CstParser {
  constructor() {
    super(tokens, {
      maxLookahead: 1,
    });
    this.performSelfAnalysis();
  }

  expression = this.RULE("expression", () => {
    this.SUBRULE(this.additionExpression);
  });

  additionExpression = this.RULE("additionExpression", () => {
    this.SUBRULE(this.multiplicationExpression, { LABEL: "lhs" });
    this.MANY(() => {
      this.CONSUME(tokens.AdditionOperator);
      this.SUBRULE1(this.multiplicationExpression, { LABEL: "rhs" });
    });
  });

  multiplicationExpression = this.RULE("multiplicationExpression", () => {
    this.SUBRULE(this.atomicExpression, { LABEL: "lhs" });
    this.MANY(() => {
      this.CONSUME(tokens.MultiplicationOperator);
      this.SUBRULE1(this.atomicExpression, { LABEL: "rhs" });
    });
  });

  atomicExpression = this.RULE("atomicExpression", () => {
    this.OR([
      { ALT: () => this.SUBRULE(this.parenthesisExpression) },
      { ALT: () => this.CONSUME(tokens.NumberLiteral) },
    ]);
  });

  parenthesisExpression = this.RULE("parenthesisExpression", () => {
    this.CONSUME(tokens.LParen);
    this.SUBRULE(this.expression);
    this.CONSUME(tokens.RParen);
  });
}

Parser written using Chervotain’s DSL looks very similar to BNF, in our case, it basically means the following:

Input is an expression.
Expression is an addition expression (we could use addition expression as a top-level, but I like it to be flexible).
Addition expression consists of 1 or more multiplication expressions.
Multiplication expression consists of 1 or more atomic expressions. Also, multiplication expression is described as a descendant of addition expression for a reason: deeper rules have higher priority, so multiplication expressions will always be evaluated before addition.
Atomic expression is either parenthesis expression of number literal.
Parenthesis expression consists of left parenthesis, expression (recursion here!), and right parenthesis. Recursion is very important here, it means that we can put any expression inside parens and it will be parsed in exactly the same way.

That’s it. Now you can use this parser to create CSTs, like this:

    this.parser = new FormulaParser();
    this.parser.reset();
    this.parser.input = lexingResult.tokens;
    const cst = this.parser.expression();

Also, there’s another useful tool if you want to explore syntax trees: https://ohmlang.github.io/editor/

Visitor

A visitor is a place where the actual execution is happening. It is called visitor because it implements a Visitor pattern. I’d recommend familiarizing yourself with it in case you aren’t because it will make things a lot more clear. In a nutshell, a visitor reduces the syntax tree to a single value by recursively visiting (calculating) each node of the tree. Chevrotain’s parser provides us with the base class with the implementation of visitor pattern, we get it by calling parser.getBaseCstVisitorConstructorWithDefaults(). Then, all we need to do is to inherit from it and add methods for visiting each node type. As you can see, addition and multiplication are almost the same, so it could be generalized, but I wanted to keep this code as simple and straightforward as possible.

import { CstNode, tokenMatcher } from "chevrotain";
import { tokens } from "./lexer";
import { FormulaParser } from "./parser";

export interface IVisitor {
  visit(cst: CstNode, state?: any): any;
}

export function createEvalVisitor(parser: FormulaParser): IVisitor {
  const FormulaVisitorBase = parser.getBaseCstVisitorConstructorWithDefaults();

  class InterpreterVisitor extends FormulaVisitorBase {
    constructor() {
      super();
      this.validateVisitor();
    }

    expression(ctx, state): any {
      return this.visit(ctx.additionExpression, state);
    }

    additionExpression(ctx, state): any {
      let result = this.visit(ctx.lhs, state);
      if (!ctx.rhs) return result;
      for (let i = 0; i < ctx.rhs.length; i++) {
        const operator = ctx.AdditionOperator[i];
        const value = this.visit(ctx.rhs[i], state);
        if (tokenMatcher(operator, tokens.Plus)) {
          result += value;
        } else if (tokenMatcher(operator, tokens.Minus)) {
          result -= value;
        } else {
          throw new Error(
            `Unknown operator: ${operator.image} at ${operator.startOffset}`
          );
        }
      }
      return result;
    }

    multiplicationExpression(ctx, state): any {
      let result = this.visit(ctx.lhs, state);
      if (!ctx.rhs) return result;
      for (let i = 0; i < ctx.rhs.length; i++) {
        const operator = ctx.MultiplicationOperator[i];
        const value = this.visit(ctx.rhs[i], state);
        if (tokenMatcher(operator, tokens.Mul)) {
          result *= value;
        } else if (tokenMatcher(operator, tokens.Div)) {
          result /= value;
        } else {
          throw new Error(
            `Unknown operator: ${operator.image} at ${operator.startOffset}`
          );
        }
      }
      return result;
    }

    atomicExpression(ctx, state): any {
      if (ctx.parenthesisExpression) {
        return this.visit(ctx.parenthesisExpression, state);
      }
      if (ctx.NumberLiteral) {
        return Number.parseFloat(ctx.NumberLiteral[0].image);
      }
    }

    parenthesisExpression(ctx, state): any {
      return this.visit(ctx.expression, state);
    }
  }

  return new InterpreterVisitor();
}

Engine

I like to have clean and easy-to-use interfaces, so in our example, I’ve created another entity called “engine” that is going to orchestrate work between lexer, parser, and visitor. It can be used like this:

    const engine = new FormulaEngine();
    const result = engine.exec(formula);

Implementation is pretty straightforward as it just creates instances of lexer, parser, and visitor and pipes input through this chain.

Summary

So, as you can see, it’s not that hard to create a simple expression parser if you pick the right tools (Chevrotain in our case). It would be way more complicated if you’d need to write all the components by yourself, especially for more complicated languages. It’s not to say that it isn’t fun, but if you just need to have things working, I’d recommend not reinventing the wheel. You can also build much more complicated things using this approach, but depending on the case, you may need to have two visitors: one to convert from CST to AST, and the other one (or even more) to process the AST and implement the actual semantics of the language. As an example of a more complicated expression parser, take a look at https://github.com/LeanyLabs/formula-engine. It supports 3 data types (boolean, string, and numbers), functions, and external variables.