# Tutorial - Lexer

# TLDR

Run and Debug the source code.

# Introduction

In this tutorial we will implement a Lexer for a simple SQL Select statement language:

SELECT column1 FROM table2
SELECT name, age FROM persons WHERE age > 100
...

A Lexer transforms a string input into a Token vector. Chevrotain has a built in Lexer engine based on Javascript Regular Expressions.

# Our First Token

To use the Chevrotain lexer the Tokens must first be defined. Lets examine the definition for a "FROM" Token:

const createToken = chevrotain.createToken
// using createToken API
const From = createToken({ name: "From", pattern: /FROM/ })

There is nothing much to it. We simply use the createToken API to define the tokens, and provide it with a name property and a pattern property which is a RegExp which will be used when splitting up the input string into separate Tokens.

# More complex Tokens

How can we define Tokens for Identifiers or Integers?

const Identifier = createToken({ name: "Identifier", pattern: /[a-zA-Z]\w*/ })

const Integer = createToken({ name: "Integer", pattern: /0|[1-9]\d*/ })

# Skipping Tokens

The obvious use case in this language (and many others) is whitespace. skipping certain Tokens is easily accomplished by marking them with the SKIP group.

const WhiteSpace = createToken({
  name: "WhiteSpace",
  pattern: /\s+/,
  group: chevrotain.Lexer.SKIPPED
})

# All Our Tokens

Lets examine all the needed Tokens definitions"

const Identifier = createToken({ name: "Identifier", pattern: /[a-zA-Z]\w*/ })
// We specify the "longer_alt" property to resolve keywords vs identifiers ambiguity.
// See: https://github.com/SAP/chevrotain/blob/master/examples/lexer/keywords_vs_identifiers/keywords_vs_identifiers.js
const Select = createToken({
  name: "Select",
  pattern: /SELECT/,
  longer_alt: Identifier
})
const From = createToken({
  name: "From",
  pattern: /FROM/,
  longer_alt: Identifier
})
const Where = createToken({
  name: "Where",
  pattern: /WHERE/,
  longer_alt: Identifier
})

const Comma = createToken({ name: "Comma", pattern: /,/ })

const Integer = createToken({ name: "Integer", pattern: /0|[1-9]\d*/ })

const GreaterThan = createToken({ name: "GreaterThan", pattern: />/ })

const LessThan = createToken({ name: "LessThan", pattern: /</ })

const WhiteSpace = createToken({
  name: "WhiteSpace",
  pattern: /\s+/,
  group: chevrotain.Lexer.SKIPPED
})

# Creating The Lexer

We now have Token definitions, but how do we create a Lexer from these?

// note we are placing WhiteSpace first as it is very common thus it will speed up the lexer.
let allTokens = [
  WhiteSpace,
  // "keywords" appear before the Identifier
  Select,
  From,
  Where,
  Comma,
  // The Identifier must appear after the keywords because all keywords are valid identifiers.
  Identifier,
  Integer,
  GreaterThan,
  LessThan
]
let SelectLexer = new Lexer(allTokens)

Note that:

  • The order of Token definitions passed to the Lexer is important. The first PATTERN to match will be chosen not the longest.

  • The lexer's Tokenize method is a pure function, thus only a single Lexer (per grammar) is needed.

  • The lexer is context unaware, it lexes each token (pattern) individually.

    • If you need to distinguish between different contexts during the lexing phase, take a look at Lexer Modes.
  • For more patterns requiring more complex constraints than a regular expression, take a look at Custom Token Patterns.

# Using The Lexer

let inputText = "SELECT column1 FROM table2"
let lexingResult = SelectLexer.tokenize(inputText)

The Lexing Result will contain:

  1. A Token Vector.
  2. the lexing errors (if any were encountered)
  3. And other Token groups (if grouping was used)