Grammar Reference¶

This page is the canonical grammar reference for Pegium.

If you want the recommended learning order first, go back to Write the Grammar.

All snippets assume:

#include <pegium/core/parser/PegiumParser.hpp>
using namespace pegium::parser;

Snippets that use Terminal<T>, Rule<T>, or Infix<...> assume they are written inside a subclass of pegium::parser::PegiumParser.

Parser class¶

A Pegium grammar starts with a parser class deriving from pegium::parser::PegiumParser.

class MyParser : public PegiumParser {
public:
  using PegiumParser::PegiumParser;
  using PegiumParser::parse;

protected:
  const pegium::grammar::ParserRule &getEntryRule() const noexcept override {
    return Module;
  }

  const Skipper &getSkipper() const noexcept override {
    return skipper;
  }

  Terminal<std::string> ID{"ID", "a-zA-Z_"_cr + many(w)};
  Rule<ast::Module> Module{"Module", "module"_kw + assign<&ast::Module::name>(ID)};
};

Parser aliases¶

Inside a PegiumParser subclass, prefer these aliases:

Terminal<T>: terminal rule
Rule<T>: data type rule or parser rule, depending on T
Infix<T, Left, Op, Right>: infix rule

Rule<T> resolves automatically:

Rule<ast::Node> becomes a parser rule
Rule<std::string> or Rule<double> becomes a data type rule

Entry rule¶

Every parser must override getEntryRule().

The entry rule:

is the rule used by PegiumParser::parse(...)
must be AST-producing
is therefore typically a Rule<RootAstNode>

Use getSkipper() to provide the hidden-token policy for the whole parser.

Terminals¶

`Literal`¶

Create literals with "_kw":

auto kw = "catalogue"_kw;
auto ciKw = "catalogue"_kw.i();

Literal matches a fixed piece of text.

Use it for:

keywords such as "entity"_kw
punctuation such as "{"_kw, ":"_kw, or ";"_kw
fixed operators such as "+"_kw or "->"_kw

Important behavior:

literals are case-sensitive by default
call .i() to make a literal case-insensitive
when the literal ends with a word character, Pegium enforces a word boundary

That last point means "entity"_kw matches entity, but not the entity prefix inside entityName.

Case-insensitive literals are useful for language keywords:

"module"_kw.i()
"entity"_kw.i()
"extends"_kw.i()

For punctuation and operators, the default case-sensitive form is usually the right one.

`CharacterRange`¶

Create ranges with "_cr":

auto lower = "a-z"_cr;
auto digit = "0-9"_cr;
auto notNewline = "^\n"_cr;

CharacterRange matches one character chosen from a set or interval.

Typical uses:

letters: "a-zA-Z"_cr
digits: "0-9"_cr
identifier start: "a-zA-Z_"_cr
identifier continuation: "a-zA-Z0-9_"_cr

Rules of thumb:

a-z means every character from a to z
you can concatenate several ranges in the same expression
a leading ^ negates the range

Examples:

auto identifierStart = "a-zA-Z_"_cr;
auto identifierPart = "a-zA-Z0-9_"_cr;
auto hexDigit = "0-9a-fA-F"_cr;
auto notQuote = "^\""_cr;

CharacterRange is best for compact lexical constraints. When the language needs to match structured text rather than one character at a time, combine several ranges with parser expressions or move to a named rule.

`AnyCharacter`¶

Use dot:

auto any = dot;

dot matches any single character except end-of-input.

In practice it is useful for:

fallback matching
comment bodies
scanning until a delimiter
simple “consume one more character” patterns

Example:

auto blockComment = "/*"_kw <=> "*/"_kw;
auto untilEndOfLine = many(!eol + dot);

dot is the most permissive primitive. Prefer Literal or CharacterRange when the intent is more specific.

`Terminal<T>`¶

Use terminals for lexical items:

// inside a Parser subclass
Terminal<std::string> ID{"ID", "a-zA-Z_"_cr + many(w)};
Terminal<double> NUMBER{"NUMBER", some(d) + option("."_kw + many(d))};

Important: a terminal is contiguous. Hidden tokens cannot appear between the elements inside a terminal rule.

If you need whitespace or comments between parts, use Rule<T> instead.

This distinction is easy to miss:

Terminal<std::string> A{"A", ID + "."_kw + ID};
Rule<std::string> B{"B", ID + "."_kw + ID};

A only matches contiguous text such as foo.bar
B can match foo.bar, foo . bar, or foo /*comment*/ . bar, depending on the skipper

Rules¶

`Rule<T>`¶

Use Rule<T> for named parser-level constructs:

// inside a Parser subclass
Rule<std::string> QualifiedName{"QualifiedName", some(ID, "."_kw)};
Rule<ast::Entity> EntityRule{
    "Entity",
    "entity"_kw.i() + assign<&ast::Entity::name>(ID)};

Rules run with the current skipper, so hidden tokens can appear between their elements.

Data-type rules¶

Rule<T> is a data-type rule when T is not derived from AstNode.

Typical use cases:

qualified names
scalar values
enums
strongly-typed terminal conversions

Parser rules¶

Rule<T> is a parser rule when T derives from AstNode.

Use parser rules for:

AST-producing language constructs
nesting and containment
assignments and actions

PEG combinators¶

`Group`¶

Sequence with +:

auto qualifiedName = some(w) + "."_kw + some(w);

`OrderedChoice`¶

Choice with |:

auto sign = "+"_kw | "-"_kw;

`UnorderedGroup`¶

All elements once, in any order, with &:

auto modifiers = "public"_kw & "static"_kw;

`Repetition`¶

Helpers:

auto optionalSemicolon = option(";"_kw);          // 0..1
auto spaces = many(s);                            // 0..N
auto identifierTail = some(w | d | "_"_kw);       // 1..N
auto exactly2Digits = repeat<2>(d);               // exactly 2
auto oneToThreeDigits = repeat<1, 3>(d);          // min/max
auto csvWords = many(some(w), ","_kw);            // separated repetition

`AndPredicate`¶

Lookahead with unary &:

auto beforeSemicolon = &";"_kw;

AndPredicate checks that the next expression matches without consuming input. Use it for lookahead constraints.

`NotPredicate`¶

Negative lookahead with unary !:

auto untilEolChar = !eol + dot;

NotPredicate checks that the next expression does not match, again without consuming input. It is often combined with dot to express “consume until ...”.

Until operator (`<=>`)¶

start <=> end parses from start to the first end:

auto blockComment = "/*"_kw <=> "*/"_kw;

This operator is a concise non-greedy “from ... to ...” pattern. It is especially useful for comment terminals.

Assignments and actions¶

`Assignment`¶

Use assign, append, and enable_if to fill AST members:

// inside a Parser subclass
struct FieldNode : pegium::AstNode {
  std::string name;
  std::vector<std::string> tags;
  bool optional = false;
};

Rule<FieldNode> FieldRule{
    "Field",
    assign<&FieldNode::name>("name"_kw) +
        ":"_kw +
        many(append<&FieldNode::tags>("tag"_kw), ","_kw) +
        enable_if<&FieldNode::optional>("?"_kw)};

`Action`¶

Use create<T>(), action<T>(), and nest<&T::member>() to create or wrap AST nodes while parsing.

struct Expr : pegium::AstNode {};
struct NumberExpr : Expr {
  std::string value;
};
struct UnaryExpr : Expr {
  pegium::pointer<Expr> operand;
};

auto newNumber = action<NumberExpr>();
auto wrapCurrent = action<UnaryExpr, &UnaryExpr::operand>();

Infix rules¶

Use Infix<...> for operator-heavy languages with precedence and associativity.

// inside a Parser subclass
Infix<ast::BinaryExpression, &ast::BinaryExpression::left,
      &ast::BinaryExpression::op, &ast::BinaryExpression::right>
    BinaryExpression{"BinaryExpression",
                     PrimaryExpression,
                     LeftAssociation("%"_kw),
                     LeftAssociation("^"_kw),
                     LeftAssociation("*"_kw | "/"_kw),
                     LeftAssociation("+"_kw | "-"_kw)};

Use:

LeftAssociation(...) for left-associative operators
RightAssociation(...) for right-associative operators

Important: the operator declarations are listed from the strongest precedence to the weakest precedence.

So this declaration:

Infix<ast::BinaryExpression, &ast::BinaryExpression::left,
      &ast::BinaryExpression::op, &ast::BinaryExpression::right>
    BinaryExpression{"BinaryExpression",
                     PrimaryExpression,
                     LeftAssociation("%"_kw),
                     LeftAssociation("^"_kw),
                     LeftAssociation("*"_kw | "/"_kw),
                     LeftAssociation("+"_kw | "-"_kw)};

means:

% binds more strongly than ^
^ binds more strongly than * and /
* and / bind more strongly than + and -

That precedence table drives the AST grouping:

2 + 3 * 4 parses as 2 + (3 * 4)
a * b + c parses as (a * b) + c
a - b - c with LeftAssociation(...) parses as (a - b) - c
a ^ b ^ c with RightAssociation(...) parses as a ^ (b ^ c)

The first argument after the rule name is the base rule. In practice this is usually a rule such as PrimaryExpression containing the atomic expressions of the language:

number or string literals
identifiers
grouped expressions
function calls

Infix then builds binary-expression nodes on top of that base rule according to the precedence and associativity declarations.

This is especially useful when you want to:

avoid manually writing one rule per precedence level
keep a uniform AST node type for binary expressions
make the operator table obvious in one place

An infix rule cannot be used as a terminal. It is a parser-level construct.

Skipper integration¶

Global skipper¶

Build a skipper with ignored and hidden terminals:

// inside a Parser subclass
static constexpr auto WS = some(s);
Terminal<> SL_COMMENT{"SL_COMMENT", "//"_kw <=> &(eol | eof)};
Terminal<> ML_COMMENT{"ML_COMMENT", "/*"_kw <=> "*/"_kw};

Skipper skipper =
    SkipperBuilder().ignore(WS).hide(ML_COMMENT, SL_COMMENT).build();

The skipper defines which tokens may appear automatically between parser-level elements, and whether those tokens are discarded or preserved in the CST.

`ignore(...)`¶

ignore(...) is for tokens that should disappear completely after parsing.

Typical example:

static constexpr auto WS = some(s);
Skipper skipper = SkipperBuilder().ignore(WS).build();

Ignored tokens:

are consumed automatically between elements of parser rules and data-type rules
do not appear in the CST
cannot be visited later as CST nodes

`hide(...)`¶

hide(...) is for tokens that should not influence parsing, but should remain available in the CST as hidden nodes.

Typical example:

Terminal<> SL_COMMENT{"SL_COMMENT", "//"_kw <=> &(eol | eof)};
Terminal<> ML_COMMENT{"ML_COMMENT", "/*"_kw <=> "*/"_kw};

Skipper skipper =
    SkipperBuilder().ignore(WS).hide(ML_COMMENT, SL_COMMENT).build();

Hidden tokens:

are consumed automatically between elements of parser rules and data-type rules
do appear in the CST
are marked as hidden nodes
can be used later by formatting or source-aware features

Parser rules and data-type rules¶

The skipper applies between the elements of parser-level rules.

That includes:

Rule<T> when T is an AST node type
Rule<T> when T is a scalar or other non-AST value type

So a rule such as:

// inside a Parser subclass
Rule<std::string> QualifiedName{"QualifiedName", ID + "."_kw + ID};

can accept all of the following depending on the configured skipper:

foo.bar
foo . bar
foo /* comment */ . bar

Terminals¶

The skipper does not apply inside Terminal<T>.

So:

// inside a Parser subclass
Terminal<std::string> QualifiedNameToken{"QualifiedNameToken",
                                         ID + "."_kw + ID};

matches contiguous text only. It does not allow ignored or hidden tokens between ID, ".", and ID.

This is why whitespace-sensitive lexical constructs usually belong in Terminal<T>, while whitespace-tolerant structured constructs usually belong in Rule<T>.

Local skipper on expressions¶

Supported on Group, OrderedChoice, UnorderedGroup, and Repetition:

auto localSkipper = SkipperBuilder().ignore(WS).build();

auto grouped = ("a"_kw + "b"_kw).with_skipper(localSkipper);
auto choice = ("a"_kw | "b"_kw).with_skipper(localSkipper);
auto unordered = ("a"_kw & "b"_kw).with_skipper(localSkipper);
auto repeated = some("a"_kw).with_skipper(localSkipper);

Local skipper on rules¶

Use opt::with_skipper(...) for named rules:

// inside a Parser subclass
Rule<std::string> Token{
    "Token", "a"_kw + "b"_kw, opt::with_skipper(localSkipper)};

Built-in shortcuts¶

Predefined expressions in pegium::parser:

auto end = eof;
auto endLine = eol;
auto space = s;
auto nonSpace = S;
auto word = w;
auto nonWord = W;
auto digit = d;
auto nonDigit = D;

These shortcuts are meant to cover the most common lexical building blocks:

eof: end of input
eol: line ending (\n, \r\n, or \r)
s: whitespace characters
S: non-whitespace character
w: word character ([a-zA-Z0-9_])
W: non-word character
d: digit ([0-9])
D: non-digit character

Examples:

Terminal<> WS{"WS", some(s)};
Terminal<std::string> ID{"ID", "a-zA-Z_"_cr + many(w)};
Terminal<> SL_COMMENT{"SL_COMMENT", "//"_kw <=> &(eol | eof)};

Nullability constraints¶

Several APIs intentionally reject nullable expressions:

Terminal<T>
Rule<T>
assign, append, enable_if
many, some, repeat

Grammar Reference¶

Parser class¶

Parser aliases¶

Entry rule¶

Terminals¶

Literal¶

CharacterRange¶

AnyCharacter¶

Terminal<T>¶

Rules¶

Rule<T>¶

Data-type rules¶

Parser rules¶

PEG combinators¶

Group¶

OrderedChoice¶

UnorderedGroup¶

Repetition¶

AndPredicate¶

NotPredicate¶

Until operator (<=>)¶

Assignments and actions¶

Assignment¶

Action¶

Infix rules¶

Skipper integration¶

Global skipper¶

ignore(...)¶

hide(...)¶

Parser rules and data-type rules¶

Terminals¶

Local skipper on expressions¶

Local skipper on rules¶

Built-in shortcuts¶

Nullability constraints¶

Related pages¶

`Literal`¶

`CharacterRange`¶

`AnyCharacter`¶

`Terminal<T>`¶

`Rule<T>`¶

`Group`¶

`OrderedChoice`¶

`UnorderedGroup`¶

`Repetition`¶

`AndPredicate`¶

`NotPredicate`¶

Until operator (`<=>`)¶

`Assignment`¶

`Action`¶

`ignore(...)`¶

`hide(...)`¶