Skip to content

4. Shape the AST and CST

After defining the grammar, you already have a parser. But a language project needs more than “accepted text”. It needs a semantic model that later services can rely on.

In Pegium, this step is not a separate code generation phase. AST and CST shape are defined directly by the parser and the C++ node types you choose.

The semantic model

The AST is the semantic tree of your language. A typical AST node looks like this:

struct Entity : pegium::AstNode {
  string name;
  optional<reference<Entity>> superType;
  vector<pointer<Feature>> features;
};

This one node already captures three different semantics:

  • name is plain scalar data
  • superType is a reference that will resolve later
  • features are contained child nodes owned by the entity

How the parser builds the model

The grammar determines how those fields are populated. For example:

Rule<ast::Entity> EntityRule{
    "Entity",
    "entity"_kw.i() + assign<&ast::Entity::name>(ID) +
        option("extends"_kw.i() +
               assign<&ast::Entity::superType>(QualifiedName)) +
        "{"_kw + many(append<&ast::Entity::features>(FeatureRule)) +
        "}"_kw};

This is where parsing and model construction meet:

  • assign writes one value into a field
  • append adds repeated children or values
  • create, action, and nest help shape more advanced trees

Why the CST matters too

Pegium keeps the CST alongside the AST. This is important because not every feature is purely semantic.

The CST is later used for:

  • formatting
  • comment handling
  • precise keyword and property lookup
  • cursor-position-sensitive editor features

So even when the AST is your main semantic model, the CST remains a first-class part of the language infrastructure.

When shaping the AST, keep it close to the semantics of the language:

  • use scalar fields for plain values
  • use pointer<T> for owned children
  • use vector<pointer<T>> for repeated owned children
  • use reference<T> for links to other nodes
  • use optional<T> only when absence is meaningful

This tends to make validation, linking, and formatting easier later.

What to expect at the end of this step

At the end of this step, your parser should produce the node structure that the rest of the framework will work with, and that structure should already feel like the language model you want to reason about.

Continue with