Skip to content

Semantic Model

When AST nodes are created during parsing, they become the semantic model of your language. In Pegium, that semantic model is shaped directly by two things:

  • the C++ AST node types you define
  • the grammar assignments and actions that populate them

Unlike generator-centric workflows, Pegium does not infer a separate semantic type system from another DSL. The semantic model is already present in your C++ types.

This page explains how AST types, grammar assignments, references, and CST structure fit together.

AST fields shape the model

Consider this example:

struct Entity : pegium::AstNode {
  string name;
  optional<reference<Entity>> superType;
  vector<pointer<Feature>> features;
};

This one type already tells the framework a lot:

  • name is plain scalar semantic data
  • superType is a link to another node
  • features are owned nested children

The parser then decides how those fields are populated:

Rule<ast::Entity> EntityRule{
    "Entity",
    "entity"_kw.i() + assign<&ast::Entity::name>(ID) +
        option("extends"_kw.i() +
               assign<&ast::Entity::superType>(QualifiedName)) +
        "{"_kw + many(append<&ast::Entity::features>(FeatureRule)) +
        "}"_kw};

So the semantic model in Pegium is not a post-processing artifact. It is the combined result of AST node definitions and grammar wiring.

Why this matters

The shape of the AST is not just an implementation detail. It directly affects:

  • how references are represented and linked
  • how validation reasons about the model
  • how formatter and editor features find the right source regions
  • how stable your language-specific code remains as the grammar evolves

References are part of the model

References are not just strings stored in fields. They carry:

  • the text written by the user
  • the owning container, feature name, target type, and reference cardinality
  • the source CST node, when available
  • the resolved target, once linking has happened

That is why the same model can support linking, completion, rename, and hover without inventing separate data structures for each feature.

Runtime type information

Runtime AST reflection also treats pegium::AstNode as the implicit root type of every registered AST class. In practice, AstReflection::isSubtype(...) and AstReflection::getAllSubTypes(...) stay consistent with AstReflection::isInstance(..., typeid(pegium::AstNode)).

AstReflection::getAllTypes() and AstReflection::getAllSubTypes(...) expose stable std::unordered_set views over the bootstrapped registry state, so their iteration order is intentionally unspecified.

Pegium bootstraps that reflection once during single-threaded language registration by walking the parser entry rule. Because this bootstrap probes the participating AST types directly, AST-producing parser aliases such as Rule<T> and Infix<T, ...> require DefaultConstructibleAstNode<T>.

That constraint only applies to AST types that are produced directly by parser rules. Abstract or non-default-constructible AST supertypes can still be used as reference targets or as containment slot supertypes in assignments; the bootstrap only needs their runtime type information for subtype filtering.

Pegium therefore treats the parsed AST as a mutable technical model of the source program. The recommended pattern is:

  • keep parser-managed AST nodes simple and default-constructible
  • express semantic constraints with validation, linking, or later transforms
  • build a stricter domain model after parsing if your application needs one

Why the CST still matters

The AST is the semantic model, but Pegium keeps the CST alongside it because many editor-facing features still need source structure:

  • formatting
  • comment rewriting
  • keyword lookup
  • precise property ranges
  • cursor-sensitive features

So the real working model of a Pegium language is AST plus CST plus references.

Stable modeling advice

For mature languages, it is worth treating the AST as a deliberate API rather than just whatever happened to fall out of the first grammar draft.

Good signs of a stable model:

  • containment is explicit
  • references are modeled as references, not as raw strings
  • optionality reflects real semantics
  • later services can reason about the tree without special-case hacks