Next: No warranty, Previous: (dir), Up: (dir) [Contents][Index]
This manual (7 December 2022) is for Libmarpa 11.0.1.
Copyright © 2022 Jeffrey Kegler.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Next: About this document, Previous: Top, Up: Top [Contents][Index]
The Libmarpa license takes precedence over the statements in this document. In particular, the license states that Libmarpa is free software and has no warranty. No statement in this document should be construed as providing any kind of warranty.
• Updates: |
Previous: No warranty, Up: No warranty [Contents][Index]
For important information that has changed since the last stable release, there is an “updates” document (https://github.com/jeffreykegler/libmarpa/blob/updated/UPDATES.md). The updates document includes
To allow that information to be kept current without issuing a new stable release, we describe how to obtain support in the updates document. See Support.
Next: Overview of Libmarpa, Previous: No warranty, Up: Top [Contents][Index]
• How to read this document: | ||
• Prerequisites: |
Next: Prerequisites, Previous: About this document, Up: About this document [Contents][Index]
This is essentially a reference document, but its early chapters lay out concepts essential to the others. Readers will usually want to read the chapters up and including Introduction to the method descriptions in order. Otherwise, they should follow their interests.
Previous: How to read this document, Up: About this document [Contents][Index]
This document is very far from self-contained. It assumes the following:
Marpa::R2
or Marpa::R3
,
in Perl.
Next: Terms, Previous: About this document, Up: Top [Contents][Index]
This chapter contains a quick overview of Libmarpa, using standard parsing terminology. It is intended to help a prospective reader of the whole document to know what to expect. Details and careful definitions will be provided in later chapters.
Libmarpa implements the Marpa parsing algorithm. Marpa is named after the legendary 11th century Tibetan translator, Marpa Lotsawa. In creating Marpa, I depended heavily on previous work by Jay Earley, Joop Leo, John Aycock and Nigel Horspool.
Libmarpa implements the entire Marpa algorithm. This library does the necessary grammar preprocessing, recognizes the input, and produces a “bocage”, which is an optimized parse forest. Libmarpa also supports the ordering, iteration and evaluation of the parse trees in the bocage.
Libmarpa is very low-level. For example, it has no strings. Rules, symbols, and token values are all represented by integers. This, of course, will not suffice for many applications. Users will very often want names for the symbols, non-integer values for tokens, or both. Typically, applications will use arrays to translate Libmarpa’s integer ID’s to strings or other values as required.
Libmarpa also does not implement most of the semantics. Libmarpa does have an evaluator (called a “valuator”), but it does not manipulate the stack directly. Instead, Libmarpa, based on its traversal of the parse tree, passes optimized step by step stack manipulation instructions to the upper layer. These instructions indicate the token or rule involved, and the proper location for the true token value or the result of the rule evaluation. For rule evaluations, the instructions include the stack location of the arguments.
Marpa requires most semantics to be implemented in the application. This allows the application total flexibility. It also puts the application is in a much better position to prevent errors, to catch errors at runtime or, failing all else, to successfully debug the logic.
Next: Architecture, Previous: Overview of Libmarpa, Up: Top [Contents][Index]
• Miscellaneous definitions: | ||
• Parsing theory preliminaries: | ||
• Stages of parsing: | ||
• Rules: | ||
• Derivations: | ||
• Nulling: | ||
• Useless rules: | ||
• Recursion and cycles: | ||
• Trees: | ||
• Traversal: | ||
• Ambiguity: | ||
• Evaluating a parse: | ||
• Semantics terms: | ||
• Application and diagnostic behavior: |
Next: Parsing theory preliminaries, Previous: Terms, Up: Terms [Contents][Index]
max(x,y)
is the maximum of
x
and y
,
where x
and y
are two numbers.
Next: Stages of parsing, Previous: Miscellaneous definitions, Up: Terms [Contents][Index]
This document assumes the reader is familiar with parsing theory. The following exposition is not intended an introduction or a reference. Instead, it is intended to serve as a guide to the definitions of parsing terms as used in this document.
Where a narrow or specialized sense of the term is the one that applies within Marpa, that is the only definition given. Marpa also sometimes uses a standard term with a definition which is slightly different from the standard one. “Ambiguous grammar” is one example: See Ambiguity. The term “grammar” itself is another. See grammar-non-standard. When a definition is non-standard, this is explicitly pointed out.
Readers who want a textbook or tutorial in parsing theory can look at Mark Jason Dominus’s excellent chapter on parsing in the Perl context. See Bibliography-Dominus-2005. It is available on-line. Wikipedia is also an excellent place to start. See Bibliography-Wikipedia.
A grammar is a set of rules, associated with a set of symbols, one of which is distinguished as the start symbol. A symbol string, or simply string where the meaning is clear, is an ordered series of symbols. The length of a string is the number of symbols in it. A symbol string is also called a sentential form.
Some of the symbols are terminals. For the purposes of this subsection, a terminal is a symbol which may occur in an input to a parse of a grammar. In a parse, an input is either accepted or rejected. A potential input string, that is, a sentential form which is made up entirely of terminal symbols, is called a sentence. The set of sentences that a grammar accepts is the language of the grammar.
It is important to note that the term language, as it is used in parsing theory, means something very different from what it means in ordinary use. The meaning of the strings is an essential part of the ordinary idea of what a language is. In ordinary use, the word “language” means far more than a unordered list of its sentences. In parsing terminology, meaning (or semantics as it is called) is a separate issue. For parsing theory a language is exactly a set of strings – that and nothing more.
The Marpa definition of a grammar differs slightly from the various standard ones. Standard definitions usually sharply distinguish terminal symbols from non-terminals. Marpa does not. Further discussion of Marpa’s handling of terminal is below (see Terminals).
Next: Rules, Previous: Parsing theory preliminaries, Up: Terms [Contents][Index]
A recognizer is a program that determines whether its input is in the language of a grammar and a start symbol. A parser is a program which finds the structure of that input.
The term parsing is used in a strict and a loose sense. Parsing in the loose sense is all phases of finding a grammar’s structure, including a separate recognition phase if the parser has one. (Marpa does.) If a parser has phases, parsing in the strict sense refers specifically to the phase that finds the structure of the input. When the Marpa documents use the term parsing in its strict sense, they will speak explicitly of “parsing in the strict sense”. Otherwise, parsing will mean parsing in the loose sense.
Parsers often use a lexical analyzer to convert raw input, usually input text, into a token stream, which is a series of tokens. Each token represents a symbol of the grammar and has a value. A lexical analyzer is often called a lexer or a scanner, and lexical analysis is often called lexing or scanning.
The series of symbols represented by the series of tokens becomes the symbol string input seen by the recognizer. The symbol string input is more often called the input sentence.
The output of the Marpa parser is a parse forest. See def-forest.
Next: Derivations, Previous: Stages of parsing, Up: Terms [Contents][Index]
A standard way of describing rules is Backus-Naur Form, or BNF. A rule of a grammar is sometimes called a rule. In one common way of writing BNF, a rule looks like this:
Expression ::= Term Factor
In the rule above, Expression, Term and Factor are symbols. A rule consists of a left hand side and a right hand side. In a context-free grammar, like those Marpa parses, the left hand side of a rule is always a symbol string of length 1. The right hand side of a rule is a symbol string of zero or more symbols. In the example, Expression is the left hand side, and Term and Factor are right hand side symbols.
Left hand side and right hand side are often abbreviated as RHS and LHS. If the RHS of a rule has no symbols, the rule is called an empty rule or an empty rule.
In a standard grammar, all rules are BNF rules, as just described. Marpa grammars differ from standard grammars in allowing a second kind of rule: a sequence rule. The RHS of a sequence rules is a single symbol, which is repeated zero or more times. Libmarpa allows the application to specify other parameters, including a separator symbol. See Sequence rules.
A step of a derivation, or derivation step, is a change made to a symbol string by applying one of the rules from the grammar. The rule must be one of those with a LHS that occurs in the symbol string. The result of the derivation step is another symbol string, one in which every occurence of the LHS symbol from the rule is replaced by the RHS of the rule. For example, if A, B, C, D, and X are symbols, and
X ::= B C
is a rule, then
A X D -> A B C D
is a derivation step,
A X D
” as its beginning,
A B C D
” as its end or result, and
A derivation is a sequence of derivation steps. The length of a derivation is its length in steps.
A X D
”
derives
the symbol string
“A B C D
” in one step.
A X D
”
directly derives
the symbol string
“A B C D
”.
Technically, a symbol X and a string that consists of only that symbol are two different things. But we often say “the symbol X” as shorthand for “the string of length 1 whose only symbol is X”. For example, if the string containing only the symbol X derives a string Y, we will usually say simply that “X derives Y”.
Wherever symbol or string X derives Y, we may also say X produces Y. Derivations are often described as symbol matches. Wherever symbol or string X derives Y, we may also say that Y matches X or that X matches Y. It is particularly common to say that X matches Y when X or Y is a sentence.
The parse of an input by a grammar is successful if and only if, according to the grammar, the start symbol produces the input sentence. The set of all input sentences that a grammar will successfully parse is the language of the grammar.
Next: Useless rules, Previous: Derivations, Up: Terms [Contents][Index]
The zero length symbol string is called the empty string. The empty string can be considered to be a sentence, in which case it is the empty sentence. A string of one or more symbols is non-empty. A derivation which produces the empty string is a null derivation. A derivation from the start symbol which produces the empty string is a null parse.
If a symbol has a null derivation, it is a nullable symbol. If the only sentence produced by a symbol is the empty sentence, it is a nulling symbol. All nulling symbols are nullable symbols.
If a symbol is not nullable, it is non-nullable. If a symbol is not nulling, it is non-nulling.
A rule is nullable iff it is the rule of the first step of a null derivation. A rule is nullable iff its LHS symbol is nullable.
A rule R is nulling iff every derivation whose first step has R as its rule is a null derivation. A rule is nulling iff its LHS symbol is nulling.
If a rule is not nullable, it is non-nullable. If a rule is not nulling, it is non-nulling.
Next: Recursion and cycles, Previous: Nulling, Up: Terms [Contents][Index]
If any derivation from the start symbol uses a rule, that rule is called reachable or accessible. A rule that is not accessible is called unreachable or inaccessible. If any derivation which results in a sentence uses a rule, that rule is said to be productive. A rule that is not productive is called unproductive. A rule is productive iff every symbol on its RHS is productive. A symbol is productive iff it is a terminal or it is the LHS of a productive rule. A rule which is inaccessible or unproductive is called a useless rule. Marpa can handle grammars with useless rules.
A symbol is reachable or accessible if it appears in a reachable rule. If a symbol is not reachable, it is unreachable or inaccessible. A symbol is productive if it appears on the LHS of a productive rule, or if it is a nullable symbol. If a symbol is not productive, it is unproductive. A symbol which is inaccessible or unproductive is called a useless symbol. Marpa can handle grammars with useless symbols.
Next: Trees, Previous: Useless rules, Up: Terms [Contents][Index]
If any symbol in the grammar non-trivially produces a symbol string containing itself, the grammar is said to be recursive. If any symbol non-trivially produces a symbol string with itself on the left, the grammar is said to be left-recursive. If any symbol non-trivially produces a symbol string with itself on the right, the grammar is said to be right-recursive. Marpa can handle all recursive grammars, including grammars which are left-recursive, grammars which are right-recursive, and grammars which contain both left- and right-recursion.
A cycle is a non-trivial derivation of a string of symbols from itself. If it is not possible for any derivation using a grammar to contain a cycle, then that grammar is said to be cycle-free. Traditionally, a grammar is considered useless if it is not cycle-free.
The traditional deprecation of cycles is well-founded. A cycle is the parsing equivalent of an infinite loop. Once a cycle appears, it can be repeated over and over again. Even a very short input sentence can have an infinite number of parses when the grammar is not cycle-free.
For that reason, a grammar which contains a cycle is also called infinitely ambiguous. Marpa can parse with grammars which are not cycle-free, and will even parse inputs that cause cycles. When a parse is infinitely ambiguous, Marpa limits cycles to a single loop, so that only a finite number of parses is returned.
Next: Traversal, Previous: Recursion and cycles, Up: Terms [Contents][Index]
In this document, unless otherwise stated,
For brevity, in contexts where the meaning is clear, we refer to a tree node simply as a node. Especially when looked at from the point of view of its labels, a node is often called an instance.
A node is a pair of tuples:
We note that this definition of a tree node is recursive.
In the following list of definitions and assertions, let
nd = [ [ sym, start, end ], children ]
be a tree node:
end-start
.
start = end
.
Let nd1 and nd2 be two nodes. If nd2 is a child of nd1, then nd1 is the parent of nd2.
We define ancestor recursively such that nd1 is the ancestor of a node nd2 iff one of the following are true:
Simlarly, we define descendant recursively such that nd1 is the descendant of a node nd2 iff one of the following are true:
A tree is its own root node. That implies that, in fact, tree and node are just two different terms for the same thing. We usually speak of trees when we are thinking of the nodes/trees as a collection of nodes, and we speak of nodes when we are more focused on the individual nodes.
A parse forest is a set of one or more parse trees. Each tree represents a parse.
We have used “parse” as a noun in several senses. Depending on context a “parse” may be
When the meaning of “parse” is not clear in context, we will be explicit about which sense is intended.
[ TODO: give example of tree ] [ TODO: define path ] [ TODO: define left vs. right ] [ TODO: define cut ] [ TODO: define frontier ] [ TODO: define top-down traversal ] [ TODO: define bottom-down traversal ]
The structure of a parse can be represented as a series of derivation steps from the start symbol to the input. The node at the root of the tree is also called the start node.
Next: Evaluating a parse, Previous: Traversal, Up: Terms [Contents][Index]
Marpa allows ambiguous grammars. Traditionally we say that a parse is ambiguous if, for a given grammar and a given input, more than one derivation tree is possible. However, Marpa allows ambiguous input tokens, which the traditional definition does not take into account. If Marpa used the traditional definition, all grammars would be ambiguous except those grammars which allowed only the null parse.
[ TODO: Rewrite two reasons to differ from traditional definition – ambiguous tokens and pruned null forests. Def is that cardinality of forest > 1. ]
It is easiest if the Marpa definition and the traditional definition were extensionally equivalent — that is, if Marpa’s set of ambiguous grammars was exactly the same as the set of traditionally ambiguous grammars. This can be accomplished by using a slightly altered definition. In the Marpa context, a grammar is ambiguous if and only if, for some UNAMBIGUOUS stream of input tokens, that grammar produces more than one parse tree.
Next: Semantics terms, Previous: Ambiguity, Up: Terms [Contents][Index]
A parser is an algorithm that takes a string of symbols (tokens or characters) and finds a structure in it. Traditionally, that structure is a tree.
Rarely is an application interested only in the tree. Usually the idea is that the string “means” something: the idea is that the string has a semantics. Traditionally and most often, the tree is an intermediate step in producing a value, a value which represents the “meaning” or “semantics” of the string. Evaluating a tree means finding its semantics.
Next: Application and diagnostic behavior, Previous: Evaluating a parse, Up: Terms [Contents][Index]
In real life, the structure of a parse is usually a means to an end. Grammars usually have a semantics associated with them, and what the user actually wants is the value of the parse according to the semantics.
The tree representation is especially useful when evaluating a parse. In the traditional method of evaluating a parse tree, every node which represents a terminal symbol has a value associated with it on input. Recall that nodes are often called “instances” of their symbols or rules. Semantics is associated with instances of rules or of lexemes.
Non-null inner nodes take their semantics from the rule whose LHS they represent. Nulled nodes are dealt with as special cases.
The semantics for a rule describe how to calculate the value of the node which represents the LHS (the parent node) from the values of zero or more of the nodes which represent the RHS symbols (child nodes). Values are computed recursively, bottom-up. The value of a parse tree is the value of its start symbol.
Previous: Semantics terms, Up: Terms [Contents][Index]
An application behavior is a behavior on which it is intended that the design of applications will be based. In this document, a behavior is an application behavior unless otherwise stated. Most of the behaviors specified in this document are application behaviors. We sometimes say that “applications may expect” a certain behavior to emphasize that that behavior is an application behavior.
After an irrecoverable failure, the behavior of a Libmarpa application is undefined, so that there are no behaviors that can be relied on for normal application processing, and therefore, there are no application behaviors. In this circumstance, some of the application behaviors become diagnostic behaviors. A diagnostic behavior is a behavior that this document suggests that the programmer may attempt in the face of an irrecoverable failure, for purpose of testing, diagnostics and debugging. Diagnostic behaviors are hoped for, rather than expected, and intended to allow the programmer to deal with irrecoverable failures as smoothly as possible. (See Failure.)
In this document, a behavior is a diagnostic behavior only if that is specifically indicated. Applications should not be designed to rely on diagnostic behaviors. We sometimes say that “diagnostics may attempt” a certain behavior to emphasize that that behavior is a diagnostic behavior.
• Major objects: | ||
• Time objects: | ||
• Reference counting: | ||
• Numbered objects: |
Next: Time objects, Previous: Architecture, Up: Architecture [Contents][Index]
The classes of Libmarpa’s object system fall into two types: major and numbered. These are the Libmarpa’s major classes, in sequence.
The major objects have one letter abbreviations, which are used frequently. These are, in the standard sequence,
Next: Reference counting, Previous: Major objects, Up: Architecture [Contents][Index]
All of Libmarpa’s major classes, except the configuration class, are “time” classes. Except for objects in the grammar class, all time objects are created from another time object. Each time object is created from a time object of the class before it in the sequence. A recognizer cannot be created without a precomputed grammar; a bocage cannot be created without a recognizer; and so on.
When one time object is used to create a second time object, the first time object is the parent object and the second time object is the child object. For example, when a bocage is created from a recognizer, the recognizer is the parent object, and the bocage is the child object.
Grammars have no parent object. Every other time object has exactly one parent object. Value objects have no child objects. All other time objects can have any number of children, from zero up to some machine-determined limit, such as memory.
An object is the ancestor of another object if it is the parent of that object, or if it is the parent of an ancestor of that object. An object is the descendant of another object if it is the child of that object, or if it is the child of an descendant of that object. The following three statements are mutually exclusive:
X
is of class C
.
X
has an ancestor of class C
.
X
has a descendant of class C
.
It follows from the definitions of “parent” and “ancestor” that, for any time object class, an object can have at most one ancestor of that class. On the other hand, if an object has descendants in a class, there can be many of them.
An object is a base of another object, if it is that object, or if it is the ancestor of the object. For each time object class, an object has at most one base object. For example, a recognizer is its own base recognizer, and has exactly one base grammar.
The base grammar of a time object is of special importance. Every time object has a base grammar. A grammar object is its own base grammar. The base grammar of a recognizer is its parent grammar, the one that it was created with. The base grammar of any other time object is the base grammar of its parent object. For example, the base grammar of a bocage is the base grammar of the recognizer that it was created with.
Next: Numbered objects, Previous: Time objects, Up: Architecture [Contents][Index]
Every object in a “time” class has its own, distinct, lifetime, which is controlled by the object’s reference count. Reference counting follows the usual practice. Contexts that take a share of the “ownership” of an object increase the reference count by 1. When a context relinquishes its share of the ownership of an object, it decreases the reference count by 1.
Each class of time object has a “ref” and an “unref”
method, to be used by those contexts that need to
explicitly increment and decrement the reference count.
For example, the “ref” method for the grammar class is
marpa_g_ref()
and the “unref” method for the grammar class is
marpa_g_unref()
.
Time objects do not have explicit destructors. When the reference count of a time object reaches 0, that time object is destroyed.
Much of the necessary reference counting is performed automatically. The context calling the constructor of a time object does not need to explicitly increase the reference count, because Libmarpa time objects are always created with a reference count of 1.
Child objects “own” their parents, and when a child object is successfully created, the reference count of its parent object is automatically incremented to reflect this. When a child object is destroyed, it automatically decrements the reference count of its parent.
In a typical application, a calling context needs only to remember to “unref” each time object that it creates, once it is finished with that time object. All other reference decrements and increments are taken care of automatically. The typical application never needs to explicitly call one of the “ref” methods.
More complex applications may find it convenient to have one or more contexts share ownership of objects created in another context. These more complex situations are the only cases in which the “ref” methods will be needed.
Previous: Reference counting, Up: Architecture [Contents][Index]
In addition to its major, “time” objects, Libmarpa also has numbered objects. Numbered objects do not have lifetimes of their own. Every numbered object belongs to a time object, and is destroyed with it. Rules and symbols are numbered objects. Tokens values are another class of numbered objects.
Next: Exhaustion, Previous: Architecture, Up: Top [Contents][Index]
• Earlemes: | ||
• The basic models of input: | ||
• Terminals: |
Next: The basic models of input, Previous: Input, Up: Input [Contents][Index]
• The traditional input model: | ||
• The latest earleme: | ||
• The current earleme: | ||
• The furthest earleme: |
Next: The latest earleme, Previous: Earlemes, Up: Earlemes [Contents][Index]
In traditional Earley parsers, the concept of location is very simple. Locations are numbered from 0 to n, where n is the length of the input. Every location has an Earley set, and vice versa. Location 0 is the start location. Every location after the start location has exactly one input token associated with it.
Some applications do not fit this traditional input model — natural language processing requires ambiguous tokens, for example. Libmarpa allows a wide variety of alternative input models.
In Libmarpa a location is called a earleme. The number of an Earley set is the ID of the Earley set, or its ordinal. In the traditional model, the ordinal of an Earley set and its earleme are always exactly the same, but in Libmarpa’s advanced input models the ordinal of an Earley set can be different from its location (earleme).
The important earleme values are the latest earleme. the current earleme, and the furthest earleme. Latest, current and furthest earleme, when they have specified values, obey a lexical order in this sense: The latest earleme is always at or before the current earleme, and the current earleme is always at or before the furthest earleme.
Next: The current earleme, Previous: The traditional input model, Up: Earlemes [Contents][Index]
The
latest Earley set
is the Earley set completed most recently.
This is initially the Earley set at location 0.
The latest Earley set is always the Earley set with the highest ordinal,
and the Earley set with the highest earleme location.
The
latest earleme is the earleme of the latest Earley set.
If there is an Earley set at the current earleme,
it is the latest Earley set and the latest earleme
is equal to the current earleme.
There is never an Earley set after the current earleme,
and therefore the latest Earley set is never after the
current earleme.
The marpa_r_start input()
and
marpa_r_earleme_complete()
methods
are only ones that change the latest earleme.
See marpa_r_start_input(), and
marpa_r_earleme_complete().
The latest earleme is different from the current earleme if and only if there is no Earley set at the current earleme. A different end of parsing can be specified, but by default, parsing is of the input in the range from earleme 0 to the latest earleme.
Next: The furthest earleme, Previous: The latest earleme, Up: Earlemes [Contents][Index]
The
current earleme
is the earleme that Libmarpa is currently working on.
More specifically, it is the one at which new tokens will start.
Since tokens are never zero length, a new token will always end after the
current earleme.
marpa_r_start_input()
initializes the current earleme to 0,
and every call to
marpa_r_earleme_complete()
advances the
current earleme by 1.
The marpa_r_start input()
and
marpa_r_earleme_complete()
methods
are only ones that change the current earleme.
See marpa_r_start_input(), and
marpa_r_earleme_complete().
Previous: The current earleme, Up: Earlemes [Contents][Index]
Loosely speaking,
the
furthest earleme
is the furthest earleme reached by the parse.
More precisely,
it is the highest numbered
earleme at which a token ends
and is 0 if there are no tokens.
The furthest earleme is 0 when a recognizer is
created.
With every call to
marpa_r_alternative()
, the end of the token
it adds is calculated.
A token ends at the earleme location current+length,
where current is the current earleme,
and length is the length of the newly added token.
If old_f
is the furthest earleme before
a call to
marpa_r_alternative()
,
the furthest earleme after the call
is max(old_f, current+length)
.
The marpa_r_new()
and
marpa_r_alternative()
methods
are only ones that change the furthest earleme.
See marpa_r_new(), and
marpa_r_alternative().
In the basic input models,
where every token has length 1,
calling
marpa_r_earleme_complete()
after each
marpa_r_alternative()
call is sufficient to process
all inputs,
and the furthest earleme’s value
can be typically be ignored.
In alternative input models,
where tokens have lengths greater than 1,
calling marpa_r_earleme_complete()
once after the last token
is read may not be enough to ensure that all tokens have been processed.
To ensure that all tokens have been processed,
an application must advance the current earleme
by calling
marpa_r_earleme_complete()
,
until the current earleme is equal to the furthest earleme.
For the purposes of presentation, we (somewhat arbitrarily) divide Libmarpa’s input models into two groups: basic and advanced. In the basic input models of input, every token is exactly one earleme long. This implies that, in a basic model of input,
In the advanced models of input, tokens may have a length other than 1. Most applications use the basic input models. The details of the advanced models of input are presented in a later chapter. See Advanced input models.
• The standard model of input: | ||
• Ambiguous input: |
Next: Ambiguous input, Previous: The basic models of input, Up: The basic models of input [Contents][Index]
In the standard model of input,
there is exactly one successful
marpa_r_alternative()
call
immediately previous
to every
marpa_r_earleme_complete()
call.
A marpa_r_alternative()
call is
immediately previous to a
marpa_r_earleme_complete()
call
iff
that marpa_r_earleme_complete()
call is
the first
marpa_r_earleme_complete()
call after
the marpa_r_alternative()
call.
Recall that, since the standard model is
a basic model,
the token length in every successful call to marpa_r_alternative()
will be one.
For an input of length n, there will be
exactly n marpa_r_earleme_complete()
calls,
and all but the last call
to marpa_r_earleme_complete()
must be successful.
In the standard model,
after a successful call
to
marpa_r_alternative()
,
if c is the value of the current earleme before the call,
In the standard model,
a call to
marpa_r_earleme_complete()
follows a successful call of
marpa_r_alternative()
,
so that the value of the furthest earleme before the call to
marpa_r_earleme_complete()
will be c+1
,
where c is the value of the current earleme.
After a successful call to
marpa_r_earleme_complete()
,
c+1
; and
Recall that, in the basic models of input, the latest earleme is always equal to the current earleme.
Previous: The standard model of input, Up: The basic models of input [Contents][Index]
We can loosen the standard model to
allow more than one successful call to
marpa_r_alternative()
immediately previous to each call to
marpa_r_earleme_complete()
.
This change will mean that multiple tokens become possible
at each earleme —
in other words, that the input becomes ambiguous.
We continue to require that there be
at least one successful call to
marpa_r_alternative()
before each call to
marpa_r_earleme_complete()
.
And we recall that,
since this is a basic input model,
all tokens must have a length of 1.
In the ambiguous input model, the behavior of the current, latest and furthest earlemes are exactly as described for the standard model. See The standard model of input.
Previous: The basic models of input, Up: Input [Contents][Index]
Traditionally, a terminal symbol is a symbol that may appear in the input. Traditional grammars divide all symbols sharply into terminals and non-terminals: A terminal symbol must always be used as a terminal. A non-terminal symbol can never be used as a terminal.
In Libmarpa, by default, a symbol is a terminal, and therefore may appear in the input iff both of the following are true:
Marpa’s default behavior follows tradition. A now-deprecated feature of Marpa allowed for LHS terminals. See LHS terminals. Most readers will want to stick to Marpa’s default behavior, and can and should ignore the possibility of LHS terminals. Even when LHS terminals are allowed, terminals can never be zero length.
In Libmarpa,
every terminal instance has a token value associated with it.
Token values are int
’s.
Libmarpa does nothing with token values except accept
them from the application and return them during
parse evaluation.
A parse is exhausted when it cannot accept any further input. A parse is active iff it is not exhausted. For a parse to be exhausted, the furthest earleme and the current earleme must be equal. However, the converse is not always the case: if more tokens can be read at the current earleme, then it is possible for the furthest earleme and the current earleme to be equal in an active parse.
Parse exhaustion always has a location.
That is, if a parse is exhausted it is exhausted at some earleme location X
.
If a parse is exhausted at location X
, then
X
.
X
.
X
.
X
.
X
.
X
.
X
.
marpa_r_alternative()
after a parser has become exhausted.
X
.
marpa_r_earleme_complete()
after a parser has become exhausted.
Users sometimes assume that parse exhaustion means parse failure. But other users sometimes assume that parse exhaustion means parse success. For many grammars, there are strong associations between parse exhaustion and parse success, but the strong association can go either way, Both exhaustion-loving and exhaustion-hating grammars are very common in practical application.
In an
exhaustion-hating
application,
parse exhaustion typically means parse failure.
C programs, Perl scripts and most programming languages
are exhaustion-hating applications.
If a C program is well-formed,
it is always possible to read more input.
The same is true of a Perl program that does not have a __DATA__
section.
In an exhaustion-loving application parse exhaustion means parse success. A toy example of an exhaustion-loving application is the language consisting of balanced parentheses. When the parentheses come into perfect balance the parse is exhausted, because any further input would unbalance the brackets. And the parse succeeds when the parentheses come into perfect balance. Exhaustion means success. Any language that balances start and end indicators will tend to be exhaustion-loving. HTML and XML, with their start and end tags, can be seen as exhaustion-loving languages.
One common form of exhaustion-loving parsing occurs in lexers that look for longest matches. Exhaustion will indicate that the longest match has been found.
It is possible for a language to be
exhaustion-loving at some points
and exhaustion-hating at others.
We mentioned Perl’s __DATA__
as a complication in a
basically exhaustion-hating language.
marpa_r_earleme_complete()
and
marpa_r_start_input
are the only methods
that may encounter parse exhaustion.
See marpa_r_earleme_complete(), and
marpa_r_start_input().
When the marpa_r_start_input
or
marpa_r_earleme_complete()
methods
exhaust the parse,
they generate a MARPA_EVENT_EXHAUSTED
event.
Applications
can also query
parse exhaustion status directly
with the
marpa_r_is_exhausted()
method.
See marpa_r_is_exhausted().
Next: Threads, Previous: Exhaustion, Up: Top [Contents][Index]
Libmarpa handling of semantics is unusual. Most semantics are left up to the application, but Libmarpa guides them. Specifically, the application is expected to maintain the evaluation stack. Libmarpa’s valuator provides instructions on how to handle the stack. Libmarpa’s stack handling instructions are called “steps”. For example, a Libmarpa step might tell the application that the value of a token needs to go into a certain stack position. Or a Libmarpa step might tell the application that a rule is to be evaluated. For rule evalution, Libmarpa will tell the application where the operands are to be found, and where the result must go.
The detailed discussion of Libmarpa’s handling of semantics is in the reference chapters of this document, under the appropriate methods and classes. The most extensive discussion of the semantics is in the section that deals with the methods of the value time class (Value methods).
Next: Sequence rules, Previous: Semantics, Up: Top [Contents][Index]
Libmarpa is thread-safe, given circumstances as described below. The Libmarpa methods are not reentrant.
Libmarpa is C89-compliant. It uses no global data, and calls only the routines that are defined in the C89 standard and that can be made thread-safe. In most modern implementations, the default C89 implementation is thread-safe to the extent possible. But the C89 standard does not require thread-safety, and even most modern environments allow the user to turn thread safety off. To be thread-safe, Libmarpa must be compiled and linked in an environment that provides thread-safety.
While Libmarpa can be used safely across multiple threads, a Libmarpa grammar cannot be. Further, a Libmarpa time object can only be used safely in the same thread as its base grammar. This is because all time objects with the same base grammar share data from that base grammar.
To work around this limitation, the same grammar definition can be used to a create a new Libmarpa grammar time object in each thread. If there is sufficient interest, future versions of Libmarpa could allow thread-safe cloning of grammars and other time objects.
Next: Nullability, Previous: Threads, Up: Top [Contents][Index]
Traditionally, grammars only allow BNF rules. Libmarpa allows sequence rules, which express sequences by allowing a single RHS symbol to be repeated.
A sequence rule consists of a LHS and a RHS symbol. Additionally, the application must indicate the minimum number of repetitions. The minimum count must be 0 or 1.
Optionally, a separator symbol may be specified. For example, a comma-separated sequence of numbers
1,42,7192,711,
may be recognized by specifying the rule Seq ::= num and the separator comma ::= ','. By default, an optional final separator, as shown in the example above, is recognized, but “proper separation” may also be specified. In proper separation separators must, in fact, come between (“separate”) items of the sequence. A final separator is not a separator in the strict sense, and therefore is not recognized when proper separation is in effect. For more on specifying sequence rules, see marpa_g_sequence_new.
Sequence rules are “sugar” — their presence in the Libmarpa interface does not extend its power. Every Libmarpa grammar that can be written using sequence rules can be rewritten as a grammar without sequence rules.
The RHS symbol and the separator, if there is one, must not be nullable. This is because it is not completely clear what an application intends when it asks for a sequence of items, some of which are nullable — the most natural interpretation of this usually results in a highly ambiguous grammar.
Libmarpa allows highly ambiquous grammars and a programmer who wants a grammar with sequences containing nullable items or separators can write that grammar using BNF rules. The use of BNF rules make it clearer that ambiguity is what the programmer intended, and allows the programmer more flexibility.
A sequence rule must have a dedicated LHS — that is, the LHS of a sequence rule must not be the LHS of any other rule. This implies that the LHS of a sequence rule can never be the LHS of a BNF rule.
The requirement that the LHS of a sequence rule be unique is imposed for reasons similar to those for the prohibition against RHS and separator nullables. Often reuse of the LHS of a sequence rule is simply a mistake. Even when deliberate, reuse of the LHS results in a complex grammar, one which often parses in ways that the programmer did not intend.
A programmer who believes they know what they are doing, and really does want alternative sequences starting at the same input location, can specify this behavior indirectly. They can do this by creating two sequence rules with distinct LHS’s:
Seq1 ::= Item1 Seq2 ::= Item2
and adding a new “parent” LHS which recognizes the sequences as alternatives.
SeqChoice ::= Seq1 SeqChoice ::= Seq2
Next: Failure, Previous: Sequence rules, Up: Top [Contents][Index]
In Libmarpa, there is no direct way to mark a symbol nullable or nulling. All Libmarpa’s terminal symbols are non-nullable. By default, Libmarpa’s non-terminal symbols are nullable or nulling depending on the rules in which they appear on the LHS. The default behavior for non-terminals can be changed (see LHS terminals), but this is deprecated.
To make a symbol x nullable, a user must create an nulling rule whose LHS is x. The empty rule is nulling, so that one way a user can ensure x is nullable is by making it the LHS of an empty rule. If every rule with x on the LHS is nulling, x will be not just nullable, but nulling as well.
• Nullability and the valuator: | ||
• Assigning semantics to nulled symbols: | ||
• Evaluating nulled symbols: | ||
• Example of nulled symbol: |
Next: Assigning semantics to nulled symbols, Previous: Nullability, Up: Nullability [Contents][Index]
In the valuator, every nulling tree is pruned back to its topmost nulling symbol. This means that there are no nulling rules in the valuator, only nulling symbols. For an example of how this works, see Example of nulled symbol.
While this may sound draconian, the “lost” semantics of the nulled rules and non-topmost nulled symbols are almost never missed. Nulled subtrees cannot contain input, and therefore do not contain token symbols. So no token values are lost when nulled subtrees are pruned, and we are dealing with the semantics of the empty string. See Evaluating nulled symbols.
Next: Evaluating nulled symbols, Previous: Nullability and the valuator, Up: Nullability [Contents][Index]
Libmarpa leaves the semantics to an upper layer, so that we usually treat semantics as outside the scope of this document. But most upper layers will find that nulled symbols are a corner case for their semantics, and we therefore offer the writers of upper layers some hints.
Typically, upper layers will assign semantics to
a LHS symbol based on the rule instance
in which the LHS occurs.
All nulled symbols are LHS symbols,
but the valuator prunes all nulled rules,
forcing the application to determine the semantics
of a nulled symbol instance based on its symbol.
One method of making this determination
is the one which is implemented in Marpa::R2
.
Call a grammar g;
let x be a symbol that
is nulled in a parse that uses g;
and call a rule in g with x on its LHS,
an “x LHS rule”.
Marpa::R2
assigns a semantics to x
using the first of following guidelines that applies:
Marpa::R2
assigns that shared semantics to x.
Marpa::R2
assigns the semantics of that empty rule to x.
Marpa::R2
reports an error.
Next: Example of nulled symbol, Previous: Assigning semantics to nulled symbols, Up: Nullability [Contents][Index]
In theory, the semantics of nulled symbols, like any semantics, can be arbitrarily complex. In practice, we are dealing with the semantics of the empty string, which is literally the “semantics of nothing”. If what we are dealing with truly is primarily a parsing problem, we can usually expect that the semantics of nothing will be simple.
The possible subtrees below a nulled symbol can be seen as a set, and that set is a constant that depends on the grammar. Since the input corresponding to the nulled symbol is also a constant (the empty string), the semantics of a nulled symbol will also be constant, with a few exceptions:
All of these exceptions are unusual or rare. When they do occur, the upper layer can implement the semantics of the nulled symbols with a function or a closure.
Previous: Evaluating nulled symbols, Up: Nullability [Contents][Index]
As already stated, Marpa prunes every null subtree back to its topmost null symbol. Here is an example grammar, with S as the start symbol.
S ::= L R L ::= A B X L ::= R ::= A B Y R ::= A ::= B ::= X ::= X ::= "x" Y ::= Y ::= "y"
If we let the input be ‘x’, we can write the unpruned parse tree in pre-order, depth-first, indenting children below their parents, like this:
0: Visible Rule: S := L R 1: Visible Rule L := A B X 1.1: Nulled Symbol A 1.2: Nulled Symbol B 1.3: Token, Value is "x" 2: Nulled Rule, Rule R := A B Y 2.1: Nulled Symbol A 2.2: Nulled Symbol B 2.3: Nulled Symbol Y
In this example, five symbols and a rule are nulled. The nulled rule and three of the nulled symbols are in a nulled subtree: 2, 2.1, 2.2 and 2.3. Marpa prunes every null subtree back to its topmost symbol, which in this case is the LHS of the rule numbered 2. The pruned tree looks like this:
0: Visible Rule: S := L R 1: Visible Rule L := A B X 1.1: Nulled Symbol A 1.2: Nulled Symbol B 1.3: Token, Value is "x" 2: LHS of Nulled Rule, Symbol R
Nulled nodes 1.1, 1.2 and 2 were all kept, because they are topmost in their nulled subtree. All the other nulled nodes were discarded.
Next: Introduction to the method descriptions, Previous: Nullability, Up: Top [Contents][Index]
As a reminder, no language in this chapter (or, for that matter, in this document) should be read as providing, or suggesting the existence of, a warranty. See license. Also, see No warranty.
Next: User non-conformity to specified behavior, Previous: Failure, Up: Failure [Contents][Index]
Libmarpa is a C language library, and inherits the traditional C language approach to avoiding and handling user programming errors. This approach will strike readers unfamiliar with this tradition as putting an appallingly large portion of the burden of avoiding application programmer error on the application programmer themself.
But in the early 1970’s, when the C language first stabilized, the alternative, and the consensus choice for its target applications was assembly language. In that context, C was radical in its willingness to incur a price in efficiency in order to protect the programmer from themself. C was considered to take a excessively “hand holding” approach which very much flew in the face of consensus.
The decades have made a large difference in the trade-offs, and the consensus about the degree to which even a low-level language should protect the user has changed. It seems inevitable that C will be replaced as the low-level language of choice, by a language that places fewer burdens on the programmer, and more on the machine. The question seems to be not whether C will be dethroned as the “go to” language for low-level progamming, but when, and by which alternative.
Modern hardware makes many simple checks essentially cost-free, and Libmarpa’s efforts to protect the application programmer go well beyond what would have been considered best practice in the past. But it remains a C language library. But, on the whole, the Libmarpa application programmer must be prepared to exercise the high degree of carefulness traditionally required by its C language environment. Libmarpa places the burden of avoiding irrecoverable failures, and of handling recoverable failures, largely on the application programmer.
Next: Classifying failure, Previous: Libmarpa's approach to failure, Up: Failure [Contents][Index]
This document specifies many behaviors for Libmarpa application programs to follow, such as the nature of the arguments to each method. The C language environment specifies many more behaviors, such as proper memory management. When a non-conformity to specified behavior is unintentional and problematic, it is frequently called a “bug”. Even the most carefully programmed Libmarpa application may sometimes contain a “bug”. In addition, some specified behaviors are explicitly stated as characterizing a primary branch of the processing, rather than made mandatory for all successful processing. Non-conformity to non-mandatory behaviors can be efficiently recoverable, and is often intentional.
This chapter describes how non-conformity to specified behavior by a Libmarpa application is handled by Libmarpa. Non-conformity to specified behavior by a Libmarpa application is also called, for the purposes of this document, a Libmarpa application programming failure. In contexts where no ambiguity arises, Libmarpa application programming failure will usually be abbreviated to failure.
Libmarpa application programming success in a context is defined as the absence of unrecovered failure in that context. When no ambiguity arises, Libmarpa application programming success is almost always abbreviated to success. For example, the success of an application means the application ran without any irrecoverable failures, and that it recovered from all the recoverable failures that were detected.
Next: Memory allocation failure, Previous: User non-conformity to specified behavior, Up: Failure [Contents][Index]
A Libmarpa application programming failure, unless stated otherwise, is an irrecoverable failure. Once an irrecoverable failure has occurred, the further behavior of the program is undefined. Nonetheless, we specify, and Libmarpa attempts, diagnostics behaviors (see Application and diagnostic behavior) in an effort to handle irrecoverable failures as smoothly as possible.
A Libmarpa application programming failure is not recoverable, unless this document states otherwise.
A failure is called a hard failure is it has an error code associated with it. A recoverable failure is called a soft failure if it has no associated error code. (For more on error codes, see Error codes.)
All failures fall into one of five types. In order of severity, these are
Next: Undetected failure, Previous: Classifying failure, Up: Failure [Contents][Index]
Failure to allocate memory is the most irrecoverable of irrecoverable
errors.
Even effective error handling assumes the ability to allocate memory,
so that the practice has been, in the event of a memory allocation failure,
to take Draconian action.
On
memory allocation failure,
as with all irrecoverable failures,
Libmarpa’s behavior in undefined,
but Libmarpa attempts to terminate the current program abnormally by calling abort()
.
Memory allocation failure is the only case in which Libmarpa terminates the program. In all other cases, Libmarpa leaves the decision to terminate the program, whether normally or abnormally, up to the application programmer.
Memory allocation failure does not have an error code. As a pedantic matter, memory allocation failure is neither a hard or a soft failure.
Next: Irrecoverable hard failure, Previous: Memory allocation failure, Up: Failure [Contents][Index]
An undetected failure is a failure that the Libmarpa library does not detect. Many failures are impossible or impractical for a C library to detect. Two examples of failure that the Libmarpa methods do not detect are writes outside the bounds of allocated memory, and use of memory after it has been freed. C is not strongly typed, and arguments of Libmarpa routines undergo only a few simple tests, tests which are inadequate to detect many of the potential problems.
By undetected failure we emphasize that we mean failures undetected by the Libmarpa methods. In the examples just given, there exist tools that can help the programmer detect memory errors and other tools exist to check the sanity of method arguments.
This document points out some of the potentially undetected problems, when doing so seems more helpful than tedious. But any attempt to list all the undetected problems would be too large and unwieldy to be useful.
Undetected failure is always irrecoverable. An undetected failure is neither a hard or a soft failure.
Next: Partially recoverable hard failure, Previous: Undetected failure, Up: Failure [Contents][Index]
An irrecoverable hard failure is an irrecoverable Libmarpa application programming failure that has an error code associated with it. Libmarpa attempts to behave as predictably as possible in the face of a hard failure, but once an irrecoverable failure occurs, the behavior of a Libmarpa application is undefined.
In the event of an irrecoverable failure, there are no application behaviors. The diagnostic behavior for a hard failure is as described for the method that detects the hard failure. At a minimum, this diagnostic behavior will be returning from the method that detects the hard failure with the return value specified for hard failure, and setting the error code as specified for hard failure.
Next: Library-recoverable hard failure, Previous: Irrecoverable hard failure, Up: Failure [Contents][Index]
A partially recoverable hard failure is a recoverable Libmarpa application programming failure
For every partially recoverable hard failure, this document specifies the application behaviors that remain available after it occurs. The most common kind of partially recoverable hard failure is a library-recoverable hard failure. For an example of partially recoverable hard failure, see Library-recoverable hard failure.
Next: Ancestry-recoverable hard failure, Previous: Partially recoverable hard failure, Up: Failure [Contents][Index]
A library-recoverable hard failure is a type of partially recoverable hard failure. Loosely described, it is a hard failure that allows the programmer to continue to use many of the Libmarpa methods in the library, but that disallows certain methods on some objects.
To state the restrictions of application behaviors more precisely, let the “failure grammar” be the base grammar of the method that detected the library-recoverable hard failure. After a library-recoverable hard failure, the following behaviors are no longer applcation behaviors:
Recall that any use of a behavior that is not an application behavior is an irrecoverable failure.
The application behaviors remaining after a library-recoverable hard failure are the following:
Note that Libmarpa destructors remain available after a library recoverable failure. An application will often want to destroy all Libmarpa objects whose base grammar is the failure grammar, in order to clear memory of problematic objects.
An example of a library-recoverable hard failure is
the MARPA_ERR_COUNTED_NULLABLE
error
in the marpa_g_precompute
method.
See marpa_g_precompute().
Next: Fully recoverable hard failure, Previous: Library-recoverable hard failure, Up: Failure [Contents][Index]
An ancestry-recoverable hard failure is a type of partially recoverable hard failure. An ancestry-recoverable failure allows a superset of the application behaviors allowed by a library-recoverable hard failure. More precisely, let the “failure object” be the object that detected the ancestry-recoverable hard failure. After an ancestry-recoverable hard failure, the following behaviors are no longer applcation behaviors:
Recall that any use of a behavior that is not an application behavior is an irrecoverable failure.
The application behaviors remaining after a ancestry-recoverable hard failure are the following:
Note that all Libmarpa destructors remain available after an ancestry-recoverable failure. An application will often want to destroy the failure object and all of its descendants, in order to clear memory of problematic objects.
As an example,
users calling marpa_g_precompute()
will often want to treat a MARPA_EVENT_EARLEY_ITEM_THRESHOLD
event
as if it were an ancestry-recoverable hard failure.
See marpa_g_precompute().
Library-recoverable failure is a special case of ancestry-recoverable failure. When the failure object is a grammar, ancestry-recoverable failure is synonymous with library-recoverable failure.
Next: Soft failure, Previous: Ancestry-recoverable hard failure, Up: Failure [Contents][Index]
A fully recoverable hard failure is a recoverable Libmarpa application programming failure
One example of a fully recoverable hard failure is
the error code MARPA_ERR_UNEXPECTED_TOKEN_ID
.
The “Ruby Slippers” parsing technique
(see Ruby Slippers),
which has seen extensive usage,
is based
on Libmarpa’s ability to recover from
a MARPA_ERR_UNEXPECTED_TOKEN_ID
error
fully and efficiently,
Next: Error codes, Previous: Fully recoverable hard failure, Up: Failure [Contents][Index]
An soft failure is an recoverable Libmarpa application programming failure that has no error code associated with it. Hard errors are assigned error codes in order to tell them apart. Error codes are not necessary or useful for soft errors, because there is at most one type of soft failure per Libmarpa method.
Soft failures are so called, because they are the least severe kind of failure. The most severe failures are “bugs” — unintended, and a symptom of a problem. Soft failures, on the other hand, are a frequent occurrence in normal, successful, processing. In the phrase “soft failure”, the word “failure” is used in the same sense that its cognate “fail” is used when we say that a loop terminates when it “fails” its loop condition. That ”failure” is of a condition necessary to continue on a main branch of processing, and a signal to proceed on another branch.
It is expected that Libmarpa applications will be designed such that successful execution is based on the handling specified for soft failures. In fact, a non-trival Libmarpa application can hardly be designed except on that basis.
Previous: Soft failure, Up: Failure [Contents][Index]
As stated, every hard failure has an associated error code. Full descriptions of the error codes that are returned by the external methods are given in their own section (External error codes).
How the error code is accessed depends on the method that detects the hard failure associated with that error code. Methods for time objects always set the error code in the base grammar, from which it may be accessed using the error methods described below (Error methods). If a method has no base grammar, the way in which the error code for the hard failures that it detects can be accessed will be stated in the description of that method.
Since the error of a time object is set in the base grammar, it follows that every object with the same base grammar has the same error code. Objects with different base grammars may have different error codes.
While error codes are properties of a base grammar, irrecoverability is application-wide. That is, whenever any irrecoverable failure occurs, the entire application is irrecoverable. Once an application becomes irrecoverable, those Libmarpa objects with error codes for recoverable errors are still subject to the general irrecoverability.
Next: Static methods, Previous: Failure, Up: Top [Contents][Index]
The following chapters describe Libmarpa’s methods in detail.
• About the overviews: | ||
• Naming conventions: | ||
• Return values: | ||
• How to read the method descriptions: |
Next: Naming conventions, Previous: Introduction to the method descriptions, Up: Introduction to the method descriptions [Contents][Index]
The method descriptions are grouped into chapters and sections. Each such group of methods descriptions begins, optionally, with an overview. These overviews, again optionally, end with a “cheat sheet”. The “cheat sheets” name the most important Libmarpa methods in that chapter or section, in the order in which they are typically used, and very briefly describe their purpose.
The overviews sometimes speak of an “archetypal” application. The archetypal Libmarpa application implements a complete logic flow, starting with the creation of a grammar, and proceeding all the way to the return of the final result from a value object. In the archetypal Libmarpa application, the grammar, input and semantics are all small but non-trivial.
Next: Return values, Previous: About the overviews, Up: Introduction to the method descriptions [Contents][Index]
Methods in Libmarpa follow a strict naming convention.
All methods have a name beginning with
marpa_
,
if they are part of the
external interface.
If an external method is not a static method,
its name is prefixed with one of
marpa_c_
,
marpa_g_
,
marpa_r_
,
marpa_b_
,
marpa_o_
,
marpa_t_
or
marpa_v_
,
where the single letter between underscores
is one of the Libmarpa major class abbreviations.
The letter indicates which class
the method belongs to.
Methods that are exported,
but that are part of
the internal interface,
begin with _marpa_
.
Methods that are part of the internal interface
(often called “internal methods”)
are subject to change and are intended for use
only by Libmarpa’s developers.
Libmarpa reserves the
marpa_
and _marpa_
prefixes for itself,
with all their capitalization variants.
All Libmarpa names visible outside the package
will begin with a capitalization variant
of one of these two prefixes.
Next: How to read the method descriptions, Previous: Naming conventions, Up: Introduction to the method descriptions [Contents][Index]
Some general conventions for return values are worth mentioning:
NULL
usually indicates method failure.
Any other result usually indicates method success.
The words “success” and “failure” are heavily overloaded in these documents. But in contexts where our meaning is clear we will usually abbreviate “method success” and “method failure” to “success” and “failure”, respectively.
The Libmarpa programmer should not overly rely on the general
conventions for return values.
In particular, -2 may sometimes be ambiguous —
both a valid return value
for method success, and a potential indication of hard method failure.
In this case, the programmer must distinguish the two return statuses
based on the error code,
and a programmer who is relying too heavily on the general
conventions will fall into a trap.
For a the description of the return values of
marpa_g_rule_rank_set()
,
see Rank methods.
Previous: Return values, Up: Introduction to the method descriptions [Contents][Index]
The method descriptions are written on the assumption that the reader has the following in mind while reading them:
void
,
the last paragraph of its method description is a
“return value summary”.
The return value summary
starts with the label “Return Value”.
Next: Configuration methods, Previous: Introduction to the method descriptions, Up: Top [Contents][Index]
Checks that the Marpa library in use is compatible with the
given version. Generally, the application programmer will pass in the constants
MARPA_MAJOR_VERSION
,
MARPA_MINOR_VERSION
, and
MARPA_MICRO_VERSION
as the three arguments,
to check that their application was compiled with headers
the match the version of Libmarpa that they
are using.
If required_major.required_minor.required_micro is an exact match with 11.0.1, the method succeeds. Otherwise the return status is an irrecoverable hard failure.
Return value: On success, MARPA_ERR_NONE
.
On hard failure, the error code.
Writes the version number in version.
It is an undetected irrecoverable hard failure
if version does not have room for three int
’s.
Return value: Always succeeds. The return value is unspecified.
Next: Grammar methods, Previous: Static methods, Up: Top [Contents][Index]
The configuration object is intended for future extensions.
These may
allow the application to override Libmarpa’s memory allocation
and fatal error handling without resorting to global
variables, and therefore in a thread-safe way.
Currently, the only function of the Marpa_Config
class is to give
marpa_g_new()
a place to put its error code.
Marpa_Config
is Libmarpa’s only “major”
class which is not a time class.
There is no constructor or destructor, although
Marpa_Config
objects do need to be initialized
before use.
Aside from its own accessor,
Marpa_Config
objects are only used by
marpa_g_new()
and no reference to their location is not kept
in any of Libmarpa’s time objects.
The intent is to that it be convenient
to have them in memory that might be deallocated
soon after
marpa_g_new()
returns.
For example, they could be put on the stack.
Initialize the config information to “safe” default values. An irrecoverable error will result if an uninitialized configuration is used to create a grammar.
Return value: Always succeeds. The return value is unspecified.
Error codes are usually kept in the base grammar,
which leaves
marpa_g_new()
no place to put
its error code on failure.
Objects of
the Marpa_Config
class provide such a place.
p_error_string is reserved for use by
the internals.
Applications should set it to NULL
.
Return value: The error code in config.
Always succeeds, so that
marpa_c_error()
never requires an error code
for itself.
Next: Recognizer methods, Previous: Configuration methods, Up: Top [Contents][Index]
• Grammar overview: | ||
• Grammar constructor: | ||
• Grammar reference counting: | ||
• Symbol methods: | ||
• Rule methods: | ||
• Sequence methods: | ||
• Rank methods: | ||
• Grammar precomputation: |
Next: Grammar constructor, Previous: Grammar methods, Up: Grammar methods [Contents][Index]
An archetypal application has a grammar.
To create a grammar, use the
marpa_g_new()
method.
When a grammar is no longer in use, its memory can be freed
using the
marpa_g_unref()
method.
To be precomputed,
a grammar must have one or more symbols.
To create symbols, use the
marpa_g_symbol_new()
method.
To be precomputed,
a grammar must have one or more rules.
To create rules, use the
marpa_g_rule_new()
and
marpa_g_sequence_new()
methods.
To be precomputed,
a grammar must have exactly one start symbol.
To mark a symbol as the start symbol,
use the
marpa_g_start_symbol_set()
method.
Before parsing with a grammar, it must be precomputed.
To precompute a grammar,
use the
marpa_g_precompute()
method.
Next: Grammar reference counting, Previous: Grammar overview, Up: Grammar methods [Contents][Index]
Creates a new grammar time object. The returned grammar object is not yet precomputed, and will have no symbols and rules. Its reference count will be 1.
Unless the application calls
marpa_c_error()
Libmarpa will not reference the location
pointed to by the configuration
argument after
marpa_g_new()
returns.
(See marpa_c_error().)
The configuration argument may be NULL
,
but if it is,
there will be no way to determine
the error code on failure.
Return value: On success, the grammar object.
On hard failure, NULL
.
Also on hard failure,
if the configuration argument is not NULL
,
the error code is set in configuration.
The error code may be accessed using
marpa_c_error()
.
It is recommended that this call be made immediately after the grammar constructor. It turns off a deprecated feature.
The
marpa_g_force_valued()
forces all the
symbols in a grammar to be “valued”.
The opposite of a valued symbol is one about whose value
you do not care.
This distinction has been made in the past in hope
of gaining efficiencies at evaluation time.
Current thinking is that the gains do not repay the extra
complexity.
Return value: On success, a non-negative integer, whose value is otherwise unspecified. On failure, -2.
Next: Symbol methods, Previous: Grammar constructor, Up: Grammar methods [Contents][Index]
Increases the reference count of g by 1. Not needed by most applications.
Return value:
On success, g.
On hard failure, NULL
.
Decreases the reference count by 1, destroying g once the reference count reaches zero.
Next: Rule methods, Previous: Grammar reference counting, Up: Grammar methods [Contents][Index]
When successful, returns the ID of the start symbol.
Soft fails, if there is no start symbol.
The start symbol is set by the
marpa_g_start_symbol_set()
call.
Return value: On success, the ID of the start symbol, which is always a non-negative number. On soft failure, -1. On hard failure, -2.
When successful, sets the start symbol of grammar g to symbol sym_id. Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist.
Return value: On success, sym_id, which will always be a non-negative number. On soft failure, -1. On hard failure, -2.
Return value: On success, the numerically largest symbol ID of g. On hard failure, -2.
A symbol is accessible if it can be reached from the start symbol. Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success, 1 if symbol sym_id is accessible, 0 if not. On soft failure, -1. On hard failure, -2.
A symbol is nullable if it sometimes produces the empty string. A nulling symbol is always a nullable symbol, but not all nullable symbols are nulling symbols. Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success, 1 if symbol sym_id is nullable, 0 if not. On soft failure, -1. On hard failure, -2.
A symbol is nulling if it always produces the empty string. Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success, 1 if symbol sym_id is nulling, 0 if not. On soft failure, -1. On hard failure, -2.
A symbol is productive if it can produce a string of terminals. All nullable symbols are considered productive. Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success, 1 if symbol sym_id is productive, 0 if not. On soft failure, -1. On hard failure, -2.
On success, if sym_id is the start symbol, returns 1. On success, if sym_id is not the start symbol, returns 0. On success, if no start symbol has been set, returns 0. is the start symbol.
Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist.
Return value: On success, 1 or 0. On soft failure, -1. On hard failure, -2.
On succcess, returns the “terminal status” of a sym_id.
The terminal status is 1 if sym_id is a terminal,
0 otherwise.
To be used as an input symbol
in the
marpa_r_alternative()
method,
a symbol must be a terminal.
Soft fails if sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist.
Return value: On success, 1 or 0. On soft failure, -1. On hard failure, -2.
When successful, creates a new symbol in grammar g. The symbol ID’s are non-negative integers. Within each grammar, a symbol’s ID is unique to that symbol.
Symbols are numbered consecutively, starting at 0.
That is, the first successful call of this method for a grammar returns the symbol
with ID 0.
The n’th successful call returns the symbol for a grammar
with ID n-1
.
This makes it convenient for applications to store additional information
about the symbols in an array.
Return value: On success, the ID of the new symbol, which will be a non-negative integer. On hard failure, -2.
Next: Sequence methods, Previous: Symbol methods, Up: Grammar methods [Contents][Index]
Return value: On success, the numerically largest rule ID of g. On hard failure, -2.
A rule is accessible if it can be reached from the start symbol. A rule is accessible if and only if its LHS symbol is accessible. The start rule is always an accessible rule.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success 1 or 0: 1 if rule with ID rule_id is accessible, 0 if not. On soft failure, -1. On hard failure, -2.
A rule is nullable if it sometimes produces the empty string. A nulling rule is always a nullable rule, but not all nullable rules are nulling rules.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success 1 or 0: 1 if the rule with ID rule_id is nullable, 0 if not. On soft failure, -1. On hard failure, -2.
A rule is nulling if it always produces the empty string.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success 1 or 0: 1 if the rule with ID rule_id is nulling, 0 if not. On soft failure, -1. On hard failure, -2.
A rule is a loop rule if it non-trivially produces the string of length one that consists only of its LHS symbol. Such a derivation takes the parse back to where it started, hence the term “loop”. “Non-trivially” means the zero-step derivation does not count — the derivation must have at least one step.
The presence of a loop rule makes a grammar infinitely ambiguous, and applications will typically want to treat them as fatal errors. But nothing forces an application to do this, and Marpa will successfully parse and evaluate grammars with loop rules.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success 1 or 0: 1 if the rule with ID rule_id is a loop rule, 0 if not. On soft failure, -1. On hard failure, -2.
A rule is productive if it can produce a string of terminals. A rule is productive if and only if all the symbols on its RHS are productive. The empty string counts as a string of terminals, so that a nullable rule is always a productive rule. For that same reason, an empty rule is considered productive.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist. A common hard failure is calling this method with a grammar that is not precomputed.
Return value: On success 1 or 0: 1 if the rule with ID rule_id is productive, 0 if not. On soft failure, -1. On hard failure, -2.
The length of a rule is the number of symbols on its RHS.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
Return value: On success, the length of the rule with ID rule_id. On soft failure, -1. On hard failure, -2.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
Return value: On success, the ID of the LHS symbol of the rule with ID rule_id. On soft failure, -1. On hard failure, -2.
On success, creates a new external BNF rule in grammar g.
In addition to BNF rules, Marpa also allows sequence rules,
which are created by
the
marpa_g_sequence_new()
method.
See marpa_g_sequence_new().
We call
marpa_g_rule_new()
and
marpa_g_sequence_new()
rule creation methods.
Sequence rules and BNF rules are both rules: They share the same series of rule IDs, and are accessed and manipulated by the same methods, with the only differences being as noted in the descriptions of those methods.
Each grammar’s rule ID’s are a consecutive sequence of non-negative integers, starting at 0. This is intended to make it convenient for applications to store additional information about a grammar’s rules in an array. Within each grammar, the following is true:
n-1
.
The LHS symbol is lhs_id, and there are length symbols on the RHS. The RHS symbols are in an array pointed to by rhs_ids.
Possible hard failures, with their error codes, include:
MARPA_ERR_SEQUENCE_LHS_NOT_UNIQUE
: The LHS symbol is the same
as that of a sequence rule.
MARPA_ERR_DUPLICATE_RULE
: The new rule would duplicate another BNF
rule.
Another BNF rule is considered the duplicate of the new one,
if its LHS symbol is the same as symbol lhs_id,
if its length is the same as length,
and if its RHS symbols match one for one those
in the array of symbols rhs_ids.
Return value: On success, the ID of the new external rule. On hard failure, -2.
When successful, returns the ID of the symbol at index ix in the RHS of the rule with ID rule_id. The indexing of RHS symbols is zero-based.
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
A common hard failure is for ix not to be a valid index of the RHS. This happens if ix is less than zero, or or if ix is greater than or equal to the length of the rule.
Return value: On success, a symbol ID, which is always non-negative. On soft failure, -1. On hard failure, -2.
Next: Rank methods, Previous: Rule methods, Up: Grammar methods [Contents][Index]
When successful, returns
Does not distinguish sequence rules without proper
separation from non-sequence rules.
That is,
does not distinguish an unset proper separation flag
from a
proper separation flag whose value is unspecified
because rule_id is the ID of a BNF rule.
Applications that want to determine whether
or not a rule is a sequence rule
can use
marpa_g_sequence_min()
to do this.
See marpa_g_sequence_min().
Soft fails if rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
Return value: On success, 1 or 0. On soft failure, -1. On hard failure, -2.
On success, returns the mininum length of a sequence rule. Soft fails if a rule with ID rule_id exists, but is not a sequence rule. This soft failure can used to test whether or not a rule is a sequence rule.
Hard fails irrecoverably if rule_id is not well-formed (a non-negative number). Also, hard fails irrecoverably if no rule with ID rule_id exists, even when rule_id is well formed. Note that, in its handling of the non-existence of a rule for its rule argument, this method differs from many of the other grammar methods. Grammar methods that take a rule ID argument more often treat the non-existence of rule for a well-formed rule ID as a soft, recoverable, failure.
Return value: On success, the minimum length of the sequence rule with ID rule_id, which is always non-negative. On soft failure, -1. On hard failure, -2.
When successful,
adds a new sequence rule to grammar g,
and returns its ID.
In addition to sequence rules, Marpa also allows BNF rules,
which are created by
the
marpa_g_rule_new()
method.
See marpa_g_rule_new().
We call
marpa_g_rule_new()
and
marpa_g_sequence_new()
rule creation methods.
For details on the use of sequence rules,
see Sequence rules.
Sequence rules and BNF rules are both rules: They share the same series of rule IDs, and are accessed and manipulated by the same methods, with the only differences being as noted in the descriptions of those methods.
Each grammar’s rule ID’s are a consecutive sequence of non-negative integers, starting at 0. This is intended to make it convenient for applications to store additional information about a grammar’s rules in an array. Within each grammar, the following is true:
n-1
.
The LHS of the sequence is lhs_id,
and the item to be repeated on the RHS of the sequence is rhs_id.
The sequence must be repeated at least min times,
where min is 0 or 1.
The sequence RHS, or item,
is restricted to a single symbol,
and that symbol cannot be nullable.
If separator_id is non-negative,
it is a separator symbol,
which cannot be nullable.
flags is a bit vector.
Use of any other bit except MARPA_PROPER_SEPARATION
results in undefined behavior.
By default, a sequence rule recognizes a trailing separator.
If flags & MARPA_PROPER_SEPARATION
is non-zero,
separation is “proper”.
Proper separation means the the rule does
not recognize a trailing separator.
Specifying proper separation has no effect unless
a separator symbol has also been specified.
The LHS symbol cannot be the LHS of any other rule,
whether a BNF rule or a sequence rule.
On an attempt to create an sequence rule with a duplicate
LHS,
this method hard fails,
with an error code of
MARPA_ERR_SEQUENCE_LHS_NOT_UNIQUE
.
Return value: On success, the ID of the newly added sequence rule, which is always non-negative. On hard failure, -2.
On success, returns the symbol ID of the separator of the sequence rule with ID rule_id. Soft fails if there is no separator. The causes of hard failure include rule_id not being well-formed; rule_id not being the ID of a rule that exists; and rule_id not being the ID a sequence rule.
Return value: On success, a symbol ID, which is always non-negative. On soft failure, -1. On hard failure, -2.
On success, returns a boolean whose value is 1 iff the symbol with ID sym_id is counted. A symbol is counted iff
Soft fails iff sym_id is well-formed (a non-negative integer), but a symbol with that ID does not exist.
Return value: On success, a boolean. On soft failure, -1. On hard failure, -2.
Next: Grammar precomputation, Previous: Sequence methods, Up: Grammar methods [Contents][Index]
On success, returns the default rank of the grammar g. For more about the default rank of a grammar, see marpa_g_default_rank_set().
Return value: On success, returns
the default rank of the grammar,
and sets the error code to
MARPA_ERR_NONE
.
On failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that when
the default rank of the grammar is -2,
the error code is the only way to distinguish
success from failure.
The error code can be determined by using the
marpa_g_error()
call.
See marpa_g_error().
On success, sets the default rank of the grammar g to rank. When a grammar is created, the default rank is 0. When rules and symbols are created, their rank is the default rank of the grammar.
Changing the grammar’s default rank does not affect those rules and symbols already created, only those that will be created. This means that the grammar’s default rank can be used to, in effect, assign ranks to groups of rules and symbols. Applications may find this behavior useful.
Return value: On success, returns rank
and sets the error code to
MARPA_ERR_NONE
.
On failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that when the rank is -2,
the error code is the only way to distinguish
success from failure.
The error code can be determined by using the
marpa_g_error()
call.
See marpa_g_error().
When successful, returns the rank of the symbol with ID sym_id. When a symbol is created, its rank is initialized to the default rank of the grammar.
Return value:
On success, returns a symbol rank,
and sets the error code to
MARPA_ERR_NONE
.
On hard failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that -2 is a valid symbol rank,
so that when -2 is returned,
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
When successful, sets the rank of the symbol with ID sym_id to rank. When a symbol is created, its rank is initialized to the default rank of the grammar.
Return value:
On success, returns rank,
and sets the error code to
MARPA_ERR_NONE
.
On hard failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that rank may be -2,
and in this case
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
When successful, returns the rank of the rule with ID rule_id. When a rule is created, its rank is initialized to the default rank of the grammar.
Return value:
On success, returns a rule rank,
and sets the error code to
MARPA_ERR_NONE
.
The rule rank is an integer.
On hard failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that -2 is a valid rule rank,
so that when -2 is returned,
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
When successful, sets the rank of the rule with ID rule_id to rank and returns rank.
Return value:
On success, returns rank,
which will be an integer,
and sets the error code to
MARPA_ERR_NONE
.
On hard failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Note that -2 is a valid rule rank,
so that when -2 is returned,
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
On success, returns a boolean whose value is 1 iff “null ranks high” is set in the rule with ID rule_id. When a rule is created, it has “null ranks high” set.
For more on the
“null ranks high” setting, read the description of
marpa_g_rule_null_high_set()
.
See marpa_g_rule_null_high_set().
Soft fails iff rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
Return value: On success, a boolean. On soft failure, -1. On hard failure, -2.
On success,
The “null ranks high” setting affects the ranking of rules with properly nullable symbols on their right hand side. If a rule has properly nullable symbols on its RHS, each instance in which it appears in a parse will have a pattern of nulled and non-nulled symbols. Such a pattern is called a “null variant”.
If the “null ranks high” is set, nulled symbols rank high. If the “null ranks high” is unset is the default), nulled symbols rank low. Ranking of a null variants is done from left-to-right.
Soft fails iff rule_id is well-formed (a non-negative integer), but a rule with that ID does not exist.
Hard fails if the grammar has been precomputed.
Return value: On success, a boolean. On soft failure, -1. On hard failure, -2.
Previous: Rank methods, Up: Grammar methods [Contents][Index]
On success, returns a boolean which is 1 iff
g has a cycle.
Cycles make a grammar infinitely ambiguous,
and are considered useless in current
practice.
Cycles make processing the grammar less
efficient, sometimes considerably so.
Applications will almost always want to treat cycles
as mistakes on the part of the writer of the grammar.
To determine which rules are in the cycle,
marpa_g_rule_is_loop()
can be used.
Return value: On success, a boolean. On hard failure, -2.
Return value: On success, a boolean which is 1 iff grammar g is precomputed. On hard failure, -2.
On success, and on fully recoverable hard failure, precomputes the grammar g. Precomputation involves running a series of grammar checks and “precomputing” some useful information which is kept internally to save repeated calculations. After precomputation, the grammar is “frozen” in many respects, and many grammar mutators that succeed before precomputation will cause hard failures after precomputation. Precomputation is necessary for a recognizer to be generated from a grammar.
When called, clears any events already in the event queue.
May return one or more events.
The types of event that this method may return
are
A MARPA_EVENT_LOOP_RULES
,
MARPA_EVENT_COUNTED_NULLABLE
,
MARPA_EVENT_NULLING_TERMINAL
.
All of these events occur only on failure.
Applications must be prepared for this method
to return additional events,
including events that occur on success.
Events may be queried using the
marpa_g_event()
method.
See marpa_g_event().
The fully recoverable hard failure is
MARPA_ERR_GRAMMAR_HAS_CYCLE
.
Recall that for fully recoverable hard failures
this method precomputes the grammar.
Most appplications, however, will want to treat
a grammar with cycles as if it were
a library-recoverable error.
A MARPA_ERR_GRAMMAR_HAS_CYCLE
error occurs
iff
a MARPA_EVENT_LOOP_RULES
event occurs.
For more details on cycles,
see marpa_g_has_cycle().
The error code MARPA_ERR_COUNTED_NULLABLE
is library-recoverable.
This failure occurs when a symbol on the RHS of a sequence rule is
nullable,
which Libmarpa does not allow in a grammar.
Error code MARPA_ERR_COUNTED_NULLABLE
occurs iff
one or more MARPA_EVENT_COUNTED_NULLABLE
events occur.
There is one MARPA_EVENT_COUNTED_NULLABLE
event for every symbol
that is a nullable on the right hand side of a sequence
rule.
An application may use these events to inform the user
of the problematic symbols,
and this detail may help the user fix the grammar.
The error code MARPA_ERR_NULLING_TERMINAL
occurs only if LHS terminals are enabled.
The LHS terminals feature is deprecated.
See LHS terminals.
Error code MARPA_ERR_NULLING_TERMINAL
is library-recoverable.
One or more MARPA_EVENT_NULLING_TERMINAL
events will occur iff
this method fails with error code MARPA_ERR_NULLING_TERMINAL
.
See Nulling terminals.
Among the other error codes that may case this method to fail are the following:
MARPA_ERR_NO_RULES
: The grammar has no rules.
MARPA_ERR_NO_START_SYMBOL
: No start symbol was specified.
MARPA_ERR_INVALID_START_SYMBOL
: A start symbol ID was specified, but it
is not the ID of a valid symbol.
MARPA_ERR_START_NOT_LHS
: The start symbol is not on the LHS of any rule.
MARPA_ERR_UNPRODUCTIVE_START
: The start symbol is not productive.
More details of these can be found under the description of the appropriate code. See External error codes.
Return value: On success, a non-negative number,
whose value is otherwise unspecified.
On hard failure, -2.
For the error code MARPA_ERR_GRAMMAR_HAS_CYCLE
,
the hard failure is fully recoverable.
For the error codes MARPA_ERR_COUNTED_NULLABLE
and MARPA_ERR_NULLING_TERMINAL
,
the hard failure is library-recoverable.
Next: Progress reports, Previous: Grammar methods, Up: Top [Contents][Index]
• Recognizer overview: | ||
• Creating a new recognizer: | ||
• Recognizer reference counting: | ||
• Recognizer life cycle mutators: | ||
• Location accessors: | ||
• Other parse status methods: |
Next: Creating a new recognizer, Previous: Recognizer methods, Up: Recognizer methods [Contents][Index]
An archetypal application uses a recognizer to read input.
To create a recognizer, use the
marpa_r_new()
method.
When a recognizer is no longer in use, its memory can be freed
using the
marpa_r_unref()
method.
To make a recognizer ready for input,
use the
marpa_r_start_input()
method.
The recognizer starts with its current earleme
at location 0.
To read a token at the current earleme,
use the
marpa_r_alternative()
call.
To complete the processing of the current earleme,
and move forward to a new one,
use the
marpa_r_earleme_complete()
call.
Next: Recognizer reference counting, Previous: Recognizer overview, Up: Recognizer methods [Contents][Index]
On success, creates a new recognizer and increments the reference count of g, the base grammar, by one. In the new recognizer,
Return value:
On success, the newly created recognizer, which is never NULL
.
If g is not precomputed, or on other hard failure, NULL
.
Next: Recognizer life cycle mutators, Previous: Creating a new recognizer, Up: Recognizer methods [Contents][Index]
Increases the reference count by 1. This method is not needed by most applications.
Return value:
On success, the recognizer object, r, which is never NULL
.
On hard failure, NULL
.
Decreases the reference count by 1, destroying r once the reference count reaches zero. When r is destroyed, the reference count of its base grammar is decreased by one. If this takes the reference count of the base grammar to zero, the base grammar is also destroyed.
Next: Location accessors, Previous: Recognizer reference counting, Up: Recognizer methods [Contents][Index]
When successful, does the following:
MARPA_EVENT_EXHAUSTED
event.
See Exhaustion.
MARPA_EVENT_SYMBOL_NULLED
,
MARPA_EVENT_SYMBOL_PREDICTED
, or MARPA_EVENT_SYMBOL_EXPECTED
events.
See Events.
Return value: On success, a non-negative value, whose value is otherwise unspecified. On hard failure, -2.
The token_id argument must be the symbol ID of a terminal. The value argument is an integer that represents the “value” of the token, and which should not be zero. The length argument is the length of the token, which must be greater than zero.
On success, does the following, where current is the value of the current earleme before the call and furthest is the value of the furthest earleme before the call:
current+length
.
max(current+length,furthest)
.
After recoverable failure, the following are the case:
Libmarpa allows tokens to be ambiguous. Two tokens are ambiguous if they end at the same earleme location. If two tokens are ambiguous, Libmarpa will attempt to produce all the parses that include either of them.
Libmarpa allows tokens to overlap. Let the notation t@s-e indicate that token t starts at earleme s and ends at earleme e. Let t1@s1-e1 and t2@s2-e2 be two tokens such that s1<=s2. We say that t1 and t2 overlap iff e1>s2.
The value argument is not used inside Libmarpa — it is simply stored to be returned by the valuator as a convenience for the application. In applications where the token’s actual value is not an integer, it is expected that the application will use value as a “virtual” value, perhaps finding the actual value by using value to index an array. Some applications may prefer to track token values on their own, perhaps based on the earleme location and token_id, instead of using Libmarpa’s token values.
A value of 0 does not cause a failure, but it is reserved for unvalued symbols, a now-deprecated feature. See Valued and unvalued symbols.
Hard fails irrecoverably with MARPA_ERR_DUPLICATE_TOKEN
if the token added would be a duplicate.
Two tokens are duplicates iff all of the following are true:
marpa_r_alternative()
attempts
to read them while at the same current earleme.
If a token was not accepted
because of its token ID,
hard fails with the MARPA_ERR_UNEXPECTED_TOKEN_ID
.
This hard failure is fully recoverable
so that, for example,
the application may
retry this method with different token IDs
until it succeeds.
These retries are efficient,
and are quite useable as a parsing
technique —
so much so we have given the technique a name:
the Ruby Slippers.
The Ruby Slippers are used in several
applications.
Return value: On success, MARPA_ERR_NONE
.
On failure, an error code other than MARPA_ERR_NONE
.
The hard failure for MARPA_ERR_UNEXPECTED_TOKEN_ID
is fully recoverable.
For the purposes of this method description, we define the following:
marpa_r_earleme_complete
.
marpa_r_earleme_complete
.
marpa_r_terminal_is_expected()
determines if a terminal is “expected”
at the current earleme.
See marpa_r_terminals_expected().
marpa_r_alternative()
to end at an earleme after the current
earleme.
An anticipated terminal will have length greater than one.
“Anticipated” terminals only occur if the application is using
an advanced model of input.
See Advanced input models.
On success, does the final processing for the current earleme, including the following:
current+1
.
current+1
.
marpa_r_earleme_complete()
.
MARPA_EVENT_SYMBOL_COMPLETED
,
MARPA_EVENT_SYMBOL_NULLED
, MARPA_EVENT_SYMBOL_PREDICTED
,
or MARPA_EVENT_SYMBOL_EXPECTED
events.
See Events.
MARPA_EVENT_EARLEY_ITEM_THRESHOLD
event.
Often, the application will want to treat this event
as if it were a ancestry-recoverable
failure.
See marpa_r_earley_item_warning_threshold_set().
MARPA_EVENT_EXHAUSTED
event.
Exhaustion on success only occurs if no terminals
are expected at the current earleme after
the call to this method
(that is, at current+1
)
and no terminals are anticipated
after current+1
.
On hard failure
with the code MARPA_ERR_PARSE_EXHAUSTED
, does the following:
marpa_r_earleme_complete()
.
MARPA_EVENT_EXHAUSTED
event and no others.
MARPA_ERR_PARSE_EXHAUSTED
to be fully recoverable.
We note that exhaustion can occur when this method fails and when it succeeds. The distinction is that, on success, the call creates a new Earley set before becoming exhausted while, on failure, it becomes exhausted without creating a new Earley set.
This method is commonly called at the top of a loop.
Almost all applications will want to check the return value
and take special action in case of a value other than zero.
If the value is greater than zero, an event will have occurred
and almost all applications
should react to MARPA_EVENT_EARLEY_ITEM_THRESHOLD
events,
as described above,
and to unexpected events.
If the value is less than zero,
it may be due to an irrecoverable error,
and only in very unusual circumstances will an application wish
to ignore these.
How an application reacts to exhaustion will depend on the kind of parsing it is doing:
MARPA_EVENT_SYMBOL_COMPLETED
event.
Typically, these applications will treat exhaustion on method
failure and exhaustion before the end of input
as parse errors.
They may wish to ignore exhaustion on method success
at the end of input.
Return value: On success, the number of events generated.
On hard failure, -2.
Hard failure with the code
MARPA_ERR_PARSE_EXHAUSTED
is fully recoverable.
Next: Other parse status methods, Previous: Recognizer life cycle mutators, Up: Recognizer methods [Contents][Index]
Marpa_Earleme
marpa_r_current_earleme (Marpa_Recognizer r)Successful iff input has started. If input has not started, returns soft failure.
Return value: On success, the current earleme, which is always non-negative. On soft failure, -1. Never returns a hard failure.
On success, returns the earleme of the Earley set with ID set_id
.
The ID of an Earley set ID is also called its ordinal.
In the default, token-stream model, Earley set ID and earleme
are always equal, but this is not the case in other input
models.
Hard fails if there is no Earley set whose ID is
set_id.
This hard failure is fully recoverable.
If set_id was negative,
the error code of the hard failure is
MARPA_ERR_INVALID_LOCATION
.
If set_id is greater than the ordinal
of the latest Earley set,
the error code of the hard failure is
MARPA_ERR_NO_EARLEY_SET_AT_LOCATION
.
At this writing, there is no method for
the inverse operation (conversion of an earleme to an Earley set
ID).
One consideration in writing
such a method is that not all earlemes correspond to Earley sets.
Applications that want to map earlemes
to Earley sets will have no trouble if they
are using the standard input model —
the Earley set
ID is always exactly equal to the earleme in that model.
For other applications
that want an earleme-to-ID mapping,
the most general method is create an ID-to-earleme
array using the
marpa_r_earleme()
method
and invert it.
Return value:
On success, the earleme corresponding to Earley
set set_id,
which is always non-negative.
On hard failure, -2.
The hard failures with error codes
MARPA_ERR_INVALID_LOCATION
and
MARPA_ERR_NO_EARLEY_SET_AT_LOCATION
are fully recoverable.
On success, returns the “integer value” of earley_set. For more about the integer value of an Earley set, see marpa_r_earley_set_values().
Return value:
On success, returns the the integer value of earley_set,
and sets the error code to MARPA_ERR_NONE
.
On hard failure, returns -2,
and sets the error code to
the error code of the hard failure,
which will never be MARPA_ERR_NONE
.
Note that -2 is a valid “integer value” for
an Earley set,
so that when -2 is returned,
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
On success, does the following:
The “value” and “pointer” of an Earley set are an arbitrary integer and an arbitrary pointer. Libmarpa never examines them and the application is free to use them for its own purposes. In an application with a character-per-earleme input model, for example, the integer value of the Earley set can used to store the codepoint of the current character. In a traditional token-per-earleme input model, the integer and pointer values could be used to track the string value of the token – the pointer could point to the start of the string, and the integer could indicate its length.
The Earley set integer value defaults to -1,
and the pointer value defaults to NULL
.
The Earley set value and pointer can be set using
the
marpa_r_latest_earley_set_values_set()
method.
See marpa_r_latest_earley_set_values_set().
Return value: On success, returns a non-negative integer. On hard failure, returns -2.
unsigned int
marpa_r_furthest_earleme (Marpa_Recognizer r)Return value: The furthest earleme. Always succeeds.
Returns the Earley set ID of the latest Earley set.
The ID of an Earley set ID is also called its ordinal.
Applications that want the
value of the latest earleme can convert
this value using
the
marpa_r_earleme()
method.
See marpa_r_earleme().
Return value: The ID of the latest Earley set. Always succeeds.
Sets the “integer value” of the latest Earley set to value. For more about the integer value of an Earley set, see marpa_r_earley_set_values().
Return value:
On success, returns the newly set integer value of the latest earley set,
and sets the error code to MARPA_ERR_NONE
.
On hard failure, returns -2,
and sets the error code to
the error code of the hard failure,
which will never be MARPA_ERR_NONE
.
Note that -2 is a valid “integer value” for
an Earley set,
so that when -2 is returned,
the error code is the only way to distinguish
success from failure.
The error code can be determined using
marpa_g_error()
.
See marpa_g_error().
Sets the integer and pointer value of the latest Earley set. For more about the “integer value” and “pointer value” of an Earley set, see marpa_r_earley_set_values().
Return value: On success, returns a non-negative integer. On hard failure, returns -2.
Previous: Location accessors, Up: Recognizer methods [Contents][Index]
For details about the “earley item warning threshold”, see marpa_r_earley_item_warning_threshold_set().
Return value: The Earley item warning threshold. Always succeeds.
On success, sets
the Earley item warning threshold.
The
Earley item warning threshold
is a number that is compared with
the count of Earley items in each Earley set.
When it is matched or exceeded,
a MARPA_EVENT_EARLEY_ITEM_THRESHOLD
event is created.
See MARPA_EVENT_EARLEY_ITEM_THRESHOLD.
If threshold is zero or less, an unlimited number of Earley items will be allowed without warning. This will rarely be what the user wants.
By default, Libmarpa calculates a value based on the grammar. The formula Libmarpa uses is the result of some experience, and most applications will be happy with it.
What should be done when the threshold is exceeded, depends on the application,
but exceeding the threshold means that it is very likely
that the time and space resources consumed by
the parse will prove excessive.
This is often a sign of a bug in the grammar.
Applications often will want to smoothly shut down
the parse,
in effect treating
the MARPA_EVENT_EARLEY_ITEM_THRESHOLD
event
as equivalent to library-recoverable hard failure.
Return value: The value that the Earley item warning threshold has after the method call is finished. Always succeeds.
A parser is “exhausted” if it cannot accept any more input. See Exhaustion.
Return value: 1 if the parser is exhausted, 0 otherwise. Always succeeds.
Returns a list of the ID’s of the symbols
that are acceptable as tokens
at the current earleme.
buffer is expected to be large enough to hold
the result.
This is guaranteed to be the case if the buffer
is large enough to hold an array of
Marpa_Symbol_ID
’s whose length
is greater than or equal to the number of symbols
in the grammar.
Return value: On success, the number of Marpa_Symbol_ID
’s
in buffer, which is always non-negative.
On hard failure, -2.
On success, does the folloing:
Hard fails if the symbol with ID symbol_id does not exist.
Return value: On success, 0 or 1. On hard failure, -2.
Next: Bocage methods, Previous: Recognizer methods, Up: Top [Contents][Index]
It is an important property of the Marpa algorithm that the
Earley sets are added one at a time,
so that
before we have started the construction of the Earley set at n+1
,
we know the full state of the parse at and before
location n
.
Libmarpa’s progress reports allow access to the Earley items
in an Earley set.
To start a progress report,
use the
marpa_r_progress_report_start()
command.
For each recognizer,
only one progress report can be in use at any one time.
To step through the Earley items,
use the
marpa_r_progress_item()
method.
On success, sets the current vertex of the report traverser to the null vertex. For more about the report traverser, including details about the current and null vertices, see marpa_r_progress_report_start().
This method is not usually needed.
Its effect is to leave the traverser in the same state as it is
immediately after the
marpa_r_progress_report_start()
method.
Loosely speaking, it allows the traversal to “start over”.
Hard fails if the recognizer is not started, or if no progress report traverser is active.
Return value: On success, a non-negative value. On failure, -2.
Creates a progress report traverser in recognizer r for the Earley set with ID set_id. A progress report traverser is a non-empty directed cycle graph whose vertices consist of the following:
n
,
and we will write ritem[i]
for the i
’th report item.
null
for
the null vertex.
There may be no Earley items in an Earley set,
and therefore a progress report traverser may contain no report items.
A progress report traverser with no report items is called a
“trivial traverser”.
A trivial traverser has exactly one edge: (null, null)
.
The edges of a non-trivial traverser are
(null, ritem[0])
,
(ritem[n-1], null)
, and
0 <= i < v-1
, (ritem[i-1], ritem[i])
.
This implies that every vertex has exactly one direct successor.
The report items are a subgraph,
and this graph can be seen as inducing the sequence ritem[0] ... ritem[n-1]
.
When a progress report traverser is active, one vertex is distinguished as the
current vertex,
which we will write as current
.
We call the direct successor of the current vertex,
the
next vertex.
On success, does the following:
n
, the number of report items.
n
may be zero.
Hard fails if no Earley set with ID
set_id exists.
The error code is MARPA_ERR_INVALID_LOCATION
if set_id
is negative.
The error code is MARPA_ERR_NO_EARLEY_SET_AT_LOCATION
if set_id is greater than the ID of
the latest Earley set.
Return value: On success, the number of report items, which will always be non-negative. On hard failure, -2.
On success, destroys the progress report traverser for recognizer r, freeing its memory. For details about the report traverser, see marpa_r_progress_report_start().
It is often not necessary to call this method.
marpa_r_progress_report_start()
destroys
any previously existing progress report.
And,
when a recognizer is destroyed,
its progress report is destroyed as a side effect.
Hard fails if no progress report is active.
Return value: On success, a non-negative value. On hard failure, -2.
This method allows access to the data for the next progress report item of a progress report. For details about progress reports, see marpa_r_progress_report_start().
In the event of success:
c_before
be the vertex
that is the current vertex immediately
before the call to this method.
The report item traverser has exactly one edge such that c_before
is its first element.
Let this edge be (c_before,c_after)
.
This method sets the current vertex to c_after
.
In this method description, we will write current
as an alias for c_after
.
current
will be a report item vertex and therefore there will
be an Earley item corresponding to current
.
current
to the location pointed to by the position argument.
current
to the location pointed to by the origin argument.
current
.
The “cooked dot position” is
Use of the cooked dot position allows an application to quickly determine if the dotted rule is a completion. The cooked dot position is -1 iff the dotted rule is a completion.
In the event of soft failure:
current
is the null vertex.
MARPA_ERR_PROGRESS_REPORT_EXHAUSTED
.
In addition to watching for soft failure,
the application can use the item count returned by
marpa_r_progress_report_start()
to determine when the last
item has been seen.
Return value: On success, the rule ID of
the progress report item, which is always non-negative.
On soft failure, -1.
If either the position or the origin
argument is NULL
,
or on other hard failure, -2.
Next: Ordering methods, Previous: Progress reports, Up: Top [Contents][Index]
• Bocage overview: | ||
• Bocage data structure: | ||
• Bocage constructor: | ||
• Bocage reference counting: | ||
• Bocage accessor: |
Next: Bocage data structure, Previous: Bocage methods, Up: Bocage methods [Contents][Index]
To create a bocage, use the
marpa_b_new()
method.
When a bocage is no longer in use, its memory can be freed
using the
marpa_b_unref()
method.
Next: Bocage constructor, Previous: Bocage overview, Up: Bocage methods [Contents][Index]
A bocage is a data structure containing the parses found by processing the input according to the grammar. It is related to a parse forest, but is in a form that is more compact and easily traversable. “Bocage” is our term, and we discovered this structure independently, but our work was preceded by Elizabeth Scott. And, unlike us, Prof. Scott did the all-important work of documenting it and providing the appropriate mathematical apparatus. See Elizabeth Scott's SPPFs.
The bocage contains the data for the parse trees whose root is an instance of the start symbol that begins at Earley set 0 and ends at the end of parse Earley set. Applications usually use the Earley set at the current earleme as the “end of parse Earley set”, so that the bocage is for parses of the entire input. But some applications may be interested in parsing prefixes of the input, and these applications can choose other end of parse Earley sets in their constructor. See marpa_b_new().
Next: Bocage reference counting, Previous: Bocage data structure, Up: Bocage methods [Contents][Index]
On success, the following is the case:
earley_set_ID
is non-negative,
creates a new bocage object, whose “end of parse Earley set”
is the Earley set with ID earley_set_ID
.
earley_set_ID
is -1,
creates a new bocage object, whose “end of parse Earley set”
is the Earley set at the current earleme.
If earley_set_ID is -1
and there is no Earley set at the current earleme;
or if earley_set_ID is non-negative
and there is no parse ending at Earley set earley_set_ID,
marpa_b_new()
hard fails
with the error code MARPA_ERR_NO_PARSE
.
Return value: On success, the new bocage object.
On hard failure, NULL
.
Next: Bocage accessor, Previous: Bocage constructor, Up: Bocage methods [Contents][Index]
On success, increases the reference count by 1. This method is not needed by most applications.
Return value:
On success, b.
On hard failure, NULL
.
Decreases the reference count by 1, destroying b once the reference count reaches zero. When b is destroyed, the reference count of its parent recognizer is decreased by 1.
Previous: Bocage reference counting, Up: Bocage methods [Contents][Index]
On success, returns an ambiguity metric. If the parse is unambiguous, the metric is 1. If the parse is ambiguous, the metric is 2 or greater, and is otherwise unspecified. See Better defined ambiguity metric.
Return value: On success, the ambiguity metric, which is always non-negative. On hard failure, -2.
Return value On success, a non-negative integer: 1 or greater if the bocage is for a null parse, and 0 if the bocage is not for a null parse. On hard failure, -2.
Next: Tree methods, Previous: Bocage methods, Up: Top [Contents][Index]
• Ordering overview: | ||
• Freezing the ordering: | ||
• Ordering constructor: | ||
• Ordering reference counting: | ||
• Order accessor: | ||
• Non-default ordering: |
Next: Freezing the ordering, Previous: Ordering methods, Up: Ordering methods [Contents][Index]
Before iterating through the parse trees in the bocage,
the parse trees must be ordered.
To create an ordering, use the
marpa_o_new()
method.
When an ordering is no longer in use, its memory can be freed
using the
marpa_o_unref()
method.
Next: Ordering constructor, Previous: Ordering overview, Up: Ordering methods [Contents][Index]
An ordering is frozen under the following circumstances:
marpa_o_ambiguity_metric()
is successfully called.
See marpa_o_ambiguity_metric().
marpa_o_rank()
is successfully called.
See marpa_o_rank().
A frozen ordering cannot be changed. There is no way to “unfreeze” an ordering.
Next: Ordering reference counting, Previous: Freezing the ordering, Up: Ordering methods [Contents][Index]
On success, does the following:
Return value: On success, the new ordering object.
On hard failure, NULL
.
Next: Order accessor, Previous: Ordering constructor, Up: Ordering methods [Contents][Index]
On success, increases the reference count by 1. Not needed by most applications.
Return value:
On success, o.
On hard failure, NULL
.
Decreases the reference count by 1, destroying o once the reference count reaches zero.
Next: Non-default ordering, Previous: Ordering reference counting, Up: Ordering methods [Contents][Index]
On success, returns an ambiguity metric. If the parse is unambiguous, the metric is 1. If the parse is ambiguous, the metric is 2 or greater, and is otherwise unspecified. See Better defined ambiguity metric.
If “high rank only” is in effect,
this ambiguity metric may differ from that returned by
marpa_b_ambiguity_metric()
.
In particular, a “high rank only” ordering
may be unambiguous even if its base bocage is ambiguous.
But note also,
because multiple parses choices may have the same rank,
a “high rank only” ordering may be ambiguous.
If the ordering is not already frozen,
it will be frozen on return from
marpa_o_ambiguity_metric()
.
For our purposes,
marpa_o_ambiguity_metric()
is considered an “accessor”,
because it treats its ordering as if it was frozen
before the call to marpa_o_ambiguity_metric()
.
Return value: On success, the ambiguity metric, which is non-negative. On hard failure, -2.
Return value: On success: A number greater than or equal to 1 if the ordering is for a null parse; otherwise, 0. On hard failure, -2.
Previous: Order accessor, Up: Ordering methods [Contents][Index]
On success, returns, the “high rank only” flag of ordering o. See marpa_o_high_rank_only_set().
Return value: On success, the value of the “high rank only” flag, which is a boolean. On hard failure, -2.
Sets the “high rank only” flag of ordering o. A flag of 1 indicates that, when ranking, all choices should be discarded except those of the highest rank. A flag of 0 indicates that no choices should be discarded on the basis of their rank.
A value of 1 is the default.
The value of the “high rank only” flag has no effect
until ranking is turned on using the
marpa_o_rank()
method.
Hards fails if the ordering is frozen.
Return value: On success, a boolean which is the value of the “high rank only” flag after the call. On hard failure, -2.
By default, the ordering of parse trees is arbitrary. On success, the following happens:
Return value: On success, a non-negative value. On hard failure, -2.
Next: Value methods, Previous: Ordering methods, Up: Top [Contents][Index]
• Tree overview: | ||
• Tree constructor: | ||
• Tree reference counting: | ||
• Iterating through the trees: |
Next: Tree constructor, Previous: Tree methods, Up: Tree methods [Contents][Index]
Once the bocage has an ordering, the parses trees can be iterated. Marpa’s parse tree iterators iterate the parse trees contained in a bocage object. In Libmarpa, “parse tree iterators” are usually just called trees.
To create a tree, use the
marpa_t_new()
method.
A newly created tree iterator is positioned before the first parse tree.
When a tree iterator is no longer in use, its memory can be freed
using the
marpa_t_unref()
method.
To position a newly created tree iterator at the first parse tree,
use the
marpa_t_next()
method.
Once the tree iterator is positioned at a parse tree,
the same
marpa_t_next()
method is used
to position it to the next parse tree.
Next: Tree reference counting, Previous: Tree overview, Up: Tree methods [Contents][Index]
On success, does the following:
Return value:
On success, a newly created tree.
On hard failure, NULL
.
Next: Iterating through the trees, Previous: Tree constructor, Up: Tree methods [Contents][Index]
On success, increases the reference count by 1. Not needed by most applications.
Return value:
On success, t.
On hard failure, NULL
.
Decreases the reference count by 1, destroying t once the reference count reaches zero.
Previous: Tree reference counting, Up: Tree methods [Contents][Index]
On success, positions t at the next parse tree in the iteration.
Tree iterators are initialized to the position before the first parse tree, so this method must be called before creating a valuator from a tree.
If a tree iterator is positioned after the last parse, the tree is said to be “exhausted”. A tree iterator for a bocage with no parse trees is considered to be “exhausted” when initialized.
If the tree iterator is exhausted,
soft fails, and
sets the error code to MARPA_ERR_TREE_EXHAUSTED
.
See Orthogonal treatment of soft failures.
It the tree iterator is paused,
hard fails, and
sets the error code to MARPA_ERR_TREE_PAUSED
.
This hard failure is fully recoverable.
See marpa_v_new().
Return value: On success, a non-negative value.
On soft failure, -1.
On hard failure, -2.
The hard failure with error code MARPA_ERR_TREE_PAUSED
is fully recoverable.
Returns the count of the number of parse trees traversed so far. The count includes the current iteration of the tree. A value of 0 indicates that the tree iterator is at its initialized position, before the first parse tree.
Return value: The number of parses traversed so far. Always succeeds.
Next: Events, Previous: Tree methods, Up: Top [Contents][Index]
Next: How to use the valuator, Previous: Value methods, Up: Value methods [Contents][Index]
The archetypal application needs
a value object (or
valuator) to produce
the value of the parse tree.
To create a valuator, use the
marpa_v_new()
method.
The application is required to maintain the stack,
and the application is also required to implement
most of the semantics, including the evaluation
of rules.
Libmarpa’s valuator provides instructions to
the application on how to manipulate the stack.
To iterate through this series of instructions,
use the
marpa_v_step()
method.
When successful, marpa_v_step()
returns the type
of step.
Most step types have values associated with them.
See Basic step accessors,
see How to use the valuator, and
see Stepping through the valuator.
When a valuator is no longer in use, its memory can be freed
using the
marpa_v_unref()
method.
Next: Advantages of step-driven valuation, Previous: Value overview, Up: Value methods [Contents][Index]
Libmarpa’s valuator provides the application with “steps”, which are instructions for stack manipulation. Libmarpa itself does not maintain a stack. This leaves the upper layer in total control of the stack and the values that are placed on it.
As example may make this clearer. Suppose the evalution is at a place in the parse tree where an addition is being performed. Libmarpa does not know that the operation is an addition. It will tell the application that rule number R is to be applied to the arguments at stack locations N and N+1, and that the result is to placed in stack location N.
In this system the application keeps track of the semantics for all rules, so it looks up rule R and determines that it is an addition. The application can do this by using R as an index into an array of callbacks, or by any other method it chooses. Let’s assume a callback implements the semantics for rule R. Libmarpa has told the application that two arguments are available for this operation, and that they are at locations N and N+1 in the stack. They might be the numbers 42 and 711. So the callback is called with its two arguments, and produces a return value, let’s say, 753. Libmarpa has told the application that the result belongs at location N in the stack, so the application writes 753 to location N.
Since Libmarpa knows nothing about the semantics, the operation for rule R could be string concatenation instead of addition. Or, if it is addition, it could allow for its arguments to be floating point or complex numbers. Since the application maintains the stack, it is up to the application whether the stack contains integers, strings, complex numbers, or polymorphic objects that are capable of being any of these things and more.
Next: Maintaining the stack, Previous: How to use the valuator, Up: Value methods [Contents][Index]
Step-driven valuation hides Libmarpa’s grammar rewrites from the application, and is quite efficient. Libmarpa knows which rules are sequences. Libmarpa optimizes stack manipulations based on this knowledge. Long sequences are very common in practical grammars. For these, the stack manipulations suggested by Libmarpa’s step-driven valuator will be significantly faster than the traditional stack evaluation algorithm.
Step-driven evalution has another advantage. To illustrate this, consider what is a very common case: The semantics are implemented in a higher-level language, using callbacks. If Libmarpa did not use step-driven valuation, it would need to provide for this case. But for generality, Libmarpa would have to deal in C callbacks. Therefore, a middle layer would have to create C language wrappers for the callbacks in the higher level language.
The implementation that results is this: The higher level language would need to wrap each callback in C. When calling Libmarpa, it would pass the wrappered callback. Libmarpa would then need to call the C language “wrappered” callback. Next, the wrapper would call the higher-level language callback. The return value, which would be data native to the higher-level language, would need to be passed to the C language wrapper, which will need to make arrangements for it to be based back to the higher-level language when appropriate.
A setup like this is not terribly efficient. And exception handling across language boundaries would be very tricky.
But neither of these is the worst problem. The worst problem is that callbacks are hard to debug. Wrappered callbacks are even worse. Calls made across language boundaries are harder yet to debug. In the system described above, by the time a return value is finally consumed, a language boundary will have been crossed four times. The ability to debug can make the difference between code that works and code that does not work.
So, while step-driven valuation seems a roundabout approach, it is simpler and more direct than the likely alternatives. And there is something to be said for pushing semantics up to the higher levels — they can be expected to know more about it.
These advantages of step-driven valuation are strictly in the context of a low-level interface. We are under no illusion that direct use of Libmarpa’s valuator will be found satisfactory by most Libmarpa users, even those using the C language. Libmarpa’s valuator is intended to be used via an upper layer, one that does know about semantics.
Next: Valuator constructor, Previous: Advantages of step-driven valuation, Up: Value methods [Contents][Index]
This section discusses in detail the requirements for maintaining the stack. In some cases, such as implementation using a Perl array, fulfilling these requirements is trivial. Perl auto-extends its arrays, and initializes the element values, on every read or write. For the C programmer, things are not quite so easy.
In this section, we will assume a C89 standard-conformant C application. This assumption is convenient on two grounds. First, this will be the intended use for many readers. Second, standard-conformant C89 is a “worst case”. Any issue faced by a programmer of another environment is likely to also be one that must be solved by the C programmer.
Libmarpa often optimizes away unnecessary stack writes to stack locations. When it does so, it will not necessarily optimize away all reads to that stack location. This means that a location’s first access, as suggested by the Libmarpa step instructions, may be a read. This possibility requires a special awareness from the C programmer. See Sizing the stack.
• Sizing the stack: |
Previous: Maintaining the stack, Up: Maintaining the stack [Contents][Index]
In our discussion of the stack handler for the valuator, we will treat the stack as a 0-based array. If an implementation applies Libmarpa’s step instructions literally, using a physical stack, it must make sure that all locations in the stack are initialized. The range of locations that must be initialized is from stack location 0 to the “end of stack” location. The result of the parse tree is always stored in stack location 0, so that a stack location 0 is always needed. Therefore, the end of stack location is always a specified value, and greater than or equal to 0. The end of stack location must also be greater than or equal to
marpa_v_result(v)
for every MARPA_STEP_TOKEN
step,
marpa_v_result(v)
for every MARPA_STEP_NULLING_SYMBOL
step, and
marpa_v_arg_n(v)
for every MARPA_STEP_RULE
step.
In practice, an application will often extend the stack as it iterates through the steps, initializing the new stack locations as they are created.
Note that our requirement is not merely that the stack locations exist and be writable, but that they be initialized. This is necessary for C89 conformance. Because of write optimizations in our implementation, the first access to any stack location may be a read. C89 allows trap values, so that a read of an uninitialized location could result in undefined behavior. See Trap representations.
Next: Valuator reference counting, Previous: Maintaining the stack, Up: Value methods [Contents][Index]
On success, does the following:
As long as a parent tree iterator is
paused
marpa_t_next()
will not succeed,
and therefore the parent tree iterator cannot move on
to a new parse tree.
Many valuators can share the same parent parse tree.
A tree iterator is “unpaused” when
all of the valuators of that tree iterator are destroyed.
Return value:
On success, the newly created valuator.
On hard failure, NULL
.
Next: Stepping through the valuator, Previous: Valuator constructor, Up: Value methods [Contents][Index]
On success, increases the reference count by 1. Not needed by most applications.
Return value:
On success, v.
On hard failure, NULL
.
Decreases the reference count by 1, destroying v once the reference count reaches zero.
Next: Valuator step types, Previous: Valuator reference counting, Up: Value methods [Contents][Index]
Steps through the tree in depth-first, left-to-right order. On success, does the following:
Marpa_Step_Type
.
The step type tells
the application how it expected to act on
the step.
See Valuator step types.
Steps are often referred to along with their step type so that,
for example, we say
“a MARPA_STEP_RULE
step”
to refer to a step whose step type
is MARPA_STEP_RULE
.
When the iteration through the steps is finished,
the step type is MARPA_STEP_INACTIVE
.
At this point, we say that the valuator is
inactive.
Once a valuator becomes inactive, it stays inactive.
Return value: On success, a Marpa_Step_Type
,
which always be a non-negative integer.
On hard failure, -2.
Next: Basic step accessors, Previous: Stepping through the valuator, Up: Value methods [Contents][Index]
MARPA_STEP_RULE is the step type for for a rule node. The application should perform its equivalent of rule execution.
marpa_v_arg_0(v)
to
marpa_v_arg_n(v)
.
marpa_v_rule(v)
.
marpa_v_result(v)
.
Typically, the result of this step is determined by
executing the semantics for its rule on
its child values.
marpa_v_result(v)
is guaranteed to
be equal to
marpa_v_arg_0(v)
.
MARPA_STEP_TOKEN is the step type for a token node. The application’s equivalent of the evaluation of the semantics of a non-null token should be performed.
marpa_v_token(v)
.
marpa_r_alternative()
method.
See marpa_r_alternative().
Libmarpa’s “token value” will be in
stack location marpa_v_token_value(v)
.
marpa_v_result(v)
.
Often, an application will simply copy Libmarpa’s “token value”
to stack location marpa_v_result(v)
.
MARPA_STEP_RULE is the step type for for a nulled node. The application’s equivalent of the evaluation of the semantics of a nulling token should be performed.
marpa_v_symbol(v)
.
marpa_v_result(v)
.
Often, an application will assign a fixed value to each
nullable symbol,
and will simply copy this fixed value to
stack location marpa_v_result(v)
.
The use of the word "nulling" in
the step type name MARPA_STEP_NULLING_SYMBOL
is problematic:
While the node must be zero-length (nulled or nulling),
the node’s symbol need not be nulling:
it may be nullable.
See Nulling versus nulled.
When this is the step type, the valuator has gone through all of its steps
and is now inactive.
The value of the parse tree will be in stack location 0.
Because of optimizations,
it is possible for valuator to immediately
became inactive — MARPA_STEP_INACTIVE
could
be both the first and last step.
Once a valuator becomes inactive, it stays inactive.
The valuator is new and has yet to go through any steps.
These step types are reserved for internal purposes.
Next: Step location accessors, Previous: Valuator step types, Up: Value methods [Contents][Index]
This section describes the accessors that are basic to stack manipulation.
Return value:
For a MARPA_STEP_RULE
step,
the stack location where the value of first child
can be found.
For other step types, an unspecified value.
Always succeeds.
Return value:
For a MARPA_STEP_RULE
step,
the stack location where the value of the last child
can be found.
For other step types, an unspecified value.
Always succeeds.
Return value:
For MARPA_STEP_RULE
,
MARPA_STEP_TOKEN
,
and MARPA_STEP_NULLING_SYMBOL
steps,
the stack location where the result of the semantics
should be placed.
For other step types, an unspecified value.
Always succeeds.
Return value:
For the
MARPA_STEP_RULE
step,
the ID of the rule.
For other step types, an unspecified value.
Always succeeds.
This macro is usually not needed since its return value
is the same as the value that marpa_v_step()
returns on success.
Return value:
The current step type: MARPA_STEP_TOKEN
, MARPA_STEP_RULE
, etc.
Always succeeds.
Return value:
For the MARPA_STEP_NULLING_SYMBOL
step,
the ID of the symbol.
The value returned is the same as that
returned by the
marpa_v_token()
macro.
For other step types, an unspecified value.
Always succeeds.
Return value:
For the MARPA_STEP_TOKEN
step,
the ID of the token.
The value returned is the same as that
returned by the
marpa_v_symbol()
macro.
For other step types, an unspecified value.
Always succeeds.
Return value:
For the MARPA_STEP_TOKEN
step,
the “token value” that was assigned to the token by
the marpa_r_alternative()
method.
See marpa_r_alternative().
For other step types, an unspecified value.
Always succeeds.
Previous: Basic step accessors, Up: Value methods [Contents][Index]
This section describes step accessors that are not basic to stack manipulation. They provide Earley set location information about the parse tree.
A step’s location in terms of Earley sets is called its ES location. Every ES location is the ID of an Earley set. ES location is only relevant for steps that correspond to tree nodes.
Every tree node has both a start ES location and an end ES location. The start ES location is the first ES location of that parse node.
The end ES location of a leaf is the ES location where the next leaf symbol in the fringe of the current parse tree would start. Typically, this is the location where a leaf node actually starts but, toward the end of a parse, there may not be an actual next leaf node.
The start ES location of a MARPA_RULE_STEP is the start ES location of its first child node in the current parse tree. The end ES location of a MARPA_RULE_STEP is the end ES location of its last child node in the current parse tree.
Return value: If the current step type is MARPA_STEP_RULE, MARPA_STEP_TOKEN, or MARPA_STEP_NULLING_SYMBOL, the return value is the end ES location of the parse node. If the current step type is anything else, or if the valuator is inactive, the return value is unspecified.
Return value: If the current step type is MARPA_STEP_RULE, the start ES location of the rule node. If the current step type is anything else, or if the valuator is inactive, the return value is unspecified.
Return value: If the current step type is MARPA_STEP_TOKEN or MARPA_STEP_NULLING_SYMBOL, the start ES location of the leaf node. If the current step type is anything else, or if the valuator is inactive, the return value is unspecified.
For every parse node of the current parse tree, the Earley set length (ES length) of the node is the end ES location, less the start ES location. The ES length of a nulled node is always 0.
If v is a valuator whose current step type is MARPA_STEP_NULLING_SYMBOL, it is always the case that
marpa_v_token_start_es_id(v) == marpa_v_es_id(v)
If v is a valuator whose current step type is MARPA_STEP_RULE or MARPA_STEP_TOKEN, it is always the case that
marpa_v_token_start_es_id(v) <= marpa_v_es_id(v)
For the following examples,
Ordered from left to right, a possible fringe is
Null@0-0, Tok@0-1, Null@1-1, Tok@1-2, Null@2-2
Another example is
Null@0-0, Null@0-0, Tok@0-1, Null@1-1, Null@1-1, Tok@1-2, Null@2-2, Null@2-2
In this second example note that when a nulled leaf immediately follows another nulled leaf, both leaves has the same start ES location and the same end ES location. This makes sense, because nulled symbol instances do not advance the current ES location, but it also implies that the ES locations and LHS symbol cannot be used to uniquely identify a parse node.
Next: Error methods macros and codes, Previous: Value methods, Up: Top [Contents][Index]
• Events overview: | ||
• Basic event accessors: | ||
• Completion events: | ||
• Symbol nulled events: | ||
• Prediction events: | ||
• Symbol expected events: | ||
• Event codes: |
Next: Basic event accessors, Previous: Events, Up: Events [Contents][Index]
This chapter discusses Libmarpa’s events. It contains descriptions of both grammar and recognizer methods.
A method is
event-generating
iff it can add events to the event queue.
The event-generating methods are
marpa_g_precompute()
,
marpa_r_earleme_complete()
,
and
marpa_r_start_input()
.
The event-generating methods always clear all previous events
so that, on return from an event-generating method,
the only events in the event queue will be
the events generated by that method.
A Libmarpa method or macro is event-safe iff it does not change the events queue. All Libmarpa accessors are event-safe.
Regardless of the event-safety of the methods calls between event generation and event access, it is good practice to access events as soon as reasonable after the method that generated them. Note that events are kept in the base grammar, so that multiple recognizers using the same base grammar overwrite each other’s events.
To find out how many events are in the event queue,
use the
marpa_g_event_count()
method.
To access specific events,
use the
marpa_g_event()
and
marpa_g_event_value()
methods.
Next: Completion events, Previous: Events overview, Up: Events [Contents][Index]
On success,
Event indexes are in sequence.
Valid events will be in the range from 0 to n,
where n is one less than the event count.
The event count
can be read using the
marpa_g_event_count()
method.
Hard fails if there is no ix’th event, or if ix is negative. On failure, the locations pointed to by event are not changed.
Return value: On success, the type of event ix, which is always non-negative. On hard failure, -2.
Return value: On success, the number of events, which is always non-negative. On hard failure, -2.
Returns the “value” of the event. The semantics of the value varies according to the type of the event, and is described in the section on event codes. See Event codes.
Return value: The “value” of the event. Always succeeds.
Next: Symbol nulled events, Previous: Basic event accessors, Up: Events [Contents][Index]
Libmarpa can be set up to generate
a MARPA_EVENT_SYMBOL_COMPLETED
event
whenever the symbol is completed.
A symbol is said to be completed
when a non-nulling rule with
that symbol on its LHS is completed.
For a completion event to be generated, the symbol must be marked, and the event must be activated.
To mark a symbol as a completion event symbol
use the marpa_g_symbol_is_completion_event_set()
method.
The event will be activated by default.
To activate or deactivate a completion symbol event
use the marpa_r_completion_symbol_activate()
method.
Allows the user to deactivate and reactivate symbol completion events in the grammar. On success, does the following:
The activation status of a completion event in the grammar becomes the initial activation status of a completion event in all of its child recognizers.
This method is seldom needed.
When a symbol is marked as a completion event symbol in
the grammar,
it is activated by default.
See marpa_g_symbol_is_completion_event_set().
And a completion event can be deactivated
and reactivated in the recognizer
using the
marpa_r_completion_symbol_activate
method.
See marpa_r_completion_symbol_activate().
Hard fails if the sym_id is not marked as a completion event symbol in the grammar, or if the grammar has not been precomputed.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
Allows the user to deactivate and reactivate symbol completion events in the recognizer. On success, does the following:
When a recognizer is created, the activation status of its symbol completion event for sym_id is initialized to the activation status of the symbol completion event for sym_id in the base grammar.
Hard fails if sym_id was not marked for completion events in the base grammar.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
On success, returns a boolean which is 1 iff sym_id is marked as a completion event symbol in g. For more about completion events, see marpa_g_symbol_is_completion_event_set().
On soft failure, sym_id is well-formed, but there is no such symbol.
Hard fails if g is precomputed.
Return value: On success, a boolean . On soft failure, -1. On hard failure, -2.
Libmarpa can be set up to generate an
MARPA_EVENT_SYMBOL_COMPLETED
event whenever the symbol is completed.
A symbol is said to be completed
when a non-nulling rule with
that symbol on its LHS is completed.
For completion events for sym_id to occur, sym_id must be marked as a completion event symbol, and the completion event for sym_id must be activated in the recognizer. Event activation also occurs in the grammar, and the recognizer event activation status for sym_id is initialized from the grammar event activation status for sym_id. See marpa_g_completion_symbol_activate(), and see marpa_r_completion_symbol_activate().
On success, if value is 1,
On success, if value is 0,
Nulled rules and symbols will never cause completion events. Nullable symbols may be marked as completion event symbols, but this will have an effect only when the symbol is not nulled. Nulling symbols may be marked as completion event symbols, but no completion events will ever be generated for a nulling symbol. Note that this implies that no completion event will ever be generated at earleme 0, the start of parsing.
If sym_id is well-formed, but there is no such symbol, soft fails.
Hards fails if the grammar is precomputed.
Return value: On success, value, which is a boolean. On soft failure, -1. On hard failure, -2.
Next: Prediction events, Previous: Completion events, Up: Events [Contents][Index]
Libmarpa can set up to generate
an MARPA_EVENT_SYMBOL_NULLED
event whenever the symbol is nulled.
A symbol is said to be nulled
when a zero length instance of that symbol
is recognized.
For a nulled event to be generated, the symbol must be marked, and the event must be activated.
To mark a symbol as a nulled event symbol
use the marpa_g_symbol_is_nulled_event_set()
method.
The event will be activated by default.
To activate or deactivate a nulled symbol event
use the marpa_r_nulled_symbol_activate()
method.
Allows the user to deactivate and reactivate symbol nulled events in the grammar. On success, does the following:
The activation status of a nulled event in the grammar becomes the initial activation status of a nulled event in all of its child recognizers.
This method is seldom needed.
When a symbol is marked as a nulled event symbol in
the grammar,
it is activated by default.
See marpa_g_symbol_is_nulled_event_set().
And a nulled event can be deactivated
and reactivated in the recognizer
using the
marpa_r_nulled_symbol_activate
method.
See marpa_r_nulled_symbol_activate().
Hard fails if the sym_id is not marked as a nulled event symbol in the grammar, or if the grammar has not been precomputed.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
Allows the user to deactivate and reactivate symbol nulled events in the recognizer. On success, does the following:
When a recognizer is created, the activation status of its symbol nulled event for sym_id is initialized to the activation status of the symbol nulled event for sym_id in the base grammar.
Hard fails if sym_id was not marked for nulled events in the base grammar.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
On success, returns a boolean which is 1 iff sym_id is marked as a nulled event symbol in g. For more about nulled events, see marpa_g_symbol_is_nulled_event_set().
On soft failure, sym_id is well-formed, but there is no such symbol.
Hard fails if g is precomputed.
Return value: On success, a boolean . On soft failure, -1. On hard failure, -2.
Libmarpa can set up to generate
an MARPA_EVENT_SYMBOL_NULLED
event whenever the symbol is nulled.
A symbol is said to be nulled
when a zero length instance of that symbol
is recognized.
For nulled events for sym_id to occur, sym_id must be marked as a nulled event symbol, and the nulled event for sym_id must be activated in the recognizer. Event activation also occurs in the grammar, and the recognizer event activation status for sym_id is initialized from the grammar event activation status for sym_id. See marpa_g_nulled_symbol_activate(), and see marpa_r_nulled_symbol_activate().
On success, if value is 1,
On success, if value is 0,
A symbol instance can never generate both a nulled and a prediction event at the same location. Also, a symbol instance can never generate both a nulled and a completion event at the same location. (As a reminder, a symbol instance is a symbol starting at a specific location in the input, and with a specific length.) This is because the symbol instance for a nulled event must be zero length, and the symbol instance for prediction and completion events can never be zero length.
However, prediction and nulled events for the same symbol can trigger at the same location. This is because The same location can be the location of a nulled instance of a symbol, and the start of an non-nulled instance of the same symbol.
Also, completion and nulled events for the same symbol can trigger at the same location. This is because the same location can be the location of a nulled instance of a symbol, and the end of one or more non-nulled instances of the same symbol.
The
marpa_g_symbol_is_nulled_event_set()
method will
mark a symbol as a nulled event symbol,
even if the symbol is non-nullable.
This is convenient, for example,
for automatically generated grammars.
Applications that wish to treat
it as a failure if there is an
attempt to
mark a non-nullable symbol
as a nulled event symbol,
can check for this case using
the
marpa_g_symbol_is_nullable()
method.
If sym_id is well-formed, but there is no such symbol, soft fails.
Hards fails if the grammar is precomputed.
Return value: On success, value, which is a boolean. On soft failure, -1. On hard failure, -2.
Next: Symbol expected events, Previous: Symbol nulled events, Up: Events [Contents][Index]
Libmarpa can be set up
to generate a
MARPA_EVENT_SYMBOL_PREDICTED
event when a non-nulled symbol is predicted.
A non-nulled symbol is said to be predicted
when a instance of it
is acceptable at the current
earleme according to the grammar.
Nulled symbols do not generate predictions.
For a prediction event to be generated, the symbol must be marked, and the event must be activated.
To mark a symbol as a prediction event symbol
use the marpa_g_symbol_is_prediction_event_set()
method.
The event will be activated by default.
To activate or deactivate a prediction symbol event
use the marpa_r_prediction_symbol_activate()
method.
Allows the user to deactivate and reactivate symbol prediction events in the grammar. On success, does the following:
The activation status of a prediction event in the grammar becomes the initial activation status of a prediction event in all of its child recognizers.
This method is seldom needed.
When a symbol is marked as a prediction event symbol in
the grammar,
it is activated by default.
See marpa_g_symbol_is_prediction_event_set().
And a prediction event can be deactivated
and reactivated in the recognizer
using the
marpa_r_prediction_symbol_activate
method.
See marpa_r_prediction_symbol_activate().
Hard fails if the sym_id is not marked as a prediction event symbol in the grammar, or if the grammar has not been precomputed.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
Allows the user to deactivate and reactivate symbol prediction events in the recognizer. On success, does the following:
When a recognizer is created, the activation status of its symbol prediction event for sym_id is initialized to the activation status of the symbol prediction event for sym_id in the base grammar.
Hard fails if sym_id was not marked for prediction events in the base grammar.
Return value: On success, the value of reactivate, which is a boolean. On hard failure, -2.
On success, returns a boolean which is 1 iff sym_id is marked as a prediction event symbol in g. For more about prediction events, see marpa_g_symbol_is_prediction_event_set().
On soft failure, sym_id is well-formed, but there is no such symbol.
Hard fails if g is precomputed.
Return value: On success, a boolean . On soft failure, -1. On hard failure, -2.
Libmarpa can be set up
to generate a
MARPA_EVENT_SYMBOL_PREDICTED
event when a non-nulled symbol is predicted.
A non-nulled symbol is said to be predicted
when a instance of it
is acceptable at the current
earleme according to the grammar.
Nulled symbols do not generate predictions.
For prediction events for sym_id to occur, sym_id must be marked as a prediction event symbol, and the prediction event for sym_id must be activated in the recognizer. Event activation also occurs in the grammar, and the recognizer event activation status for sym_id is initialized from the grammar event activation status for sym_id. See marpa_g_prediction_symbol_activate(), and see marpa_r_prediction_symbol_activate().
On success, if value is 1,
On success, if value is 0,
If sym_id is well-formed, but there is no such symbol, soft fails.
Hards fails if the grammar is precomputed.
Return value: On success, value, which is a boolean. On soft failure, -1. On hard failure, -2.
Next: Event codes, Previous: Prediction events, Up: Events [Contents][Index]
Libmarpa can be set up
to generate an
expected symbol event
(MARPA_EVENT_SYMBOL_EXPECTED
)
when the symbol with ID symbol_id
is acceptable as a terminal at the current earleme.
Note that the symbol expected event is only generated if
the symbol with ID symbol_id is acceptable as terminal.
If the symbol with ID symbol_id is expected
at the current earleme
as a non-terminal,
but is not acceptable as a terminal,
an expected symbol event will not be triggered
at the current earleme.
On success, if value is 1,
On success, if value is 0,
Hard fails if value is not a boolean. Hard fails if value is 1, and symbol_id is the ID of a nulling symbol, an inaccessible symbol, or an unproductive symbol. Hard fails if symbol_id is not the ID of a valid symbol.
Return value: On success, value, which will be a boolean. On hard failure, -2.
Previous: Symbol expected events, Up: Events [Contents][Index]
Applications should never see this event. Event value: Unspecified. Suggested message: "No event".
A nullable symbol is either the separator for, or the right hand side of, a sequence. Event value: The ID of the symbol. Suggested message: "This symbol is a counted nullable".
This event indicates that an application-settable threshold on the number of Earley items has been reached or exceeded. See marpa_r_earley_item_warning_threshold_set().
Event value: The current Earley item count. Suggested message: "Too many Earley items".
The parse is exhausted. Event value: Unspecified. Suggested message: "Recognizer is exhausted".
One or more rules are loop rules — rules that are part of a cycle. Cycles are pathological cases of recursion, in which the same symbol string derives itself a potentially infinite number of times. Nonetheless, Marpa parses in the presence of these, and it is up to the application to treat these as fatal errors, something they almost always will wish to do. Event value: The count of loop rules. Suggested message: "Grammar contains a infinite loop".
This event occurs only if LHS terminals feature is in use. The LHS terminals feature is deprecated. See LHS terminals. Event value: The ID of the symbol. Suggested message: "This symbol is a nulling terminal".
The recognizer can be set to generate an event
a symbol is completed
using its
marpa_g_symbol_is_completion_event_set()
method.
(A symbol is “completed” if and only if any rule with that symbol
as its LHS is completed.)
This event code indicates that one of those events
occurred.
Event value: The ID of the completed symbol.
Suggested message: "Completed symbol".
The recognizer can be set to generate an event when a
symbol is expected as a terminal,
using its
marpa_r_expected_symbol_event_set()
method.
Note that this event only triggers if the symbol is
expected as a terminal.
Predicted symbols that are not expected as terminals
do not trigger this event.
This event code indicates that one of those events
occurred.
Event value: The ID of the expected symbol.
Suggested message: "Expecting symbol".
The recognizer can be set to generate an event when a
symbol is nulled – that is, recognized as a
zero-length symbol.
To set an nulled symbol event,
use the recognizer’s
marpa_r_nulled_symbol_event_set()
method.
This event code indicates that a nulled symbol event
occurred.
Event value: The ID of the nulled symbol.
Suggested message: "Symbol was nulled".
The recognizer can be set to generate an event when a
symbol is predicted.
To set an predicted symbol event,
use the recognizer’s
marpa_g_symbol_is_prediction_event_set()
method.
Unlike the
MARPA_EVENT_SYMBOL_EXPECTED
event,
the MARPA_EVENT_SYMBOL_PREDICTED
event
triggers for predictions of both
non-terminals and terminals.
This event code indicates that a predicted symbol event
occurred.
Event value: The ID of the predicted symbol.
Suggested message: "Symbol was predicted".
Next: Technical notes, Previous: Events, Up: Top [Contents][Index]
• Error methods: | ||
• Error Macros: | ||
• External error codes: | ||
• Internal error codes: |
Next: Error Macros, Previous: Error methods macros and codes, Up: Error methods macros and codes [Contents][Index]
Allows the application to read the error code.
p_error_string is reserved for use by
the internals.
Applications should set it to NULL
.
Return value: The current error code. Always succeeds.
Sets the error code
to MARPA_ERR_NONE
.
Not often used,
but now and then it can be useful
to force the error code to a known state.
Return value: MARPA_ERR_NONE
.
Always succeeds.
Next: External error codes, Previous: Error methods, Up: Error methods macros and codes [Contents][Index]
The number of error codes.
All error codes, whether internal or external,
will be integers, non-negative but
strictly less than MARPA_ERRCODE_COUNT
.
Next: Internal error codes, Previous: Error Macros, Up: Error methods macros and codes [Contents][Index]
This section lists the external error codes. These are the only error codes that users of the Libmarpa external interface should ever see. Internal error codes are in their own section (Internal error codes).
No error condition.
The error code is initialized to this value.
Methods that do not result in failure
sometimes reset the error code to MARPA_ERR_NONE
.
Numeric value: 0.
Suggested message: "No error".
A separator was specified for a sequence rule, but its ID was not that of a valid symbol. Numeric value: 6. Suggested message: "Separator has invalid symbol ID".
A tree iterator is positioned before the first tree,
and the tree iterator was specified in a context where
the tree iterator must be positioned at or after
the first tree.
A newly created tree is positioned before the first
tree.
To position a newly created tree iterator to the first tree
use the
marpa_t_next()
method.
Numeric value: 91.
Suggested message: "Tree iterator is before first tree".
A “counted” symbol was found that is also a nullable symbol. A “counted” symbol is one that appears on the RHS of a sequence rule. If a symbol is nullable, counting its occurrences becomes difficult. Questions of definition and problems of implementation arise. At a minimum, a sequence with counted nullables would be wildly ambigious.
Sequence rules are simply an optimized shorthand for rules that can also be written in ordinary BNF. If the equivalent of a sequence of nullables is really what your application needs, nothing in Libmarpa prevents you from specifying that sequence with ordinary BNF rules.
Numeric value: 8. Suggested message: "Nullable symbol on RHS of a sequence rule".
This error indicates an attempt to add a BNF rule that is a duplicate of a BNF rule already in the grammar. Two BNF rules are considered duplicates if
Duplication of sequence rules, and duplication between BNF rules and sequence rules, is dealt with by requiring that the LHS of a sequence rule not be the LHS of any other rule.
Numeric value: 11. Suggested message: "Duplicate rule".
This error indicates an attempt to add a duplicate token. A token is a duplicate if one already read at the same earleme has the same symbol ID and the same length. Numeric value: 12. Suggested message: "Duplicate token".
This error code indicates that an implementation-defined limit on the number of Earley items per Earley set was exceedeed. This limit is different from the Earley item warning threshold, an optional limit on the number of Earley items in an Earley set, which can be set by the application.
The implementation defined-limit is very large, at least 500,000,000 earlemes. An application is unlikely ever to see this error. Libmarpa’s use of memory would almost certainly exceed the implementation’s limits before it occurred. Numeric value: 13. Suggested message: "Maximum number of Earley items exceeded".
A negative event index was specified. That is not allowed. Numeric value: 15. Suggested message: "Negative event index".
An non-negative event index was specified, but there is no event at that index. Since the events are in sequence, this means it was too large. Numeric value: 16. Suggested message: "No event at that index".
The grammar has a cycle — one or more loop rules. This is a recoverable error, although most applications will want to treat it as fatal. For more see the description of marpa_g_precompute. Numeric value: 17. Suggested message: "Grammar has cycle".
This is an internal error, and indicates that Libmarpa was wrongly built. Libmarpa was compiled with headers that do not match the rest of the code. The solution is to find a correctly built Libmarpa. Numeric value: 98. Suggested message: "Internal error: Libmarpa was built incorrectly"
The Libmarpa base grammar is in a “not ok” state. Currently, the only way this can happen is if Libmarpa memory is being overwritten. Numeric value: 29. Suggested message: "Marpa is in a not OK state".
This error code indicates that the token symbol is an inaccessible symbol — one that cannot be reached from the start symbol.
Since the inaccessibility of a symbol is a property of the grammar, this error code typically indicates an application error. Nevertheless, a retry at this location, using another token ID, may succeed. At this writing, the author knows of no uses of this technique.
Numeric value: 18. Suggested message: "Token symbol is inaccessible".
A function was called that takes a boolean argument, but the value of that argument was not either 0 or 1. Numeric value: 22. Suggested message: "Argument is not boolean".
The location (Earley set ID) is not valid. It may be invalid for one of two reasons:
For users of input models other than the standard one, the term “location”, as used in association with this error code, means Earley set ID or Earley set ordinal. In the standard input model, this will always be identical with Libmarpa’s other idea of location, the earleme.
Numeric value: 25. Suggested message: "Location is not valid".
A start symbol was specified, but its symbol ID is not that of a valid symbol. Numeric value: 27. Suggested message: "Specified start symbol is not valid".
A method was called with an invalid assertion ID. This is a assertion ID that not only does not exist, but cannot exist. Currently that means its value is less than zero. Numeric value: 96. Suggested message: "Assertion ID is malformed".
A method was called with an invalid rule ID. This is a rule ID that not only does not exist, but cannot exist. Currently that means its value is less than zero. Numeric value: 26. Suggested message: "Rule ID is malformed".
A method was called with an invalid symbol ID. This is a symbol ID that not only does not exist, but cannot exist. Currently that means its value is less than zero. Numeric value: 28. Suggested message: "Symbol ID is malformed".
There was a mismatch in the major version number between the requested version of libmarpa, and the actual one. Numeric value: 30. Suggested message: "Libmarpa major version number is a mismatch".
There was a mismatch in the micro version number between the requested version of libmarpa, and the actual one. Numeric value: 31. Suggested message: "Libmarpa micro version number is a mismatch".
There was a mismatch in the minor version number between the requested version of libmarpa, and the actual one. Numeric value: 32. Suggested message: "Libmarpa minor version number is a mismatch".
A non-negative Earley set ID (also called an Earley set ordinal) was specified, but there is no corresponding Earley set. Since the Earley set ordinals are in sequence, this means that the specified ID is greater than that of the latest Earley set. Numeric value: 39. Suggested message: "Earley set ID is after latest Earley set".
The grammar is not precomputed, and attempt was made to do something with it that is not allowed for unprecomputed grammars. For example, a recognizer cannot be created from a grammar until it is precomputed. Numeric value: 34. Suggested message: "This grammar is not precomputed".
The application attempted to create a bocage from a recognizer with no parse tree. Applications will often want to treat this as a soft error. Numeric value: 41. Suggested message: "No parse".
A grammar that has no rules is being used in a way that is not allowed. Usually the problem is that the user is trying to precompute the grammar. Numeric value: 42. Suggested message: "This grammar does not have any rules".
The grammar has no start symbol, and an attempt was made to perform an operation that requires one. Usually the problem is that the user is trying to precompute the grammar. Numeric value: 43. Suggested message: "This grammar has no start symbol".
A method was called with an assertion ID that is well-formed, but the assertion does not exist. Numeric value: 97. Suggested message: "No assertion with this ID exists".
A method was called with a rule ID that is well-formed, but the rule does not exist. Numeric value: 89. Suggested message: "No rule with this ID exists".
A method was called with a symbol ID that is well-formed, but the symbol does not exist. Numeric value: 90. Suggested message: "No symbol with this ID exists".
This error code indicates that no tokens at all were expected at this earleme location. This can only happen in alternative input models.
Typically, this indicates an application programming error. Retrying input at this location will always fail. But if the application is able to leave this earleme empty, a retry at a later location, using this or another token, may succeed. At this writing, the author knows of no uses of this technique.
Numeric value: 44. Suggested message: "No token is expected at this earleme location".
This error occurs in situations where a rule is required to be a sequence, and indicates that the rule of interest is, in fact, not a sequence.
Numeric value: 99. Suggested message: "Rule is not a sequence".
This error occurs only if LHS terminals feature is in use. The LHS terminals feature is deprecated. See LHS terminals. Numeric value: 49. Suggested message: "A symbol is both terminal and nulling".
The Marpa order object has been frozen. If a Marpa order object is frozen, it cannot be changed.
Multiple tree iterators can share a Marpa order object, but that order object is frozen after the first tree iterator is created from it. Applications can order an bocage in many ways, but they must do so by creating multiple order objects.
Numeric value: 50. Suggested message: "The ordering is frozen".
The parse is exhausted. Numeric value: 53. Suggested message: "The parse is exhausted".
The parse is too long. The limit on the length of a parse is implementation dependent, but it is very large, at least 500,000,000 earlemes.
This error code is unlikely in the standard input model. Almost certainly memory would be exceeded before it could occur. If an application sees this error, it almost certainly using one of the non-standard input models.
Most often this message will occur because of a request to add a single extremely long token, perhaps as a result of an application error. But it is also possible this error condition will occur after the input of a large number of long tokens.
Numeric value: 54. Suggested message: "This input would make the parse too long".
In a method that takes pointers as arguments,
one of the pointer arguments is NULL
,
in a case where that is not allowed.
One such method is marpa_r_progress_item()
.
Numeric value: 56.
Suggested message: "An argument is null when it should not be".
An attempt was made to use a precomputed grammar in a way that is not allowed. Often this is an attempt to change the grammar. Nearly every change to a grammar after precomputation invalidates the precomputation, and is therefore not allowed. Numeric value: 57. Suggested message: "This grammar is precomputed".
No recognizer progress report is currently active,
and an action has been attempted that
requires the progress report to be active.
One such action would be a
marpa_r_progress_item()
call.
Numeric value: 59.
Suggested message: "No progress report has been started".
The progress report is “exhausted” — all its items have been iterated through. Numeric value: 58. Suggested message: "The progress report is exhausted".
A symbol or rule rank was specified that was less than an implementation-defined minimum. Implementations will always allow at least those ranks in the range between -134,217,727 and 134,217,727. Numeric value: 85. Suggested message: "Rule or symbol rank too low".
A symbol or rule rank was specified that was greater than an implementation-defined maximum. Implementations will always allow at least those ranks in the range between -134,217,727 and 134,217,727. Numeric value: 86. Suggested message: "Rule or symbol rank too high".
The recognizer is “inconsistent”,
usually because the user has rejected one or
more rules or terminals,
and has not yet called
the
marpa_r_consistent()
method.
Numeric value: 95.
Suggested message: "The recognizer is inconsistent.
The recognizer is not accepting input, and the application has attempted something that is inconsistent with that fact. Numeric value: 60. Suggested message: "The recognizer is not accepting input".
The recognizer has not been started. and the application has attempted something that is inconsistent with that fact. Numeric value: 61. Suggested message: "The recognizer has not been started".
The recognizer has been started. and the application has attempted something that is inconsistent with that fact. Numeric value: 62. Suggested message: "The recognizer has been started".
The index of a RHS symbol was specified, and it was negative. That is not allowed. Numeric value: 63. Suggested message: "RHS index cannot be negative".
A non-negative index of RHS symbol was specified, but there is no symbol at that index. Since the indexes are in sequence, this means the index was greater than or equal to the rule length. Numeric value: 64. Suggested message: "RHS index must be less than rule length".
An attempt was made to add a rule with too many right hand side symbols. The limit on the RHS symbol count is implementation dependent, but it is very large, at least 500,000,000 symbols. This is far beyond what is required in any current practical grammar. An application with rules of this length is almost certain to run into memory and other limits. Numeric value: 65. Suggested message: "The RHS is too long".
The LHS of a sequence rule cannot be the LHS of any other rule, whether a sequence rule or a BNF rule. An attempt was made to violate this restriction. Numeric value: 66. Suggested message: "LHS of sequence rule would not be unique".
The start symbol is not on the LHS on any rule. That means it could never match any possible input, not even the null string. Presumably, an error in writing the grammar. Numeric value: 73. Suggested message: "Start symbol not on LHS of any rule".
An attempt was made to use a symbol in a way that requires it to be set up for completion events, but the symbol was not set set up for completion events. The archetypal case is an attempt to activate completion events for the symbol in the recognizer. The archetypal case is an attempt to activate a completion event in the recognizer for a symbol that is not set up as a completion event. Numeric value: 92. Suggested message: "Symbol is not set up for completion events".
An attempt was made to use a symbol in a way that requires it to be set up for nulled events, but the symbol was not set set up for nulled events. The archetypal case is an attempt to activate a nulled events in the recognizer for a symbol that is not set up as a nulled event. Numeric value: 93. Suggested message: "Symbol is not set up for nulled events".
An attempt was made to use a symbol in a way that requires it to be set up for predictino events, but the symbol was not set set up for predictino events. The archetypal case is an attempt to activate a prediction event in the recognizer for a symbol that is not set up as a prediction event. Numeric value: 94. Suggested message: "Symbol is not set up for prediction events".
Unvalued symbols are a deprecated Marpa feature,
which may be avoided with
the
marpa_g_force_valued()
method.
An unvalued symbol may take on any value,
and therefore a symbol that is unvalued at some points
cannot safely to be used to contain a value at
others.
This error indicates that such an unsafe use is
being attempted.
Numeric value: 74.
Suggested message: "Symbol is treated both as valued and unvalued".
An attempt was made to change the terminal status of a symbol to a different value after it was locked. Numeric value: 75. Suggested message: "The terminal status of the symbol is locked".
A token was specified whose symbol ID is not a terminal. Numeric value: 76. Suggested message: "Token symbol must be a terminal".
A token length was specified that is less than or equal to zero. Zero-length tokens are not allowed in Libmarpa. Numeric value: 77. Suggested message: "Token length must greater than zero".
The token length is too long. The limit on the length of a token is implementation dependent, but it is at least 500,000,000 earlemes. An application using a token that long is almost certain to run into some other limit. Numeric value: 78. Suggested message: "Token is too long".
A Libmarpa parse tree iterator is “exhausted”, that is, it has no more parse trees. Numeric value: 79. Suggested message: "Tree iterator is exhausted".
A Libmarpa tree is “paused”
and an operation was attempted that
is inconsistent with that fact.
Typically, this operation will be
a call of the
marpa_t_next()
method.
Numeric value: 80.
Suggested message: "Tree iterator is paused".
An attempt was made to read a token where a token with that symbol ID is not expected. This message can also occur when an attempt is made to read a token at a location where no token is expected. Numeric value: 81. Suggested message: "Unexpected token".
The start symbol is unproductive. That means it could never match any possible input, not even the null string. Presumably, an error in writing the grammar. Numeric value: 82. Suggested message: "Unproductive start symbol".
The valuator is inactive in a context where that should not be the case. Numeric value: 83. Suggested message: "Valuator inactive".
Unvalued symbols are a deprecated Marpa feature,
which may be avoided with
the
marpa_g_force_valued()
method.
This error code
indicates that the valued status of a symbol is locked,
and an attempt was made
to change it to a status different from the
current one.
Numeric value: 84.
Suggested message: "The valued status of the symbol is locked".
An attempt was made to do something with a nulling
symbol that is not allowed.
For example,
the ID of a nulling symbol cannot be an argument
to
marpa_r_expected_symbol_event_set()
—
because it is not possible to create an “expected symbol” event
for a nulling symbol.
Numeric value: 87.
Suggested message: "Symbol is nulling".
An attempt was made to do something with an unused symbol that is not allowed.
An “unused” symbol is a inaccessible or unproductive symbol.
For example,
the ID of a unused symbol cannot be an argument
to
marpa_r_expected_symbol_event_set()
—
because it is not possible to create an “expected symbol” event
for an unused symbol.
Numeric value: 88.
Suggested message: "Symbol is not used".
Previous: External error codes, Up: Error methods macros and codes [Contents][Index]
An internal error code may be one of two things: First, it can be an error code that arises from an internal Libmarpa programming issue (in other words, something happening in the code that was not supposed to be able to happen.) Second, it can be an error code that only occurs when a method from Libmarpa’s internal interface is used. Both kinds of internal error message share one common trait — users of the Libmarpa’s external interface should never see them.
Internal error messages require someone with knowledge of the Libmarpa internals to follow up on them. They usually do not have descriptions or suggested messages.
Numeric value: 1.
Numeric value: 2.
Numeric value: 3.
Numeric value: 4.
Numeric value: 5.
Numeric value: 7.
“Development” errors were used heavily during Libmarpa’s development, when it was not yet clear how precisely to classify every error condition. Unless they are using a developer’s version, users of the external interface should never see development errors.
Development errors have an error string associated with them. The error string is a short 7-bit ASCII error string that describes the error. Numeric value: 9. Suggested message: "Development error, see string".
Numeric value: 10.
Numeric value: 14.
A “catchall” internal error. Numeric value: 19.
The AHFA ID was invalid. There are no AHFAs any more, so this message should not occur. Numeric value: 20.
The AHM ID was invalid. The term “AIMID” is a legacy of earlier implementations and must be kept for backward compatibility. Numeric value: 21.
Numeric value: 23.
Numeric value: 24.
Numeric value: 33.
Numeric value: 35.
Numeric value: 36.
Numeric value: 37.
Numeric value: 38.
Numeric value: 40.
Numeric value: 46.
Numeric value: 47.
Numeric value: 45.
Numeric value: 48.
Numeric value: 51.
Numeric value: 52.
Numeric value: 55.
Numeric value: 70.
Numeric value: 71.
Numeric value: 68.
Numeric value: 69.
Numeric value: 67.
Numeric value: 72.
Next: Advanced input models, Previous: Error methods macros and codes, Up: Top [Contents][Index]
This section contains technical notes that are not necessary for the main presentation, but which may be helpful or interesting.
• Elizabeth Scott's SPPFs: | ||
• Data types used by Libmarpa: | ||
• Why so many time objects: | ||
• Design of numbered objects: | ||
• Trap representations: |
Next: Data types used by Libmarpa, Previous: Technical notes, Up: Technical notes [Contents][Index]
One of our most important data structures is what we call a “bocage”. Prof. Scott’s work preceded ours, and her SPPF structure is our bocage in all essential respects, so much so that her excellent writeup serves perfectly as documentation for the bocage: Scott, Elizabeth. “SPPF-style parsing from Earley recognisers.” Electronic Notes in Theoretical Computer Science 203.2 (2008): 53-67, https://dinhe.net/~aredridel/.notmine/PDFs/Parsing/SCOTT%2C%20Elizabeth%20-%20SPPF-Style%20Parsing%20From%20Earley%20Recognizers.pdf.
Next: Why so many time objects, Previous: Elizabeth Scott's SPPFs, Up: Technical notes [Contents][Index]
Libmarpa does not use any floating point data or strings. All data are either integers or pointers.
Next: Design of numbered objects, Previous: Data types used by Libmarpa, Up: Technical notes [Contents][Index]
Marpa is an aggressively multi-pass algorithm. Marpa achieves its efficiency, not in spite of making multiple passes over the data, but because of it. Marpa regularly substitutes two fast O(n) passes for a single O(n log n) pass. Marpa’s proliferation of time objects is in keeping with its multi-pass approach.
Bocage objects come at no cost, even for unambiguous parses, because the same pass that creates the bocage also deals with other issues that are of major significance for unambiguous parses. It is the post-processing of the bocage pass that enables Marpa to do both left- and right-recursion in linear time.
Of the various objects, the best case for elimination is of the ordering object. In many cases, the ordering is trivial. Either the parse is unambiguous, or the application does not care about the order in which parse trees are returned. But while it would be easy to add an option to bypass creation of an ordering object, there is little to be gained from it. When the ordering is trivial, its overhead is very small — essentially a handful of subroutine calls. Many orderings accomplish nothing, but these cost next to nothing.
Tree objects come at minimal cost to unambiguous grammars, because the same pass that allows iteration through multiple parse trees does the tree traversal. This eliminates much of the work that otherwise would need to be done in the valuation time object. In the current implementation, the valuation time object needs only to step through a sequence already determined by the tree iterator.
Next: Trap representations, Previous: Why so many time objects, Up: Technical notes [Contents][Index]
As the name suggests, the choice was made to implement numbered objects as integers, and not as pointers. In standard-conformant C, integers can be safely checked for validity, while pointers cannot.
There are efficiency tradeoffs between pointers and integers but they are complicated, and they go both ways. Pointers can be faster, but integers can be used as indexes into more than one data structure. Which is actually faster depends on the design. Integers allow for a more flexible design, so that once the choice is settled on, careful programming can make them a win, possibly a very big one.
The approach taken in Libmarpa was to settle, from the outset, on integers as the implementation for numbered objects, and to optimize on that basis. The author concedes that it is possible that others redoing Libmarpa from scratch might find that pointers are faster. But the author is confident that they will also discover, on modern architectures, that the lack of safe validity checking is far too high a price to pay for the difference in speed.
Previous: Design of numbered objects, Up: Technical notes [Contents][Index]
In order to be C89 conformant, an application must initialize all locations that might be read. This is because C89 allows trap representations.
A trap representation is a byte pattern in memory that is not a valid value of some object type. When read, the trap representation causes undefined behavior according to the C89 standard, making the application that allowed the read non-conformant to the C89 standard. Trap representations are carefully defined and discussed in the C99 standard.
In real life, trap representations can occur when floating point values are used: Some byte patterns that can occur in memory are not valid floating point values, and can cause undefined behavior when read.
Pointers raise the same issue although,
since it can be safely read as an integer,
some insist that an invalid pointer is not,
strictly speaking, a trap representation.
But there is no portable c89-conformant way of
testing the integer form of a pointer for validity,
so that the only way to guarantee C89 conformance
is to initialize the pointer, either to a valid pointer,
or to a known and therefore testable value, such as NULL
.
All this implies that, in order to claim c89-conformance, an application must initialize all locations that might be read to non-trap values. For a stack implementation, this means that, as a practical matter, all locations on the stack must be initialized.
Next: Support, Previous: Technical notes, Up: Top [Contents][Index]
In an earlier chapter, we introduced Libmarpa’s concept of input, and described its basic input models. See Input. In this chapter we describe Libmarpa’s advanced models of input. These advanced input models have attracted considerable interest. However, they have seen little actual use so far, and for that reason we delayed their consideration until now.
A Libmarpa input model is advanced if it allows tokens of length other than 1. The advanced input models are also called variable-length token models because they allow the token length to vary from the “normal” length of 1.
• The dense variable-length token model: | ||
• The fully general input model: |
Next: The fully general input model, Previous: Advanced input models, Up: Advanced input models [Contents][Index]
In the
dense variable-length model of input,
one or more successful
calls of
marpa_r_alternative()
must be immediately previous
to every call to
marpa_r_earleme_complete()
.
Note that,
for a variable-length input model to be “dense”
according to this definition,
at least one successful call
of marpa_r_alternative()
must be immediately previous to each call to
marpa_r_earleme_complete()
.
Recall that, in this document, we say that a marpa_r_alternative()
call is
“immediately previous” to a
marpa_r_earleme_complete()
call
iff
that marpa_r_earleme_complete()
call is
the first
marpa_r_earleme_complete()
call after
the marpa_r_alternative()
call.
In the dense model of input,
after a successful call of
marpa_r_alternative()
,
the earleme variables are as follows:
max(old_f, old_c+length)
,
marpa_r_alternative()
,
marpa_r_alternative()
, and
marpa_r_alternative()
never changes the
latest or current earleme.
In the dense variable-length model of input,
the effect of the
marpa_r_earleme_complete()
mutator on the earleme variables
is the same as for the
basic models of input.
See The standard model of input.
In the dense model of input, the latest earleme is always the same as the current earleme. In fact, the latest earleme and the current earleme are always the same, except in the fully general model of input.
Previous: The dense variable-length token model, Up: Advanced input models [Contents][Index]
In the
sparse variable-length model of input,
zero or more successful
calls of
marpa_r_alternative()
must be immediately previous
to every call to
marpa_r_earleme_complete()
.
The sparse model is the dense variable-length model,
with its only restriction lifted —
the sparse variable-length input model
allows calls to
marpa_r_earleme_complete()
that are not immediately preceded by calls to
marpa_r_alternative()
.
Since it is unrestricted, the sparse input model is Libmarpa’s fully general input model. Because of this, it may be useful for us to state the effect of mutators on the earleme variables in detail, even at the expense of some repetition.
In the sparse input model,
empty earlemes
are now possible.
An empty earleme is an earleme
with no tokens and no Earley set.
An empty earleme occurs iff
marpa_r_earleme_complete()
is called when there is no immediately previous
call to
marpa_r_alternative()
.
The sparse model takes its name
from the fact that there may be earlemes with no
Earley set.
In the sparse model, Earley sets are “sparsely”
distributed among the earlemes.
In the dense model of input,
the effect on the earleme variables of
a successful call of the
marpa_r_alternative()
mutator
is the same as for the sparse model of input:
max(old_f, old_c+length)
,
marpa_r_alternative()
,
marpa_r_alternative()
, and
marpa_r_alternative()
never changes the
latest or current earleme.
In the sparse model,
when the earleme is not empty,
the effect of
a call to
marpa_r_earleme_complete()
on the earleme variables is the same as
in the dense and the basic models of input.
Specifically, the following will be true:
old_c+1
,
where old_c is the current earleme before the call.
old_c+1
, and therefore
will be equal to the current earleme.
marpa_r_earleme_complete()
.
Recall that, in the dense and basic input models,
as a matter of definition,
there are no empty earlemes.
For the sparse input model,
in the case of an empty earleme,
the effect of the
marpa_r_earleme_complete()
mutator on the earleme variables
is the following:
old_c+1
,
where old_c is the current earleme before the call.
marpa_r_earleme_complete()
.
After a call to marpa_r_earleme_complete()
for an empty earleme,
the lastest and current earlemes will have different values.
In a parse that never calls marpa_r_earleme_complete()
for an empty earleme,
the lastest and current earlemes will always be the same.
Next: Futures, Previous: Advanced input models, Up: Top [Contents][Index]
The “updates” (https://github.com/jeffreykegler/libmarpa/blob/updated/UPDATES.md). document contains instructions for reporting bugs, getting answers to questions, and other support.
Next: Deprecated techniques and methods, Previous: Support, Up: Top [Contents][Index]
This chapter is not about the current interface. Instead, it discusses changes or additions that might be made to this document or to the external interface in the future.
Next: Document pre-conditions more formally, Previous: Futures, Up: Futures [Contents][Index]
Currently we call a zero-length instance (aka tree node) either a nulling instance or a nulled instance. The use of “nulling” is for historic reasons and arguably is confusing. The symbol of a nulling instance is not necessarily a nulling symbol — it might be a nullable symbol. Usage of the term “nulled” is less confusing. At this time, we continue to allow zero-length instances to be called nulling instances because that terminology is embedded in a lot of code and documents.
Next: Simpler events interface, Previous: Nulling versus nulled, Up: Futures [Contents][Index]
A more formal approach to documenting preconditions of the methods is possible, and may be helpful enough to repay any cost in verbosity or complexity. Dave Abrahams recommended I look at https://www.boost.org/sgi/stl/ for one approach.
Next: Better defined ambiguity metric, Previous: Document pre-conditions more formally, Up: Futures [Contents][Index]
Some of the events interfaces are unnecessarily complex. Activation in the grammar is unnecessary, as is the ability to “unmark” an event for a symbol before precomputation. See Completion events, see Symbol nulled events, and see Prediction events.
Next: Report item traverser should be a time object, Previous: Simpler events interface, Up: Futures [Contents][Index]
With experience, we are now in a position to define an ambiguity metric that can be cheaply calculated, and that might be of real use. Preliminary notes are in the CWeb code.
Next: Orthogonal treatment of soft failures, Previous: Better defined ambiguity metric, Up: Futures [Contents][Index]
Right now, a report item traverser is a kind of “subobject” of a recognizer. It should be made into a full-fledged time object. This will allow multiple report item traversers to be in use at once, allowing more aggressive use of this facility.
Next: Orthogonal treatment of exhaustion, Previous: Report item traverser should be a time object, Up: Futures [Contents][Index]
The treatment of soft failure evolved along with this interface,
leaving traces of that evolution in the interface.
For example, soft failures should not set the error code,
but soft failure in
marpa_r_progress_item()
sets the error code to MARPA_ERR_PROGRESS_REPORT_EXHAUSTED
.
See marpa_r_progress_item().
Similar, soft failure
marpa_t_next()
sets the error code to MARPA_ERR_TREE_EXHAUSTED
.
These non-orthogonalities should be fixed someday.
Next: Furthest earleme values, Previous: Orthogonal treatment of soft failures, Up: Futures [Contents][Index]
The treatment of parse exhaustion is very awkward.
marpa_r_start_input()
returns success on exhaustion,
while
marpa_r_earleme_complete()
either returns success or
a hard failure, depending on circumstances.
See marpa_r_earleme_complete(), and
marpa_r_start_input().
Ideally the treatment should be simpler, more intuitive and more orthogonal. Better, perhaps, would be to always treat parse exhaustion as a soft failure.
Next: Additional recoverable failures in marpa_r_alternative(), Previous: Orthogonal treatment of exhaustion, Up: Futures [Contents][Index]
marpa_r_furthest_earleme
returns
unsigned int
, which is non-orthogonal with
marpa_r_current_earleme
.
This leaves no room for an failure return value,
which we deal with by not checking for failures.
The only important potential failure is calling
marpa_r_furthest_earleme
when the furthest
earleme is an indeterminate value.
We eliminate this potential cause of failure by
regarding furthest earleme as having
been initialized when the recognizer was created,
which is another non-orthogonality with
marpa_r_current_earleme
.
All this might be fine, if something were gained, but in fact in the furthest
earleme, unless there is a problem, always becomes the current earleme,
and no use cases for extremely long variable-length tokens are envisioned,
so that the two should never be far apart.
Additionally, the additional values for the furthest earleme only come into
play if the parse is to large for the computer memories as of this writing.
Summarizing, marpa_r_furthest_earleme
,
should return an int
,
like marpa_r_current_earleme
,
and the non-orthogonalities should be eliminated.
Next: Untested methods, Previous: Furthest earleme values, Up: Futures [Contents][Index]
Among the hard failures that
marpa_r_alternative() returns
are the error codes
MARPA_ERR_DUPLICATE_TOKEN
,
MARPA_ERR_NO_TOKEN_EXPECTED_HERE
and MARPA_ERR_INACCESSIBLE_TOKEN
.
These are currently irrecoverable.
They may in fact be fully recoverable,
but are not documented as such because this has not been
tested.
At this writing, we know of no applications that attempt to recover from these errors. It is possible that these error codes may also be useable for the techniques similar to the Ruby Slippers, as of this writing, we know of no proposals to use them in this way.
Previous: Additional recoverable failures in marpa_r_alternative(), Up: Futures [Contents][Index]
The methods of this section are not in the external interface, because they have not been adequately tested. Their fate is uncertain. Users should regard these methods as unsupported.
• Zero-width assertion methods: | ||
• Methods for revising parses: |
Next: Methods for revising parses, Previous: Untested methods, Up: Untested methods [Contents][Index]
On success, returns previous default value of the assertion.
Changes default value to default_value. On success, returns previous default value of the assertion.
Previous: Zero-width assertion methods, Up: Untested methods [Contents][Index]
Marpa allows an application to “change its mind” about a parse, rejecting rules previously recognized or predicted, and terminals previously scanned. The methods in this section provide that capability.
Next: History of the Marpa algorithm, Previous: Futures, Up: Top [Contents][Index]
• LHS terminals: | ||
• Valued and unvalued symbols: |
Next: Valued and unvalued symbols, Previous: Deprecated techniques and methods, Up: Deprecated techniques and methods [Contents][Index]
• Overview of LHS terminals: | ||
• Motivation of LHS terminals: | ||
• LHS terminal methods: | ||
• Precomputation and LHS terminals: | ||
• Nulling terminals: |
Next: Motivation of LHS terminals, Previous: LHS terminals, Up: LHS terminals [Contents][Index]
The user creates LHS terminals with the
marpa_g_symbol_is_terminal_set()
method.
See marpa_g_symbol_is_terminal_set().
If the marpa_g_symbol_is_terminal_set()
method is never called for a grammar,
then LHS terminals are not
in use
for any time object with that grammar as its
base grammar.
Next: LHS terminal methods, Previous: Overview of LHS terminals, Up: LHS terminals [Contents][Index]
Recall that a terminal symbol is a symbol that may appear in the input. Traditionally, all LHS symbols, as well as the start symbol, must be non-terminals. By default, this is Marpa’s behavior.
In a departure from tradition, Marpa had a feature that allowed the user to eliminate the distinction between terminals and non-terminals. This feature is now deprecated.
When LHS terminals are in use, a terminal can appear on the LHS of one or more rules, and can be be the start symbol. Note however, that terminals can never be zero length.
The basis of the LHS terminals feature was that, while sharp division between terminals and non-terminals was a useful simplification for proving theorems, it was not essential in practice. In the UNIX “toolkit” tradition, the practice has been to include even awkward, dangerous tools with no known use, in the toolkit. The philosophy was that empowering the user who discovers new techniques is more important than playing nanny to the toolkit’s users.
LHS symbols could be used to bypass, or “short circuit”, the rules on whose LHS they occur. Short circuiting rules, it was thought, might prove helpful in debugging, or have other applications.
But, a decade after the release of Libmarpa, no uses for LHS symbols have emerged. And they do introduce many new corner cases into the code and complicate the API documentation.
Next: Precomputation and LHS terminals, Previous: Motivation of LHS terminals, Up: LHS terminals [Contents][Index]
The terminal status of a symbol is a boolean, which is true iff the symbol is a terminal. The terminal status of a symbol is locked iff the terminal status of that symbol cannot be changed.
On success, does the following:
marpa_r_alternative()
method,
a symbol must be a terminal.
Hard fails with error code MARPA_ERR_TERMINAL_IS_LOCKED
if the symbol with sym_id
is locked,
and the terminal status
of the symbol with sym_id
is not equal to value.
Also hard fails if
value is not a boolean or if g is precomputed.
Return value: On success, value, which will be 1 or 0. On soft failure, -1. On hard failure, -2.
Next: Nulling terminals, Previous: LHS terminal methods, Up: LHS terminals [Contents][Index]
On success,
marpa_g_precompute()
will sets
and locks the terminal status of every symbol.
More precisely,
let the symbol be x,
let the terminal status of x when
marpa_g_precompute()
was called
be v_before,
and let the terminal status of x when
marpa_g_precompute()
returns success
be v_after.
The effect of the successful call of
marpa_g_precompute()
will be as follows:
marpa_g_precompute()
was called,
then v_after = v_before
.
The terminal status of all symbols is locked
after a successful call to
marpa_g_precompute()
.
See marpa_g_precompute().
Previous: Precomputation and LHS terminals, Up: LHS terminals [Contents][Index]
When LHS terminals are not in use, nulling terminals cannot occur, and applications need not take them in account. This is because, in order to be nullable, a symbol must appear on the LHS of a nullable rule. Without LHS terminals, therefore, no terminals can ever be either nullable or nulling.
Things become more complicated if LHS terminals are allowed. In that case nulling terminals can be created, and Libmarpa must take measures to prevent a recognizer from being created for a grammar with nulling terminals. Libmarpa will not allow a recognizer to be created from a grammar with nulling terminals because they are a logical contradiction. A terminal is (by definition) a symbol which can appear in the input, and a nulling symbol, by definition, cannot appear in the input.
Libmarpa’s marpa_g_precompute
method
fails with the error code MARPA_ERR_NULLING_TERMINAL
if it detects nulling terminals during precomputation.
The error code MARPA_ERR_NULLING_TERMINAL
is library-recoverable.
See marpa_g_precompute().
Libmarpa’s marpa_g_precompute
method
also triggers
one MARPA_EVENT_NULLING_TERMINAL
event
for every nulling terminal in the grammar.
This implies that
one or more MARPA_EVENT_NULLING_TERMINAL
events
occur iff
marpa_g_precompute
fails
with error code MARPA_ERR_NULLING_TERMINAL
.
Previous: LHS terminals, Up: Deprecated techniques and methods [Contents][Index]
• What unvalued symbols were: | ||
• Grammar methods dealing with unvalued symbols: | ||
• Registering semantics in the valuator: |
Next: Grammar methods dealing with unvalued symbols, Previous: Valued and unvalued symbols, Up: Valued and unvalued symbols [Contents][Index]
Libmarpa symbols can have values, which is the traditional way of doing semantics. Libmarpa also allows symbols to be unvalued. An unvalued symbol is one whose value is unpredictable from instance to instance. If a symbol is unvalued, we sometimes say that it has “whatever” semantics.
Situations where the semantics can tolerate unvalued symbols are surprisingly frequent. For example, the top-level of many languages is a series of major units, all of whose semantics are typically accomplished via side effects. The compiler is typically indifferent to the actual value produced by these major units, and tracking them is a waste of time. Similarly, the value of the separators in a list is typically ignored.
Rules are unvalued if and only if their LHS symbols are unvalued. When rules and symbols are unvalued, Libmarpa optimizes their evaluation.
It is in principle unsafe to check the value of a symbol if it can be unvalued. For this reason, once a symbol has been treated as valued, Libmarpa marks it as valued. Similarly, once a symbol has been treated as unvalued, Libmarpa marks it as unvalued. Once marked, a symbol’s valued status is locked and cannot be changed later.
The valued status of terminals is marked the first time they are read.
Unvalued symbols may be used in combination with another deprecated feature, LHS terminals. See LHS terminals. The valued status of LHS symbols must be explicitly marked by the application when initializing the valuator — this is Libmarpa’s equivalent of registering a callback.
The valued status of a LHS terminal will be locked in the recognizer if it is used as a terminal, and the symbol’s use as a rule LHS in the valuator must be consistent with the recognizer’s valued marking. LHS terminals are disabled by default.
Marpa reports an error when a symbol’s use conflicts with its locked valued status. Doing so usually saves the Libmarpa user some tricky debugging further down the road.
Next: Registering semantics in the valuator, Previous: What unvalued symbols were, Up: Valued and unvalued symbols [Contents][Index]
These methods, respectively, set
and query the “valued status” of a symbol.
Once set to a value with the
marpa_g_symbol_is_valued_set()
method,
the valued status of a symbol is “locked” at that value.
It cannot thereafter be changed.
Subsequent calls to
marpa_g_symbol_is_valued_set()
for the same sym_id will fail,
leaving sym_id’s valued status unchanged,
unless value is the same as the locked-in value.
Return value: On success, 1 if the symbol symbol_id is valued after the call, 0 if not. If the valued status is locked and value is different from the current status, -2. If value is not 0 or 1; or on other failure, -2.
Previous: Grammar methods dealing with unvalued symbols, Up: Valued and unvalued symbols [Contents][Index]
By default, Libmarpa’s valuator objects
assume that
non-terminal symbols have
no semantics.
The archetypal application will need to register
symbols that contain semantics.
The primary method for doing this is
marpa_v_symbol_is_valued()
.
Applications will typically register semantics by rule,
and these applications will find
the
marpa_v_rule_is_valued()
method more convenient.
These methods, respectively,
set and query
the valued status of symbol sym_id.
marpa_v_symbol_is_valued_set()
will set
the valued status to
the value of
its status argument.
A valued status of 1 indicates that the symbol is valued.
A valued status of 0 indicates that the symbol is unvalued.
If the valued status is locked,
an attempt to change to a status different from the
current one will fail
(error code MARPA_ERR_VALUED_IS_LOCKED
).
Return value: On success, the valued status after the call. If value is not either 0 or 1, or on other failure, -2.
These methods, respectively,
set and query
the valued status
for the LHS symbol of rule rule_id.
marpa_v_rule_is_valued_set()
sets
the valued status to the value
of its status argument.
A valued status of 1 indicates that the symbol is valued.
A valued status of 0 indicates that the symbol is unvalued.
If the valued status is locked,
an attempt to change to a status different from the
current one will fail
(error code MARPA_ERR_VALUED_IS_LOCKED
).
Rules have no valued status of their own. The valued status of a rule is always that of its LHS symbol. These methods are conveniences — they save the application the trouble of looking up the rule’s LHS.
Return value: On success, the valued status of the rule rule_id’s LHS symbol after the call. If value is not either 0 or 1, or on other failure, -2.
This methods locks the valued status of all symbols to 1, indicated that the symbol is valued. If this is not possible, for example because one of the grammar’s symbols already is locked at a valued status of 0, failure is returned.
Return value: On success, a non-negative number.
On failure, returns -2,
and sets the error code to an appropriate
value, which will never be
MARPA_ERR_NONE
.
Next: Annotated bibliography, Previous: Deprecated techniques and methods, Up: Top [Contents][Index]
This chapter is a quick summary of the most important events in Marpa’s development. My “timeline” of the major events in parsing theory has a much broader scope, and also includes more detail about Marpa’s development. See Timeline.
Next: Index of terms, Previous: History of the Marpa algorithm, Up: Top [Contents][Index]
• Aho and Ullman 1972: | ||
• Aycock and Horspool 2002: | ||
• Dominus 2005: | ||
• Earley 1970: | ||
• Grune and Jacobs 1990: | ||
• Grune and Jacobs 2008: | ||
• Kegler 2022: | ||
• Timeline: | ||
• Leo 1991: | ||
• Wikipedia: |
Next: Aycock and Horspool 2002, Previous: Annotated bibliography, Up: Annotated bibliography [Contents][Index]
The Theory of Parsing, Translation and Compiling, Volume I: Parsing by Alfred Aho and Jeffrey Ullman (Prentice-Hall: Englewood Cliffs, New Jersey, 1972). I think this was the standard source for Earley’s algorithm for decades. It certainly was my standard source. The account of Earley’s algorithm is on pages 320-330.
Next: Dominus 2005, Previous: Aho and Ullman 1972, Up: Annotated bibliography [Contents][Index]
Marpa is based on ideas from John Aycock and R. Nigel Horspool’s “Practical Earley Parsing”, The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. The idea of doing LR(0) precomputation for Earley’s general parsing algorithm (see Bibliography-Earley-1970), and Marpa’s approach to handling nullable symbols and rules, both came from this article.
The Aycock and Horspool paper summarizes Earley’s very nicely and is available on the web: http://www.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf. Unlike Earley’s 1970 paper (see Bibliography-Earley-1970), Aycock and Horspool 2002 is not easy reading. I have been following this particular topic on and off for years and nonetheless found this paper very heavy going.
Next: Earley 1970, Previous: Aycock and Horspool 2002, Up: Annotated bibliography [Contents][Index]
Although my approach to parsing is not influenced by Mark Jason Dominus’s Higher Order Perl, Mark’s treatment of parsing is an excellent introduction to parsing, especially in a Perl context. His focus on just about every other technique except general BNF parsing is pretty much standard, and will help a beginner understand how unconventional Marpa’s approach is.
Both Mark’s Perl and his English are examples of good writing, and the book is dense with insights. Mark’s discussion on memoization in Chapter 3 is the best I’ve seen. I wish I’d bought his book earlier in my coding.
Mark’s book is available on-line. You can download chapter-by-chapter or the whole thing at once, and you can take your pick of his original sources or PDF, at http://hop.perl.plover.com/book/. A PDF of the parsing chapter is at http://hop.perl.plover.com/book/pdf/08Parsing.pdf.
Next: Grune and Jacobs 1990, Previous: Dominus 2005, Up: Annotated bibliography [Contents][Index]
Of Jay Earley’s papers on his general parsing algorithm, the most readily available is “An efficient context-free parsing algorithm”, Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
Ordinarily, I’d not bother pointing out 35-year old nits in a brilliant and historically important article. But more than a few people treat this article as not just the first word in Earley parsing, but the last as well. Many implementations of Earley’s algorithm come, directly and unaltered, from his paper. These implementers and their users need to be aware of two issues.
First, the recognition engine itself, as described, has a serious bug. There’s an easy fix, but one that greatly slows down an algorithm whose main problem, in its original form, was speed. This issue is well laid out by Aycock and Horspool in their article. See Bibliography-Aycock-and-Horspool-2002.
Second, according to Tomita there is a mistake in the parse tree representation. See page 153 of Bibliography-Grune-and-Jacobs-1990, page 210 of Bibliography-Grune-and-Jacobs-2008, and the bibliography entry for Earley 1970 in Bibliography-Grune-and-Jacobs-2008. In the printed edition of the 2008 bibliography, the entry is on page 578, and on the web (ftp://ftp.cs.vu.nl/pub/dick/PTAPG_2nd_Edition/CompleteList.pdf), it’s on pp. 583-584. My methods for producing parse results from Earley sets do not come from Earley 1970, so I am taking Tomita’s word on this one.
Next: Grune and Jacobs 2008, Previous: Earley 1970, Up: Annotated bibliography [Contents][Index]
Parsing Techniques: A Practical Guide, by Dick Grune and Ceriel Jacobs, (Ellis Horwood Limited: Chichester, West Sussex, England, 1990). This book is available on the Web: http://dickgrune.com/Books/PTAPG_1st_Edition/
Next: Kegler 2022, Previous: Grune and Jacobs 1990, Up: Annotated bibliography [Contents][Index]
Parsing Techniques: A Practical Guide, by Dick Grune and Ceriel Jacobs, 2nd Edition. (Springer: New York NY, 2008). This is the most authoritative and comprehensive introduction to parsing I know of. In theory it requires no mathematics, only a programming background, but even so it is moderately difficult reading.
This is Bibliography-Grune-and-Jacobs-1990, updated. The bibliography for this book is available in enlarged form on the web: ftp://ftp.cs.vu.nl/pub/dick/PTAPG_2nd_Edition/CompleteList.pdf.
Next: Timeline, Previous: Grune and Jacobs 2008, Up: Annotated bibliography [Contents][Index]
My writeup of the theory behind Marpa,
with proofs of correctness and of my complexity claims,
was first made public in 2013.
It was updated in 2022,
and can be found on arxiv.org
(https://arxiv.org/abs/1910.08129).
Next: Leo 1991, Previous: Kegler 2022, Up: Annotated bibliography [Contents][Index]
Far more popular than my Marpa theory paper is my Parsing: a timeline. This is a detailed history of parsing theory, and is available online: https://jeffreykegler.github.io/personal/timeline_v3.
Next: Wikipedia, Previous: Timeline, Up: Annotated bibliography [Contents][Index]
Marpa’s handling of right-recursion uses the ideas in Joop M.I.M. Leo’s “A General Context-Free Parsing Algorithm Running in Linear Time on Every LR(k) Grammar Without Using Lookahead”, Theoretical Computer Science, Vol. 82, No. 1, 1991, pp 165-176. This is a difficult paper. It is available online at http://www.sciencedirect.com/science/article/pii/030439759190180A, click the PDF icon at the top left.
Previous: Leo 1991, Up: Annotated bibliography [Contents][Index]
Wikipedia’s article on Backus-Naur form is http://en.wikipedia.org/wiki/Backus-Naur_form. It’s a great place to start if you don’t know the basics of grammars and parsing. As Wikipedia points out, BNF might better be called Panini-Backus Form. The grammarian Panini gave a precise description of Sanskrit more than 23 centuries earlier in India using a similar notation.
Previous: Annotated bibliography, Up: Top [Contents][Index]
This index is of terms that are used in a special sense in this document. Not every use of these terms is indexed — only those uses that are in some way defining.
Jump to: | A B C D E F G H I L M N O P R S T U V W |
---|
Jump to: | A B C D E F G H I L M N O P R S T U V W |
---|