draw and decorate parse tree
Parse Tree
Context-Sensitive Analysis
Keith D. Cooper , Linda Torczon , in Engineering a Compiler (Second Edition), 2012
Instantiating the Parse Tree
An attribute grammar specifies a computation relative to the parse tree for a valid sentence in the underlying grammar. The paradigm relies, inherently, on the availability of the parse tree. The evaluator might simulate the parse tree, but it must behave as if the parse tree exists. While the parse tree is useful for discussions of parsing, few compilers actually build a parse tree.
Some compilers use an abstract syntax tree (ast) to represent the program being compiled. The ast has the essential structure of the parse tree but eliminates many of the internal nodes that represent nonterminal symbols in the grammar (see the description starting on page 226 of Section 5.2.1). If the compiler builds an ast, it could use an attribute grammar tied to a grammar for the ast. However, if the compiler has no other use for the ast, then the programming effort and compile-time cost associated with building and maintaining the ast must be weighed against the benefits of using the attribute-grammar formalism.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780120884780000049
Intermediate Representations
Keith D. Cooper , Linda Torczon , in Engineering a Compiler (Second Edition), 2012
Parse Trees
As we saw in Section 3.2.2, the parse tree is a graphical representation for the derivation, or parse, that corresponds to the input program. Figure 5.1 shows the classic expression grammar alongside a parse tree for a × 2 + a × 2 × b. The parse tree is large relative to the source text because it represents the complete derivation, with a node for each grammar symbol in the derivation. Since the compiler must allocate memory for each node and each edge, and it must traverse all those nodes and edges during compilation, it is worth considering ways to shrink this parse tree.
Figure 5.1. Parse Tree for a × 2 + a × 2 × b Using the Classic Expression Grammar.
Minor transformations on the grammar, as described in Section 3.6.1, can eliminate some of the steps in the derivation and their corresponding syntax-tree nodes. A more effective technique is to abstract away those nodes that serve no real purpose in the rest of the compiler. This approach leads to a simplified version of the parse tree, called an abstract syntax tree.
Parse trees are used primarily in discussions of parsing, and in attribute-grammar systems, where they are the primary ir. In most other applications in which a source-level tree is needed, compiler writers tend to use one of the more concise alternatives, described in the remainder of this subsection.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780120884780000050
Semantic Analysis
Michael L. Scott , in Programming Language Pragmatics (Third Edition), 2009
Inherited Attributes
In general, we can imagine (and will in fact have need of) attributes whose values are calculated when their symbol is on the right-hand side of the current production. Such attributes are said to be inherited . They allow contextual information to flow into a symbol from above or from the side, so that the rules of that production can be enforced in different ways (or generate different values) depending on surrounding context. Symbol table information is commonly passed from symbol to symbol by means of inherited attributes. Inherited attributes of the root of the parse tree can also be used to represent the external environment (characteristics of the target machine, command-line arguments to the compiler, etc.).
Example 4.7
Top-Down CFG and Parse Tree for Subtraction
As a simple example of inherited attributes, consider the following simplified fragment of an LL(1) expression grammar (here covering only subtraction):
expr → constexpr_tail
expr_tail → − constexpr_tail | ϵ
For the expression 9 − 4 − 3, we obtain the following parse tree:
If we want to create an attribute grammar that accumulates the value of the overall expression into the root of the tree, we have a problem: because subtraction is left associative, we cannot summarize the right subtree of the root with a single numeric value. If we want to decorate the tree bottom-up, with an S-attributed grammar, we must be prepared to describe an arbitrary number of right operands in the attributes of the top-most expr_tail node (see Exercise 4.4). This is indeed possible, but it defeats the purpose of the formalism: in effect, it requires us to embed the entire tree into the attributes of a single node, and do all the real work inside a single semantic function.
Example 4.8
Decoration with Left-to-Right Attribute Flow
If, however, we are allowed to pass attribute values not only bottom-up but also left-to-right in the tree, then we can pass the 9 into the top-most expr_tail node, where it can be combined (in proper left-associative fashion) with the 4. The resulting 5 can then be passed into the middle expr_tail node, combined with the 3 to make 2, and then passed upward to the root:
Example 4.9
Top-Down AG for Subtraction
To effect this style of decoration, we need the following attribute rules:
expr → constexpr_tail
⊳ expr_tail.st := const.val
⊳ expr.val := expr_tail.val
expr_tail 1 → − constexpr_tail 2
⊳ expr_tail2.st := expr_tail1.st − const.val
⊳ expr_tail1.val := expr_tail2.val
expr_tail → ϵ
⊳ expr_tail.val := expr_tail.st
In each of the first two productions, the first rule serves to copy the left context (value of the expression so far) into a "subtotal" (st) attribute; the second rule copies the final value from the right-most leaf back up to the root. In the expr_tail nodes of the picture in Example 4.8, the left box holds the st attribute; the right holds val.
Example 4.10
Top-Down AG for Constant Expressions
We can flesh out the grammar fragment of Example 4.7 to produce a more complete expression grammar, as shown (with shorter symbol names) in Figure 4.3. The underlying CFG for this grammar accepts the same language as the one in Figure 4.1, but where that one was SLR(1), this one is LL(1). Attribute flow for a parse of (1 + 3) * 2, using the LL(1) grammar, appears in Figure 4.4. As in the grammar fragment of Example 4.9, the value of the left operand of each operator is carried into the TT and FT productions by the st (subtotal) attribute. The relative complexity of the attribute flow arises from the fact that operators are left associative, but the grammar cannot be left recursive: the left and right operands of a given operator are thus found in separate productions. Grammars to perform semantic analysis for practical languages generally require some non–S-attributed flow.
Figure 4.3. An attribute grammar for constant expressions based on an LL(1) CFG.
In this grammar several productions have two semantic rules.
Figure 4.4. Decoration of a top-down parse tree for (1 + 3) * 2, using the AG of Figure 4.3.
Curving arrows again indicate attribute flow; the arrow(s) entering a given box represent the application of a single semantic rule. Flow in this case is no longer strictly bottom-up, but it is still left-to-right. At FT and TT nodes, the left box holds the st attribute; the right holds val.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123745149000136
Compilers
Keith D. Cooper , ... Linda Torczon , in Encyclopedia of Physical Science and Technology (Third Edition), 2003
III Internal Representations
Once the compiler is broken into distinct phases, it needs an internal representation to transmit the program between them. This internal form becomes the definitive representation of the program—the compiler does not retain the original source program. Compilers use a variety of internal forms. The selection of a particular internal form is one of the critical design decisions that a compiler writer must make. Internal forms capture different properties of the program; thus, different forms are appropriate for different tasks. The two most common internal representations—the abstract syntax tree and three-address code—mimic the form of the program at different points in translation.
- •
-
The abstract syntax tree (AST) resembles the parse tree for the input program. It includes the important syntactic structure of the program while omitting any nonterminals that are not needed to understand that structure. Because of its ties to the source-language syntax, an AST retains concise representations for most of the abstractions in the source language. This makes it the IR of choice for analyses and transformations that are tied to source program structure, such as the high-level transformations discussed in Section VC.
- •
-
Three-address code resembles the assembly code of a typical microprocessor. It consists of a sequence of operations with an implicit order. Each operation has an operator, one or two input arguments, and a destination argument. A typical three-address code represents some of the relevant features of the target machine, including a realistic memory model, branches and labels for changing the flow of control, and a specified evaluation order for all the expressions. Because programs expressed in three-address code must provide an explicit implementation for all of the source language's abstractions, this kind of IR is well suited to analyses and transformations that attack the overhead of implementing those abstractions.
To see the difference between an AST and a three-address code, consider representing an assignment statement a[i] ← b * c in each. Assume that a is a vector of 100 elements (numbered 0–99) and that b and c are scalars.
The AST for the assignment, shown on the left, captures the essence of the source-language statement. It is easy to see how a simple in-order treewalk could reprint the original assignment statement. However, it shows none of the details about how the assignment can be implemented. The three-address code for the assignment, shown on the right, loses any obvious connection to the source-language statement. It imposes an evaluation order on the statement: first b, then c, then b * c, then i, then a[i], and, finally, the assignment. It uses the notation @b to refer to b's address in memory—a concept missing completely from the AST.
Many compilers use more than one IR. These compilers shift between representations so that they can use the most appropriate form in each stage of translation.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B0122274105001265
Parsers
Keith D. Cooper , Linda Torczon , in Engineering a Compiler (Second Edition), 2012
3.3 Top-Down Parsing
A top-down parser begins with the root of the parse tree and systematically extends the tree downward until its leaves match the classified words returned by the scanner. At each point, the process considers a partially built parse tree. It selects a nonterminal symbol on the lower fringe of the tree and extends it by adding children that correspond to the right-hand side of some production for that nonterminal. It cannot extend the frontier from a terminal. This process continues until either
- a.
-
the fringe of the parse tree contains only terminal symbols, and the input stream has been exhausted, or
- b.
-
a clear mismatch occurs between the fringe of the partially built parse tree and the input stream.
In the first case, the parse succeeds. In the second case, two situations are possible. The parser may have selected the wrong production at some earlier step in the process, in which case it can backtrack, systematically reconsidering earlier decisions. For an input string that is a valid sentence, backtracking will lead the parser to a correct sequence of choices and let it construct a correct parse tree. Alternatively, if the input string is not a valid sentence, backtracking will fail and the parser should report the syntax error to the user.
One key insight makes top-down parsing efficient: a large subset of the context-free grammars can be parsed without backtracking. Section 3.3.1 shows transformations that can often convert an arbitrary grammar into one suitable for backtrack-free top-down parsing. The two sections that follow it introduce two distinct techniques for constructing top-down parsers: hand-coded recursive-descent parsers and generated ll(1) parsers.
Figure 3.2 shows a concrete algorithm for a top-down parser that constructs a leftmost derivation. It builds a parse tree, anchored at the variable root. It uses a stack, with access functions push( ) and pop( ), to track the unmatched portion of the fringe.
Figure 3.2. A Leftmost, Top-Down Parsing Algorithm.
The main portion of the parser consists of a loop that focuses on the leftmost unmatched symbol on the partially-built parse tree's lower fringe. If the focus symbol is a nonterminal, it expands the parse tree downward; it chooses a production, builds the corresponding part of the parse tree, and moves the focus to the leftmost symbol on this new portion of the fringe. If the focus symbol is a terminal, it compares the focus against the next word in the input. A match moves both the focus to the next symbol on the fringe and advances the input stream.
If the focus is a terminal symbol that does not match the input, the parser must backtrack. First, it systematically considers alternatives for the most recently chosen rule. If it exhausts those alternatives, it moves back up the parse tree and reconsiders choices at a higher level in the parse tree. If this process fails to match the input, the parser reports a syntax error. Backtracking increases the asymptotic cost of parsing; in practice, it is an expensive way to discover syntax errors.
The implementation of " backtrack " is straightforward. It sets focus to its parent in the partially-built parse tree and disconnects its children. If an untried rule remains with focus on its left-hand side, the parser expands focus by that rule. It builds children for each symbol on the right-hand side, pushes those symbols onto the stack in right-to-left order, and sets focus to point at the first child. If no untried rule remains, the parser moves up another level and tries again. When it runs out of possibilities, it reports a syntax error and quits.
To facilitate finding the "next" rule, the parser can store the rule number in a nonterminal's node when it expands that node.
When it backtracks, the parser must also rewind the input stream. Fortunately, the partial parse tree encodes enough information to make this action efficient. The parser must place each matched terminal in the discarded production back into the input stream, an action it can take as it disconnects them from the parse tree in a left-to-right traversal of the discarded children.
3.3.1 Transforming a Grammar for Top-Down Parsing
The efficiency of a top-down parser depends critically on its ability to pick the correct production each time that it expands a nonterminal. If the parser always makes the right choice, top-down parsing is efficient. If it makes poor choices, the cost of parsing rises. For some grammars, the worst case behavior is that the parser does not terminate. This section examines two structural issues with cfgs that lead to problems with top-down parsers and presents transformations that the compiler writer can apply to the grammar to avoid these problems.
A Top-Down Parser with Oracular Choice
As an initial exercise, consider the behavior of the parser from Figure 3.2 with the classic expression grammar in Figure 3.1 when applied to the string a + b × c. For the moment, assume that the parser has an oracle that picks the correct production at each point in the parse. With oracular choice, it might proceed as shown in Figure 3.3 . The right column shows the input string, with a marker ↑ to indicate the parser's current position in the string. The symbol → in the rule column represents a step in which the parser matches a terminal symbol against the input string and advances the input. At each step, the sentential form represents the lower fringe of the partially-built parse tree.
Figure 3.3. Leftmost, Top-Down Parse of a + b × c with Oracular Choice.
With oracular choice, the parser should take a number of steps proportional to the length of the derivation plus the length of the input. For a + b × c the parser applied eight rules and matched five words.
Notice, however, that oracular choice means inconsistent choice. In both the first and second steps, the parser considered the nonterminal Expr. In the first step, it applied rule 1, Expr → Expr + Term. In the second step, it applied rule 3, Expr → Term. Similarly, when expanding Term in an attempt to match a, it applied rule 6, Term → Factor, but when expanding Term to match b, it applied rule 4, Term → Term × Factor. It would be difficult to make the top-down parser work with consistent, algorithmic choice when using this version of the expression grammar.
Eliminating Left Recursion
One problem with the combination of the classic expression grammar and a leftmost, top-down parser arises from the structure of the grammar. To see the difficulty, consider an implementation that always tries to apply the rules in the order in which they appear in the grammar. Its first several actions would be:
It starts with Expr and tries to match a. It applies rule 1 to create the sentential form Expr + Term on the fringe. Now, it faces the nonterminal Expr and the input word a, again. By consistent choice, it applies rule 1 to replace Expr with Expr + Term. Of course, it still faces Expr and the input word a. With this grammar and consistent choice, the parser will continue to expand the fringe indefinitely because that expansion never generates a leading terminal symbol.
This problem arises because the grammar uses left recursion in productions 1, 2, 4, and 5. With left-recursion, a top-down parser can loop indefinitely without generating a leading terminal symbol that the parser can match (and advance the input). Fortunately, we can reformulate a left-recursive grammar so that it uses right recursion—any recursion involves the rightmost symbol in a rule.
Left Recursion
A rule in a cfg is left recursive if the first symbol on its right-hand side is the symbol on its left-hand side or can derive that symbol.
The former case is called direct left recursion, while the latter case is called indirect left recursion.
The translation from left recursion to right recursion is mechanical. For direct left recursion, like the one shown below to the left, we can rewrite the individual productions to use right recursion, shown on the right.
The transformation introduces a new nonterminal, Fee′, and transfers the recursion onto Fee′. It also adds the rule Fee′→∊, where ∊ represents the empty string. This ∊-production requires careful interpretation in the parsing algorithm. To expand the production Fee′→∊, the parser simply sets focus ← pop( ) , which advances its attention to the next node, terminal or nonterminal, on the fringe.
In the classic expression grammar, direct left recursion appears in the productions for both Expr and Term.
Plugging these replacements back into the classic expression grammar yields a right-recursive variant of the grammar, shown in Figure 3.4. It specifies the same set of expressions as the classic expression grammar.
Figure 3.4. Right-Recursive Variant of the Classic Expression Grammar.
The grammar in Figure 3.4 eliminates the problem with nontermination. It does not avoid the need for backtracking. Figure 3.5 shows the behavior of the top-down parser with this grammar on the input a + b × c. The example still assumes oracular choice; we will address that issue in the next subsection. It matches all 5 terminals and applies 11 productions—3 more than it did with the left-recursive grammar. All of the additional rule applications involve productions that derive ∊.
Figure 3.5. Leftmost, Top-Down Parse of a + b × c with the Right-Recursive Expression Grammar.
This simple transformation eliminates direct left recursion. We must also eliminate indirect left recursion, which occurs when a chain of rules such as α → β, β → γ, and γ → αδ creates the situation that α → + αδ. Such indirect left recursion is not always obvious; it can be obscured by a long chain of productions.
To convert indirect left recursion into right recursion, we need a more systematic approach than inspection followed by application of our transformation. The algorithm in Figure 3.6 eliminates all left recursion from a grammar by thorough application of two techniques: forward substitution to convert indirect left recursion into direct left recursion and rewriting direct left recursion as right recursion. It assumes that the original grammar has no cycles (A → + A) and no ∊-productions.
Figure 3.6. Removal of Indirect Left Recursion.
The algorithm imposes an arbitrary order on the nonterminals. The outer loop cycles through the nonterminals in this order. The inner loop looks for any production that expands Ai into a right-hand side that begins with Aj , for j < i. Such an expansion may lead to an indirect left recursion. To avoid this, the algorithm replaces the occurrence of Aj with all the alternative right-hand sides for Aj . That is, if the inner loop discovers a production Ai → Ajγ, and Aj → δ 1|δ 2|⋯|δk , then the algorithm replaces Ai → Ajγ with a set of productions Ai → δ 1 γ|δ 2 γ|⋯|δkγ. This process eventually converts each indirect left recursion into a direct left recursion. The final step in the outer loop converts any direct left recursion on Ai to right recursion using the simple transformation shown earlier. Because new nonterminals are added at the end and only involve right recursion, the loop can ignore them—they do not need to be checked and converted.
Considering the loop invariant for the outer loop may make this clearer. At the start of the ith outer loop iteration
∀ k < i, no production expanding Ak has Al in its rhs, for l < k.
At the end of this process, (i = n), all indirect left recursion has been eliminated through the repetitive application of the inner loop, and all immediate left recursion has been eliminated in the final step of each iteration.
Backtrack-Free Parsing
The major source of inefficiency in the leftmost, top-down parser arises from its need to backtrack. If the parser expands the lower fringe with the wrong production, it eventually encounters a mismatch between that fringe and the parse tree's leaves, which correspond to the words returned by the scanner. When the parser discovers the mismatch, it must undo the actions that built the wrong fringe and try other productions. The act of expanding, retracting, and re-expanding the fringe wastes time and effort.
In the derivation of Figure 3.5, the parser chose the correct rule at each step. With consistent choice, such as considering rules in order of appearance in the grammar, it would have backtracked on each name, first trying Factor → ( Expr ) and then Factor → num before deriving name. Similarly, the expansions by rules 4 and 8 would have considered the other alternatives before expanding to ∊.
For this grammar, the parser can avoid backtracking with a simple modification. When the parser goes to select the next rule, it can consider both the focus symbol and the next input symbol, called the lookahead symbol. Using a one symbol lookahead, the parser can disambiguate all of the choices that arise in parsing the right-recursive expression grammar. Thus, we say that the grammar is backtrack free with a lookahead of one symbol. A backtrack-free grammar is also called a predictive grammar.
Backtrack-Free Grammar
a cfg for which the leftmost, top-down parser can always predict the correct rule with lookahead of at most one word
We can formalize the property that makes the right-recursive expression grammar backtrack free. At each point in the parse, the choice of an expansion is obvious because each alternative for the leftmost nonterminal leads to a distinct terminal symbol. Comparing the next word in the input stream against those choices reveals the correct expansion.
The intuition is clear, but formalizing it will require some notation. For each grammar symbol α, define the set first(α) as the set of terminal symbols that can appear as the first word in some string derived from α. The domain of first is the set of grammar symbols, T ∪ NT ∪ {∊, eof} and its range is T ∪ {∊, eof}. If α is either a terminal, ∊, or eof, then first(α) has exactly one member, α. For a nonterminal A, first(A) contains the complete set of terminal symbols that can appear as the leading symbol in a sentential form derived from A.
first Set
For a grammar symbol α, first(α) is the set of terminals that can appear at the start of a sentence derived from α.
Figure 3.7 shows an algorithm that computes the first sets for each symbol in a grammar. As its initial step, the algorithm sets the first sets for the simple cases, terminals, ∊, and eof. For the right-recursive expression grammar shown in Figure 3.4 on page 101, that initial step produces the following first sets:
eof occurs implicitly at the end of every sentence in the grammar. Thus, it is in both the domain and range of first.
Figure 3.7. Computing first Sets for Symbols in a Grammar.
Next, the algorithm iterates over the productions, using the first sets for the right-hand side of a production to derive the first set for the nonterminal on its left-hand side. This process halts when it reaches a fixed point. For the right-recursive expression grammar, the first sets of the nonterminals are:
We defined first sets over single grammar symbols. It is convenient to extend that definition to strings of symbols. For a string of symbols, s = β 1 β 2 β 3…β k, we define first(s) as the union of the first sets for β 1, β 2, …,β n , where β n is the first symbol whose first set does not contain ∊, and ∊ ∈ first(s) if and only if it is in the set for each of the β i , 1 ≤ i ≤ k. The algorithm in Figure 3.7 computes this quantity into the variable rhs.
Conceptually, first sets simplify implementation of a top-down parser. Consider, for example, the rules for Expr′ in the right-recursive expression grammar:
When the parser tries to expand an Expr′, it uses the lookahead symbol and the first sets to choose between rules 2, 3, and 4. With a lookahead of +, the parser expands by rule 2 because + is in first(+ Term Expr′) and not in first(− Term Expr′) or first(∊). Similarly, a lookahead of − dictates a choice of rule 3.
Rule 4, the ∊-production, poses a slightly harder problem. first(∊) is just {∊}, which matches no word returned by the scanner. Intuitively, the parser should apply the ∊ production when the lookahead symbol is not a member of the first set of any other alternative. To differentiate between legal inputs and syntax errors, the parser needs to know which words can appear as the leading symbol after a valid application of rule 4—the set of symbols that can follow an Expr′.
To capture that knowledge, we define the set follow(Expr′) to contain all of the words that can occur to the immediate right of a string derived from Expr′. Figure 3.8 presents an algorithm to compute the follow set for each nonterminal in a grammar; it assumes the existence of first sets. The algorithm initializes each follow set to the empty set and then iterates over the productions, computing the contribution of the partial suffixes to the follow set of each symbol in each right-hand side. The algorithm halts when it reaches a fixed point. For the right-recursive expression grammar, the algorithm produces:
follow Set
For a nonterminal α, follow(α) contains the set of words that can occur immediately after α in a sentence.
Figure 3.8. Computing follow Sets for Non-Terminal Symbols.
The parser can use follow(Expr′) when it tries to expand an Expr′. If the lookahead symbol is +, it applies rule 2. If the lookahead symbol is −, it applies rule 3. If the lookahead symbol is in follow(Expr′), which contains eof and ) , it applies rule 4. Any other symbol causes a syntax error.
Using first and follow, we can specify precisely the condition that makes a grammar backtrack free for a top-down parser. For a production A → β, define its augmented first set, first +, as follows:
Now, a backtrack-free grammar has the property that, for any nonterminal A with multiple right-hand sides, A → β 1 | β 2 | … | β n
Any grammar that has this property is backtrack free.
For the right-recursive expression grammar, only productions 4 and 8 have first + sets that differ from their first sets.
Applying the backtrack-free condition pairwise to each set of alternate right-hand sides proves that the grammar is, indeed, backtrack free.
Left-Factoring to Eliminate Backtracking
Not all grammars are backtrack free. For an example of such a grammar, consider extending the expression grammar to include function calls, denoted with parentheses, ( and ) , and array-element references, denoted with square brackets, [ and ] . To add these options, we replace production 11, Factor → name, with a set of three rules, plus a set of right-recursive rules for argument lists.
Because productions 11, 12, and 13 all begin with name, they have identical first + sets. When the parser tries to expand an instance of Factor with a lookahead of name, it has no basis to choose among 11, 12, and 13. The compiler writer can implement a parser that chooses one rule and backtracks when it is wrong. As an alternative, we can transform these productions to create disjoint first + sets.
A two-word lookahead would handle this case. However, for any finite lookahead we can devise a grammar where that lookahead is insufficient.
The following rewrite of productions 11, 12, and 13 describes the same language but produces disjoint first + sets:
The rewrite breaks the derivation of Factor into two steps. The first step matches the common prefix of rules 11, 12, and 13. The second step recognizes the three distinct suffixes: [ Expr ] , ( Expr ) , and ∊. The rewrite adds a new nonterminal, Arguments, and pushes the alternate suffixes for Factor into right-hand sides for Arguments. We call this transformation left factoring.
Left Factoring
the process of extracting and isolating common prefixes in a set of productions
We can left factor any set of rules that has alternate right-hand sides with a common prefix. The transformation takes a nonterminal and its productions:
where α is the common prefix and the γ i 's represent right-hand sides that do not begin with α. The transformation introduces a new nonterminal B to represent the alternate suffixes for α and rewrites the original productions according to the pattern:
To left factor a complete grammar, we must inspect each nonterminal, discover common prefixes, and apply the transformation in a systematic way. For example, in the pattern above, we must consider factoring the right-hand sides of B, as two or more of the β i 's could share a prefix. The process stops when all common prefixes have been identified and rewritten.
Left-factoring can often eliminate the need to backtrack. However, some context-free languages have no backtrack-free grammar. Given an arbitrary cfg, the compiler writer can systematically eliminate left recursion and use left-factoring to eliminate common prefixes. These transformations may produce a backtrack-free grammar. In general, however, it is undecidable whether or not a backtrack-free grammar exists for an arbitrary context-free language.
3.3.2 Top-Down Recursive-Descent Parsers
Backtrack-free grammars lend themselves to simple and efficient parsing with a paradigm called recursive descent. A recursive-descent parser is structured as a set of mutually recursive procedures, one for each nonterminal in the grammar. The procedure corresponding to nonterminal A recognizes an instance of A in the input stream. To recognize a nonterminal B on some right-hand side for A, the parser invokes the procedure corresponding to B. Thus, the grammar itself serves as a guide to the parser's implementation.
Predictive Parsers versus DFAs
Predictive parsing is the natural extension of dfa-style reasoning to parsers. A dfa transitions from state to state based solely on the next input character. A predictive parser chooses an expansion based on the next word in the input stream. Thus, for each nonterminal in the grammar, there must be a unique mapping from the first word in any acceptable input string to a specific production that leads to a derivation for that string. The real difference in power between a dfa and a predictively parsable grammar derives from the fact that one prediction may lead to a right-hand side with many symbols, whereas in a regular grammar, it predicts only a single symbol. This lets predictive grammars include productions such as p → ( p ), which are beyond the power of a regular expression to describe.(Recall that a regular expression can recognize ( + Σ* ) +, but this does not specify that the numbers of opening and closing parentheses must match).
Of course, a hand-coded, recursive-descent parser can use arbitrary tricks to disambiguate production choices. For example, if a particular left-hand side cannot be predicted with a single-symbol lookahead, the parser could use two symbols. Done judiciously, this should not cause problems.
Consider the three rules for Expr′ in the right-recursive expression grammar:
To recognize instances of Expr′, we will create a routine EPrime() . It follows a simple scheme: choose among the three rules (or a syntax error) based on the first + sets of their right-hand sides. For each right-hand side, the code tests directly for any further symbols.
To test for the presence of a nonterminal, say A, the code invokes the procedure that corresponds to A. To test for a terminal symbol, such as name, it performs a direct comparison and, if successful, advances the input stream by calling the scanner, NextWord() . If it matches an ∊-production, the code does not call NextWord(). Figure 3.9 shows a straightforward implementation of EPrime() . It combines rules 2 and 3 because they both end with the same suffix, Term Expr′.
Figure 3.9. An Implementation of EPrime().
The strategy for constructing a complete recursive-descent parser is clear. For each nonterminal, we construct a procedure to recognize its alternative right-hand sides. These procedures call one another to recognize nonterminals. They recognize terminals by direct matching. Figure 3.10 shows a top-down recursive-descent parser for the right-recursive version of the classic expression grammar shown in Figure 3.4 on page 101. The code for similar right-hand sides has been combined.
Figure 3.10. Recursive-Descent Parser for Expressions.
For a small grammar, a compiler writer can quickly craft a recursive-descent parser. With a little care, a recursive-descent parser can produce accurate, informative error messages. The natural location for generating those messages is when the parser fails to find an expected terminal symbol—inside EPrime, TPrime , and Factor in the example.
3.3.3 Table-Driven LL(1) Parsers
Following the insights that underlie the first + sets, we can automatically generate top-down parsers for backtrack-free grammars. The tool constructs first, follow, and first + sets. The first + sets completely dictate the parsing decisions, so the tool can then emit an efficient top-down parser. The resulting parser is called an ll(1) parser. The name ll(1) derives from the fact that these parsers scan their input l eft to right, construct a l eftmost derivation, and use a lookahead of 1 symbol. Grammars that work in an ll(1) scheme are often called ll(1) grammars. ll(1) grammars are, by definition, backtrack free.
To build an ll(1) parser, the compiler writer provides a right-recursive, backtrack-free grammar and a parser generator constructs the actual parser. The most common implementation technique for an ll(1) parser generator uses a table-driven skeleton parser, such as the one shown at the top of Figure 3.11. The parser generator constructs the table, Table , which codifies the parsing decisions and drives the skeleton parser. The bottom of Figure 3.11 shows the ll(1) table for the right-recursive expression grammar shown in Figure 3.4 on page 101.
Figure 3.11. An ll(1) Parser for Expressions.
Parser Generator
a tool that builds a parser from specifications, usually a grammar in a bnf-like notation
Parser generators are also called compiler compilers.
In the skeleton parser, the variable focus holds the next grammar symbol on the partially built parse tree's lower fringe that must be matched. ( focus plays the same role in Figure 3.2.) The parse table, Table , maps pairs of nonterminals and lookahead symbols (terminals or eof) into productions. Given a nonterminal A and a lookahead symbol w, Table[A, w] specifies the correct expansion.
The algorithm to build Table is straightforward. It assumes that first, follow, and first + sets are available for the grammar. It iterates over the grammar symbols and fills in Table , as shown in Figure 3.12. If the grammar meets the backtrack free condition (see page 107), the construction will produce a correct table in O(|P| × | T|) time, where P is the set of productions and T is the set of terminals.
Figure 3.12. ll(1) Table-Construction Algorithm.
If the grammar is not backtrack free, the construction will assign more than one production to some elements of Table . If the construction assigns to Table[A, w] multiple times, then two or more alternative right-hand sides for A have w in their first + sets, violating the backtrack-free condition. The parser generator can detect this situation with a simple test on the two assignments to Table.
The example in Figure 3.13 shows the actions of the ll(1) expression parser for the input string a + b × c. The central column shows the contents of the parser's stack, which holds the partially completed lower fringe of the parse tree. The parse concludes successfully when it pops Expr′ from the stack, leaving eof exposed on the stack and eof as the next symbol, implicitly, in the input stream.
Figure 3.13. Actions of the ll(1) Parser on a a + b × c
Now, consider the actions of the ll(1) parser on the illegal input string x + ÷ y, shown in Figure 3.14 on page 115. It detects the syntax error when it attempts to expand a Term with lookahead symbol ÷. Table [Term,÷] contains "—", indicating a syntax error.
Figure 3.14. Actions of the ll(1) Parser on x + ÷ y.
Alternatively, an ll(1) parser generator could emit a direct-coded parser, in the style of the direct-coded scanners discussed in Chapter 2. The parser generator would build first, follow, and first + sets. Next, it would iterate through the grammar, following the same scheme used by the table-construction algorithm in Figure 3.12. Rather than emitting table entries, it would generate, for each nonterminal, a procedure to recognize each of the possible right-hand sides for that nonterminal. This process would be guided by the first + sets. It would have the same speed and locality advantages that accrue to direct-coded scanners and recursive-descent parsers, while retaining the advantages of a grammar-generated system, such as a concise, high-level specification and reduced implementation effort.
Section Review
Predictive parsers are simple, compact, and efficient. They can be implemented in a number of ways, including hand-coded, recursive-descent parsers and generated ll(1) parsers, either table driven or direct coded. Because these parsers know, at each point in the parse, the set of words that can occur as the next symbol in a valid input string, they can produce accurate and useful error messages.
Most programming-language constructs can be expressed in a backtrack-free grammar. Thus, these techniques have widespread application. The restriction that alternate right-hand sides for a nonterminal have disjoint first + sets does not seriously limit the utility of ll(1) grammars. As we will see in Section 3.5.4, the primary drawback of top-down, predictive parsers lies in their inability to handle left recursion. Left-recursive grammars model the left-to-right associativity of expression operators in a more natural way than right-recursive grammars.
Review Questions
- 1.
-
To build an efficient top-down parser, the compiler writer must express the source language in a somewhat constrained form. Explain the restrictions on the source-language grammar that are required to make it amenable to efficient top-down parsing.
- 2.
-
Name two potential advantages of a hand-coded recursive-descent parser over a generated, table-driven ll(1) parser, and two advantages of the ll(1) parser over the recursive-descent implementation.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780120884780000037
Big Data Analytics
Venkat N. Gudivada , ... Vijay V. Raghavan , in Handbook of Statistics, 2015
2.5.3 Evaluating Parsing Algorithms
Performance of parsing algorithms is evaluated using precision and recall measures on test data. Parse trees are represented as a collection of constituents to facilitate evaluation. A constituent refers to a subtree in the parse tree and corresponds to a sequence of words in the sentence being parsed. The label assigned to a constituent is the label of the root of the subtree. A constituent will also have two other pieces of information: the start and end word numbers of the subsequence of the sentence represented by the constituent.
For each sentence in the test data, a manually parsed tree is available. The latter is called the gold standard. The constituents in the parse tree generated by the algorithm are compared with the constituents in the gold standard. Based on the degree of conformance of constituents, precision and recall measures are calculated. Parsers based on LPCFGs perform significantly better than their PCFG counterparts.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780444634924000095
Principles and Methods for Data Science
Kalidas Yeturu , in Handbook of Statistics, 2020
9.1 Automatic differentiation
Automatic differentiation (Kakade and Lee, 2018 ) refers to generation of expression parse tree after application of differentiation as an operator. In a graphical representation of a program, the loss function is essentially an expression tree. The more complex a loss function, the more difficult it is to manually and analytically perform gradient calculation. Today it has become norm of the day and the de facto standard to use automatic differentiation for loss functions minimizing human error or efforts.
An illustration of a very minimalistic differentiation is shown in Fig. 31. The figure shows application of the differential operator to the multiplication operator,
Fig. 31. (A) An illustration of automatic differentiation of a multiplication operation over two operands is shown. The task of differentiation itself is carried forward recursively to the subtrees of operands. (B) An illustration of how a loop is handled in automatic differentiation is shown. The recurrence equations and updated values are shown in the other two boxes. The equations are y = x 4 and iterations are y = y × x + x 2 in which is determined. Recurrence equations for the value of y in kth iteration y (k) and the value of derivative of y in kth iteration as y ′ (k) are defined. For a starting value of x = 3, the derivative y′ (2) = 1491 is calculated, as described in the intermediate steps.
The recursive definition of conversion of expression trees extends to loop structures as well. Most typically, the loop iterations are treated as incremental and the differentiation operation goes inside of the loop. Therefore it is the responsibility of the engineer to create loss functions that ensure that each iteration of the loop corresponds to an operation on an independent parameter of the data. Given that a loss function is a sequence of operations rather than a single operation, the sequence is processed from top to bottom (or left to right) for differential operator application. In the loss function, any reference to a previously used variable, the automatic differentiation operator will also guarantee that its previously computed differential value will be used, seamlessly.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0169716120300225
Creativity in Computing and DataFlow SuperComputing
D. Bojić , M. Bojović , in Advances in Computers, 2017
Abstract
Parsing is the task of analyzing grammatical structures of an input sentence and deriving its parse tree. Efficient solutions for parsing are needed in many applications such as natural language processing, bioinformatics, and pattern recognition. The Cocke–Younger–Kasami (CYK) algorithm is a well-known parsing algorithm that operates on context-free grammars in Chomsky normal form and has been extensively studied for execution on parallel machines. In this chapter, we analyze the parallelizing opportunities for the CYK algorithm and give an overview of existing implementations on different hardware architectures. We propose a novel, efficient streaming dataflow implementation of the CYK algorithm on reconfigurable hardware (Maxeler dataflow engines), which achieves 18–76 × speedup over an optimized sequential implementation for real-life grammars for natural language processing, depending on the length of the input string.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0065245816300602
Prolog Programming Language
Heimo H. Adelsberger , in Encyclopedia of Physical Science and Technology (Third Edition), 2003
VII.E A Grammar Example
The arguments of the nonterminals can be used to produce data structures during the parsing process. This means that such a grammar can be used not only to check the validity of a sentence, but also to produce the corresponding parse tree. Another use of arguments is to deal with context dependence. To handle, e.g., number agreement, certain nonterminals will have an extra argument, which can take the values "singular" or "plural."
The following grammar demonstrates some aspects of writing a DCG in Prolog. It produces the complete parse tree of the sentence. It handles some form of number agreement. (A sentence like "The boy kick the ball" would be rejected.). Finally, it separates grammar rules from the dictionary. In this form it is easier to maintain the grammar:
/* A simple grammar */
sentence(s(NP, VP))
–> noun_phrase(NP, Number),
verb_phrase(VP, Number).
noun_phrase(np(Det,Noun), Number) –> determiner(Det, Number),
noun(Noun, Number).
verb_phrase(vp(V,NP), Number) –> verb(V, Number, transitive),noun_phrase(NP,} _).
determiner(det(Word), Number) –> [word], {is_determiner(Word, Number)|.
noun(n(Root), Number) –> [Word], {is_noun(Word, Number, Root)|.
verb(v(Root, Tense), Number, Transitivity) –> [Word], {is_verb(Word, Root, Number, Tense, Transitivity)|.
/* the dictionary */
/* determiner */
is_determiner(a, singular).
is_determiner(every, singular).
is_determiner(the, singular).
is_determiner(all, plural).
/* nouns */
is_noun(man, singular, man).
is_noun(men, plural, man).
is_noun(boy, singular, boy).
is_noun(boys, plural, boy).
is_noun(woman, singular, woman).
is_noun(women, plural, woman).
is_noun(ball, singular, ball).
is_noun(balls, plural, ball).
/* verbs */
is_verb(Word, Root, Number, Tense, Transitivity):-
verb_form(Word, Root, Number, Tense), infinitive(Root, Transitivity).
infinitive(kick, transitive).
infinitive(live, intransitive).
infinitive(like, transitive).
verb_form(kicks, kick, singular, present).
verb_form(kick, kick, plural, present).
verb_form(kicked, kick, _, past).
verb_form(lives, live, singular, present).verb_form(live, live, plural, present).
verb_form(lived, live, _, past).
verb_form(likes, like, singular, present).
verb_form(like, like, plural, present).
verb_form(liked, like, _, past).
This grammar allows one to parse the original sentence "The boy kicked the ball," producing the following parse tree:
?- phrase(sentence(X), [the, boy, kicked, the, ball]).
X = s(np(det(the), n(boy)), vp(v(kick, past),np(det(the),n(ball)))).
In addition, one can parse many other sentences, most of which may not make sense, such as
a man kicks a man
a man kicks a boy.
a boy kicked a man.
every man likes every woman.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B012227410500853X
VIEWs, Derived, and Other Virtual Tables
Joe Celko , in Joe Celko's SQL for Smarties (Fifth Edition), 2015
6.4.3 In-Line Text Expansion
Another approach is to store the text of the CREATE VIEW statement and work it into the parse tree of the SELECT, INSERT, UPDATE, or DELETE statements that use it. This allows the optimizer to blend the VIEW definition into the final query plan. For example, you can create a VIEW based on a particular department, thus:
CREATE VIEW SalesDept (dept_name, city_name, . . .)
AS SELECT 'Sales', city_name, . . .
FROM Departments
WHERE dept_name = 'Sales';
and then use it as a query, thus:
SELECT *
FROM SalesDept
WHERE city_name = 'New York';
The parser expands the VIEW into text (or an intermediate tokenized form) within the FROM clause. The query would become, in effect,
SELECT *
FROM (SELECT 'Sales', city_name, . . .
FROM Departments
WHERE dept_name = 'Sales')
AS SalesDept (dept_name, city_name, . . .)
WHERE city_name = 'New York';
and the query optimizer would then "flatten it out" into
SELECT *
FROM Departments
WHERE (dept_name = 'Sales')
AND (city_name = 'New York');
Though this sounds like a nice approach, it had problems in early systems where the in-line expansion does not result in proper SQL. An earlier version of DB2 was one such system. To illustrate the problem, imagine that you are given a DB2 table that has a long identification number and some figures in each row. The long identification number is like those 40-digit monsters they give you on a utility bill—they are unique only in the first few characters, but the utility company prints the whole thing out anyway. Your task is to create a report that is grouped according to the first six characters of the long identification number. The immediate naive query uses the SUBSTRING() function:
SELECT SUBSTRING(long_id FROM 1 TO 6), SUM(amt1), SUM(amt2), . . .
FROM TableA
GROUP BY id;
This does not work; it is incorrect SQL, since the SELECT and GROUP BY lists do not agree. Other common attempts include GROUP BY SUBSTRING(long_id FROM 1 TO 6), which will fail because you cannot use a function, and GROUP BY 1, which will fail because you can use a column position only in a UNION statement, (column position is now deprecated in Standard SQL) and in the ORDER BY in some products.
The GROUP BY has to have a list of simple column names drawn from the tables of the FROM clause. The next attempt is to build a VIEW:
CREATE VIEW BadTry (short_id, amt1, amt2, . . .)
AS
SELECT SUBSTRING(long_id FROM 1 TO 6), amt1, amt2, . . .
FROM TableA;
and then do a grouped select on it. This is correct SQL, but it does not work in older versions DB2. The compiler apparently tried to insert the VIEW into the FROM clause, as we have seen, but when it expands it out, the results are the same as those of the incorrect first query attempt with a function call in the GROUP BY clause. The trick was to force DB2 to materialize the VIEW so that you can name the column constructed with the SUBSTRING() function. Anything that causes a sort will do this—the SELECT DISTINCT, UNION, GROUP BY, and HAVING clauses, for example.
Since we know that the short identification number is a key, we can use this VIEW:
CREATE VIEW Shorty (short_id, amt1, amt2, . . .)
AS
SELECT DISTINCT SUBSTRING(long_id FROM 1 TO 6), amt1, amt2, . . .
FROM TableA;
Then the report query is
SELECT short_id, SUM(amt1), SUM(amt2), . . .
FROM Shorty
GROUP BY short_id;
This works fine in DB2. I am indebted to Susan Vombrack of Loral Aerospace for this example. Incidentally, this can be written in Standard SQL as
SELECT *
FROM (SELECT SUBSTRING(long_id FROM 1 TO 6) AS short_id,
SUM(amt1), SUM(amt2), . . .
FROM TableA
GROUP BY long_id)
GROUP BY short_id;
The name on the substring result column in the subquery expression makes it recognizable to the parser.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128007617000061
Source: https://www.sciencedirect.com/topics/computer-science/parse-tree
0 Response to "draw and decorate parse tree"
Post a Comment