current position:Home>Scripting language compilation front-end algorithm notes

Scripting language compilation front-end algorithm notes

2022-09-23 08:49:04Book memories of Jiangnan


1. 在词法分析阶段,Finite automata can be constructed by hand(FSA,或 FSM)实现词法解析,过程比较简单.There is now no need to manually construct the lexer,Instead, parse the lexical directly with regular expressions:

(1)首先,translate the regular expression into非确定的有限自动机(Nondeterministic Finite Automaton,NFA).

(2)其次,基于NFA处理字符串,see what features it has.

(3)然后,Convert a nondeterministic finite automaton into确定的有限自动机(Deterministic Finite Automaton,DFA).

(4)最后,运行DFA,see what features it has.

在词法分析阶段,有限自动机(FSA)has a finite number of states,识别Token的过程就是FSAstate transition process.其中,FSAdeterministic finite automata(DFA)和非确定的有限自动机(NFA).DFAis characterized by being in any state,Can do a deterministic conversion based on the input string,比如:

NFA的特点是,It exists certain state,Cannot do a deterministic transformation for some inputs,This is subdivided into two cases:


(2)there is some kind of conversion,在没有任何输入的情况下,It is also possible to migrate from one state to another.

比如,“a[a-zA-Z0-9]*bc”The regular expression to the requirements of the string isa开头,以bc结尾,a和bcThere can be any number of letters or numbers in between.Status in the figure below1node inputb时,There are two paths to choose from in this state,So this finite automaton is aNFA:

这个NFA还有引入εConverted drawing,它们是等价的.实际上第二个NFAThe algorithm mentioned next can be used,Automatically generated from regular expressions,如下图所示:

需要注意的是,无论是NFA还是DFAare equivalent to regular expressions,That is, all regular expressions can be converted intoNFA或DFA,反之亦然.

2. 理解了NFA和DFA之后,To see how to generate from the regular expressionNFA.It needs to be split into two subtasks:

(1)Parse the regular expression into an internal data structure,Easy to use in subsequent programs.Because a regular expression is also a string,So first make a small compiler,to understand strings representing regular expressions.Here is a bit of laziness to generate the corresponding data structure directly for the regular expression of the example,No need to make this compiler.The regular expression used to test can beint关键字、标识符,or numeric literals,如下所示:

int | [a-zA-Z][a-zA-Z0-9]* | [0-9]+

The following code creates a tree-like data structure,to represent the regular expression:

private static GrammarNode sampleGrammar1() {
    GrammarNode node = new GrammarNode("regex1",GrammarNodeType.Or);

    GrammarNode intNode = node.createChild(GrammarNodeType.And);
    intNode.createChild(new CharSet('i'));
    intNode.createChild(new CharSet('n'));
    intNode.createChild(new CharSet('t'));

    GrammarNode idNode = node.createChild(GrammarNodeType.And);
    GrammarNode firstLetter = idNode.createChild(CharSet.letter);

    GrammarNode letterOrDigit = idNode.createChild(CharSet.letterOrDigit);
    letterOrDigit.setRepeatTimes(0, -1);

    GrammarNode literalNode = node.createChild(CharSet.digit);
    literalNode.setRepeatTimes(1, -1);

    return node;



A more intuitive schematic is shown below:

(2)After the test data is generated,put data structures representing regular expressions,转换成一个NFA.这个过程比较简单,because for every structure in the regular expression,can be converted according to a fixed rule.There are several rule steps like this:

a. 识别ε的NFA:does not accept any input,can also move from one state to another,Labels on the edge of the state diagramε,如下图所示:

b. 识别i的NFA:when accepting charactersitriggers a transition,Labels on the edge of the state diagrami.如下图所示:

c. 转换“s|t”这样的正则表达式:意思是或者s或者t,二者选一.s和tis itself two subexpressions,Two new states can be added:Start state and an accepting state(最终状态)Also is the state with double line in the picture below,It means that the string being checked now matches the regular expression.然后用εConverting separately concatenated repss和t的子图,That is, either take the above path and that iss,Either take the following path that ist:

d. 对于“?”“*”和“+”这样的操作:means repeatable0次、0到多次、1到多次,Add extra states and edges when transitioning.以“s*”为例,do the following conversion:

That is, it can bei直接到 f,也就是对s匹配零次,也可以在sloop multiple times on the start and end nodes of.

3. 按照这些规则,Programs can be written to convert.如下列代码所示:

public static State[] regexToNFA(GrammarNode node) {
        State beginState = null;
        State endState = null;

        switch (node.getType()) {
            case Or:
                beginState = new State(); //New start state
                endState = new State(true); //new acceptance state
                for (GrammarNode child : node.children()) {
                    //递归,生成子图,Return to the first and last two states
                    State[] childState = regexToNFA(child);
                    //beginState,通过εThe start state of the received subgraph
                    beginState.addTransition(new CharTransition(), childState[0]);
                    //end state of the subgraph,通过ε接到endState
                    childState[1].addTransition(new CharTransition(), endState);
            case And:
                State[] lastChildState = null;
                for (int i = 0; i < node.getChildCount(); i++) {
                    State[] childState = regexToNFA(node.getChild(i)); //生成子图
                    if (lastChildState != null) {
                        //Merge the accept state of the previous subgraph with the start state of the next subgraph,Join two subgraphs together
                    lastChildState = childState;

                    if (i == 0) {
                        beginState = childState[0];  //overall starting state
                        endState = childState[1];
                    } else {
                        endState = childState[1];    //overall acceptance status
            //Handling ordinary characters
            case Char:
                beginState = new State();
                endState = new State(true);
                //On the edge of the chart is the current nodecharSet,That is, the set of characters that caused the migration,like all letters
                beginState.addTransition(new CharTransition(node.getCharSet()), endState);


        State[] rtn = null;

        //考虑重复的情况,Add necessary nodes and edges
        if (node.getMinTimes() != 1 || node.getMaxTimes() != 1) {
            rtn = addRepitition(beginState, endState, node);
        } else {
            rtn = new State[]{beginState, endState};

        //mark named syntax nodes,Behind will be used to set upToken类型.
        if (node.getName() != null) {
        return rtn;

转换完毕以后,将生成的NFA打印输出,列出了所有的状态,and transitions from each state to other states,比如“0 ε -> 2”means from the state0通过ε转换,到达状态2 :

NFA states:
0  ε -> 2
  ε -> 8
  ε -> 14
2  i -> 3
3  n -> 5
5  t -> 7
7  ε -> 1
1  (end)
8  [a-z]|[A-Z] -> 9
9  ε -> 10
  ε -> 13
10  [0-9]|[a-z]|[A-Z] -> 11
11  ε -> 10
  ε -> 13
13  ε -> 1
14  [0-9] -> 15
15  ε -> 14
  ε -> 1

The image below shows the output visually,The figure is divided into three paths: upper, middle and lower,can clearly see the analysisint关键字、Procedures for identifiers and numeric literals:

生成NFA之后,How to use it to identify whether a string matches thisNFAWhat about the regular expression represented?以上图为例,当解析intA这个字符串时,First select the top path to match,匹配完intThese three characters later come to the state7,If no other characters follow,to reach the accepting state1,返回匹配成功的信息.可实际上,int后面是有A的,So the first path match fails.Cannot return directly after failure“匹配失败”的结果,because there are other paths,So to backtrack to the state0to try the second path,Attempt in the second path succeeded.

运行Regex.java中的matchWithNFA()方法,可以用NFAto do regular expression matching:

 * 用NFA来匹配字符串
 * @param state current state
 * @param chars 要匹配的字符串,用数组表示
 * @param index1 where the current matching character begins.
 * @return 匹配后,新index的位置.Points to the next character after the successful match.
private static int matchWithNFA(State state, char[] chars, int index1){
    System.out.println("trying state : " + + ", index =" + index1);

    int index2 = index1;
    for (Transition transition : state.transitions()){
        State nextState = state.getState(transition);
        if (transition.isEpsilon()){
            index2 = matchWithNFA(nextState, chars, index1);
            if (index2 == chars.length){
        //digest a character,指针前移
        else if (transition.match(chars[index1])){
            index2 ++; //consumes a character

            if (index2 < chars.length) {
                index2 = matchWithNFA(nextState, chars, index1 + 1);
            //If all characters have been scanned
            //Check if the current state is an accepting state,或者可以通过epsilon到达接受状态
            //If the state machine has not reached the accepting state,本次匹配失败
            else {
                if (acceptable(nextState)) {
                    index2 = -1;

    return index2;

which is matching“intA”时,will see its backtracking process,如下所示:

NFA matching: 'intA'
trying state : 0, index =0
trying state : 2, index =0    //Take the first path,即intkeyword this path
trying state : 3, index =1
trying state : 5, index =2
trying state : 7, index =3
trying state : 1, index =3    //It's the end,found that there are characters'A'没有匹配上
trying state : 8, index =0    //回溯,try the second path,即标识符
trying state : 9, index =1
trying state : 10, index =1   //在10和11This loops multiple times
trying state : 11, index =2
trying state : 10, index =2
trying state : 11, index =3
trying state : 10, index =3

It can be seen fromNFA算法的特点:Because there are multiple possible paths,So heuristics and backtracking are required,在比较极端的情况下,There will be a lot of backtracking,Performance will become very slow.Especially when dealing with something likes*这样的语句时,因为s可以重0 到无穷次,So when matching the string,May need to try many times.Pay attention to in the generatedNFA中,If a state has two paths to other states,The algorithm tries different paths in a certain order.

9和11Both states have two lines going out,where the red line is the more preferred path,that is, try to*matches as many characters as possible.This algorithmic strategy is called“贪婪(greedy)”策略.在有的情况下,would like to have the algorithm adoptnon-greedy strategy,或者叫“忽略优先”策略,in order to be more efficient.Some regular expression tools will support adding one more?,比如??、*?、+?to represent a non-greedy strategy.

4. NFAruns that can result in a lot of backtracking,So can it beNFA转换成DFA,Make the string matching process easier?如果能的话,That whole process can be automated,从正则表达式NFA,再从 NFA到DFA.Such an algorithm is子集构造法,它的思路如下:

(1)首先NFA有一个初始状态(从状态0通过εTransitions can reach all states,That is, without accepting any input,从状态0also reachable).This set of states is called“状态0的ε闭包”,simply call its0,s0包含0、2、8、14这几个状态,如下图所示:

(2)将字母i给到s0中的每一个状态,to see what state they can transform into,then pass these states throughεThe state that can be reached by the transition is also added.,形成一个包含“3、9、10、13、1”这5个状态的集合s1.其中3和9letters are acceptedistate to migrate to,10、13、1是在状态9的ε闭包中.如下图所示:

(3)在s0和s1Intermediate drawing transition line,标注上i,意思是s0接收到i的情况下,转换到s1:

在这里把s0和s1regarded as a state.也就是说要生成的DFA,each state of it,就是原来的NFAA collection of some state.在上面的推导过程中,There are two main calculations:

(1)ε-closure(s),即集合s的ε闭包.that is, from the collections中的每个节点,Plus starting from this node throughεtransition to all states.

(2)move(s,’i’),i.e. from the collections接收一个字符i,The set of new states that can be reached.

Continue the derivation according to the above idea,识别intThe identification path of the keyword is also derived.,如下图所示:

Write the above derivation idea as an algorithm,NFA2DFA()The pseudo code of the function is shown below:

把s0push onto the pending stack
把s0Add a collection of all the state setsS
循环:There are still unprocessed state sets in the pending stack
   循环:for each character in the alphabetc
      循环:For each state set in the stacks(i)(unprocessed state set)
          计算s(m) = move(s(i), c)(就是从s(i)出发,接收字符c能够
                                   A collection of new states to migrate to)
          看看s(j)Is it a new state set,If this state set already exists,把它找出来
                  否则,把s(j)Join the full setSand pending stack
          建立s(i)到s(j)的连线,The transition condition isc

运行NFA2DFA()方法,Then print out the generatedDFA.By drawing a picture, you can see the migration path intuitively.,如下图所示:

从初始状态开始,如果输入是i那就走intIdentify this line,也就是按照19、21、22This line migrates in turn,If found to be inconsistentint模式,就跳转到20i.e. the identifier state.注意,在上面的DFA中,As long as the accept status is included1的,都是DFA的接受状态.To further distinguish,22是intThe acceptance status of the keyword,因为它包含了intThe original acceptance status of the keyword7.同理,17is the accepting state of a numeric literal,18、19、20、21Both are the acceptance status of the identifier.

You can run the sample programRegex.java中的matchWithDFA()的方法,看看效果:

private static boolean matchWithDFA(DFAState state, char[] chars, int index){
    System.out.println("trying DFAState : " + + ", index =" + index);
    //根据字符,find next state
    DFAState nextState = null;
    for (Transition transition : state.transitions()){
        if (transition.match(chars[index])){
            nextState = (DFAState)state.getState(transition);

    if (nextState != null){
        //continue to match the string
        if (index < chars.length-1){
            return matchWithDFA(nextState,chars, index + 1);
            //The string is matched
            //See if it arrived in an accepting state
                return true;
                return false;
        return false;

When running, it prints out the matching process,And no traceback is generated during execution.现在可以自动生成DFA了,可以根据DFADo more efficient calculations.But good is bad,DFA也存在一些缺点.比如DFAThere may be many states.假设原来NFA的状态有n个,then combine them into different sets,The total number of possible sets is2的n次方个.for the example aboveNFA有13个状态,所以最坏的情况下,形成的DFA可能有2的13次方,也就是8192个状态,will occupy more memory space.And generate thisDFAIt also requires a certain amount of computing time.Of course this worst situation rarely happens,上面示例的NFA生成DFA后,只有7个状态.

因此,NFA和DFA有各自的优缺点:NFAUsually the number of states is relatively small,can be used directly for calculation,but may involve backtracking,resulting in poor performance;DFAThe number of state can be quite large,占用更多的空间,并且生成DFAIt also needs to consume computing resources.一般来说,Regular expression tools can be directly based onNFA.And the lexical analyzer(如Lex)则是基于DFA,because when generating lexical analysis tools,只需要计算一次DFA,就可以基于这个DFADo a lot of lexical analysis.


5. The recursive descent algorithm is generated during the parsing phaseAST时很常用,But there will be backtracking,There will be a loss in performance.So update the algorithm,Implementing top-down analysis algorithms with predictive capabilities,避免回溯.Algorithms for top-down analysis are a large class of algorithms.总体来说,it starts from a nonterminal,step-by-step derivation of the same program as the parsed programToken串.This process can be seen as a search process of a graph,This picture is very large,Because for every derivation,may generate a new node,如下图所示:

The job of the algorithm is to draw the big picture,找到一条路径,produce a sentence(Token串).For example, the above figure finds three red paths,都能产生“2+3*5”这个表达式.

According to the search strategy,有深度优先(Depth First)和广度优先(Breadth First)两种,The derivation process of the two strategies is different.深度优先is to explore all possibilities along a branch,以“add->mul+add”Production for example,它会先把mulThis non-terminal expansion,比如替换成pri,and then put its first nonterminalpri展开.Only after expanding this branch downwards,will return to the previous node,to expand its siblings.The recursive descent algorithm is depth-first,That's why it can't handle left recursion,Because the left branch can never be fully expanded.

而针对“add->add+mul”这个产生式,广度优先会把add和mulExpand both first,This forms four search paths,分别是mul+mul、add+mul+mul、add+pri和add+mul*pri.Then expand each of their nonterminals again,会形成18new search paths.所以Breadth-first traversal quickly explodes in the number of paths to explore,成指数级上升.Even with the simplest syntax below,去匹配“2+3”表达式,都需要尝试20多次,Not to mention targeting more complex expressions or employing more complex grammar rules:

//a very simple syntax
add -> pri          //1
add -> add + pri    //2
pri -> Int          //3
pri -> (add)        //4

这样看来,The exponential rise of memory consumption and the amount of calculation,makes breadth-first useless at all in this regard.Although the above algorithm has room for optimization,But it cannot fundamentally reduce the complexity of the algorithm.Of course it also has the advantage of being able to use left recursive grammars,However, it will not suffer the performance problems of the algorithm for this advantage..而Depth-first algorithm on the memory footprint is linear growth.Consider backtracking,It also grows exponentially in the worst case,But the complexity can be reduced to linear growth through optimization.

The optimization direction for the depth-first algorithm is to reduce or even avoid backtracking,The idea is to add predictive ability to the algorithm.比如在解析statement时,看到一个if,I know for sure that this is a conditional statement,No need to try other productions.LL算法this type of predictive algorithm.第一个L是Left-to-right,Represents left-to-right handler code.第二个L是Leftmost,Mean the leftmost derivation.

按照语法规则,A non-terminal symbol will form multiple child nodes after expansion,Contains the terminator and non-terminal.最左推导是指,Deduce and expand these nonterminal symbols from left to right.采用Leftmost的方法,During the derivation process, the left side of the sentence is gradually replaced with a terminal symbol,Only the right-hand side may contain nonterminals.以“2+3*5”为例,Its derivation order is from left to right,Nonterminals are gradually replaced by terminals:

The following figure is established by the above derivation processAST,“1、2、3……”Etc number isASTorder of node creation:

6. The above gives a general overview of the top-down analysis algorithm.,And talked about the meaning of the leftmost derivation,现在来看看LL算法到底是什么.LLAlgorithms are top-down algorithms with predictive power.In derivation it is desirable when there are multiple candidate productions,Take a look at the next one(或多个)Token,know which production to use.If you only need to preview aToken,就是LL(1)算法.拿statementgrammar example,it has several productions,分别产生if 语句、while语句、switch语句等,如下所示:

    : block
    | IF parExpression statement (ELSE statement)?
    | FOR '(' forControl ')' statement
    | WHILE parExpression statement
    | DO statement WHILE parExpression ';'
    | SWITCH parExpression '{' switchBlockStatementGroup* switchLabel*                
    | RETURN expression? ';'
    | SEMI
    | statementExpression=expression ';'
    | identifierLabel=IDENTIFIER ':' statement

If you see the next oneToken是if,Then it must be followed byif语句,This achieves the prediction,No need to try each production.问题来了,ifThe first element of a statement's production is a terminator,This is a good natural judgment,But if it is a non-terminal such as an expression statement,该怎么判断呢?

其实可以为statementComputes a set for each branch of,The set contains all possible beginnings of this branchToken.If the start of each branchToken是不一样的,That is, the intersection of these sets is the empty set,Then it is easy to decide which production to choose based on this set,Call such a set the productionFirst集合.FirstCalculation of sets is straightforward,Suppose the production to be computed isx:

(1)如果x以Token开头,那么First(x)The contained element is thisToken,比如if语句的First集合就是{IF}.

(2)如果xstarts with a nonterminala,那么First(x)要包含First(a)的所有成员.比如expressionStatment是以expression开头,因此它的FirstCollection to containFirst(expression)的全体成员.

(3)如果x的第一个元素a能够产生ε,Then look down one more elementb,把First(b)members also joinedFirst(x),以此类推.if all elements may returnε,那么First(x)也应该包含ε,意思是x也可能产生ε.比如下面的blockStatements产生式,它的第一个元素是blockStatement*,也就意味着blockStatementThe number of may0,therefore may produceε.那么First(blockStatements) Except to includeFirst(blockStatement)的全部成员,also include the following“;”,如下所示:

        : blockStatement*

(4)如果x是一个非终结符,It has multiple produce type to choose from,那么First(x)should contain all productionsFirst()集合的成员.比如statement的FirstCollection to containif、whileetc. all productionsFirst集合的成员.and if only one of these productions could produceε,那么x就可能产生ε,因此First(x)就应该包含ε.

在这里的示例中,可以用SampleGrammar.expressionGrammar()method to get the syntax of an expression,把它dump()一下,This is actually an expression syntax that eliminates left recursion:

expression  : assign ;
assign  : equal | assign1 ;
assign1 : '=' equal assign1 | ε;  
equal  : rel equal1 ;
equal1  : ('==' | '!=') rel equal1 | ε ;
rel    : add rel1 ;
rel1  : ('>=' | '>' | '<=' | '<') add rel1 | ε ;
add    : mul add1 ;
add1  : ('+' | '-') mul add1 | ε ;
mul    : pri mul1 ;
mul1  : ('*' | '/') pri mul1 | ε ;
pri    : ID | INT_LITERAL | LPAREN expression RPAREN ;

用GrammarNodeA class represents a node of the grammar,A grammar is formed(The subordinate nodes of the blue node are“或”的关系,That is, the vertical bar in grammar),如下图所示:

Based on this data structure it is possible to calculate the value of each nonterminalFirst集合,Be careful when calculating,Because the context-free grammar recursive nesting is allowed,所以这些GrammarNodeNodes form a graph, not a tree,cannot be computed by simply traversing the treeFirst集合.比如pri节点是expression的后代节点,但pri又引用了expression(pri->(expression)).这样,计算First(expression)需要用到First(pri),而计算First(pri)又需要依赖First(expression).

The way to break this deadlock is to use“不动点法”来计算.Iterate over the nodes in the graph multiple times,See if a new set member is calculated each time.For example, in the first calculation,当求First(pri)的时候,它所依赖的First(expression)The members may not be full,When waiting for the next round to continue the calculation,If you find a new set member, just add it in,until all members of the set are unchanged.

7. 现在可以用FirstThe set is branched,But also to deal with the production may beε的情况,比如“+mul add1 | ε”或“blockStatement*”都会产生ε.对ε的处理分成两种情况:

(1)Some elements of the production will yieldε.比如在JavaWhen declaring a class member in the grammar,可能会用public、privatewait to fix,But you can also omit not to write.In the grammar rules this part is“accessModifier?”,it may produceε,如下所示:

memberDeclaration : accessModifier? type identifier ';' ;
accessModifier : 'public' | 'private' ;
type : 'int' | 'long' | 'double' ;

So when the following two statements are encountered,can be judged as declarations of class members:

public int a;
int b;

这时,typeterminators that can be generated‘int’、‘long’和‘double’也在memberDeclaration的First集合中.This actually putsaccessModifierpenetrated,directly to the next nonterminaltype.So such questions depend onFirstCollection still resolves.在解析的过程中,如果下一个Token是‘int’,可以认为accessModifier返回了ε,ignore it and continue parsing the next elementtype,因为它的Firstwill be included in the collection‘int’.

(2)the production itself(而不是其组成部分)产生ε.Just rely on this kind of problemFirstCollections are unresolvable,To introduce another collection:Follow集合.It is the set of all terminals that may follow a nonterminal.以block语句为例,在PlayScript.g4中,roughly defined like this:

    : '{' blockStatements '}'

    : blockStatement*
    : variableDeclarators ';'
    | statement
    | functionDeclaration
    | classDeclaration

也就是说,block是由blockStatements构成的,而blockStatements可以由0到n个blockStatement构成,therefore may produceε.Then look at parsingblock时会发生什么,Suppose a statement in the curly braces,也就是blockStatments实际上产生了ε,那么在解析block时,首先读取了一个Token即“{”,然后处理blockStatements,Read one moreToken发现是“}”,Then the closing curly brace isblockStatementWhich production of ??In fact it is not in any productionFirst集合中,The following is the pseudo code for the judgment:

nextToken = tokens.peek();                //得到'}'
nextToken in First(variableDeclarators) ? //no
nextToken in First(statement) ?           //no
nextToken in First(functionDeclaration) ? //no
nextToken in First(classDeclaration) ?    //no

Could not find any available productions.这可怎么办呢?Except maybeblockStatmentsproduced itselfε之外,Another possibility is that there is a grammatical error.but continue to judge,就需要用到Follow集合.像blockStatements的Followset has only one element,is the closing curly brace“}”.So just check againnextTokenIs it the curly braces?:

nextToken = tokens.peek();                //得到'}'
nextToken in First(variableDeclarators) ? //no
nextToken in First(statement) ?           //no
nextToken in First(functionDeclaration) ? //no
nextToken in First(classDeclaration) ?    //no

if (nextToken in Follow(blockStatements)) //检查Follow集合
  return Epsilon;                         //推导出ε
  error;                                  //语法错误


(1)Scan grammar rules,看看xwhat symbols might follow.

(2)for the terminator that follows,都加到Follow(x)集合中去.

(3)If the back is the terminator,就把它的Firstcollection to ownFollow集合中去.

(4)最后,If the following nonterminal may yieldε,Just look back,until the program terminator is found.

This symbol is usually written as$,means the end of a program.For example, in the syntax of expressions,expressionmay be followed by this symbol,expressionAll descendant nodes of the right branch of , may also follow this symbol,That is, they may all appear at the end of the program.But other nonterminals are not followed by this symbol,如blockstatements,Because it's there will definitely be back“}”.The fixed point method is also used here for the calculation..Running the program prints out the example syntaxFollow集合.program printoutFirst和followThe collection is organized as follows(The printout also contains some intermediate nodes,这里省略):

In the expression of parsing,will be used comprehensivelyFirst和Follow集合.比如对于“add1 -> + mul add1 | ε”,If read ahead of the nextToken是+,Then process it according to the first production,因为+在First(“+ mul add1”)集合中.If read aheadToken是>号,Then it's definitely not thereFirst(add1)中,but to see if it belongs toFollow(add1),如果是那么add1就产生一个ε,否则就报错.

8. has now establishedFirst集合、Follow集合和LLIntuitive Cognition of Algorithmic Computing Process,This makes it easier to write the implementation of the algorithm..用LLWhen the algorithm parses the grammar,There are two options for implementation:

(1)Or use recursive descent algorithm,It's just that the current recursive descent algorithm does not have any backtracking..No matter where you go,can accurately predict which production should be used.

(2)采用表驱动的方式.This time need based on the calculatedFirst和FollowSet up a predictive analysis table.According to this table lookup to meetToken的情况下,which path should be taken.

这两种方式是等价的,Next, let's talk about how to designLL(k)特别是LL(1)Grammar of Algorithms.We already know that left-recursive grammars are to be avoided.,In addition, try to extract the left common factor,这样可以避免FirstCollection produces intersection.举例来说,The rules for variable declarations and function declarations are similar in the first half,Are type followed by an identifier:

statement : variableDeclare | functionDeclare | other;
variableDeclare : type Identifier ('=' expression)? ;
funcationDeclare : type Identifier '(' parameterList ')' block ;


int age;
int cacl(int a, int b){
  return a + b;

Such grammatical rules, if according toLL(1)算法,First(variableDeclare)和First(funcationDeclare)是相同的,Can't decide which path to take.就算用LL(2)也是一样的,要用到LL(3)才行.但对于LL(k) k > 1来说,Program overhead is a bit high,because there are more sets to compute,Constructing more complex forecast analysis tables.But this problem is easy to solve,As long as the left common factor of them:

statement: declarator | other;
declarator : declarePrefix (variableDeclarePostfix                
                            |functionDeclarePostfix) ;
variableDeclarePostfix : ('=' expression)? ;
functionDeclarePostfix : '(' parameterList ')' block ;

这样,The public portion of the program to parse them namelydeclarePrefix,Then look at the back of the differences.Both of them at this timeFirst集合,一个是{ =  ; },一个是{  (  },No intersection can easily distinguish between both.

三、Shift and reduce:用LR算法推演

9. The parsing algorithms discussed above are all top-down.Corresponding to the bottom-up algorithm,比如LR算法,it can support more syntax,And there is no problem with left recursion.第一个字母L与LL算法的第一个L一样,代表Read programs from left to right.第二个字母R,指的是RightMost(最右推导),也就是Used in production,expands nonterminals from right to left,例如对于“add->add+mul”Such a production,is the prioritymulexpand and thenadd.

自顶向下的算法,Pattern matching is to recursively,to gradually constructAST.So how is the bottom-up algorithm constructed?AST的呢?The answer is to move in-The reduced algorithm.Let's first look at the process of bottom-up parsing through an example,如下所示:

add -> mul
add -> add + mul
mul -> pri
mul -> mul * pri  
pri -> Int | (add)

然后来解析“2+3*5”这个表达式,AST如下,Step look at the specific process of parsing:

(1)看到第一个Token是Int即2,把它作为AST的第一个节点,At the same time, put it in a stack(The part to the left of the red line in the figure below).This stack represents some of theAST节点,把TokenMoved to action is called in the stack移进(Shift).

(2)根据语法规则,Int是从pri推导出来的(pri->Int),Well it's superiorAST肯定是pri,So add a parent node to itpri,Also put in the stackInt替换成了pri.This process is the reverse process of grammar derivation,叫做规约(Reduce),如下图所示:

具体来讲,It was taken out of the work area upside down1到n个元素,Combines the nonterminal of the previous level according to a production,也就是AST的上级节点,Then put it in the work area(That is, to the left of the vertical line above).At this time, there may be non-terminal symbols in the stack,may also have terminators,it seems to be assembledASTa workspace of,The right side of the vertical line is allToken(即终结符),they are waiting.

(3)与第2步一样,因为pri只能是mul推导出来的,That is, the production is“mul->pri”,So make another statute,如下图所示:

(4)根据“add->mul”产生式,将mul规约成add.So far for the first oneToken做了3Secondary Statute,已经到头了.Why do statutes here,rather than stopmulMove in+号,是有原因的.because no production ismul后面跟+号,而addyou can follow+号,如下图所示:

(5)移进+号.There are now two elements in the stack,分别是add和+,如下图所示:

(6)移进Int也就是数字3.There are now in the stack3个元素,如下图所示:

(7)&(8)将Int规约到priRe-contract tomul.So far the way to do the protocol is relatively simple,is the element at the top of the stack,reverse it,如下图所示:

(9)面临3个选择.The first option is to continuemul规约成add,The second option is to put“add+mul”规约成add.This two options are wrong,because they ultimately fail to form the correctAST,如下图所示:

第三个选择,也就是按照“mul->mul*pri”继续移进*numbers instead of conventions.Only in this way can the correctAST,Like the dotted line in the image below:



(12)mul*pri规约成mul.Note that there are also two options here,比如把priContinue to contractmul,But it's also obviously the wrong choice,如下图所示:


At this point, a correctAST,And there is only one element left in the stack,就是根节点.

10. The entire grammar parsing process above,实质是reverse rightmost derivation(Reverse RightMost Derivation),即如果把ASTNodes are numbered according to the order in which they were created,This is what the picture below looks like,The root node number is at most13:

But this is the process of statute,如果是The derivation process from the root node,The order happens to be reversed,先是13The number is then the right child node12号,再是12right child of11号,以此类推.The rightmost derivation process is as follows:

When parsing the grammar, it is pushed back from the bottom,So it is called the reverse rightmost derivation process.从这个意义上讲,LR算法中的Rwith reverse(Reverse)和最右(Reightmost)these two meanings.During the rightmost derivation,The underlined part is called a句柄(Handle).A handle is the right-hand part of a production,and it is in a right-sentence form(Sentence patterns that can be obtained by rightmost derivation)中的位置.Take the bottom row as an example,这个句柄“Int”是产生式“pri->Int”的右边部分,its position is the sentence pattern“Int + Int * Int”的第一个位置.简单来说,The handle is the production at which the derivation is made,If need to do work backward,It is also from this position to do the statute.

11. To turn this judgment process into a rigorous algorithm,Make sure you know at every step whether to move or reduce,Which production should we follow to make the specification?,这就是LRThe core problem to be solved by the algorithm.So how do you find the correct handle??The right derived from the beginning to start production,After several steps of derivation(记作->*),Step by step to form the current situation(That is, there are some non-terminal symbols and terminal symbols in the left stack,You can also preview on the right1到k个Token),如下所示:

add ->* 栈 | Token

According to the information at hand,This multi-step derivation path can be reversed,to get the correct handle.This is based on the information in the left stack,以及右边的Token串.对于LR(0)算法来说,Only on the basis of the stack on the left to find the right handle,对于LR(1)算法来说,One can preview from the rightToken.The idea is to reproduce this derivation path according to the grammar rules.以上面第8步为例,The following figure is its derivation process,The red path is the only way to reach the8步的路径.Know the path of forward derivation,在第8The right choice for step is to do shift:

To demonstrate this derivation process,这里引入一个新概念:项目(Item).Item代表带有“.”production of symbols.比如“pri->(add)”可以产生4个Item,“.”分别在不同的位置.“.”It can be seen as the vertical line in the schematic diagram of the previous steps,The part on the left is considered to be already on the stack,“.”The one on the right is the expected part:


Above is aNFA,利用这个NFAexpresses all possible derivation steps.每个Item(或者状态)when a symbol is received,就迁移到下一个状态,比如“add->.add+mul”在接收到一个addmigrating to “add->add.+mul”,receive another“+”就迁移到“add->add+.mul”.

in the upper left corner of this state diagram,with an auxiliary production“start->add”作为整个 NFA 的唯一入口.from this entrance,可以用这个NFAto match the contents of the stack,比如在第 8 步的时候,stack and next to the right Token 的状态如下,The left side of the vertical line is the content of the stack:

add + mul | *

在NFA中,从start开始遍历,Based on the contents of the stack, the red multi-step derivation path in the figure can be found.During this state transition,The symbols that cause the conversion are“ε、add、+、ε、mul”,忽略其中的εIs the content in the stack.在NFA中,查找到的Item是“mul->mul.*pri”.这个时候“.”在Item的中间.So the next operation can only be aShift操作,that is, the nextToken即*move the number onto the stack.如果“.”在ItemThe last of which corresponds to a reduction operation,比如在第12步,The content is on the stack:

add + mul | $    //$代表Token串的结尾

这个时候的Item是“add->add+mul.”.For all dot symbols at the endItem,There is no way to continue the downward migration,At this time, a protocol operation needs to be done,也就是基于“add + mul”规约到add,也就是到“add->.add+mul”这个状态.对于任何的ε转换,Its reverse operation is also a statute,For example from the picture“add->.add+mul”规约到“start->.add”.But before doing the reduction operation,Still need to check the followingToken是不是在Follow(add)中.对于add来说,它的Follow集合包括{ $ + ) },如果是这些TokenThen do the protocol,Otherwise, a compilation error will be reported.

当然,每个NFAcan be converted into aDFA.So it can be directly on the aboveNFAto match,也可以把NFA转成DFA,避免NFAretrospective phenomenon,让算法效率更高.convertedDFA如下:

在这个DFAAlso marked in the8Derivation path at step.为了更清晰地理解LR算法的本质,基于这个DFADerive the process of parsing again:

(1)Move in oneInt,从状态1迁移到9.Item是“pri->Int.”.

(2)依据“pri->Int”做规约,从状态9回到状态1.Because now there is apri元素,So it's back to the state8.

(3)依据“mul->pri”做规约,从状态8回到状态1,Then according to the stackmulelement enters state7.Pay attention to the state7的时候,There are two possible directions for the next step,分别是“add->mul.”和“mul->mul.*pri”这两个Itemrepresents the direction.基于“add->mul.”will do the protocol,而基于“mul->mul.*pri”will do the move in,This requires a look at the backToken了.如果后面的Token是*号,That's actually going to be the second way.But now the back is+号,So it means that only the statute can be done here.

(4)依据“add->mul”做规约,从状态7回到状态1,再依据addelement enters state2.

(5)移进+号.This corresponds to two transitions on the statechart,First according to the first element in the stackadd从1迁移到2.然后再根据“+”从2到3.Item的变化是:

状态 1:start->.add
状态 1:add->.add+mul
状态 2:add->add.+mul
状态 3:add->add+.mul

By moving in this plus sign,In fact, knowing that the top of this expression must have a“add+mul”的结构.

(6)~(8)移进Int,and has been stipulated untilmul.state change is the first from the state3到状态9,then back to the state3,back to the state4.

(9)Move in one*.According to the elements in the stack,The migration path is 1->2->3->4->5.


(11)根据“pri->Int”规约到pri,Return to state first5,接着根据pri进入状态6.

(12)根据“mul->mul*pri”规约到mul,thereby returning to the state4.

(13)根据“add->add+mul”规约到add,thereby returning to the state2.

从状态2再根据“start->add”One more stipulation, it becomesstart,回到状态1解析完成.

12. LRThe algorithm is based on the strength of the ability and the complexity of the implementation,Can be divided into multiple levels,分别是LR(0)、SLR(k)(即简单 LR)、LALR(k)(Look ahead LR)和LR(k),其中k表示要在TokenRead ahead in the queuek个Token.The characteristics of these types of algorithms are explained below.:

(1)LR(0)No need to look at the right Token,Do backward derivation only from the stack on the left.比如,前面DFA中的状态8只有一个Item:“mul->pri.”.If in this state,Then the next operation is the protocol.Suppose there is another state,它也只有一个Item,Dot not at the end,比如“mul->mul.*pri”,Then the next operation is to move in,The next input on the stack.

But the grammar rules used in practice are rarely so simple.所以LR(0)expression is too weak,Limited grammar rules that can be handled,not very in the previous example,If not read ahead oneToken,Only use information from the left workspace,Could not find the correct handle.such as in state7中,可以做两个操作:对于第一个Item即“add->mul.”,A protocol operation is required;对于第二个Item即“mul->mul.*pri”,actually need to do a shift operation.

The conflict in here,就叫做“移进/规约”冲突(Shift/Reduce Conflict),It means that it can be moved and can be reduced,对于状态7来说,Which operation to do actually depends on the rightToken.

(2)SLR(Simple LR)是在LR(0)的基础上做了增强.对于状态7In this case, a judgment condition must be added:the next input on the rightToken,是不是在add的Follow集合中.Because only this way the statute makes sense.在例子中,add的Follow集合是{+ ) $},如果不在这个范围内,So it's definitely illegal to make a statute.因为Follow集合的意思,就是哪些Tokencan appear after a nonterminal.So if in the state7中下一个Token是*,它不在add的Follow集合中,Then the only viable option left is to move in.So there are no two options,no conflict.

实际上,As far as the example syntax used here,SLR就足够了,But for some more complex syntax,采用SLRstill conflict,比如:

start -> exp
exp -> lvalue = rvalue
exp -> rvalue
lvalue -> Id
lvalue -> *rvalue
rvalue -> lvalue

This syntax says about lvalue and rvalue cases,in the example syntax here,rvalues ​​can only appear on the right side of an assignment symbol.在状态2if the next input is“=”,Then it is possible to do shift and reduce.因为“=”在rvalue的Follow集合中,如下图所示:

(3)How to deal with this conflict?仅仅根据Followcollection to determine whetherReduce不太严谨.Because in the state of the picture above2的情况下,even if it is followed by“=”Still can't do the statute.Because a reduction becomes an rvalue,but it's to the left of the equals sign,Obviously it conflicts with the above grammar definition.办法是FollowCollection dismantled,turn each of its members intoItem的一部分.This allows for a more detailed judgment.如下图所示,After this refinement,found in state2Only the next input in is“$”can make a protocol when.这就是LR(1)算法的原理,它更加强大:

LR(1)Algorithms also have a downside,就是DFA可能会很大.在语法分析阶段,DFAAs the size of the number of grammatical rules rises exponentially,a typical languageDFAThe state may reach thousands of,This makes the parsing performance very poor,Thus lost the practicality.

(4)LALR(k)is an improvement based on this shortcoming.It uses some tricks toMake the number of states smaller,But there is no big loss in processing power,YACC和BisonThese two tools are based onLALR(1)算法的.

copyright notice
author[Book memories of Jiangnan],Please bring the original link to reprint, thank you.

Random recommended