Decidability

  Some of the problems are decidable and some others are undecidable for CFG, the following seven problems

  1. Decide if two CFGs define the same language
  2. Decide if a given CFG is ambiguous
  3. Decide if a given ambiguous CFG has an equivalent CFG without ambiguity
  4. Decide if a CFL's complement is CFL
  5. Decide if a CFL's intersection is CFL
  6. Decide if two CFLs have words in common
  7. Decide if there're any words that a given CFG cannot generates

are all undecidable, because they've been proved that it is impossible to find an algorithm to decide them.

Emptiness

Given any CFG G, there exist an algorithm to decide whether it generates any words

Proof

  It is obvious that if \Lambda\in L(G), then this language generates at least a word and if it is, the G must satisfies S\stackrel{\ast}{\Rightarrow}\Lambda, which means S is a nullable nonterminal, and we've proposed an algorithm to decide whether if a nonterminal is nullable previously in the chatper Grammatical Format.
  Now, since we've shown how to decide the \Lambda's existence, now we can turn the CFG into CNF, if there is such a production S\rightarrow t where t could be any terminal, then this CFG generates at least one word, if there's no such production, we hava a way to try to build one

  1. For every N\rightarrow t where t can be any nonterminal, remove all the other productions that has N as its left side, then replace all the occurrence of N to t
  2. Repeat step one until the S is also being eliminated, if so, the CFG does generates some words, if not, the CFG generates no words.

This algorithm works because it simulates a backtracing of the derivation, if S yields any words, these words must be generated by other nonterminals (maybe 0, in which case it satisfies S\rightarrow t) which are again generated by S, since every nonterminal will be replaced by terminal, if we trace the derivation steps of a word all the way up to S, then the S must can be eliminated because it yields that word, and those nonterminals in every steps can be eliminated. If the S cannot be eliminated then it means that there is no word's derivation can be retraced up to S.
  The finiteness of this algorithm is obvious

Uselessness

Given a CFG G, there exist an algorithm to decide whether if a nonterminal X is ever used in the derivation of words

Proof

  This proof will be constructive

  1. Find all nonterminals that do not have a terminal production(we call this nonterminals "unproductive")
  2. Eliminate all productions that involve the unproductive nonterminals
  3. For every production A that involves X in its right part, we trace this production all the way up to see if S\stackrel{\ast}{\Rightarrow}uAy where u and y can be any string formed by terminal and nonterminals and both of them can be empty, if there exist at least one such A, then X is ever used in the derivation of the words

This is similar to the propagation of virus, if a production with left nonterminal, say, K, involves X as its right part, then any production that involves K as its right part will also yields X, we repeat this procedure bottom-up till we meet S, if it happens, then S must be able to yields X which anwsers our question perfectly.

Finiteness

Given a CFG G, there exist an algorithm to decide the finiteness of the language it generates

Proof

  This proof will be constructive

  1. Eliminate all the productions that involve useless nonterminals that we've just defined in the last section.
  2. For every non-X-production(which means X not in the left side), try to find a production that has K as its left side such that X is in the right side(K\rightarrow N_1N_2...X...N_n) while there exist a X-production such that X\stackrel{\ast}{\Rightarrow}uKy where u and y can be any string formed by terminal and nonterminals and both of them can be empty.
  3. Check if S\stackrel{\ast}{\Rightarrow}K

This proof hides a subtle detail that we need to ensure there must exist a production with X as its left side, this is actually unnecessary because if there're no such productions, then X will be eliminated from the rule.1 since it can not generate any word. The essential idea is to find a self-embedded nonterminal, and make sure it is reachable from the start symbol S

Membership, CYK Algorithm

Given a CFG G and a string x\in\Sigma^\ast, there exist a algorithm to decide where x can be generated by the CFG

Proof

  Let n be the length of x and k be the number of nonterminals, let the letters of x be x_1...x_n and the nonterminals be N_1...N_k
  If we know that N_i\stackrel{\ast}{\Rightarrow}x_p...x_q,p\lt q and N_o\stackrel{\ast}{\Rightarrow}x_q...x_j,q\lt j, then if there exist a production like N_u\Rightarrow N_iN_o, we would know that N_u\stackrel{\ast}{\Rightarrow} x_p...x_j.
  From the letters in x, one by one, we construct such an algorithm:

  1. Find nonterminals generate x_1...x_n respectively, this is simple, we just need to finish this by finding all the terminal productions \text{Nonterminal}\rightarrow\text{terminal}
  2. For each i in 2 to n, on each iteration, find all the nonterminals suchs that it produces substring x_p...x_q where q-p=i-1, this can be accomplished by following algorithm, for instance, when the i=2, we have
    x_1x_2\\
    x_2x_3\\
    x_3x_4\\

    apparently we know what nonterminals generate x_1 to x_n, respectively, then we can obtain the combination of nonterminals that could generate x_1x_2, x_2x_3, ..., x_{n-1}x_{n}, list all these combinations, find if there exist some nonterminals that can yields them, for example, if N_1\rightarrow x_1, N_2\rightarrow x_2 and N_3\rightarrow N_1N_2, we'll know that N_3\stackrel{\ast}{\Rightarrow}x_1x_2, we also do this repeatedly to find all the nonterminals that generates the substring with length 2, when we exhaust all the possibilities, increase the i, now i=3 and we do the same, if N_3\stackrel{\ast}{\Rightarrow}x_1x_2 and N_4\rightarrow x_3 and N_4\rightarrow N_3N_4, then N_4\stackrel{\ast}{\Rightarrow}x_1x_2x_3, which means it is a nonterminal that can generates one of the substring with length 3, apply this repeatedly until we exhaust all the possibilities from substring of length 2 to substring of length n, which is x_1...x_n itself.

  3. Check if the S is can generates the substring x_1...x_n by using the method in step 2, since x_1...x_n is x itself, so if the S can do it then x must be an element in L(G), otherwise not.

This algorithm is called CYK algorithm and it's actually bottom-up, first we find all the nonterminals generate the substring of length 1, then find all the nonterminals that can generate the nonterminals find the the previous step in a manner that they can form a substring of length 2, then length 3, length 4, ..., until length n, when we reach length n and we find that S is one of the nonterminals, it implies that S is capable of yields x

Parsing

  There is another problem needs to be solved, given a CFG G and a word w, we want know not only if the w is in L(G), but also how does it get generated

Given a word w generated by a grammar G, the procedure of finding its derivation is called parsing

The reason why we need parsing is that if we can deduce the derivation tree of a word, then we can analyze its meaning.

Top-down Parser

  Start with G and w, the task is to find a sequence of productions of G that generates w, we accomplish this by checking all of the leftmost derivations, during this process we will generate a syntax tree and modify it many time (which might involve another algorithm called Depth-First Search to traverse through the tree, DFS will try to simulate a leftmost derivation all the way down till it find that it's impossible, then turn to the next leftmost derivation from the current symbol), which means we will simulate every possibility of the leftmost derivation until we find a suitable one or we find that this path is clearly impossible. By "impossible" means that the current derivation can never lead to w, for example, a terminal that w does't have appears during the process, or a terminal appears at a wrong position compared to w, this step is called disambiguation1, it prunes the syntax tree, cut off those impossible subtrees, reduce the size of the whole tree to improve the time complexity.
  Some times there will be a tough problem, for example, if there is a production rule T\rightarrow T+E|F and the currently deriving string is i+i+T, both two rules can be applied, the method we're about to use to solve this particular problem is called backtracking, we "memorize" the current deriving string, and perform each of the productions, if we find all of them are impossible, we "restore" the deriving string back to the one that we memorized, then start to try the other productions.
  There're some rules that can help us to decide which subtree needs to be pruned

  1. Bad substring: If the deriving string contains some terminals that don't appear in the original string
  2. Good substrings but too many: The substring is correct, but it repeated appears too many times than the occurrences in the original string
  3. Good substrings but wrong order: The deriving string has the correct substrings, but the order is incorrect, e.g., for substrings ab and ba, the original order is ab...ba, but the deriving string's order is inverse
  4. Improper ending or beginning: If the deriving string ends or starts with a terminal that the original string does not, then the deriving must be impossible
  5. Excess projected length: The deriving string's length will keep growing, it can never be reduced, so if the target word has n letters and the deriving string generates at least n+1 terminals(you can calculate the least length of the deriving string by replacing all of its nonterminals with their shortest terminal productions), we presume that all the \Lambda-Productions have been eliminated
  6. End up with wrong word: If a derivation terminates, which means it contains only terminals, and these terminals cannot form the original string, prune this subtree

(These are not all the rules, the total rules are nondeterministic, it highly depends on the context)

Bottom-up Parser

  The bottom-up parser is the inverse version of top-down parser, it tries to resolve the leftmost derivation from the word itself, not from the start symbol, the aforementioned CYK algorithm is one of the bottom-up parser, it traverses all the possible derivations and tries to deduce the nonterminals all the way up to start symbol.

There is actually a third parser which works upon the postfix notation.


  1. 文法的二义性消除 ↩︎

行成于思,毁于随