# Non-Context-Free Languages

## Self-Embeddedness

### Theorem 1

Let G be a CFG in CNF, we call the productions like

\text{Nonterminal}\rightarrow\text{Nonterminal}\text{Nonterminal}

live productions, and

\text{Nonterminal}\rightarrow\text{terminal}

If each production can be used only once during the derivation, we can only generate finitely many words.

#### Proof

For every CFG in CNF, applies a live production will increase the count of nonterminals by one, and applies a dead production will decrease it by one, since the derivation starts with a single nonterminal(the start symbol) initially, and end up with zero nonterminals, we need exactly one more dead productions than live productions because we need to eliminate the single nonterminal, this truth holds for every word that every CFG generates, if a CFG has p live productions and q dead production and each of them can be used only once during the derivation, then there will be at most p live productions and p+1 dead productions be applied, and the words that can be generated from p+1 dead production have at most p+1 letters, so the words generated by applying each production at most once can have at most p+1 letters, which means there can only be finitely many of them.

### Theorem 2

Let G be a CFG in CNF with p live productions and q dead productions, if w\in L(G)\land |w|\gt 2^p, then there exist two nodes m and n in w's syntax tree such that they're the same nonterminal and n is a direct descendant of m for every syntax tree of w

#### Proof

Let's make a simple clarification before we make the proof, assume that the syntax tree of the word w has, let's say, k rows, and its is a perfect binary tree, whereby "perfect" means each of its nodes has two descendants except the last row, because the last row is where the dead productions take place, and those dead productions can only have one descendant which is the terminal itself, if w has exactly s letters, then the syntax tree would have exactly log_{2} s+1 rows for the last two rows have the same number of nodes, so if w has more than s, the syntax tree must has more than log_{2} s+1 rows, since we've restricted the length of the w to be greater than 2^p, this then implies that the number of the rows of the syntax tree derived from w must have more than p+1 rows, which means at least p+2 rows, if a terminal at the last row can be derived from the root of this tree, it must have at least p+2-1=p+1 ancestors, all this ancestors must be live productions because once a dead production occurs the path will end and the terminal of the last row will be generated, now the contradiction appears, there're only p live productions, according to the pigeonhole principle, there must exist a live production such that it was used more than once, and multiple occurrences of the same live production means the same nonterminal, which proves the theorem.

A nonterminal of a derivation of a word in a CFG is said to be self-embedded if it ever occurs as a tree descendant of itself, according to the theorem 2, any sufficiently long word must have a leftmost derivation that include a self-embededed nonterminal

Let the notation \stackrel{\ast}{\Rightarrow} deontes "yields", we say x yields y, if y can be derived from x after certain productions, if

S_1\Rightarrow S_2 \Rightarrow ...\Rightarrow S_n

we say

S_1\stackrel{\ast}{\Rightarrow}S_n

There is actually a funny fact, if a nonterminal N is self-embedded, which means N\stackrel{\ast}{\Rightarrow}vNy(where v and y are strings of terminals yielded along the procedure), then this procedure can be repeated indefinitely, we can apply N\stackrel{\ast}{\Rightarrow}vNy as many times as we want and eventually end up with an infinite language, the conclusion is, if a CFG contains at least one self-embedded nonterminal, it must be infinite, if N\stackrel{\ast}{\Rightarrow}vNy, then so does N\stackrel{\ast}{\Rightarrow}v^nNy^n

## The Pumping Lemma for Context-Free Languages

The discovery we just found is analogous to the essential idea of the pumping lemma of regular languages, we've shown that if a CFG has at least one self-embedded nonterminal, then it can yields those words that some of its part repeated occur and the rest of the word are the same

### Theorem 4

Let G be a CFG in CNF with p live productions, let w\in L(G)\land |w|\gt 2^p, then the w can be break into

w=uvxy

where x\neq\Lambda, v and y can't both be empty, and |vxy|\leqslant 2^p
such that

uv^nxy^nz\in L(G), n\in\mathbb{N}^+

#### Proof

Let P\rightarrow QR is a live production of G, and G starts with S, suppose there exists S\stackrel{\ast}{\Rightarrow}uPz, since we've proved if w is sufficiently long then every of its nonterminals will be self-embedded, so P must be self-embedded which means P\stackrel{\ast}{\Rightarrow}vPy, let's presume that the derivation string just before the second application of P\rightarrow QR is uvPyz, where v or y can be \Lambda but not both, because the v and y are apparently come from P\stackrel{\ast}{\Rightarrow}vPy, and P can be substituted into QR then vPy, the only two edges can be derived from P is Q and R, and Q is always the left cousin of R, if Q yields the next P, the v would be null, because there're only two edges and vPy contains three routes, since the P is on the left edge Q, it occupies the v's place, if the P is yielded from R the situation becomes the opposite, y would be empty, however, no matter the empty one is v or y, the other side must be capable of deriving a subtree, i.e. if Q yields P which implies that v is empty, then R would derive a subtree of y. so either v or y can be empty, but not both of them. At this point, the things become clearer, the current derivation string is uvPyz, where vPy is yielded from the last application of P\rightarrow QR, now we can do this again, we apply P\rightarrow QR to the derivation string and performs exactly the same steps as the first time until we meet the third P to generates uvvPyyz, the P is substituted into vPy again because they're using the same procedures as the first application, keep repeating the applications, we'll eventually find that every uv^nPy^nz where n\in \mathbb{N^+} can be derived. Since the u and z only exist if S yields them before the first P, so they can be null with no doubt, it is totally OK that S yields only P without u and z. Now after some subtitutions, we can finally stop the repeats and yields a string of terminals from P, let's say, x, the derivation string then be like uv^nxy^nz, this means there're totally n+1 occurrences of P where the first n times are the applications of P\stackrel{\ast}{\Rightarrow}vPy, and the last time is P\stackrel{\ast}{\Rightarrow}x. Now let's make another observation, suppose that instead of choosing P arbitrarily, we choose the repeating nonterminals from the bottom, which is row that generates those terminals, we look up the tree from the bottom for p+1 rows, since we've proved that p+1 rows must have two identical nonterminals, let's denote this two nonterminals by N' and N'' where the former one is the upper one, we know that the subtree from N' is at most p+1 rows so the length of the string it generates is at most 2^p. The reason we keep on this is that we can obtain a upper bound of the pumping length, we need to find the closest repetitive nonterminals because the lower it is, the shorter string it generates, which is turned out to be 2^p, be noticed that this string is exactly the vxy, the upper N' generates v and y, and the lower N'' generates x, just like a pyramid, so we've proved that |vxy|\leqslant 2^p