Grammatical Format

We've already know that there're several kinds of languages, some of them can be defined by regular expressions and others can't. Last chapter involves a new way to define languages which we called CFG, and we're now facing a new question, what's the relationship between regular languages and CFG? Can CFG generates all possible languages? Or just part of them?

DEFINITION
A semiword for a CFG is a string of terminals with exactly one trailing nonterminal, such as

(terminal)(terminal)...(terminal)(Nonterminal)

Relationship between regular languages and context-free langauges

Theorem 1

All regular langauges are context-free langauges

All regular langauges can be represented by a DFA, which means to prove this, we need to prove there exists a CFG for every DFA, then prove that all words that accepted by DFA could be generated from CFG, if and only if all strings generated from CFG is accepted by DFA, the first part could be finished by constructive algorithm, for DFA M_n with states S:

Part.1

  1. Let every s\in S be a nonterminal in CFG, where the start state will be the start symbol
  2. for every edge(including edges loop back to the state itself)

    create production

    X\rightarrow aY\hspace{10pt}or\hspace{10pt}Z\rightarrow aZ

    change the terminals in productions according to the character labeled on edge

  3. Create a null string derivation for each final state in S such as
    F\rightarrow \Lambda\text{(Where F is a final state in S)}

Part.2

The paths by which strings traveling through DFA could be represented by semiwords for each steps, respectively, for instance

\begin{aligned}
&&\text{Semiwords} \\
&\text{Start with S} &S\\
&\text{Read an a and goto X} &aX \\
&\text{Read a b and goto Y} & abY \\
&...&... \\
&\text{Read an a and goto F} & abbaaF\\
&\text{And F is a final state, so the word is accepted}
\end{aligned}

and the semiwords it generates is exactly the derivations from the CFG constrcuted by aforementioned algorithm

Part.3

Since we know that the right side of all productions generated by that algorithm are semiwords like

Nonterminal\rightarrow terminal\ Nonterminal

so it is sufficient to conclude that all words derived from the CFG are following a segment of the path on the DFA, and the production it follows to derive a new stage constructs a part of DFA, where the left side of production is one state, the trailing nonterminal of the right side is another state, those terminals in between are edges labeled with characters identical to itself. If the trailing nonterminals derive a null string, the process is then terminated because that means we have reached a final state according to the algorithm, which implies the string has been accepted. Repeat this step on each derivation stage until any nonterminal yields a null string, then we can get the path in which the string traveled through DFA.

Regular languages generated by CFG

Theorem 2

If all the productions of a CFG satisfies

Nonterminal\rightarrow semiword

or

Nonterminal\rightarrow word\text{(may be }\Lambda)

then the language from which this CFG generates is regular

Proof

We shall prove this theorem by showing how to construct a TG from such CFG

\begin{aligned}
N_1&\rightarrow w_1N_2 \\
N_2&\rightarrow w_2N_3 \\
N_3&\rightarrow w_3N_4 \\
&...\\
N_{10}&\rightarrow w_{11}
\end{aligned}

each of the productions satisfies the requirement, we draw states for every nonterminals in CFG except the first one, in this example, from N_2 to N_{10}, then let N_1 be start state and draw an extra halt state Q, for every N_p\rightarrow w_qN_z, draw an directed edge from N_p to N_z labeled with w_q, if N_p is the same as N_z, it would be a loop; for each N_p\rightarrow w_q, draw an directed edge from N_p to Q labeled with w_q.
Now we've built the TG, observe that any path from start state to halt state corresponds to a derivation in the CFG, each step of that path corresponds to a stage of the derivation, and every word generated by the CFG according to some derivations corresponds to a successful path on TG.
Note that the CFG might generates no words at all, but we can still build a TG upon it.

Regular grammar

After those proves, we can now give a definition

DEFINITION
A CFG is said to be a regular grammar, if each of its productions satisfies

Nonterminal\rightarrow semiword

or

Nonterminal\rightarrow word\text{(including }\Lambda)

However, this definition is only a sufficient condition, which means it's not vice-versa, there're still some regular languages that can be defined by CFG does not satisfy the two forms above

Elimination Rules

Λ-Production Elimination

Some of the CFG employed productions with \Lambda, however, it's not always necessary, it is indeed not interchangable if the language includes \Lambda, but sometimes it involves productions like this:

\begin{aligned}
S&\rightarrow aX\\
X&\rightarrow \Lambda
\end{aligned}

observe that the second production is nearly useless, no matter when and how you apply it, the resultant string is always a single a. Actually, the \Lambda-Production is only functioning on those CFLs that contain the word \Lambda, we're about to introduce a theorem to show that \Lambda-Production is unnecessary for a CFG G such that \Lambda\notin L(G)

Theorem 3

For any CFG G for a L(G) that involves \Lambda-Production, there exists a CFG without \Lambda-Production that generates exactly the same langauge(if \Lambda\notin L(G)), or generates L(G)\backslash\lbrace \Lambda\rbrace(if \Lambda\notin L(G))

Proof

DEFINITION
A nonterminal N of a CFG is said to be nullable if there exists

  1. A \Lambda-Production like N\rightarrow \Lambda
  2. A N-derivation that leads to \Lambda like N\implies ...\implies\Lambda

Firstly, delete all the \Lambda-Productions, then for every X\rightarrow s where s is a string contains nonterminals, add new production X\rightarrow w where w is all the possible results of s without one of \mathcal{P}(\text{nonterminals in s}), but there is one exception, we do not allow \Lambda-Productions such as X\rightarrow \Lambda to be generated even if the right side is consist of solely nullable nonterminals, for instance if M is nullable nonterminal, then X\rightarrow M must stay as it is instead of being reduced to X\rightarrow\Lambda, the reason that why we doing this is because if X\rightarrow M can be reduced into X\rightarrow \Lambda, one more \Lambda-Production will be generated, then you have to start again to eliminate this one, such exchanges like this one could keep running indefinitely on some circumstances
For example, if we want to eliminate \Lambda-Productions in CFG

\begin{aligned}
S&\rightarrow a|Xb|aYa\\
X&\rightarrow Y|\Lambda\\
Y&\rightarrow b|X
\end{aligned}

firstly, observe that X is the nonterminal with \Lambda-Production, so we eliminate production X\rightarrow \Lambda; then pick out all the original nullable nonterminals, in this case are X and Y; next step, find all productions with X and Y involved and construct new productions:

\begin{aligned}
&\text{Old Productions}
&\text{Newly Added Productions}\\
&X\rightarrow Y&Nothing\\
&Y\rightarrow X&Nothing\\
&X\rightarrow \Lambda&Nothing\\
&S\rightarrow Xb& S\rightarrow b\\
&S\rightarrow aYa& S\rightarrow aa
\end{aligned}

please notice that the first 3 old productions generates nothing new, because their right side is consist of solely nullable nonterminals and the only thing they can generate is \Lambda-Productions which we don't allow to, so the new CFG will be like

\begin{aligned}
S&\rightarrow a|Xb|aYa|b|aa\\
X&\rightarrow Y\\
Y&\rightarrow b|X
\end{aligned}

Another concern we might be interested in is before performing this algorithm, we must find all nullable nonterminals and this will be far less intuitive than the example above if the CFG is very complicate, we hereby give an algorithm to find all nullable nonterminals in a CFG:

  1. Mark every nonterminal that has a \Lambda-Production
  2. Mark every nonterminal that produces strings in which consist of marked nonterminals
  3. Repeat step two until there're no more nonterminals to be marked, and now all of the nullable nonterminals will be marked

for instance, if the CFG is

\begin{aligned}
S&\rightarrow Xay|YY|aX|ZYX\\
X&\rightarrow Za|bZ|ZZ|Yb\\
Y&\rightarrow Ya|XY|\Lambda\\
Z&\rightarrow aX|YYY
\end{aligned}

step 1, mark the nonterminal Y because it contains the \Lambda-Production; step 2, find all the nonterminals that produce strings of marked nonterminals, since Y is marked, so YYY is consist of solely marked nonterminals, which implies that Z must be marked according to production Z\rightarrow YYY; step 3, X must be marked according to production X\rightarrow ZZ, then Y must also be marked because of XY; in this example, all nonterminals are nullable because all of them are being marked after algorithm ends

Unit Production Elimination

DEFINITION
A unit production is a production rule like

Nonterminal\rightarrow Nonterminal

the right side of the production contains exacly one nonterminal

Theorem 4

If there is a CFG for L with no \Lambda-Productions, then there is a CFG for L without both \Lambda-Production and unit production

Proof

Every nonterminal is exists for substitution, so if there is an unit production A\rightarrow B and another production B\rightarrow str, then for every A it can be substituted to str, looks like we only need to replace unit productions with the final symbol after several iteration of substitutions, but like when we talked in the \Lambda-Productions, if you just eliminate the unit production directly using above method, it may lead into an indefinitely substitution circularly, for instance

\begin{aligned}
S&\rightarrow A|bb\\
A&\rightarrow B|b\\
B&\rightarrow S|a
\end{aligned}

eliminate A\rightarrow B,you will getA\rightarrow S|a|b,there used to be three unit productions before elimination, and there are still three unit productions after elimination: S\rightarrow A, A\rightarrow S, B\rightarrow S, what we've done is only adjust their positions, so we must treat this carefully like the one before.
What we going to do is something like the one above, the substitution is still the only way to eliminate unit productions, but this time, we only substitute the nonunit production parts, for instance, if a CFG contains such a unit production like A\rightarrow B, or it contains an A that can be substituted multiple times and then results in B like

A\implies X_1\implies X_2\implies...\implies B

we then create new productions, if B being like

B\rightarrow S_1|S_2|s_1|s_2|...

where capital S stands for unit productions and s stands for nonunit productions, we copy only the nonunit part to the newly created production

A\rightarrow s_1|s_2|...

we do this for all the unit productions simultaneously, and remove all of them after each of them had been substituted by such a precedure. This algorithm successfully remove all the indefinitely circularly by only copying those nonunit productions, and leave the unit productions as they were, that's the reason of why the newly created productions never involve new unit productions. Also noticed that in the theorem we emphasized that no \Lambda-Productions can be tolerated, because it will make us hard to find the unit production and unit production chains, in the above example, S=ZXY is a unit production, althought it's hard to identify that because it's not so obvious like X\rightarrow Y.
Consider the example above

\begin{aligned}
S&\rightarrow A|bb\\
A&\rightarrow B|b\\
B&\rightarrow S|a
\end{aligned}

it's easy to find all unit production and all unit production chains:

\begin{aligned}
S&\rightarrow A\\
S&\rightarrow A\rightarrow B\\
A&\rightarrow B\\
A&\rightarrow B\rightarrow S\\
B&\rightarrow S\\
B&\rightarrow S\rightarrow A
\end{aligned}

each of them generates a new production

\begin{aligned}
S&\rightarrow b\\
S&\rightarrow a\\
A&\rightarrow a\\
A&\rightarrow bb\\
B&\rightarrow bb\\
B&\rightarrow b
\end{aligned}

so now we get

\begin{aligned}
S&\rightarrow A|bb|b|a\\
A&\rightarrow B|b|a|bb\\
B&\rightarrow S|a|bb|b
\end{aligned}

eliminates all unit productions, we'll get the resultant CFG

\begin{aligned}
S&\rightarrow bb|b|a\\
A&\rightarrow b|a|bb\\
B&\rightarrow a|bb|b
\end{aligned}

Separation of Terminals and Nonterminals

Theorem 5

If L is a language generated by a CFG, there exists another CFG generates all the non-\Lambda words in L, and all of the productions in new CFG are one of two forms:

\begin{aligned}
Nonterminal&\rightarrow \text{string of only Nonterminals}\\
Nonterminal&\rightarrow \text{only one terminal}
\end{aligned}

Proof

The proof is quite simple, just introduce some new nonterminals for each of the terminals in old CFG, for instance, if the old CFG contains terminals a and b, then let A\rightarrow a, B\rightarrow b, replace all the occurrences of a and b in old CFG with A and B(of course it is legal to use a nonterminal that already stands for a or b in old CFG)

Chomsky Normal Form

DEFINITION
A CFG is said to be in Chomsky Normal Form, abbr. CNF, If all productions in a CFG are one of two forms:

\begin{aligned}
Nonterminal&\rightarrow \text{string of exactly two Nonterminals}\\
Nonterminal&\rightarrow \text{only one terminal}
\end{aligned}

Theorem 6

For any CFL(context-free langauge), its non-\Lambda words can be generated by a CFG in CNF, this means the CFG with \Lambda will drop it after converted to CNF

Proof

For any CFG, we will eliminate its \Lambda-Productions and unit productions first, and then convert it into the form specified in Theorem 5, now we've separated the terminals and nonterminals, which means all productions either contains only nonterminals, or contains exactly one terminal, now pick out all productions with more than two nonterminals on the right side, for example, S\rightarrow N_1N_2N_3N_4N_5, we're about to introduce some new nonterminals, each of them stands for another two nonterminals, until we've completely splited the N_1N_2N_3N_4N_5, let the newly introduced nonterminals be R_n, for S\rightarrow N_1N_2N_3N_4N_5, we generates:

\begin{aligned}
S&\rightarrow N_1R_1\\
R_1&\rightarrow N_2R_2\\
R_2&\rightarrow N_3R_3\\
R_3&\rightarrow N_4N_5
\end{aligned}

now, the quintuple-nonterminals production has been splited into four double-nonterminals production, we could run this algorithm for every production with more than two nonterminals, when S is being substituted, it will first substitute R_1, then R_2, until R_3:

\begin{aligned}
S&\implies N_1R_1\\
&\implies N_1N_2R_2\\
&\implies N_1N_2N_3R_3\\
&\implies N_1N_2N_3N_4N_5
\end{aligned}
Invariance

Like always, to prove the correctness of this algorithm, we must ensure that the language it generates has not been altered: 1. new CFG generates all words that're formerly generates 2. new CFG generates no words that cannot be generated formerly

  1. We can still generate all the words that can be generated before, because according to the substitution that we've shown above, we've just split one step into multiple steps, the resultant string stays unchanged, formerly, it could generate N_1N_2N_3N_4N_5 in one step; now, it still generate and only generate N_1N_2N_3N_4N_5 with three extra steps
  2. To show that it does not generate words that cannot be generated before, observe that the newly introduced nonterminals R_n only appear in two places, for instance R_2, it only appears in R_1\rightarrow N_2R_2 and R_2\rightarrow N_3R_3, and these two productions will be substituted at last, they just like some intermediate, only visible during the derivation of S, so it does not interfere other productions, which implies no extra words will be generated

Leftmost Derivations

DEFINITION
Leftmost nonterminal is the first nonterminal(left to right) of the string during derivation, for example, abAXcNFy's leftmost nonterminal is A

A word's derivation procedure from a CFG is said to be leftmost derivation, if on each stage of the derivation, we always apply the production rule to its leftmost nonterminal

Consider the CFG

\begin{aligned}
S&\rightarrow aSX|b\\
X&\rightarrow Xb|a
\end{aligned}

word aababa can be generated from this CFG by a leftmost derivation

\begin{aligned}
S&\implies aSX\\
&\implies aaSXX\\
&\implies aabXX\\
&\implies aabXbX\\
&\implies aababX\\
&\implies aababa
\end{aligned}

the production is applied to the leftmost nonterminal on each derivation stage, apply Depth-First Search algorithm on the leftmost derivation's syntax tree, you will found that the order of nonterminals in the iteration will be the order of their substitutions.

Theorem 7

Any word that can be generated by a CFG also has a leftmost derivation

We just need to change the substitute order(or said to be the order of the application of productions) by always replacing the leftmost nonterminal, those nonterminals are just placeholders, changing the substitute order won't affect the resultant string


行成于思,毁于随