Lecture No. 06

Dated: 14-03-2025

How to Describe Tokens

Regular languages¹ are the most popular for specifying tokens² because

They are based on simple and useful theory.
Are easy to understand.
Efficient implementations exist for generating lexical analyzers² based on such languages.

Languages

Let \(\Sigma\) be a set³ of characters. \(\Sigma\) is called the alphabet.⁴ A language⁴ defined over \(\Sigma\) is a set³ of strings⁴ of characters drawn from \(\Sigma\).

Examples

Alphabet = English letters
Language = English sentences

Alphabet = ASCII
Language = C++, Java, C#

Each regular expression⁵ is a notation for a regular language.¹ If \(A\) is the regular expression⁵ then \(L(A)\) is the regular language.¹

Regular Expressions

Regular expression⁵ is defined as

\(a\) for ordinary character from \(\Sigma\).
\(\epsilon\) for empty string.⁴
\(R | S\) means either \(R\) or \(S\).
\(RS\) means \(R\) followed by \(S\).
\(R^*\) means concatenation of \(R\) zero or more times (\(R^* = \epsilon | R | RR | RRR | …\))

Then there are extensions to it as well

\(R?\) means \(\epsilon | R\).
\(R^+\) means \(RR^*\).
\((R)\) is for grouping.
\([abc]\) means \(a | b | c\).
\([a-z]\) means characters ranging from \(a\) to \(z\).
[^ab] means anything except for \(a\) and \(b\).

Finite Automaton

We need a mechanism to determine if the string⁴ \(w\) belongs to \(L(R)\) language⁴ where \(R\) is the regular expression.⁵ Such a mechanism is called an acceptor.
The acceptor is based on a finite automaton⁶ which consists of

An input alphabet⁴ \(\Sigma\).
A set³ of states.
A start (initial) state.
A set³ of transitions.
A set³ of accepting (final) states.

References

Read more about regular languages. ↩↩↩
Read more about tokens and lexical analyzers. ↩↩
Read more about sets. ↩↩↩↩↩
Read more about languages, alphabets, strings. ↩↩↩↩↩↩↩
Read more about regular expressions. ↩↩↩↩
Read more about finite automaton. ↩