Lecture No. 06
Dated: 14-03-2025
How to Describe Tokens
Regular languages
1 are the most popular for specifying tokens
2 because
- They are based on simple and useful theory.
- Are easy to understand.
- Efficient implementations exist for generating
lexical analyzers
2 based on such languages.
Languages
Let \(\Sigma\) be a set
3 of characters. \(\Sigma\) is called the alphabet
.4 A language
4 defined over \(\Sigma\) is a set
3 of strings
4 of characters drawn from \(\Sigma\).
Examples
Alphabet = English letters
Language = English sentences
Alphabet = ASCII
Language = C++, Java, C#
Each regular expression
5 is a notation for a regular language
.1 If \(A\) is the regular expression
5 then \(L(A)\) is the regular language
.1
Regular Expressions
Regular expression
5 is defined as
- \(a\) for ordinary character from \(\Sigma\).
- \(\epsilon\) for empty
string
.4 - \(R | S\) means either \(R\) or \(S\).
- \(RS\) means \(R\) followed by \(S\).
- \(R^*\) means concatenation of \(R\) zero or more times (\(R^* = \epsilon | R | RR | RRR | …\))
Then there are extensions to it as well
- \(R?\) means \(\epsilon | R\).
- \(R^+\) means \(RR^*\).
- \((R)\) is for grouping.
- \([abc]\) means \(a | b | c\).
- \([a-z]\) means characters ranging from \(a\) to \(z\).
[^ab]
means anything except for \(a\) and \(b\).
Finite Automaton
We need a mechanism to determine if the string
4 \(w\) belongs to \(L(R)\) language
4 where \(R\) is the regular expression
.5 Such a mechanism is called an acceptor
.
The acceptor
is based on a finite automaton
6 which consists of
- An input
alphabet
4 \(\Sigma\). - A
set
3 ofstates
. - A start (initial)
state
. - A
set
3 oftransitions
. - A
set
3 of accepting (final)states
.