Lecture No. 05
Dated: 14-03-2025
Lexical Analysis
The scanner
1 converts the stream of characters into stream of tokens
.1 This process is called lexical analysis
.
Tokens
A token
1 is a syntactic category in a sentence of a language. The pair <role, word>
is called the token
.1
Example
Consider the English language sentence "He wrote the program".
<Subject, He>
<Verb, wrote>
<Object, the program>
Similarly, for languages like C
if (b == 0) a = b
<Keyword, if>
<parenthesis, (>
<variable, b>
<bool operator, ==>
<number, 0>
<parenthesis, )>
<variable, a>
<assignment operator, =>
<variable, b>
Ad Hoc Lexer
We can write a lexer
in c++
which reads from left to right and also reads a little bit ahead to determine where a token
1 ends and where does a new token
1 begins.
class Lexer {
Inputstream s;
char next; // look ahead
Lexer(Inputstream);
Token nextToken();
Token readID();
bool idChar(char c);
Token readNumber();
};
Lexer::Lexer(Inputstream _s) {
s = _s;
next = s.read();
}
Token Lexer::nextToken() {
if (idChar(next))
return readID();
if (number(next))
return readNumber();
if (next == '"')
return readString();
// ...
}
Token Lexer::readID() {
string id = "";
while (true) {
char c = input.read();
if (idChar(c) == false)
return new Token(TID, id);
id = id + string(c);
}
}
bool Lexer::idChar(char c) {
if (isAlpha(c))
return true;
if (isDigit(c))
return true;
if (c == '_')
return true;
return false;
}
Token Lexer::readNumber () {
string num = "";
while (true) {
next = input.read();
if (!isNumber(next))
return new Token(TNUM, num);
num = num + string(next);
}
}
This works okay but there is a problem. We don't know what type of token
1 is being read just by reading the first character.
- Reading
i
, does that mean it is avariable
namedi
or akeyword
such asif
? - Reading
=
, does that mean it is anassignment operator
or==
operator?
This makes hand writing the lexer
tedious therefore, the more principle approach is to generate a tokenizer
automatically.