Skip to content

Lecture No. 05

Dated: 14-03-2025

Lexical Analysis

The scanner1 converts the stream of characters into stream of tokens.1 This process is called lexical analysis.

Tokens

A token1 is a syntactic category in a sentence of a language. The pair <role, word> is called the token.1

Example

Consider the English language sentence "He wrote the program".

  • <Subject, He>
  • <Verb, wrote>
  • <Object, the program>

Similarly, for languages like C

if (b == 0) a = b
  • <Keyword, if>
  • <parenthesis, (>
  • <variable, b>
  • <bool operator, ==>
  • <number, 0>
  • <parenthesis, )>
  • <variable, a>
  • <assignment operator, =>
  • <variable, b>

Ad Hoc Lexer

We can write a lexer in c++ which reads from left to right and also reads a little bit ahead to determine where a token1 ends and where does a new token1 begins.

class Lexer {
    Inputstream s;
    char next; // look ahead

    Lexer(Inputstream);
    Token nextToken();
    Token readID();
    bool idChar(char c);
    Token readNumber();
};

Lexer::Lexer(Inputstream _s) {
    s = _s;
    next = s.read();
}

Token Lexer::nextToken() {
    if (idChar(next))
        return readID();
    if (number(next))
        return readNumber();
    if (next == '"')
        return readString();
    // ...
}

Token Lexer::readID() {
    string id = "";

    while (true) {
        char c = input.read();

        if (idChar(c) == false)
            return new Token(TID, id);

        id = id + string(c);
    }
}

bool Lexer::idChar(char c) {

    if (isAlpha(c))
        return true;
    if (isDigit(c))
        return true;
    if (c == '_')
        return true;

    return false;
}

Token Lexer::readNumber () {

    string num = "";

    while (true) {

        next = input.read();

        if (!isNumber(next))
            return new Token(TNUM, num);

        num = num + string(next);
    }
}

This works okay but there is a problem. We don't know what type of token1 is being read just by reading the first character.

  • Reading i, does that mean it is a variable named i or a keyword such as if?
  • Reading =, does that mean it is an assignment operator or == operator?

This makes hand writing the lexer tedious therefore, the more principle approach is to generate a tokenizer automatically.

References