Lecture No. 09

Dated: 17-03-2025

`DFA`¹ Minimization

The generated DFA¹ can have large number of states and can be reduced using Hopcroft's algorithm which is to merge the equivalent states into groups.

Lexical Analyzers

Lexical Analyzers³ (scanners³) use the same mechanism but have character stream as input and multiple regular expressions⁴ for multiple tokens⁵ and returns a sequence of matching tokens.⁵
It always returns the longest one.

Lexical Analyzer Generators

The process of constructing a lexical analyzer can be automated. We only need to specify Regular expressions for tokens and rules for assigning priorities for multiple longest match cases, e.g, == and =, == is longer.

There are two famous ones.

Flex: For C and C++. It is the modern version of Lex, a tool in AT&T bell labs version of UNIX.
Jlex: For Java.

Using Flex

Flex takes an input file which consists of 3 sections.

sections: C or C++ and flex definitions
%%
token definitions and actions
%%
user code

The <!-- separates the sections.

Example

Contents of lex.l file

%{
#include "tokdefs.h"
#include <iostream>
%}

D [0-9]
L [a-zA-Z_]
id {L}({L}|{D})*

-->

"void" {return(TOK_VOID);}
"int" {return(TOK_INT);}
"if" {return(TOK_IF);}
"else" {return(TOK_ELSE);}
"while" {return(TOK_WHILE);}
"<=" {return(TOK_LE);}
">=" {return(TOK_GE);}
"==" {return(TOK_EQ);}
"!=" {return(TOK_NE);}
{D}+ {return(TOK_DECIMAL);}
{id} {return(TOK_ID);}

%%

int main() {
    int tc = yylex();

    while(tc != 0) {
        std::cout << tc << "," << yytext << std::endl;
        tc = yylex();
    }

    return 0;
}

int yywrap () {
    return 1;
}

Contents of tokdefs.h are

#define TOK_VOID 1
#define TOK_INT 2 
#define TOK_IF 3 
#define TOK_ELSE 4 
#define TOK_WHILE 5 
#define TOK_LE 6 
#define TOK_GE 7 
#define TOK_EQ 8 
#define TOK_NE 9 
#define TOK_DECIMAL 10 
#define TOK_ID 111

flex lex.l
g++ lex.yy.c -o scanner.exe
./scanner.exe

References

Read more about finite automaton. ↩↩
Read more about acceptor. ↩
Read more about scanners. ↩↩
Read more about regular expressions. ↩
Read more about tokens. ↩↩