Skip to content

Lecture No. 06

Dated: 14-03-2025

How to Describe Tokens

Regular languages1 are the most popular for specifying tokens2 because

  • They are based on simple and useful theory.
  • Are easy to understand.
  • Efficient implementations exist for generating lexical analyzers2 based on such languages.

Languages

Let \(\Sigma\) be a set3 of characters. \(\Sigma\) is called the alphabet.4 A language4 defined over \(\Sigma\) is a set3 of strings4 of characters drawn from \(\Sigma\).

Examples

Alphabet = English letters
Language = English sentences

Alphabet = ASCII
Language = C++, Java, C#

Each regular expression5 is a notation for a regular language.1 If \(A\) is the regular expression5 then \(L(A)\) is the regular language.1

Regular Expressions

Regular expression5 is defined as

  • \(a\) for ordinary character from \(\Sigma\).
  • \(\epsilon\) for empty string.4
  • \(R | S\) means either \(R\) or \(S\).
  • \(RS\) means \(R\) followed by \(S\).
  • \(R^*\) means concatenation of \(R\) zero or more times (\(R^* = \epsilon | R | RR | RRR | …\))

Then there are extensions to it as well

  • \(R?\) means \(\epsilon | R\).
  • \(R^+\) means \(RR^*\).
  • \((R)\) is for grouping.
  • \([abc]\) means \(a | b | c\).
  • \([a-z]\) means characters ranging from \(a\) to \(z\).
  • [^ab] means anything except for \(a\) and \(b\).

Finite Automaton

We need a mechanism to determine if the string4 \(w\) belongs to \(L(R)\) language4 where \(R\) is the regular expression.5 Such a mechanism is called an acceptor.
The acceptor is based on a finite automaton6 which consists of

  • An input alphabet4 \(\Sigma\).
  • A set3 of states.
  • A start (initial) state.
  • A set3 of transitions.
  • A set3 of accepting (final) states.

References


  1. Read more about regular languages

  2. Read more about tokens and lexical analyzers

  3. Read more about sets

  4. Read more about languages, alphabets, strings

  5. Read more about regular expressions

  6. Read more about finite automaton