Lexing Article Index for
Lexing
Website Links For
Lexical
 

Information About

Lexing





A lexical analyzer, or '''lexer''' for short, can be thought of having two stages (these are often integrated, for efficiency reasons, so they operate in parallel). The first stage is called the ''scanner'' and is usually based on a Finite State Machine . This first stage has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as a Lexeme s). For instance, an integer token may contain any sequence of Numerical Digit characters. In many cases the first non-whitespace character can be used to deduce the kind of token that follows, the input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the Maximal Munch rule). In some languages the lexeme creation rules are more complicated and may involve Backtracking over previously read characters.

A Lexeme , however, is only a string of characters known to be of a certain type. In order to construct a token, the lexical analyzer needs a second stage. This stage, the ''evaluator,'' goes over the characters of the lexeme to produce a ''value.'' The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program the string

net_worth_future = (assets - liabilities);

might be converted (with Whitespace suppressed) into the lexical token stream:

NAME "net_worth_future"
EQUALS
OPEN_PARENTHESIS
NAME "assets"
MINUS
NAME "liabilities"
CLOSE_PARENTHESIS
SEMICOLON

Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept Regular Expression s that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a Finite State Machine (which is plugged into template code for compilation and execution).

  • . This means "any character a-z, A-Z or _, then 0 or more of a-z, A-Z, _ or 0-9".


Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides -- unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end.

The Lex Programming Tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection uses hand-written lexers.


EXAMPLE LEXICAL ANALYZER


This is an example of a scanner (written in the C programming
language) for the instructional programming language PL/0 .

The symbols recognized are:

  • ', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=',

  • '<>', '>', '>='


numbers: 0-9 {0-9}

identifiers: a-zA-Z {a-zA-Z0-9}

keywords:

"begin", "call", "const", "do", "end", "if", "odd", "procedure",
"then", "var", "while"

External variables used:

  • FILE ---source -- the source file

  • int cur_line, cur_col, err_line, err_col -- for error reporting

  • int num -- last number read stored here, for the parser

  • char id {Link without Title} -- last identifier read stored here, for the parser

  • Hashtab ---keywords -- list of keywords


External routines called:

  • error(const char msg {Link without Title} ) -- report an error

  • Hashtab ---create_htab(int estimate) -- create a lookup table

  • int enter_htab(Hashtab ---ht, char name {Link without Title} , void ---data) -- add an entry to a lookup table

  • Entry ---find_htab(Hashtab ---ht, char ---s) -- find an entry in a lookup table

  • void ---get_htab_data(Entry ---entry) -- returns data from a lookup table

  • FILE ---fopen(char fn char mode[ ) -- opens a file for reading

  • fgetc(FILE ---stream) -- read the next character from a stream

  • ungetc(int ch, FILE ---stream) -- put-back a character onto a stream

  • isdigit(int ch), isalpha(int ch), isalnum(int ch) -- character classification


External types:

  • Symbol -- an enumerated type of all the symbols in the PL/0 language.

  • Hashtab -- represents a lookup table

  • Entry -- represents an entry in the lookup table


Scanning is started by calling ''init_scan'', passing the name of the
source file. If the source file is successfully opened, the parser
calls ''getsym'' repeatedly to return successive symbols from the
source file.

The heart of the scanner, ''getsym'', should be straightforward.
First, whitespace is skipped. Then the retrieved character is
classified. If the character represents a multiple-character symbol,
additional processing must be done. Numbers are converted to internal
form, and identifiers are checked to see if they represent a keyword.


int read_ch(void) {
int ch = fgetc(source);
cur_col++;
if (ch == '
') {
cur_line++;
cur_col = 0;
}
return ch;
}

void put_back(int ch) {
ungetc(ch, source);
cur_col--;
if (ch == '
') cur_line--;
}

Symbol getsym(void) {
int ch;

while ((ch = read_ch()) != EOF && ch <= ' ')
;
err_line = cur_line;
err_col = cur_col;
switch (ch) {
case EOF: return eof;
case '+': return plus;
case '-': return minus;
  • ': return times;

  • case '/': return slash;

case '=': return eql;
case '(': return lparen;
case ')': return rparen;
case ',': return comma;
case ';': return semicolon;
case '.': return period;
case ':':
ch = read_ch();
return (ch == '=') ? becomes : nul;
case '<':
ch = read_ch();
if (ch == '>') return neq;
if (ch == '=') return leq;
put_back(ch);
return lss;
case '>':
ch = read_ch();
if (ch == '=') return geq;
put_back(ch);
return gtr;
default:
if (isdigit(ch)) {
num = 0;
  • no checking for overflow! ---/

  • num + ch - '0';

  • } while ((ch = read_ch()) != EOF && isdigit(ch));

put_back(ch);
return number;
}
if (isalpha(ch)) {
  • entry;

  • id_len = 0;

do {
if (id_len < MAX_ID) {
id {Link without Title} = (char)ch;
id_len++;
}
} while ((ch = read_ch()) != EOF && isalnum(ch));