Tiny Scanner

In this post I’ll outline the process I went through to write the scanner for the “Tiny” language using Objeck. With minor adjustment the algorithm below can be used to build scanners for more complex languages. The code is general enough to be easily rewritten in Python, Java or C++.

Here’s the full code as well as a more complex scanner written for Objeck in C++.

Step 1 – Read the file

reader := FileReader->New(@file);
leaving {
  reader->Close();
};
 
line_num := 0;
while(reader->IsEOF()  true) {
  line := reader->ReadString();
  line_num += 1;
  # do stuff
}

There are two approaches to reading a file. The first is to read the file a line at a time and the second is read the entire file into memory. For this scanner we read the file a line at a time. This approach allows us to remove newlines during the process using Objeck’s “FileReader->ReadString()” method.

Step 2 – Ignore Whitespace

# skip whitespace
while((line->Get(i) = ' ' | line->Get(i) = '\t') & i->Size()) {
  i += 1;
};

One of the main major functions of a scanner is to remove whitespace. Most languages encourage programs to indent their programs for readability. Indention characters should be ignored such the same tokens are parsed regardless of whitespace.

Step 3 – Identify what needs to be parsed
At this point we have a good idea what type of token we’ll be dealing with. If it starts with a character it’s an identifier or keyword. If it starts with a numeric it’s a number, if it’s an open quote a string. Otherwise, it’s a symbolic token such as equals or less than.

if(line->Get(i)->IsChar() | line->Get(i) = '_') {
  # do stuff
} else if(line->Get(i)->IsDigit()) {
  # do stuff
} else if(line->Get(i) = '""') {
  # do stuff
} other {
  # symbol based token
}

Step 4 – Parse the token
When parsing words we need to determine if they are keywords or identifiers. The easiest way to do this is to load all of your identifiers into a dictionary (i.e. a map or a hash table) and have the values associated with token types. When a word is found look it up and if it’s in the dictionary create a new token of the identified type.

token := reserved->Find(string)->As(IntHolder);
if(token  Nil) {
  type := token->Get()->As(Token->Type);
  @tokens->AddBack(Token->New(line_num, type, string));
}
else {
  @tokens->AddBack(Token->New(line_num, Token->Type->VAR, string));
};

Simple character matching can be used for symbolic tokens. Using a “select” statement to match the first character can be a quick way to parse these types of tokens. You should also correctly handle cases where the identified symbol is present in more than one token type for example “<" and "<=".

label ';': {
  @tokens->AddBack(Token->New(line_num, Token->Type->SEMI, "semi-colon"));
}
 
label '<': {
  if(i + 1 Size() & line->Get(i + 1) = '=') {
    @tokens->AddBack(Token->New(line_num, Token->Type->LESS_EQL, "equal"));
    i += 1;
  }
  else {
    @tokens->AddBack(Token->New(line_num, Token->Type->LESS, "less"));
  };
}

Step 5 – Storing the tokens
A scanner normally works hand-in-hand with a parser to build an AST (abstract syntax tree). In doing so, one may first scan all of the tokens in the source code and then proceed to building the parse tree or scan for tokens incrementally as while the parse tree is being built. For a small language like “Tiny” I choose to parse all of the tokens ahead of time before building the tree. For larger languages it’s more efficient to scan for tokens as you build your AST.