A Lexical Analyzer Generator written in C -- currently works on Linux and OS X.
sclex follows the same specifications that lex and GNU flex follow, except for what hasn't been implemented yet.
I wrote sclex as my first project to get back into programming after some time AFK.
Currently, I am adding refactoring, adding unit tests via Unity, and CMock, and adding to the already existing documentation.
-
For educational purposes I am not using the C Standard Library much but I am implementing my own data structures, alogorithms, and functions
-
I intend on trying to recompile sclex with my own C compiler when that is finished.
-
Add additional Unit Tests and Mocks.
-
Keep refactoring and refining code - going more modular and object-oriented.
-
Remove existing use of streams and buffers from C Standard Library and use OS calls instead.
EXCUSE THE MESS!
- Recursive Descent Predictive Parser (to parse the Regular Expressions(RE's) in the specificiation file
- Internal Syntax-Tree Construction
- Tabularized Determinitistic Finite Automata(DFA) construction
- Implements Dragon Book Algorithms firstpos,lastpos,followpos,nullable, etc.
- Double-Sided Buffer implementation
- regular expressions used to define token strings to recognize
- regular expression short-hand definitions
- character class set and ranges
- a subset of escape characters
- lex/flex specification file -- ending in .l
- and more...
To COMPILE: (in the source directory) Run make:
computer:src/sclex nicebub$ make
To Run the generator on the specification file and generate the source file "sclex.yy.c" (lex.l is a properly written specification file)
computer:src/sclex nicebub$ ./build/sclex example/lex.l
To Run the demo lexer driver Use the included "ex.r" file in the examples folder, with the "lex_driver" demo binary.
The "ex.r" example follows the language definition described in the example "lex.l"
First make the lex_driver binary:
computer:src/sclex nicebub$ make lex_driver
Then run the lex_driver binary on the example file:_
computer:src/sclex nicebub$ ./build/lex_driver examples/ex.r
First initialize the git submodules once, within the project directory:
computer:src/sclex nicebub$ git submodule init
Next, run the genrunner.sh script from the tools directory to create the test runners. Start this script from the root directory of the source code project:
computer:src/sclex nicebub$ ./tools/genrunners.sh
Then run make test:
computer:src/sclex nicebub$ make test
In the right column are examples or symbols that represent what can be written as a regular expression. The regular expressions are used to to tell the lexer how to match patterns of the input.
Usage | Definition |
---|---|
a | a stands for any alpha-numeric or printable character, will be matched |
ab | ab two regular expressions concatenated with each other, a followed by b will be matched |
(E) | (E) is a regular expression, ie, a string surrounded by parenthesis will be matched |
E|C | Both E and C are regular expressions and either one can be matched |
E* | stands for 0 or more E regular expressions concatenated with each other can match the input |
E? | stands for 0 or 1 E regular expression will match the input |
E+ | stands for 1 or more E regular expression concatenated will match the input stream |
[abc] | stands for a character set that matches any of the characters in the brackets to one in the input |
[a-z] | stands for a character set that can match the input to any character in the range of lowercase letters a through z |
\x | is an escape character, x can be any of: n,t,0,\,|,.,[,],{,},(,),*,?,and % |
{name} | is a short-hand definition name surrounded by curly brackets, expanded and matched to the input |
The lexer supports defining names to represent long or heavily-used regular expressions. These definitions are written in the top section of the lexer spec file of the form:
definition name regular expression
definitions can include previously defined short-hand names in the file above them, ie.
{ title } { movie } | { tv }*
You can find more examples in the included file lex.l
This file is read by the generator and used to create the C source file that will later become your lexer.
The layout of the file is as follows
%{
/* included C source code and possibly token definitions, ie. */
#include <stdio.h>
#define TOKEN1 5
#define STMT 6
#define PLUS 7
....
%}
/* short-hand definitions */
digit [0-9]
digits {digit}+
...
%%
/* regular expressions with C source code to run when this
* regular expression is matched to the input */
a|b*(cat[cat]){digits} _at least 1 required space token_ {
return STMT; /* returns token STMT defined above */
}
{expr} {
printf("found a expression!\n");
return EXPR;
}
%%
/* Auxillary Code to include into the generated lexer */
int my_function(float* in_float, int num_args){
...
...
return blah;
}
The generator will output a C source file called "sclex.yy.c". Within this file is a function called sclex(). This function can be called from another program or main function to start tokenizing input from an already opened file represented with a FILE pointer.
The prototype for the function is:
int sclex(FILE * opened_file);
Calling this function will run the lexer. When a token is found, the existing C code will be run. Ideally it will eventually return an INTEGER which represents a token defined somewhere else in the program.
Subsequent calls to sclex(filename) will run the lexer again on any left over file input in the buffer. (See lex_test.c for a demo use case of the lexer)
A few GOALS for this projects progress:
-
I would love community help with:
- any coding
- suggestions
- code review
- bug discovery
- feature requests
- feature implementation
- etc...
-
I intend on creating a few other tools to complement sclex:
- a suite, or toolchain, for a complete compiler.
- a YACC/BISON minimal type of clone to create the next stage to my own compiler - available here.
-
Sclex Features
- To make this program a threaded app where the lexer will buffer at most a certain amount of tokens
- clean up character sets so that can have multiple shorthand classes in the set.
- clean up further character class so isn't dependent on '-' being second and '-' can be last symbol in class.
- add negative character classes ie the ^ symbol, which matches anything not mentioned in the class minus newline '\n'
- add . as a regular expression matching any character of the alphabet
- add the ability to have comments in the entire .l file or portions of the file.
- eventually add the lookahead operator '/'.
- use better set data structure and algorithms.
- would like to go to bitmaps, for ASCII for char sets
- for integer sets maybe start at a minimum of 10 items and if needed,
- dynamically make the set larger, but still bitmapped?
- take out all c library code, if possible, anything that uses string.h or stdlib for minor things like NULL
- add buffer class to actual output file so isn't an included file with header
- change buffer to use read instead of fread perhaps
- add the ability for stack use for a stack of dfas
- Thus stack swaps from dfa to dfa depending on context. (like for string or comment, normal, etc.)
- need to add literal and string support so characters can be written normally between quotations.
- add the ability to swap buffers
- this may be useful when converting regex definition short-hands and we don't want to manipulate the original buffer but still use, the current code for translations which already uses the existing buffer code.
- add operators like $ and ^ representing end and beginning of match
Feel free to contact me with questions, comments, suggestions, etc. I'm open to feedback.