What follows after lexical analysis?
- by madflame991
I'm working on a toy compiler (for some simple language like PL/0) and I have my lexer up and running. At this point I should start working on building the parse tree, but before I start I was wondering: How much information can one gather from just the string of tokens? Here's what I gathered so far:
One can already do syntax highlighting having only the list of tokens. Numbers and operators get coloured accordingly and keywords also.
Autoformatting (indenting) should also be possible. How? Specify for each token type how many white spaces or new line characters should follow it. Also when you print tokens modify an alignment variable (when the code printer reads "{" increment the alignment variable by 1, and decrement by 1 for "}". Whenever it starts printing on a new line the code printer will align according to this alignment variable)
In languages without nested subroutines one can get a complete list of subroutines and their signature. How? Just read what follows after the "procedure" or "function" keyword until you hit the first ")" (this should work fine in a Pascal language with no nested subroutines)
In languages like Pascal you can even determine local variables and their types, as they are declared in a special place (ok, you can't handle initialization as well, but you can parse sequences like: "var a, b, c: integer")
Detection of recursive functions may also be possible, or even a graph representation of which subroutine calls who. If one can identify the body of a function then one can also search if there are any mentions of other function's names.
Gathering statistics about the code, like number of lines, instructions, subroutines
EDIT: I clarified why I think some processes are possible. As I read comments and responses I realise that the answer depends very much on the language that I'm parsing.