How tosh is parsed

05 Nov 2015

I mentioned that parsing is hard on the Scratch forums, and I was asked the following about tosh:

MegaApuTurkUltra wrote:

Lol, are you using some sort of parser generator or just making it from scratch? pun intended

The answer is complicated, and seemed like a good excuse for a longer post here.


As I’ve said before, tosh is a difficult language to parse.

In particular, the syntax highlighter has to be based on the same parser as the language itself. Most languages use a simpler system for highlighting, based on regular expressions and/or state machines. But I want the different operations to reflect the colour of the block in Scratch, so I need a full parser.

The code editor pane uses the excellent CodeMirror library. To implement syntax highlighting, tosh presents a custom mode to CodeMirror. CodeMirror gives me a single line at a time; I split the line into tokens and return the colour of each one.

The syntax highlighting is nearly as colourful as the project itself.

So this informs the architecture of the parser: it must operate on a single line at a time, and it must use a tokenizer.

Here’s what it looks like at the moment:

So there’s an overview of this part of tosh’s implementation. Writing an Earley parser, and designing the language of tosh, have definitely been the most interesting bits of tosh: all I have left is boring UI programming!

Autocomplete is where things get super-interesting… but that’s a story for next time :-)