everything is programming ♪♫

How tosh is parsed

Nov 05, 2015 • < 1 min read • Filed under tosh

I mentioned that parsing is hard on the Scratch forums, and I was asked the following about tosh:

MegaApuTurkUltra wrote:

Lol, are you using some sort of parser generator or just making it from scratch? pun intended

The answer is complicated, and seemed like a good excuse for a longer post here.


As I've said before, tosh is a difficult language to parse.

In particular, the syntax highlighter has to be based on the same parser as the language itself. Most languages use a simpler system for highlighting, based on regular expressions and/or state machines. But I want the different operations to reflect the colour of the block in Scratch, so I need a full parser.

The code editor pane uses the excellent CodeMirror library. To implement syntax highlighting, tosh presents a custom mode to CodeMirror. CodeMirror gives me a single line at a time; I split the line into tokens and return the colour of each one.

The syntax highlighting is nearly as colourful as the project itself.

So this informs the architecture of the parser: it must operate on a single line at a time, and it must use a tokenizer.

Here's what it looks like at the moment:

  • The language is defined using a context-free grammar, split into two parts:

    • The core grammar is defined by hand, and contains things like arithmetic.
    • It's then augmented with automatically-generated rules to define the rest of the Scratch blocks.