tjvr

String syntax

29 May 2015

When I first wrote about tosh, I mentioned that design is difficult.

As it turns out, designing a programming language is hard. Even when you’re only designing the syntax, and not also the semantics. There are so many little details! And every one of them matters, because they all affect how nice the language is to use.

I’m going to talk about one such decision: strings.

Tosh needs a way to delimit text inputs (strings), since otherwise it can’t tell where a text input ends. Take the following example:

say join Hello, name

Is name supposed to be a text input or a variable reporter? We can’t tell without knowing what variables are defined.

So for tosh I have chosen to delimit string literals. This makes the code easier to read–you don’t have to know what variables are defined in order to understand it–as well as easier to parse.

Nathan seemed to disagree with me on this point:

…you shouldn’t need to delimit anything, including string literals.

He’s often right, which makes me uneasy…

Anyway: what syntax to delimit strings should tosh use?

At first I used single- or double-quoted strings, in imitation of Python and many other programming languages.

Then I read A Short Quiz About Language Design, which made me think more carefully about the decision. James Hague points out that quotes are a bad choice of delimiter, since quote characters are often used in English text.

Of course, if we try and write a double quote inside a string, Python will think it marks the end of the string! (I’m using Python as an example in what follows, but a lot of languages work the same way.)

"He asked "What is it?", and waited expectantly."

It wants to interpret ‘What is it?’ as Python code, and so the input will fail to parse.

We can use backslash escaping to avoid this. Putting a backslash in front of the quote makes Python interpret it as part of the string, instead of its end:

"He asked \"What is it?\", and waited expectantly."

But the backslashes look ugly.

We can mitigate this by using single-quoted strings:

'He asked "What is it?", and waited expectantly.'

But eventually we’ll want to use both kinds of quote character, so we’ll need backslashes again:

'She replied, "It\'s a boy!"'

We’re always going to need escaping sometimes, whatever delimiter character we use. Otherwise you can never write that character in a string! And since tosh aims to be compatible with all Scratch projects, it needs to support writing any character, including backslashes.

(Fortunately Scratch disallows newlines in text inputs, so the only things we need to escape are the delimiter and the escape character, eg. quote " and backslash \.)

Still: if I chose a different delimiter, we wouldn’t need to use escaping so often.

Options

He suggests the pipe character |, since it’s rarely used in English text.

scratchblocks uses square brackets [ ] to delimit text inputs, so I could also use that.

Backslash-escaping has the disadvantage that it makes the backslash character special, in addition to the delimiter. To write a backslash, you must backslash-escape it: "\\". Could I use a different escaping method instead, such as doubling-up the delimiter?

Here are all the options I came up with:

It was hard to choose. Summarising the options:

I studied a variety of projects by advanced Scratch users, to see if either of the characters |/] were less common than the other (and therefore more suitable). But in my small sample neither option was significantly better.

None of the options make escaping easy to discover. But I could add helpful error messages to mitigate this.

I have design principles for tosh which can help with making choices like this. But like most design problems, it comes down a tradeoff between different factors. Should I favour:

In the end, I’m going to stick with my original choice: quoted strings with backslash escaping.

New programmers are going to have to learn how escaping works eventually. Using something unusual instead of backslash-escaping will upset experienced programmers, without helping anyone.

I’m using quoted strings for similar reasons. Helping beginners learn to use Python-like strings is more important than similarity to scratchblocks. Superficial similarity would actually be bad, since they behave subtly differently! I’m already trying hard to communicate that tosh is not scratchblocks.

I think there’s a principle of interface design: similar things should look similar; things that behave differently should look different. (This is related to the principle of least astonishment–maybe a corollary?) So making tosh look different to scratchblocks is beneficial, since they do behave differently.

Aftermath

So after all that, I didn’t change my mind. But thinking through all the design choices was still useful.

And I came up with a couple of subtle improvements:

I hope this was an interesting insight into some of the thinking behind tosh’s design. If you disagree with any of the above, let me know!