×


WebAssembly text format: LESv3(Read 154 times)
WebAssembly text format: LESv3 on: October 16, 2017, 07:50:23 pm
Background: WebAssembly is a new technology being standardized for running native applications (e.g. C/C++/Rust) on the web. It used to be an expression-based language that was typically viewed as an abstract syntax tree (AST) — basically a high-level assembly language, moreso than C ever was. Recently, however, the design shifted into a postorder “stack machine” which supports all the code that could have been expressed as an AST, plus some “oddball” code that is not expressible as an AST (details).

The text format of WebAssembly is not yet standardized; this article is the second followup to my proposal to use LES, an open-ended multi-purpose code format. The current version of LES is simpler and JSON-compatible, while some ideas here add complexity and are tailored more specifically for WebAssembly.

How complex should LES be? How valuable is backward compatibility with JSON? How acceptable is space-sensitivity? This article is long, but at the end there will be a survey so you can have your say in the WebAssembly text format.

Now that Wasm is moving toward a stack machine, I think that LES is now more relevant than ever, because there seems to be a need for not one Wasm language but two: one for Wasm itself, and another for the pseudo-wasm AST used by producers (i.e. binaryen). Both languages would support expressions, but each would interpret them differently. In the “official” Wasm, $a() + $b() could be syntactic sugar for the sequence $a(); $b(); i32.add if both functions return i32, whereas in compiler tools like binaryen, the same text would represent the syntax tree i32.add($a(), $b()).

While only one Wasm variation is expected so far, it’s not hard to imagine others:
-People will need to prototype new Wasm features.
-I’m interested in making some kind of higher-level language using Wasm’s AST variant as a semantic foundation (this, in turn, will be used for converting code between different programming languages).

A general-purpose syntax like LESv3 allows people to do all this without touching the parser or the printer (serializer), let alone writing new ones.

This post summarizes my current ideas for LESv3 syntax, especially as they relate to WebAssembly, while highlighting a few questions for the community.

What is LES?
If LISP had been invented in the 90s as a native member of the C family, LES would be its parser. Like the s-expression, LES is parsed into a simple data structure called a Loyc tree, but unlike s-expressions, the data structure is designed to hold code. Instead of using a “list” as the recursive element, Loyc trees use a “call”: f(x, y) instead of (f x y). f is called the “target” of the call and although most targets are identifiers, a target can be any node, including another call (e.g. f(x)(y)). Also, every node has an (optional) list of attributes attached (which can include “trivia” such as comments, and in Wasm might be used for debug info), and should also hold a source code range for use in error messages (e.g. ~/code/my_code.les:1012:12:1012:19).

In short, LES is not a programming language, it’s a data format that represents code in a natural way.

Like LESv2, LESv3 will support general C-like expressions with function calls, infix, prefix, and suffix (++ --) operators and indexing (possible syntax for stores: i32[$x] = $y). It will have JS-like [lists], maybe {dictionaries}, and tuples. Operators are represented by calls to identifiers that start with a single quote, e.g. 2 + 3 is a call with a target called '+. In the current notation, 2 + 3 could also be written as `'+`(2, 3).)

Relationship to WebAssembly
I’ve been optimizing LESv3 to make it a good basis for the WebAssembly text format and I am publishing this in the hope of getting your feedback, opinions and preferences. I’d much rather have your opinion now than after I’ve written separate parsers for multiple languages!

Keep in mind that LES is also designed for general-purpose non-Wasm uses. That’s the main reason for syntax differences between LES and the older proposal by Dan Gohman (and Michael Bebenita?), now developed in this repo. For example, in that proposal, plain words like foo are reserved for future use as keywords, whereas $foo is an “identifier”. In LESv3 (and in this document!), plain words like foo are considered identifiers, while $foo is just an identifier with the prefix operator $ in front.

Wasm stack machine in LESv3, by example

Let’s consider the C example
Code: [Select]
int sumIntegers(int* input, int length) {
  int sum = 0;
  for (int i = 0; i < length; ++i) {
    sum += input[i];
  }
  return sum;
}

The optimized s-expression I got from Wasm Explorer is, after adjusting for readability,

Code: [Select]
(module
  (memory 1)
  (export "memory" memory)
  (export "_sumIntegers" $_sumIntegers)
  (func $_sumIntegers (param $input i32) (param $length i32) (result i32)
    (local $sum i32)
    (set_local $sum (i32.const 0))
    (block $stop
      (br_if $stop (i32.lt_s (get_local $length) (i32.const 1)))
      (set_local $sum (i32.const 0))
      (loop $unused $loop
        (set_local $sum
          (i32.add (i32.load (get_local $input)) (get_local $sum)))
        (set_local $input (i32.add (get_local $0) (i32.const 4)))
        (br_if $loop (set_local $length
          (i32.add (get_local $length) (i32.const -1))))
      )
    )
    (return (get_local $sum))
  )
)

Manually updating this to stack-machine form, I get

Code: [Select]
  (memory 1)
  ...
  (func $_sumIntegers (param $input i32) (param $length i32) (result i32)
    (local $sum i32)
    (i32.const 0) (set_local $sum)
    (block $stop
      (get_local $length) (i32.const 1) i32.lt_s (br_if $stop)
      (i32.const 0) (set_local $sum)
      (loop $loop
        (get_local $input) i32.load (get_local $sum) i32.add (set_local $sum)
        (get_local $input) (i32.const 4) i32.add (set_local $input)
        (get_local $length) (i32.const -1) i32.add (tee_local $length) (br_if $loop)
      )
    )
    (return (get_local $sum))
  )

In LESv3, this stack-machine code could be expressed direcly as follows (omitting get_local and i32.const, which are redundant):

Code: [Select]
.memory 1;
  ...
  .fn $_sumIntegers($input: i32, $length: i32): i32 {
    $sum: i32;
    0; set_local $sum;
    {
      $length; 1; i32'lt_s; br_if stop;
      0; set_local $sum;
      loop (loop) {
        $input; i32'load; $sum; i32'add; set_local $sum;
        $input; 4; i32'add; set_local $input;
        $length; -1; i32'add; tee_local $length; br_if loop;
      }
      stop:
    }
    $sum; // return value
  }

More typically it would use expression notation, like this:

Code: [Select]
.memory 1;
  ...
  .fn $_sumIntegers($input: i32, $length: i32): i32 {
    $sum: i32;
    $sum = 0;
    {
      br stop 'if $input '<s 1;
      $sum = 0;
      loop (loop) {
        // I picked := for set_local (which discards the result)
        // and = for tee_local (which puts the result on the stack);
        // feel free to vote your own preference.
        $sum := i32[$input] + $sum;
        $input := $input + 4;
        br loop 'if $length = $length + -1;
      }
      stop:
    }
    $sum; // return value
  }

LESv3 syntax elements
Now that you know what it looks like, let’s discuss the details.

Semicolons
Should semicolons should be required to terminate statements? If semicolons are required then a LESv3 parser could read JSON files; if statements are terminated by newlines then JSON isn’t supported because
Code: [Select]
{ "foo"
  : "bar" }
will be parsed like { "foo"; (:"bar"); } which is quite different. Maybe a postprocessing step could restore the intended meaning, but instead you should probably just use a dedicated JSON parser in the first place.

For various technical reasons, it seems easier if newline is a terminator, so I am inclined to drop JSON support. If newline is a terminator, semicolon can still be used as a separator or terminator, which is useful when writing postorder notation or declaring multiple variables.

Update: They’re trying to standardize Wasm quickly, so the browser people want to represent Wasm with a flat instruction list. I’d say a flat list looks better without semicolons, e.g.

Code: [Select]
$a
$b
i32'add

Keywords
LESv2 has no keywords. Although Wasm doesn’t need this, in LESv3 I decided to introduce exactly three keywords, true, false and null, because I found that writing a special notation like @true was cumbersome for end-users, while relying on a postprocessing step to translate true into a true literal made it difficult or impossible to make an identifier that happened to be named true. In LESv3, true is a literal and `true` is an identifier named true.

Fancy identifiers
An identifier is a name (e.g. foo). Normal identifiers have letters, digits, underscores (_) and/or apostrophes ('). The first character must be a letter or _.

In LES it has always been possible to use any arbitrary string as an identifier. But WebAssembly has an even more stringent requirement: allowing any arbitrary byte string as an identifier. These byte strings will be interpreted as UTF-8, but some of them are not character strings.

Dan Gohman’s prototype used backslashes to escape individual characters in an identifier. That prototype doesn’t support standard escapes like $line1\nline2 for an identifier with a newline in the middle, whereas I thought identifiers should be escaped similarly to normal strings (except that you could write, for instance, fun\ fact\! to get an identifier with a space and exclamation mark in it).

In LESv3, my plan had a problem. If true is a keyword then we need a way to write an identifier called true. Using \true to represent the identifier called true is a no-go because \t conventionally represents a tab character, so this particular identifier wouldn’t have its obvious meaning of “tab character followed by rue”.

Instead, I decided that rather than escaping each individual character, identifiers can be enclosed in backquotes and parsed exactly as a string, e.g. `I'm a whole sentence!` is an identifier. Compared to the alternative, this rule gives longer identifiers in some cases and shorter ones in others. It’s perfect for representing C++ mangled names like `?Fmyclass_v@@YAXVmyclass@@@Z`, but less efficient when there’s single strange byte like `\x1B`.

So in LESv3, all strings will be interpreted as UTF-8, and \x lets you write an arbitrary byte. This means you can write invalid UTF8 like \xFF, but also valid UTF-8 such as \xE2\x82\xAC which means the same thing as \u20AC. One subtle problem with this plan is that some languages (JavaScript, Java, C#) store strings as UTF-16, and I do not wish to define identifiers as byte strings rather than strings. I solved this problem with a scheme that encodes invalid UTF-8 as invalid UTF-16 in a way that is guaranteed to round-trip properly.

Keyword expressions
LES has no keywords for creating functions, classes or other things. In lieu of keywords, LESv2 used two different kinds of left parenthesis (one with a space in front, the other without), which help distinguish “superexpressions” (such as a function declaration) from normal expressions.
Code: [Select]
// LESv2
if x { f(); };   // Superexpression equivalent to `if(x, {f();});`
if (x) { f(); }; // Superexpression equivalent to `if((x), {f();});`
if(x, y);        // Call a function named `if`
if (x, y);       // Syntax error with suggestion to remove the space
fn Foo(x: i32);  // Superexpression equivalent to `fn(Foo(x: i32));`
fn Foo (x: i32); // Syntax error with suggestion to remove the space

Although no one from the Wasm group objected to this, no one supported it either. I decided to try a different design for v3 that avoids any possible confusion. At first I planned to use # to denote “keywords”:

Code: [Select]
#fn Foo(x: int32) {...}
But # is a relatively noisy character to look at, and you need the shift key to make one. My new plan is to use a dot, which I think is a little easier on the eyes and hands:

Code: [Select]
.fn Foo(x: int32) {...}
This makes some sense, as . is used for “directives” in some assembly languages. Following the dot, a “keyword statement”, as this is called, has a single expression followed by an optional braced block with optional “continuator clauses”. Here are some examples of keyword statements, a.k.a. “dot expressions”:

Code: [Select]
// Dot expression with expression
.return $x + 1;
// Dot expression with expression and braced block
.struct Point { x: f64; y: f64; };
// Dot expression with expression, braced block and continuator
.if x > 0 { f(); } else { g(); };

A “continuator” is an identifier from a predefined set that includes else, catch and finally, that allows an extra clause to be added. The exact set of words allowed as continuators is not finalized, and the syntax of a continuator clause is not finalized either.

The dot in the keyword-expression becomes part of the name of the identifier that is called, e.g. .return $x + 1 really means `.return`($x + 1).

Potentially, a keyword statement could take comma-separated arguments:

Code: [Select]
.memory initial=1, maximum=10, exported=true;
However, LESv3 is a flexible language, so a keyword-expression is allowed anywhere that a normal expression is allowed, including as a function argument, which makes this ambiguous:

Code: [Select]
foo(bar, .baz a, b, c);
Does this function take four arguments, or two? The same issue arises if we try to keep JSON compatibility, since you could write something like

Code: [Select]
{"key":"value", .baz a, "b":"c"}