Sunday, May 29, 2011

Fully Typed LISP

I have been thinking about this concept for at least couple years now, pretty much every since I stumbled upon LISP in the first place.

Let's first outlay a couple assumptions I have built up over the years.

Compression potential is essentially abstraction potential, or the level of abstraction you can comfortably maintain in a programming language. Assembly has no compression potential aside from functional decomposition while a programming language with a programmably compiler has infinite compression potential. Lisp and Factor, (moreso factor than Lisp for whatever reason) have almost infinite compression potential. C++, using templates and X-macros, has actually decent compression potential. Functional languages, with closures, discriminated unions, and pattern matching has pretty good compression potential for some problems. For others, simple imperative languages clearly dominate.

Optimization potential is the potential for the exact same piece of code to run faster over time due to advances in compiler (or runtime) design. Lisp again had seemingly infinite optimization potential but we later figured out that much of this potential was impossible to get to (no compiler is sufficiently smart) although you will notice that SBCL is often 3rd in the language shootout game meaning if you use it in certain ways, it can be quite fast. A completely abstract programming language like a true functional language such as Haskell or perhaps prolog should also have very high optimization potential because you are programming to a completely abstract environment.

Correctness potential is the potential for the compiler to check that the program is correct. This requires something of a type system and all compilers, dynamic or otherwise have it. Again fully typed functional languages such as Haskell definitely have this exploited while in C++ the exploitation is up to the discretion of the programmer and in LISP you have to hope that your tests cover all of your runtime situations because in dynamically typed languages the compiler isn't helping you much.

One key difference between the type of problems caught by a static type system and the type of problems caught by some sort of runtime testing, be it unit tests, integration tests, system tests, black box, white box, or however you want to categorize the tests. A type system guarantees that certain types of problems won't occur. But it doesn't guarantee that the program will ever produce the right answer for any input, or that it will run to completion (in most causes, I know Haskell catches some of these). Thus you always need unit tests, but unit tests can't get complete coverage of an even moderately complex system.

In a highly complex system, such as a physics engine, unit tests help but can't hope to cover even a small percentage of the range of test cases. Especially when the right answer is highly dependent upon human perception of what looks correct or some other fuzzy definitely. So a type system is absolutely necessary.

So taking all of this into account, I want to outlay the vision and some syntax for a new language, a sort of fully typed lisp-syntax language.

Lets start with the preprocessing stage. At this stage we can't have any types, any sort of compiler extension is going to be roughly of the type of text->text. Traditional LISP macros work in this range. The output of the preprocessor goes to the typing stage where the static typing system starts interpreting the data and providing feedback.


So, in this stage, anything that looks even mildly like lisp gets passed through. We have at the moment only one mode of interaction with the compiler, def-parse-fn.

The compiler language at this point is traditional, untyped list with a macro syntax of clojure. Unlike traditional lisps lists evaluate to themselves unless their first member is something declared through def-parse-fn. Inside parse-fns, we have the normal family of escape characters of a lisp, i.e. backtick, tilde-evalutation, etc.

Now we get to another stage, where the type system can start to function. To do this we must, unfortunately, define more syntax.

(defvar <vartype>varname (init-body)(attribute-list))

vartype may be inferred from the initialization body. But it can also be specified formally to ensure the initialization body matches what one expects.

(defn <rettype>fnname (<type1>arg1 <type2>arg2)
(body-list) (function-attribute-list))

Like vartype, rettype can be inferred from the function body but it can also be specified formally. Also both types have matching dec{var,fn} members for simply declaring these objects to the type system but and expecting them to be found elsewhere.

I can't see how clearly this shows through in the blog but there can't get a space between these items, and types are always inclosed in <type-list>. This makes it a bit more familiar to people used to c++, java, or C#. It also allows them to pass through the first stage unchanged and with no chance of interpretation.

So, lets talk about defining a structure, or a named data layout possibly associated with some functions.

(deftype typename
(struct
(field name (no-arg-initializer body) (field-attribute-list))
(field name (mutable))
(constructor (arg1 arg2 )
(init-field fieldname init-list))
(destructor)
(property (get-fn) (set-fn)))

I could go on with member functions, but you get the idea. In this case, the typename names a defined layout, in memory, along with functions for creating them and destroying them. Fields can specify default initialization values for their members and such, but structs should also be able to specify that they a much dumber, meaning they can be treated like PODS (plain-old-datatype members)

Given a type, we would like to associate functions with the type and with an instance of the type (static vs normal functions).

(assoc-type-fn typename fnname) function is associated with the type.
(assoc-fn typename fnname) function is associated with an instance. One of the arguments to the function must be of the instance type and this argument will be expected to be elided when calling the function through the instance.

These functions may be used outside of the type definition to associate functions with the type. Whether the association is done inside or outside of the type, if you are calling a function via an association it looks like this:

(type-name fnname arg1 arg2)
(instance-var fnname arg1 arg2)

Breaking the normal lisp tradition allows intellisense to work normally. So in essence, instance are functions, where the first argument is the function name telling you which function to call. Ditto for types.

An important point is that getting/setting a field is no different syntactically than getting/setting a property.

(instance-var set prop-or-field-name value)
(instance-var get prop-or-field-name)

So I give a natural migration path that does require a recompilation but doesn't require much else.

At this point we have at least two other interesting meta programming possibilities. The first would be def-fn-macro.


(def-fn-macro macro-name( arg1 arg2 ) (body))

Note that these

Another macro type would be a type macro, where a the compiler expects a new type defined, def-type-macro.

The final statement of a type macro *must* or return a new type. This type may have functions associated with it, even static functions but it must be a type.

Now we turn our attention to the type system.

The type system for a language like C++, removing auto-casting, is very simple. It simply asks if the types match for assignment and function calls.

The type system for Haskell is a theorem proving machine.

I would like to start with the C++ type system, and carefully and transparently work toward the Haskell type system in a way that doesn't interfere with existing code. I would also like the type system to be compile-time extensible such that you can have a type system macro but I don't have a clear idea what that would look like yet.

One difference I have noticed between what I want to do and C++ is that in C++ changing a type, which is a compile time object, requires a runtime function call (which of course may be elided or inlined, but whatever).

Let's say we have a function that only works on integers within a certain range. I would like to declare that function with those parameters declared in the type of the function call.

But I don't want changing an arbitrary integer into the correctly typed integer (or back) to mean any real copy operations, even with the naive compiler. The type is something that, while it may require a runtime check, doesn't fundamentally change the underlying data in the system. So it shouldn't require an data copy or transfer operations. The compiler should ensure this runtime check is done before the function call. You can do this type of programming in C++ but it is tedious and perhaps that is the best you can do.

But one large difference between the way C++ is used and Haskell is that in Haskell people really do try to use the type system to ensure program correctness. This type of programming needs to be encouraged.

Also, Eiffel's contracts need to be encouraged but I believe the way to do this is to enable them to be done in the type system. Then let the compiler work out exactly when and how to check the pre-condition and post-conditions and such. Code that doesn't maintain the types correctly doesn't check the pre and post conditions.

Now, moving quickly, the runtime. Unfortunately, the standard lisp runtime is tough to do if you are type-checking and compiling things although haskell seems to have done something like it.

The LISP runtime, however, allows you to replace an existing function *in a running program*. This is actually quite key when debugging certain types of problems, and can make almost all debugging much faster. It also eliminates the possibility of inlining and a whole host of optimizations so it needs to be used sparsely; or at the very least the programming needs to be able to enable/disable it for a set of functions.

In addition there is the technical problem of calling a specific function from a generic data system (the repl is of course, generic). This means that every function that needs to be called generically at least needs to have a generic thunk wrapped around it that takes the arguments from an appropriately sized pointer array and writes the result back to an appropriately sized buffer.

Anyway, I am attempting to write this typed list language. Note that I didn't say anything about a garbage collection system. I think that while those do make programming easier, they are a large, unpredictable solution to a set of smaller problems. Regions, memory pools, and stack based construction/destruction all contribute to lessen the need for any sort of garbage collection.

Ditto for multithreading. I think clojure's current facilities for multithreading are exactly the correct ones (especially its stm implementation) but I also think some of the latest advances in the C# compiler, basically enabling continuations, are also very important.

Finally I am not talking about locally defined closures yet. The three main things I think make programming much more concise aside from a programmable compiler are closures, discriminated unions and more generally pattern matching. Out of those, only closures so far need extensive compiler support and I will get to that support when the time comes.

The other feature of C++ that absolutely must make it is being able to attach a set of statements that have to be done at any exit of an enclosed block of code. This is a generalization of using objects on the stack whose destructors do cleanup, thus ensuring that code gets run no matter how the function is existed. The generalization would look like:

(on-exit (stment-list)) and the compiler would ensure that the various details around this are done.

Exceptions, at least c++ style exceptions aren't mentioned because they are non-trivial to implement and are actually still controversial. I do feel that their use in certain contexts simplifies the code but I also know from experience that you can live happily without them and their overuse makes the code much less predictable. LLVM provides pretty good support for them, however, and i do intend at some point in the future to support them.

I am working on an initial implementation of the preprocessor, it has taken me a month alone to go through the various details to solidify my thoughts on the language compiler this far.

The overall vision is that a programmable compiler will beat a sophisticated one both in the short and long runs, at least in terms of programmer productivity. Furthermore, if we carefully break down the most useful features of most languages we can add them slowly and carefully when necessary but they don't need to be boiled into the base compiler. Extending the compiler allows us to add most features in a more modular and optional way.

Can this be faster than C/C++? I don't know, gcc has 30 years of optimizations behind it.
Can I write safer code in this than in C/C++? I don't know, that depends on how sophisticated we want to make the type system.

I am betting than a simpler low level system with good programming and modification possibilities beats more complex and developed systems. Over time I know that this means that community agreement and understanding of shared abstractions is much more important than with a language with a fixed, standardized definition. A good digression is why this doesn't work with c++. Why does everyone roll their own vector class?

This does give a good platform for experimentation for higher (and lower) level language features.

2 comments:

Sam Tobin-Hochstadt said...

Lots of people have worked on type systems for Lisp. For example, one reason that SBCL is fast is that you can add type annotations to help the compiler. People have also worked on developing entirely statically-typed Lisp dialects, such as Typed Racket. I recommend checking these out when starting on your language.

Also, the two biggest differences between the Haskell type system and the C++ type system are (1) the approach to polymorphism/generics, and (2) the use of subtyping in C++. These are really important choices in the design of a language.

Unknown said...

Have you seen Shen (a lisp with a type system)? (http://shenlanguage.org/)