Sunday, December 30, 2007

The Greatest Contribution to Programming Languages

Thinking about tools, it occurred to me that the greatest contribution to programming languages is not any single language. No, not Lisp, C, ML, Haskell, Qi, or any of the more obscure yet equally great languages.

The greatest contribution someone could make to the programming language and compiler community   is not to make a great programming language, but to make the tools necessary for making new programming languages cheap — so cheap, that programming languages and the tools that allow their efficient use (e.g. compilers, debuggers, code-editors, etc.) become abundant.

And once they become abundant, people will start to see how interfaces are everything, how everything is a language for a compiler, and how there's no real answer to code's worst enemy — size — other than abstracting away parts of it and then specifying which parts you need in a particular situation via a language.1

What tools would make implementing new programming languages cheap? Let's make a list.2
  • Parsing and Printing (Bi-directional Mappings between Text and Abstract Syntax)
  • Tree/Graph Editor
  • Inference Engine (Theorem Prover)
  • Data-flow Analysis
  • Incremental Transformation Engine (Rule-based Transformation, Incremental Compilation, Execution Stepper)
  • Dependency Analysis
  • Unified Runtime System (Memory Allocation, Garbage Collection, Runtime Safety Verification)
Some more details...

Parsing and Printing

This may be a first-step before a true code-editor is created, but I suspect that it will still be desirable in some cases to stringize code. In addition, I imagine parsing code, perhaps not entire programs but small code-blocks, will still be useful in a code editor.

What is completely obvious to a compiler-writer (in fact, I knew of someone in college whose side project was exactly this) yet others seem completely oblivious to, is that you can make "skins" for your code to format it differently in text, but have it parse down to the same underlying representation (i.e. abstract syntax). This way, Joe can use curly braces to specify blocks while I can use whitespace indentation, and it will represent the same code. The same for every other parsing variation, small or large.

What's cool about this is that even though Joe and I are typing in code differently, I can view Joe's code in my format and vice versa. So no more conforming to (practically uncheckable) coding standards enforced by your tyrannical employer.

I imagine specifying these via bi-directional mappings between strings and abstract syntax data-structures. In the past, I've always had to write one piece of code to parse and another piece of code to print, and that's just silly. Of course, there are corner-cases which are specific to one as opposed to the other, and I imagine some kind of declarative DSL for specifying these.

Tree/Graph Editor

Code at its core is a tree. Once it is parsed, it is represented internally as an abstract-syntax tree, or absyn tree. If you consider the semantics of the language though, variable references for example, refer back to nodes closer to the root of the tree. So in this sense, it is a graph.

The core of most IDEs today is a text-editor. This is silly though since the programmer is actually editing code, which is this tree or graph data-structure. In particular, when you edit the name of a variable reference, the editor should distinguish between changing what that node refers to and re-naming the node, as overwriting a variable in text changes what the node refers to; to re-name, you have to manually replace all references to it, including taking into account scoping rules of the language. The editor should understand when you're trying to wrap an expression in another expression, instead of relying on you to go to the beginning, add an open-paren, go to the end, add a close-paren.

Sure, it's a matter of convenience, but it's also a matter of error-prone-ness. I can't count the number of times I've had bugs due to copying and pasting textual code. If the editor I was using was aware that the structure I was copying was a graph, it would indicate to me which nodes the pasted code were referring to, allowing me to see my bug immediately.

The editor must be aware of the semantics of the language you are editing and give continuous feedback of changes made. If it weren't aware of the semantics of the language, how would it give you feedback on what you are editing that is specific to the language you are using. Thus, your code editor must be tied very tightly with your compiler. However, a general tree-editor would be a huge step in the right direction.

Inference Engine

An inference engine is critical for any code-level automation. For example, type checking, type inference, and semantic-preserving optimization. This is no simple feat to create, as it is, at its core, a theorem prover.

Instead of re-writing a theorem prover for every compiler, a compiler writer should only need to specify the axioms and rules of inference for his language to the industrial-strength theorem-prover. Simple things like constant folding and more complicated things like induction-variable elimination can fall out of this.

Data-flow Analysis

Of course, some standard optimizations like dead-code elimination (if your language is eager) and semantic checks like whether variables have been initialized (if your language allows such things) rely on data-flow analysis. Some compilers even use this to optimize memory allocation, as in whether to allocate something on the stack or the heap, by determining whether a reference to an object may escape a local block.

Again, this is another instance of something that can be done once generally, but it should also be tied in with the incremental transformation engine to allow for only re-compiling or re-executing pieces of code that have changed. This could also be extremely useful in a static sense where the editor displays to the programmer what pieces of code are affected by the flow of changes made (since the last save or last revision for example).

Incremental Transformation Engine

This is the meat of the compiler: the transformation engine. It transforms source code to object code. Or in the case of an interpreter, source code to immediately-executed instructions. It must be incremental in the global sense in that it must handle incremental building for a more responsive development environment. But it must also be incremental in the local sense in that it must allow for the programmer to pause, step, observe, or otherwise control the transformation for debugging purposes.

Home-grown interpreters usually start out all-or-nothing; they either transform or they don't. The reason is simply because the most straight-forward and clean implementation is a recursive traversal of an absyn tree. This leaves no room for stepping through code and displaying the current state to the programmer. A debugger is an after-thought, and rightly so, as its cost outweighs its benefit when a language implementor is creating a language for himself or others who have access to the language the compiler is written in.

I imagine writing pattern-matching rules for the transformation engine similar to the way someone might write a recursive traversal of an absyn tree in ML, however, the transformation engine will automatically handle incremental stepping and observing of the transformation process for interaction with the programmer when debugging. In addition, I imagine many additions to this language compared to ML which allow short-hand for specifying common cases of transformation, like only specifying the differences in base cases and the engine recursively applying the transformations to all other nodes.

Dependency Analysis

Based on dependency analysis, the transformation engine can automatically determine which code to (re-)compile and in which order. This should include external dependencies as well.

In many projects I've worked on in the past, circular dependencies have prevented standard tools from working as expected. As a work-around, we were forced to transform our module layouts and handle the dependencies manually. Although circular dependencies in general can lead to confusing module-interfaces, there are some cases when it just makes sense, and the transformation engine should handle circular dependencies by simply using a two-pass algorithm.

Unified Runtime System

This has got to be the most depressing topic for all language implementors. If you implement your own runtime system for memory allocation and garbage collection, what hope do you ever have of your language interacting with the rest of the world — something that is necessary both for production apps and popular adoption. If you use an existing runtime system, you are stuck on its platform and forced to use its (often unsuitable) virtual machine.

Parrot, looking good. Perhaps this is on its way already.

As for the runtime safety verification, I would love to see something that used proof-carrying code. For those who don't know about this, it is a way to guarantee runtime safety in the way that traditional sandbox-like VMs do at runtime, but statically, so that runtime overhead is eliminated and native machine-code can be run directly with all the same safety guarantees.3



All of these tools are non-language-specific, and they would help make creating new programming languages economically viable for software developers and startups in particular. In the end, this would advance the field of programming languages immensely, more than the development of any single language could.
1. I mean "language" in the most general sense, as it's sometimes suitable to abstract something away into a library and have your language be function calls into that library. Other times it makes sense to have a full-blown Turing-complete language. Anyway, this is why PL design is so important — it's really being done everywhere even though most people don't realize it.
2. What's missing from this list?
3. Peter Lee, a professor I worked for in college, was one of the inventors of proof-carrying code, and it turns out something so amazingly useful is practically a special case of type checking. However, this also requires proof-generation, another use of a good theorem prover. In the sci-fi stories that talk about computers programming themselves — they're talking about theorem provers. ...It's sad for us CSers. Most people are simply unaware of the amazing things currently in development.

Friday, December 21, 2007

Design — A Life Philosophy

Every problem is a design problem full of design choices. Even the choice to take a hands-off approach is a design choice. Every decision you make in your life — what to eat for lunch, what to do as a profession, who to hang out with on Saturday night — is a design decision. Each and every one of us does our best to design our life. Why would you ever choose something unless you were convinced that it would further you towards your idea of a successful life, for whatever your idea of a successful life is.

Joel gives a nice description of what design is in his 3-parter. Design is all about making trade-offs — as is life. Everything has an opportunity cost.

Sunday, December 16, 2007

On Practicality and Tool-Making

A comment on my last post on design and programming languages spawned the following.

Q: Are you implementing some of your ideas?

A: It's always in the back of my mind to design and implement my dream programming language. This blog is in part my outlet for ideas that come streaming into my mind about programming languages. In fact, it's my only outlet. The more specialized you get, the fewer and fewer people you can find who consciously value your field; most people don't even know it exists. Consequently, I would just love to have someone to bounce ideas between who actually has a clue of what I'm talking about.

That said, I'm not actively working on something at the moment, and I think that warrants an explanation.

Back when I started programming, I primarily liked to make things that were useful on a day-to-day basis. In making such things, I started to see patterns and got the idea of making tools to make creating them easier and quicker. That eventually led me to programming languages, the medium of expression of all software. As a result, I spent practically my entire undergraduate career studying type-theory and compilers.

To this day, I am more convinced than ever that PL design is one of the most important things to {software as we know it}.

However, given all that, tool-making can be a huge side-track. Because like all things, there is a pattern of pattern-discovery and tool-making.
  1. You use the tools you have to solve a problem.
  2. You discover patterns in the use of those tools.
  3. You make new tools that eliminate those patterns.
  4. You return to step 1 with your new tools.
But if you're like me and see patterns everywhere, you could end up spending most of your time in step 3, not step 1. Especially when you see patterns in the tool-making process and start to recurse on sub-goals. And if your ultimate goal is to solve your original problem, there's something wrong.

The bottom line is that hard work — sometimes an obscene amount of it — pays off if it's focused on your goal. And the whole tool-making branch only makes sense towards your bottom-line if {the amount of work spent making your tool} plus {the amount of work it takes to accomplish your goal with the tool} is less than {the amount of work it would take without the tool}. This simple accounting formula has told me that making compilers and new programming languages is just not worth it given my goals. The cost of making a compiler and all the tools necessary to be productive with a new language is just so high that it dominates the inequality.

This can be frustrating, especially for someone who wants to create something beautiful.

But it is a matter of practicality.

Now, this may seem like an elaborate excuse to not work on something hard and important. But to me, it is the reason why I should work on other hard and important things.

I'm not trying to discourage anyone from working on compilers or designing new programming languages. On the contrary. If your goal is to make the best programming language in the history of programming languages — a so-called hundred-year language — then by all means, work on that! Seriously, please do; we are all in need.

All I'm really saying is that everything has an opportunity cost, and you should evaluate everything in terms of your goal, whatever that may be. Sometimes that means putting aside your own desires, like the desire to make a tool. But in the end, it's worth it.

Sunday, December 9, 2007

1, 2, n

It's the 1, 2, n rule in software development.

You do something once. You do something twice. And the next time you do it, you generalize it for all times henceforth.

Waiting longer to generalize is painful at best. In the worst case, it leads to maintenance nightmares as when the same thing is done 14 times in 14 different places, each time slightly different than the rest.

Generalizing too soon however leads to other problems. Everyone's done this at one point in their coding career or another. Hey, this is a great idea! Why hasn't someone else done it before? I'll code it completely general the first time around so that it'll be done right, from the start. And then a few weeks of coding go by, you start to actually use it, and you find out the hard way that either it was a really bad idea or the API is so terrible that you have to re-write it. Because basically, you didn't iterate your design.

Code is all about design. And good design is re-design.

This is a big part of the reason why programming languages are so far behind. The amount of work that it currently takes to create (a compiler for) a programming language is so high, that the time it takes for a language designer to get feedback on his design is years. But by that time (if he was lucky enough to actually get users), he's already invested so much into it that he is unlikely to throw it away and start from scratch — something a good designer must do. On top of that, even if he wanted to start from scratch, it would take many months to de-program his mind from thinking in the previous language. (Like the first time a Java programmer learns Haskell and tries to write a loop!)

An IDE is part of a language. Error messages are part of a language. Interacting with other systems, written in other languages, is part of a language. They're all part of the interface of the language. And the interface is everything.

It's a chicken-vs.-egg problem though. Because to get all these tools made, based on history, requires that a language be popular. But how do you get people to use a language if it doesn't have the most basic tools like a debugger?

JavaScript is doing it right now. JavaScript tools — which make using the language no longer so painful — are cropping up everywhere these days. And the way it did that was by being the only choice if developers want their code to be executed in a browser window. Most languages aren't so lucky. So what is a language designer to do?

As a designer, I can't simply make a command-line compiler for my language. I have a vision of what editing code is like. And it's not with a text-editor. The closest thing I've ever seen to a true code-editor is Eclipse for editing Java code, but Eclipse has been around long enough for us to learn a few lessons. I can already hear the Emacs junkies crying about how Emacs is the ultimate editor. But again, it's still a glorified text-editor.

A true code-editor must absolutely be aware of the semantics of the language you are editing. It is inseparable from the compiler. Because the compiler knows how to interpret your code, but when something is amiss, it must be able to feed back to the programmer.

Writing code is all about design. And a designer needs feedback — as much feedback as possible. The quicker he gets that feedback the better, because he can make any necessary changes to the design only after he's gotten feedback on his current design.

So any code-editor that doesn't give immediate feedback on parse errors and requires that you do a full build is simply crap! How can you possibly work like that?! Ditto for type errors, link errors, and test-assertion errors. People see it as obvious when the old-timers tell stories about how they used to submit compilation jobs to mainframes on punch-cards, only to find out the next day that they forgot a semicolon and had to re-submit the entire job. Yet when it comes to where we are, here and now, people are blind.

Frankly, I'm getting tired of it. I want to see test data flow through my code as I write it and the unit-test results change from red to green as I fix a bug. I want to be able to highlight two blocks of code and have my code-editor factor out a function, replacing the two blocks with applications. I want bottlenecks to be highlighted in-editor based on profiling results. And compilation should be done completely in the background so I don't have to wait after making a change to use it. And this read-eval-print-loop that's all the rage nowadays might actually be useful if I could use it in the context of a particular point of my code.

Is it really that hard? Someone please tell me I'm a visionary because I can't believe that this can't be done. Nothing a little universal substitution and computing power can't solve.