The greatest contribution someone could make to the programming language and compiler community is not to make a great programming language, but to make the tools necessary for making new programming languages cheap — so cheap, that programming languages and the tools that allow their efficient use (e.g. compilers, debuggers, code-editors, etc.) become abundant.
And once they become abundant, people will start to see how interfaces are everything, how everything is a language for a compiler, and how there's no real answer to code's worst enemy — size — other than abstracting away parts of it and then specifying which parts you need in a particular situation via a language.1
What tools would make implementing new programming languages cheap? Let's make a list.2
- Parsing and Printing (Bi-directional Mappings between Text and Abstract Syntax)
- Tree/Graph Editor
- Inference Engine (Theorem Prover)
- Data-flow Analysis
- Incremental Transformation Engine (Rule-based Transformation, Incremental Compilation, Execution Stepper)
- Dependency Analysis
- Unified Runtime System (Memory Allocation, Garbage Collection, Runtime Safety Verification)
Parsing and Printing
This may be a first-step before a true code-editor is created, but I suspect that it will still be desirable in some cases to stringize code. In addition, I imagine parsing code, perhaps not entire programs but small code-blocks, will still be useful in a code editor.What is completely obvious to a compiler-writer (in fact, I knew of someone in college whose side project was exactly this) yet others seem completely oblivious to, is that you can make "skins" for your code to format it differently in text, but have it parse down to the same underlying representation (i.e. abstract syntax). This way, Joe can use curly braces to specify blocks while I can use whitespace indentation, and it will represent the same code. The same for every other parsing variation, small or large.
What's cool about this is that even though Joe and I are typing in code differently, I can view Joe's code in my format and vice versa. So no more conforming to (practically uncheckable) coding standards enforced by your tyrannical employer.
I imagine specifying these via bi-directional mappings between strings and abstract syntax data-structures. In the past, I've always had to write one piece of code to parse and another piece of code to print, and that's just silly. Of course, there are corner-cases which are specific to one as opposed to the other, and I imagine some kind of declarative DSL for specifying these.
Tree/Graph Editor
Code at its core is a tree. Once it is parsed, it is represented internally as an abstract-syntax tree, or absyn tree. If you consider the semantics of the language though, variable references for example, refer back to nodes closer to the root of the tree. So in this sense, it is a graph.The core of most IDEs today is a text-editor. This is silly though since the programmer is actually editing code, which is this tree or graph data-structure. In particular, when you edit the name of a variable reference, the editor should distinguish between changing what that node refers to and re-naming the node, as overwriting a variable in text changes what the node refers to; to re-name, you have to manually replace all references to it, including taking into account scoping rules of the language. The editor should understand when you're trying to wrap an expression in another expression, instead of relying on you to go to the beginning, add an open-paren, go to the end, add a close-paren.
Sure, it's a matter of convenience, but it's also a matter of error-prone-ness. I can't count the number of times I've had bugs due to copying and pasting textual code. If the editor I was using was aware that the structure I was copying was a graph, it would indicate to me which nodes the pasted code were referring to, allowing me to see my bug immediately.
The editor must be aware of the semantics of the language you are editing and give continuous feedback of changes made. If it weren't aware of the semantics of the language, how would it give you feedback on what you are editing that is specific to the language you are using. Thus, your code editor must be tied very tightly with your compiler. However, a general tree-editor would be a huge step in the right direction.
Inference Engine
An inference engine is critical for any code-level automation. For example, type checking, type inference, and semantic-preserving optimization. This is no simple feat to create, as it is, at its core, a theorem prover.Instead of re-writing a theorem prover for every compiler, a compiler writer should only need to specify the axioms and rules of inference for his language to the industrial-strength theorem-prover. Simple things like constant folding and more complicated things like induction-variable elimination can fall out of this.
Data-flow Analysis
Of course, some standard optimizations like dead-code elimination (if your language is eager) and semantic checks like whether variables have been initialized (if your language allows such things) rely on data-flow analysis. Some compilers even use this to optimize memory allocation, as in whether to allocate something on the stack or the heap, by determining whether a reference to an object may escape a local block.Again, this is another instance of something that can be done once generally, but it should also be tied in with the incremental transformation engine to allow for only re-compiling or re-executing pieces of code that have changed. This could also be extremely useful in a static sense where the editor displays to the programmer what pieces of code are affected by the flow of changes made (since the last save or last revision for example).
Incremental Transformation Engine
This is the meat of the compiler: the transformation engine. It transforms source code to object code. Or in the case of an interpreter, source code to immediately-executed instructions. It must be incremental in the global sense in that it must handle incremental building for a more responsive development environment. But it must also be incremental in the local sense in that it must allow for the programmer to pause, step, observe, or otherwise control the transformation for debugging purposes.Home-grown interpreters usually start out all-or-nothing; they either transform or they don't. The reason is simply because the most straight-forward and clean implementation is a recursive traversal of an absyn tree. This leaves no room for stepping through code and displaying the current state to the programmer. A debugger is an after-thought, and rightly so, as its cost outweighs its benefit when a language implementor is creating a language for himself or others who have access to the language the compiler is written in.
I imagine writing pattern-matching rules for the transformation engine similar to the way someone might write a recursive traversal of an absyn tree in ML, however, the transformation engine will automatically handle incremental stepping and observing of the transformation process for interaction with the programmer when debugging. In addition, I imagine many additions to this language compared to ML which allow short-hand for specifying common cases of transformation, like only specifying the differences in base cases and the engine recursively applying the transformations to all other nodes.
Dependency Analysis
Based on dependency analysis, the transformation engine can automatically determine which code to (re-)compile and in which order. This should include external dependencies as well.In many projects I've worked on in the past, circular dependencies have prevented standard tools from working as expected. As a work-around, we were forced to transform our module layouts and handle the dependencies manually. Although circular dependencies in general can lead to confusing module-interfaces, there are some cases when it just makes sense, and the transformation engine should handle circular dependencies by simply using a two-pass algorithm.
Unified Runtime System
This has got to be the most depressing topic for all language implementors. If you implement your own runtime system for memory allocation and garbage collection, what hope do you ever have of your language interacting with the rest of the world — something that is necessary both for production apps and popular adoption. If you use an existing runtime system, you are stuck on its platform and forced to use its (often unsuitable) virtual machine.Parrot, looking good. Perhaps this is on its way already.
As for the runtime safety verification, I would love to see something that used proof-carrying code. For those who don't know about this, it is a way to guarantee runtime safety in the way that traditional sandbox-like VMs do at runtime, but statically, so that runtime overhead is eliminated and native machine-code can be run directly with all the same safety guarantees.3
All of these tools are non-language-specific, and they would help make creating new programming languages economically viable for software developers and startups in particular. In the end, this would advance the field of programming languages immensely, more than the development of any single language could.
1. I mean "language" in the most general sense, as it's sometimes suitable to abstract something away into a library and have your language be function calls into that library. Other times it makes sense to have a full-blown Turing-complete language. Anyway, this is why PL design is so important — it's really being done everywhere even though most people don't realize it.
2. What's missing from this list?
3. Peter Lee, a professor I worked for in college, was one of the inventors of proof-carrying code, and it turns out something so amazingly useful is practically a special case of type checking. However, this also requires proof-generation, another use of a good theorem prover. In the sci-fi stories that talk about computers programming themselves — they're talking about theorem provers. ...It's sad for us CSers. Most people are simply unaware of the amazing things currently in development.
20 comments:
LLVM
I'm totally with you, and I've spent the last six months (off and on) working on a solution to "Parsing and Printing." My project is called "Gazelle"
http://www.reverberate.org/gazelle/
Going from text to AST is usually people's ultimate goal, but with Gazelle I focus on the concrete syntax with the option of layering AST-building on top. I do this for a few reasons:
- it enables whitespace-preserving transformations and other applications that care about concrete syntax.
- it makes event-based parsing possible, so that applications can avoid building any in-memory structures of any kind for greatest efficiency.
I invite you to read more about Gazelle and/or subscribe to my blog if you're interested. I'm also eager to hear any feedback.
I also agree with Steve that LLVM provides a lot of the infrastructure to support building new languages. It doesn't handle a lot of the really high-level ideas you were describing, but if you're dealing with an imperative language and you can emit LLVM assembly code, LLVM will handle the optimization and assembly code generation.
While I was reading, I was about to post a quip about how parsing isn't necessary when creating a PL and that the idea of "syntax" is illogical. But I kept reading, and it turns out you're already half-way to that conclusion.
My vision of the perfect programming language is, essentially, a way to cut the amount of effort of implementing a new language down to 0.
The idea is to basically create a "base" language that maps 1 to 1 with the underlying architecture. Different implementations might use something like x86 assembly, MSIL, or even C-like structures for an interpreter. And then you add the magic ingredient: macros. With those, you can create a generic layer above the implementation-dependent language, which would be the "compiler" in most normal languages.
Basically, the idea is to make as much as possible available in the language, essentially offloading most of the compiler or interpreter into a macro library.
Regarding syntax—there is none, per se. Code would be stored in whatever method allows the easiest loading. If the compiler were written in python, for example, it would be easiest to just pickle the AST and dump it to a file. Gone are the days of complicated parsing or variable name limitations (or syntax errors, for that matter). The benefits of such a system are innumerable.
In any event, it would be far too difficult to describe everything in one comment. I've actually been looking for a partner in crime to bounce ideas off of. Would you be interested?
I had to copy and paste this text into a text editor just so I could read it. White text on black background? Sheesh.
The idea of representing multiple languages with a common intermediate representation (like an AST) only works if the languages are semantically pretty close to each other - i.e. if everything you can say in language X is "sayable" in language Y via a fairly straightforward mapping from X's constructs to Y's.
But some things you can say in language X may have *no* reasonable mapping in language Y. Here are some examples: language X uses fully non-strict evaluation, and Y uses strict evaluation. Or language X has built-in support for concurrency (think Occam) while language Y only does this via a library (e.g. C). Or language X has dataflow variables (like Oz) and language Y does not (like Ruby).
If you want to dig further into why putting your parse tree into an AST will not buy you much, you need to do some reading on the history of this idea, starting with UNCOL. UNCOL was a manifest failure, because people discovered, slowly, painfully, that there is no one underlying "abstract computing machine" that equally well serves the practical needs of all the different high level languages we have. (Note that I said "the practical needs" - I'm not talking about theoretical Turing equivalence here.)
Having an LLVM is only "the answer" for a narrow range of possible languages.
Factorcode.org
Forget it, the easiest way to write a new language these days is to embed it in a sufficiently expressive existing language - Haskell in particular is terrific for this. (yeah, Lisp does give you more rope but then it leaves you to hang yourself) Rather than trying to compete with that from scratch, your energies would be better spent documenting _how_ one goes about writing an embedded language.
Re: Tools - we did some interesting work at Dynamic Aspects (see demo)
The editor didn't have any textual input, you just refactored the code using intuitive keyboard/mouse shortcuts. This was based on the observation that most of the time you are actually just shuffling code about with the aim of adding a dimension of freedom ready for adding the next feature (e.g. add a parameter to a method, promote a method to a class, introduce a parameter object etc).
The editor was "semantic" and understood the dataflow of the program when making transformations. The grand vision provides more details.
Unfortunately, we didn't get much further than the online demo (it was real, not simulated!), though there continues to be research in this area (link from the demo).
Also Take a look at Subtext where Jonathan Edwards is working on some similar ideas.
Anonymous, I'm familiar with UNCOL and its apparent failure. If you're referring to CW's comment, I kind of have to agree with you.
But what I'm talking about isn't a silver bullet. Just tools to help making this sort of translation easier, as there are some things which apply to compiling a vast variety of languages. And this would be very useful if I'm making a compiler for a new language.
In a sense, I'm talking about a DSL specially designed for writing compilers. ...This is a whole can of worms I'm not going to go in to right now.
I'll take that as a "no," then. Still, your ideas are rather interesting. I'll check by here every once in a while.
CW, do you have a website or blog describing in more detail what you're working on? It could be that I just don't completely understand what you mean, as some of what you said definitely makes a lot of sense and in line with what Jonathan Edwards and I are talking about.
Are you familiar with UNCOL and previous work on compiler-compilers?
Jeff, that's beautiful! I'm sad to hear you say that it's no longer being worked on. Looking forward to seeing more of your work, and hopefully others will follow your lead.
Honestly, I don't read much about that stuff. Almost all of my ideas are home grown, mostly because I enjoy thinking about this sort of stuff. I don't have a blog that I use at the moment, although I have a blank one ready, as you probably noticed.
Basically, the idea is to create a non-textual base language. You would then build up from that using macros. This allows for consistent access to very low level mechanisms, while creating a nicely layered system to boot. The final assortment would basically be something akin to Haskell.
On the other hand, it occurs to me that I'm still not being clear on what it is I'm describing. I'm talking about a programming language, not a means to simplify (traditional) compiler writing.
If you want, we can take this discussion outside of blogger comment (as much as I love 60x15 text boxes...). My email is cwarren89@gmail.com
there is a decent amount of research in trying to achieve what you are talking, I need to run but in the meantime you can play with this: http://www.jetbrains.com/mps/
I'm not necessarily trying to achieve anything. I'm just looking for a debate partner to test my theories. (assuming that post was aimed at me.)
Nevermind. It seems that comment was directed at the original post. Interesting link, although I'm a bit skeptical of anything with the words "Martin Fowler" on the page.
Something that's missing from the list: security/sandboxing. Although proof carrying code eliminates the need for runtime sandboxing, there'll allways be cases where the code doesn't carry a proof, or doesn't try to prove the property you're interested in. And as we all know, security as an afterthought doens't work.
Re CW: I think a compiler as a set of macro libraries is insufficient, because plain macro expansions don't do optimization, and without some optimizations the higher level abstractions would quickly become dog slow. Doing the optimistations automatically is again a question of building a good enough theorem prover.
A 'good enough' theorem prover would also solve the problem of translating language X to language Y if the languages are quite different, but I'm afraid the 'good enough' theorem prover will be built on the same day as the 'sufficiently smart compiler'.
An other missing point: If we're going to do optimisations, we need profiling, so support for profiling should be built in. And I think the profiling data should not be discarded when the program exits, but saved inbetween runs, so there has to be some thought about persistent storage as well.
I saw the last post was 2 days ago. I don't know if you continued your discussion over email, but if so I'm interested in joining: somejan [a] gmail
somejan, you make a good point about security. Especially in the beginning when adoption won't be widespread, people are going to want to run code w/o safety proofs. Suppose we run untrusted code in a traditional VM using dynamic checking. We could then reap the benefits of PCC when the safety proofs are available. That would be a major incentive for developers to include the proper proofs. How would that do?
As for optimization, it's true that this would be another benefit of a good theorem prover and also that it can never be done quite good enough, as you've probably seen the proof that optimizing-compiler writers will always have a job. :-) (You can always improve optimization but never quite get it perfect b/c that's equivalent to the halting problem.) Another tough issue, but all the more reason why theorem provers are so important.
As for profiling, yes I've thought before that profiling should be automatically done by the compiler/IDE and fed back to the programmer. As it is nowadays, most programmers never even see a profiler because it's sufficiently complicated enough that they only pull it out as a last resort. (Another example of why having a nice simple interface is crucial.) I imagine profiling as being an extension of the transformation engine which captures profiling info, along with other meta-data like inferred types, and tacks it onto the source tree (or at least its display).
Thanks for mentioning these issues.
Oh, and about continuing this discussion over email... we haven't. Would anyone be interested in a new Google Group or something to continue discussions?
There's always comp.lang.misc. It's been rather deserted lately, but I think most of the regulars are still there.
Post a Comment