Tuesday, October 23, 2007

Everything Is a Compiler

All algorithms can be classified into one of the following.
  • Search Engines
  • Prediction Engines
  • Networks
  • Operating Systems
  • Compilers
This becomes obvious when you look at what programs do from a high level: search, predict, communicate, manage resources, and translate/transform.1


Searching is when you start with a question and your goal is to determine the answer. The answer may be fuzzy, but there is a definite best answer that can be tested. In other words, given two potential answers, a search engine has the ability to test which answer is the better one. Moreover, there are a finite number of potential answers. Examples include string searching and graph shortest path algorithms.


Predicting, like searching, is when you start with a question and your goal is to find a good answer. Unlike search, there are potentially infinite possible answers and/or the best answer can not be tested explicitly. Heuristics must inevitably be used, but there is no guarantee that the chosen heuristics will lead to a good answer. Like fitness functions in a genetic algorithm (which falls under the prediction category), local maximas can be hit without indication. Current chess-playing algorithms are also prediction engines since they can not search the entire game-space of moves and must guess which move will more likely lead to a win.


Networks are made up of multiple entities that are independent in some respects, and the goal is to transfer/transport resources between those entities. In the case when the resource is information, the goal is to communicate. Any problem where entities do not have perfect knowledge of the entire system can be a networking problem. For example, a P2P search engine must use a networking algorithm or protocol to determine the existence of a target on a foreign entity, assuming the searching entity does not have an index which would amount to all-encompassing (possibly perfect) knowledge.

Operating Systems

Operating systems manage limited resources. When a resource is abundant compared to demands for it, management of the resource is not needed. When it is too rare to meet demands, management can not help. But when a resource is scarce, care must be taken to ensure that it is not abused, especially when entities with differing goals are competing for the resource. This is the goal of operating systems: to prevent abuse of scarce resources.


Compilers take an input and give an output that somehow corresponds to the input. In the case of a programming language, a compiler takes in a program in a source language and outputs a program in a target language whose semantics are the same as that of the input program. In this way, the input and output correspond. However, the correspondence need not be unique (or surjective or injective). The compiler acts as a map from one language to another. Moreover, the language need not be a programming language, but something more general like a GUI. In this context, the languages that a compiler translates between are languages in the most general sense — any symbols. If the symbols of the language represent something and have meaning, then we usually call this translation; if they don't, we call it transformation.


How is compilation different from search?

Fundamentally, both search engines and compilers transform an input into a corresponding output.2 The difference is in their approach. With search, there is a finite answer space and the answer is determined via exhaustive comparison with the input, whether that comparison is done at query-time or indexing-time. With compilation, the answer is determined via formulaic transformation of the input. There is a formula or method to get from the source to the target without ever touching all possibilities.

How Is This Helpful?

Realizing that a program you are making is implementing a certain kind of algorithm allows you to take advantage of all previous work on algorithms of that type. Currently, the way primitives and libraries are organized in most languages, they don't lend to revealing the high-level purpose of a program, and it's actually very easy to slowly go from a few parameters   to a configuration file   to a domain-specific language without even realizing it.

When things become that complex, you will wish you had designed your configuration interpreter as a compiler, which people have been making for years   and in general have well-designed methods of handling issues that come up during compilation. Likewise for searching and network protocols. And I suspect the same is true for prediction and resource management.

Algorithm Mismatches

Some algorithms pretend to be in one category when the problem they are trying to solve is fundamentally in another category. For example, web search. Certain cases aren't really search at all; they're prediction — because a person doesn't know what he's searching for until he's seen it. In other words, a person types in "Brad Pitt's birthday" as a Google Search, and there is a definite answer. But if someone searches for "cook", does he mean "cook" the verb or "cook" the proper name? If he means the name, does he mean the company, the county, or the person? (This is exactly why there are things like Clusty.) Or maybe his intent was literally to find all the things related to the word "cook", which most search engines assume. Because this is a search problem they can solve.

But the fundamental problem is ultimately prediction because even if the search weren't ambiguous, say it were about cooking the verb, what about cooking is the person looking for? Maybe the person is looking for a good recipe to impress his honey this weekend. Or maybe he doesn't know how to cook at all and wants to learn. Or maybe he's a master chef wanting to read articles about cooking.

An application developer will say that if this were the case, the person should have typed in exactly that: "articles about cooking". But my point is that even if he did type in "articles about cooking", which article is the one that will satisfy him? And this is something the person does not even know, until he's seen it. This is why the ultimate problem is prediction, not search. Search is merely an approximation. It may be an extremely useful one, but it's still an approximation. And seeing this mismatch will allow you to take your algorithm to the next level.3


Repetitiveness of Programming

Programming — as often as programmers attempt to never repeat themselves — is an extremely repetitive task. From a high-level view, all programs do these 5 things in varying combinations and details of problem domains. So the question arises... Can we simply master these 5 things, create the ultimate library for each of them, and only specify the details or parameters when writing a new program? At first, this seems like what we do already with programming languages. However, what this pattern suggests is that the primitives of the language should be Searching, Predicting, Networking, Managing Resources, and Compiling. Not looping, maintaining data structures, or other more low-level operations that are inherent in implementing the Big 5.

This is not a brand-new idea; it is just taking the next step. Already we see other languages that have just one of these primitives, and they excel beyond the rest. For example, Perl's regular expressions are a form of searching. They did it right and made it a language primitive. When people saw how powerful Perl regular expressions were, they duplicated them in practically every other language. Regular expressions though are still very specific in that they are only designed to search strings for a limited kind of patterns.

What would the interface to these 5 modules look like? I'm not sure, but the ultimate goal is that they entail a much higher-level language than mainstream programming languages today.


When you start to look at what I'm suggesting, you might recognize that this has already been thought of. It is the idea that your program is a specification for a search engine or a specification of a compiler, and your new compiler knows how to create the search engine or compiler from your specification of it. In other words, it is a search engine compiler. Or a compiler-compiler.

Years ago, much research was done on creating compiler-compilers. As a result, we have tools like Lex and Yacc. "Yacc" even stands for "Yet Another Compiler-Compiler". The research went further though, attempting to automatically generate the next phase of compilation. Other phases have been completely automated in a similar way with tools like ml-burg (or its equivalent in other languages like JBurg).

Similarly, a specification of a network protocol should entail its implementation. A policy for resource usage should entail its implementation. Likewise for the other categories.

Then the specification to combine these — yet another instance of one of the above types — yields the whole program.
1. If you haven't noticed by now, I don't usually bother with proofs. All axioms and even the rules of inference themselves are intuitive assumptions. For those who don't see that, I challenge you to try to prove induction, or better yet   deduction. (My study of epistemology taught me that intuition is the complementary tool of logic, and that the only way to know something for certain is to know the knower.) I realize this opens me up to a lot of criticism, but so be it. ...As long as we keep questioning.

Are there any categories that I missed?

I didn't include databases even though some might argue they are more fundamental since all algorithms use some kind of data structure. However, even though data drives all algorithms and the algorithms themselves are a form of data, no one would care about the data if they could get the results of the algorithms without the data. So databases are an implementation detail, and I am only concerned with the highest level in this post.
2. If you want to get really high-level, every program does this — transform input into a corresponding output. And this is what led me to the title "Everything Is a Compiler". ...Except perhaps truly intelligent machines, which some argue   are more than I/O.
3. Realizing that web searching is really prediction, you start to see that current criteria like frequency of keywords on a page or page rank make up the fitness function of a genetic algorithm. The fittest pages are then presented to the user in search results, who then clicks through and/or refines his search query. Smart search engines actually learn from this user-interaction — hence things like Google's search history and bookmarks. And they evolve their fitness function based on this. Voilà — co-evolution drives one of the most used, most influential websites of all time.

No comments: