Copyright © 2002, 2005, 2006 Gene Michael Stover. All rights reserved. Permission to copy, store, & view this document unmodified & in its entirety is granted.
Cite: Please cite with a form similar to that of the entry for this article in its own bibliography ([3]).
Currently1, there is interest in compiling source code to instructions for virtual machines in an attempt to write portable applications. That approach is commonly called Write Once Run Anywhere (WORA), but it's possible to achieve the same goals of portable source code by using C instead of virtual machine byte-codes as the portable intermediate language. What's more, virtual machine byte-codes have disadvantages that C does not.
Here's a description of what I mean when I suggest compiling to C as a portable intermediate language.
Say you want to write programs in some source language L. It could be any programming language. Language L could be Java, C++, Ada, Lisp, Smalltalk, Perl, Pascal, Algol, Fortran, Cobol, Bourne shell, C itself, or any other programming language. Literally, L can be any programming language you can implement.
So you write your program in L.
You feed your program to a compiler that compiles L to C.
Then you feed the C program to a compiler that produces native object code. You process the object code to produce a natively runnable program in whatever way the native platform requires.
Now you have a natively runnable program compiled from the program you write in L. You run that program.
That's it. There's no magic. You compile L to native code via C.
One advantage of compiling to C is portable applications.
Let's say you are an open source or free software developer. (I'll discuss shrink wrapped software distribution later.) You can distribute your software to users who don't have compilers for L.
To make a distribution file, you compile your program to C & stop there. You package that.
Your user obtains a copy of the distribution archive, expands it, & runs a configuration step, then runs the C compiler on the intermediate C code that you compiled & distributed. The user then has a natively executable version of your program.
Okay, you're saying ``My users are not technical. I can't require them to run a C compiler; even a configuration step is asking a lot of them.''
I reply, ``You are absolutely correct. You are also unimaginative.'' I'll describe ways of hiding the configuration & installation steps on Windows, Mac, & Unix so that your user sees exactly the type of installation process that's usual for his platform. Let's start with Unix.
One common configuration & installation process on Unix is ``./configure; make all check install''. It doesn't take much imagination to see how C as an intermediate language can translate to this just fine. ./configure determines system settings, as it always has. Then ``make all'' compiles the program, but when you (the programmer) created the distribution archive, you compiled the L source files to C, so your user's computer only needs to have a C compiler. In fact, there's precedent for this in the C files derived from (in other words, compiled from) the Bison & Flex source files in Gnu's gcc compiler.
So much for Unix.
Mac & Windows use graphical installation programs, but the same principles apply. During the configuration stage, the installation program can, well, determine the configuration. Then it can run make (or an equivalent program) & a C compiler to produce native executables. The user doesn't need to be aware that the C compiler is running. He just sees the progress bar that tells how much of the process has been done & how much remains.
Okay, now you are legitimately pointing out that the average Mac & Windows computer does not have a C compiler because those operating systems don't ship with C compilers. This is a problem that I hope would be rectified if the idea of C as a portable intermediate language became popular. Notice that the target host only needs a basic C compiler & a simple make, not a huge integrated development environment. The alternative of delivering a program compiled to byte-codes for a virtual machine requires that the user has a virtual machine & its complete run-time environment installed, & that's surely larger than a plain C compiler & make. Even worse, most Java applications I've seen ship with the entire run-time.
So yes, the technique of C as an intermediate language has a stumbling block of requiring a plain C compiler & a make (or equivalent) on those platforms, but it's no more painful than the problems VM-based applications cause. More on this later.
I've described the two parts of the technique I'm suggesting. For clarity, Figure 1 shows the steps to the technique again.
![]() |
So let's say you deliver shrink-wrapped software, & your source code is proprietary, neither open nor free. How does C as an intermediate language translate to you?
Answer: It translates just fine. You use the same development steps & distribution steps I described for open source & free software developers, except that when you create your distribution archive, you don't include the L source files the way the open source & free software developers did. You include the C files & other files required to compile your program, but not the L files.
You might be thinking ``But my algorithms are in the C source files so that people (programmers at least) can read them, & I need to keep them secret''.
Not quite true. Your algorithms were translated to C, true, but it's not a C that's meant for humans to read. When I say that we compiled L sources to C, I mean we really compiled it. The L-to-C compiler sucked up your L source files, analyzed them, & produced C code for a C compiler. It didn't produce C code for another human; it's not one of those ``please can someone give me a program that converts Pascal to C so I can have all my Pascal programs in C'' programs that newbie programmers sometimes request on Usenet. It's a real compiler interested in converting the run-time semantics of your source code to C, but it doesn't make human-readable or human-maintainable C code. Even the symbol names from your L sources are lost. (More on this topic later.) So your secrets are safe.
And if you want to distribute executable, binary files (as most shrink-wrapped software is nowadays, anyway), you still benefit from compiling L to C as an intermediate language because you need only one L-to-C compiler, which can run on whatever computer type you want. You only need C-to-native compilers on the types of systems you want to support.
A benefit to compiling to C as an intermediate language is that C is ubiquitous.2 So if someone implements just one L-to-C compiler (hopefully in C), then language L is available on nearly every computer in existence.
There can be just one L-to-C compiler. It can be implemented & debugged once. That's Write Once Run Anywhere (WORA).
It's fairly easy to write a compiler whose output language is C. It's a lot easier than writing a compiler whose output language is machine code. It is similarly fairly easy to debug a compiler whose output language is C instead of machine code.
Because L was compiled to C, then to native, your program executes at native speed; it does not have the performance penalty of a virtual machine. Sure sure sure, a virtual machine with a Just In Time (JIT) compiler can convert the VM's byte-codes to native code to overcome the performance penalty of a VM, but why bother? You can compile L to C to native & be done with it.
And why bother to implement the JIT, which is just a compiler that outputs native code, when someone has already implemented such a compiler (the C compiler)? Ever hear of ``code reuse''?
These days, C compilers produce efficient code. So though an L-to-native compiler could produce smaller or faster code than L-to-C-to-native, the difference will be minimal. What's more, there is more semantic information in a C program than in VM byte-codes, so the C compiler has a better chance at optimization's than the JIT in the VM.
If all your programs compile to C, it can be easier to integrate different languages because, ultimately, they all use the same calling convention. You might have to twiddle with indirections or support libraries, but all in all, integrating multiple languages that are compiled to C should be simpler than learning & trying to fit together the different language-integration techniques for all those languages that compile to native code in their own way.
I have heard an anecdote that some optimization's can only be performed at run-time & that such optimization's are better than those which can be performed at compile-time. The story I heard claims that Hewlett-Packard has a run-time optimizer for HPPA that improves performance. In other words, HP has an HPPA native-to-native JIT.
If this is true, then there are some optimization's that can be performed by a JIT but not by a pre-run-time compiler (such as a C compiler).
I have not been able to confirm or refute that anecdote. (I haven't tried very hard, either.)
I find this claim difficult to believe. What optimizations could be performed at run-time rather than compile-time? In a language that uses late binding, maybe a function could be compiled, at run-time, to each actual data type on which it is invoked; the compiled functions could be saved in a dispatch table keyed on actual data types. That would cost a lot of memory, & languages with late binding often (usually?) provide for optimization's which can be specified at compile time. Lisp's declare special form is an example.
Shut-up & get a sense of perspective.
I believe virtual machines, with their run-time interpretation or even with Just In Time compilation (JIT), are the wrong way to achieve portability. They have performance penalties, size penalties, reliability issues, portability issues, & they effectively are just re-implementations of perfectly good functionality that is already in the already-existing C compilers.
Creating a new intermediate language, such as Microsoft's Intermediate Language for Dot-Net, is just as bad. C is public, standardized, documented, understood, widely known, widely ported, & already exists.
Compiling to C as an intermediate language is a better way of achieving application portability than are virtual machines or a new intermediate language. Compiling to C is a good way to implement almost any higher-level language.
Since originally writing this essay, I have learned a little about UNCOL.
Many people have asked whether this technique would work with interactive debuggers. The question didn't occur to me when I wrote the essay because I don't use debuggers.3
There's some small hope that a debugger could use
#define __FILE__
and
#define __LINE__
statements in the C code to redirect it to the original source
files.
If that didn't work, I don't know of a way to get run-time debugging unless someone modified the debugger.
The compiler could always inject C code to help debugging without a run-time debugger. It could insert memory- & pointer-validation code, execution traces for a log file, & readable ``core dumps'' from the virtual machine when things went really bad.
I wonder if, in a really desperate move to get run-time debugging to work, the C code could have comments for each line of L source code so that, if you ran the run-time debugger on the executable, it'd show you the problem place in the C code, & the comments would give enough info about the corresponding location in the L source code.
Gene Michael Stover 2008-04-20