How Is the C Compiler Written in C?

7 Feb 2020 • 7 minute read

Often compilers for computer programming languages are written in their own language. This is less true now that so many compilers are based on complete compiler production systems such as LLVM. LLVM is written in C and C++ but has compilers for a huge portfolio of languages.

But the first time you come across the fact that a compiler can be written in its own language, your first reaction might be the same as mine:

Huh, how does that work?

Cambridge and Edinburgh

When I was an undergraduate at the University of Cambridge, our "local" language was called BCPL (basic combined programming language, nothing to do with the Basic programming language). It is actually the ancestor of C, but that's a story for another day. I was intrigued when I discovered that the compiler was written in BCPL. Reading the source code of the compiler probably did more for my understanding of compilers than our undergraduate course on compiler construction, which was coincidentally taught by Martin Richards, the inventor of BCPL and the creator of its early compilers.

Later, I did my postgraduate work at Edinburgh University. (Pronunciation note for Americans: Edinburgh is not pronounced "edin-borrow", it is more like "embra".) At Edinburgh, our "local" language was called IMP and, yes, the IMP compiler was written in IMP (I don't remember what IMP stood for, maybe IMPerative programming language). In that era, computers and programming languages were a lot less standard than today, where almost everything is x86 or Arm (and, at least in academia, RISC-V). New architectures came along all the time, and often universities and even companies had their own "local" programming languages. This was especially true for system programming languages. The two giant standard programming languages, FORTRAN and COBOL, were for scientific and business programming respectively. However, you couldn't rely on new computers having compilers for either language. Often new computer architectures shipped with only an assembler to get basic machine-code into the computer, and the operating system would be written in assembly code (Unix was one of the first operating systems written in a higher-level language, C, although it did have a fair bit of assembly code too in the early days).

When a new computer came along with a new architecture, then a new version of the compiler would be required that supported the new architecture. But here's the motivation for writing the compiler in its own language. Once that compiler existed, then it would also run on the new computer. This was only possible because it was written in itself and so could compile itself.

An Example

Let's make it more concrete. I was programming in BCPL long before the PC came along. But now it is 1981 and IBM comes out with the PC and MS-DOS. Let's assume you want to program this newfangled PC-thingy in BCPL. So you need an x86 BCPL compiler. You already have a BCPL compiler running on whatever computer you normally use. In my Cambridge days, that was mostly either an IBM370 that ran the university's time-sharing service, or a Modula-1 computer that the computer science department owned that only computer science students could use (unlike the time-sharing service where you were competing with physicists, geologists, and most of the rest of the university).

Compilers are constructed with a front-end, which is dependent only on the language, and a back-end that is dependent on the target computer (technically the target instruction set architecture (ISA)). So you need to write a new back-end for the x86. Once you have done that, you can "cross-compile" any BCPL program on your usual computer and get it onto the PC (presumably by floppy disk since the original PC had no networking). That program would then run on the PC.

Once you have that working and debugged, the magic happens. You compile the BCPL compiler source code using the BCPL compiler running on your usual computer. But now you are up and running since the BCPL compiler will now run on the PC, too. Going forward, you (and presumably your users) can simply work on the PC. They can write BCPL programs there, compile them there, and run them there. You no longer need your usual computer and it doesn't matter if it dies or becomes obsolete.

Bootstrapping from Scratch

Of course, that assumes that there is already a BCPL compiler for your usual machine. So how did that compiler come into existence? When you create a new language there is no compiler for it, obviously. So you can't write the very first compiler for the new language in its own language because you don't have a compiler for it. So you have to write the very first compiler in a different language, one that you do have a compiler for, maybe FORTRAN. Or if there are no existing compilers, then in assembly code. In practice, you only do that for a small useful subset of the language. This process is known as bootstrapping the compiler. The analogy is to "pulling yourself up by your bootstraps."

When you "reboot" an operating system, that word has the same roots. Rebooting an operating system actually means running the bootstrap program, today held in read-only memory. But back when I was at university, it was held in ferrite core memory that didn't lose its value when powered down. If, for some reason, that memory got overwritten, then you would have to reenter the bootstrap from scratch using the switches on the front panel of the computer. Yes, it was as tedious as it sounds.

The pulling yourself up process can then begin. You have a version of the compiler that is written in, and can compile, a small subset of the language. Now you can add more features, still using the small subset of the language. But once you have that working, you can use the larger subset of the language for programs, including in the compiler itself. In practice, compilers that are written in their own language often do not make use of the most advanced features of the language since it makes porting the compiler to a new architecture trickier—those features will not be available early on and so you also need a version of the compiler that doesn't make use of them. Once that is all working, it makes little sense to go back and reimplement the working code and debug it all again.

Biology

There is some similarity to self-reproducing programs (yes, that's possible, too) and self-reproducing organisms in biology. There is the same bootstrapping problem. Today, there is a whole mechanism for copying DNA to RNA and then using RNA to make proteins. But that mechanism itself is encoded in the DNA, so how did that get started? Once you have some basic reproductive mechanism working, evolution is all you need to bootstrap more complex mechanisms, all the way up to multicellular organisms like worms...or us. When you hear that our DNA is 90% the same as a cockroach, or something similar, it is not all that surprising—it takes a lot of DNA to make the basic cellular system work and reproduce. It takes a lot less DNA to make an ear or an antenna once you have those basic mechanisms working. But nobody knows how life got started, or even whether it is common or whether it happened just once and we are alone in the universe. You may have heard of the primordial soup experiments by Miller and Urey done in the 1950s. They created a soup in a vessel with water, ammonia, hydrogen, and so on, and then used a spark generator as a surrogate for lightning. When they analyzed the soup afterward, they had detectable amounts of some amino acids, the building blocks of proteins. But that is still a very long way from either a protein, or a mechanism for creating proteins, or a mechanism for reproducing the mechanism for creating proteins, and a mechanism for reproducing the mechanism for reproduction.

If it makes your head hurt to think about writing programs in their own languages, thinking about what it takes to get reproducing life is even more mindbending.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.