Accelerating AI: Past...

21 May 2018 • 9 minute read

SiFive does a quarterly series of tech talks, not necessarily directly to do with SiFive or even RISC-V. For example, last quarter it was Paul Kocher (and if you don't know that name, you need to go and read my post about that talk Paul Kocher: Differential Power Analysis and Spectre). This quarter it was Krste Asanović on Accelerating AI: Past, Present, and Future. This post will cover the past. The present and future have to wait (good title for a movie?).

If you know Krste's name at all, then it is probably as one of the leaders of the team that created the RISC-V ISA and as its primary evangelist. In one presentation I saw him give last year, he threw out a line that his PhD is not in instruction set architectures, but in neural networks. "Bad sense of timing, I was twenty years too early." As you know, if you haven't been on sabbatical in Timbuktu, neural networks are one of the hottest areas in computer science, semiconductor, automotive, vision, artificial intelligence. But having been working in and around neural nets his whole career makes him the ideal person to tell the whole story.

Krste opened by pointing out that we have applications, things we want to do. And we have technology, the stuff we have to do it with, which today is primarily silicon. Computer architecture is what comes in between the applications and the technology. Today, the hardest application we are dealing with is AI, and so there is a lot of investigation going on into what is the best architecture. As Krste indicated in his throwaway remark about being too early by twenty years, he has been in this field for a couple of decades. But it actually goes back a couple of decades further than that.

The Prolog Era

Let's start in the 1980s with Japan's fifth-generation project. MITI (the Japanese ministry of industrial trade and industry) had been very successful in directing a large part of the economy to improve quality and also identified semiconductor as an area for special investment. Today, when Japan is famous for its high quality, it is hard to remember that in the 1960s it was apparently a joke. I grew up in England, and "made in Hong Kong" was a similar joke (and, I only realized later, meant made in China and somehow exported through Hong Kong to avoid trade barriers).

Anyway, MITI decided that the future was artificial intelligence (AI), and logic programming was the way of the future, and that programs would all be written in Prolog. It was then just a question of building optimized fifth-generation architectures to run them faster and faster.

(In a weird coincidence, the primary textbook for learning Prolog was Programming in Prolog by William Clocksin and Christopher Mellish. When I first went to Edinburgh to do my PhD, I had to find somewhere to live from 400 miles away (this was before email, Internet, etc) by phoning the University Housing Bureau. They found me a room in Morningside, famous for its accent. In Muriel Spark's book The Prime of Miss Jean Brodie, she lives there. If you've seen the movie of the same title, her accent (well, Maggie Smith's) is a Morningside accent. The joke is that in Morningside "sex [sacks] are what you put your potatoes in". Another guy who had done the same thing from much further afield, Washington State, was Bill Clocksin. So for a year he and I were roommates, although I think this was before he was doing much with Prolog since I don't recall coming across it until I was already in the US. In that era, Edinburgh's departments of AI and CS were separate, and miles apart, so we didn't work together at all.)

If you don't know anything about Prolog, you can think of it as a language that makes it easy to express constraints (parts of the Specman e language for verification, and some parts of formal verification have things in common). One example that is easy to explain is Sudoku. A valid Sudoku board has each number in every row, column, and 3x3 square. The code below expresses that. Unlike an imperative or even functional programming language, this can run in a number of ways, trivially testing if a given solution is correct, but also by generating all valid Sudoku squares, or completing partial squares (which is what we think of as "solving" a puzzle):

sudoku(Rows) :-
 length(Rows, 9),
 maplist(same_length(Rows), Rows),
 append(Rows, Vs), Vs ins 1..9,
 maplist(all_distinct, Rows),
 transpose(Rows, Columns),
 maplist(all_distinct, Columns),
 Rows = [As,Bs,Cs,Ds,Es,Fs,Gs,Hs,Is],
 blocks(As, Bs, Cs),
 blocks(Ds, Es, Fs),
 blocks(Gs, Hs, Is).

blocks([], [], []).
blocks([N1,N2,N3|Ns1], [N4,N5,N6|Ns2], [N7,N8,N9|Ns3]) :-
 all_distinct([N1,N2,N3,N4,N5,N6,N7,N8,N9]),
 blocks(Ns1, Ns2, Ns3).

It is hard to remember how threatened people in the west felt by Japan's Fifth-Generation project. MITI seemed invincible. They had declared that Japan should get into automobiles, and suddenly they were of higher quality than Detroit and imports were starting to shake up the market. They had declared that Japan should get into semiconductors, and they had driven the US out of the memory market. They declared that Japan would build AI computers, and there wasn't any obvious indicator you could point to that indicated how totally the program would fail.

At the time, Krste was an undergraduate in the UK and thought, "maybe they know something." He went to work on a project with Padmavati at GEC Hirst Research Center in 1987-89. They were doing a machine for natural language processing. It had 170,000 processors, with 148 processors per chip. The problem was that each processor only had 36 bits of storage. In was built in 1.2um (1200nm) CMOS and ran at 8MHz and was called SPACE. In some ways, it was the ultimate in computing-in-memory since every memory bit had logic attached. Everything was done by associative lookup. It was designed to accelerate Prolog unification and LISP associated arrays. However, Krste and others soon discovered that if you programmed on the bare metal, and bypassed Prolog and LISP, it was over a hundred times faster.

Lessons that Krste took away from that:

The language needs to match the application domain
You need high memory capacity to hold the problem state, or else you waste too much effort swapping parts of it in and out
Fine-grained computing-in-memory is very inefficient since only a tiny fraction of the memory is involved in any compute step
Bit-serial arithmetic is very inefficient, most of the machine cycles in practice are sequencing multi-bit adds

Since the world is not programming everything in Prolog today, something else happened in AI, and that something was neural networks. They had actually started back in the 1950s with single-layer perceptrons, with limited success.

The ICSI Era

In 1989, the International Computer Science Institute (ICSI) in Berkely was working on The RAP Machine (Ring Array Processor) for fast training of "big dumb" networks for speech recognition. it was based on TI DSPs, with 4 per board, and up to 10 boards. It was fast and flexible, but at $100K+ (in 1989 dollars) was too expensive, you couldn't give one to each researcher.

Krste, now a naïve grad student said:

I joined the group to design custom chips for neural nets, sounded cool.

In those days, MOSIS had a "tiny chip" program where it was $500 to fab a 2.2mm x 2.2mm chip in 2u (2000nm) CMOS. They used this to build various bits of the Highly Pipelined Network Trainer, or HiPNeT-1.

But before they got finished, the language people came up with a new architecture. But it needed a different architecture, and so completely different chips. Like any computer scientist when faced with that sort of problem, people who keep changing the specification, the answer is to make it programmable. So they built a VLIW/SIMD (sorry, but if you don't know what these stand for already, just telling you the words won't help you understand what they mean) machine, with a vector unit and a scalar unit, similar to many embedded DSPs today (Tensilica's basic architecture is similar, although with more of everything).

The SQUIRT test chip from 1992, was built in 1.2um CMOS with 2 metal layers and ran at 50 MHz, consumed 400mW at 5V (yes, remember we used to have 5V power supplies), and was 8x4mm.They then, in 1992-95, they built the CNS-1, the Connectionist Network Supercomputer.

One of the things that they had learned from watching earlier work at Thinking Machines was just how important the physical enclosure was. The benchmark program was to evaluate a network with a million units and a thosand connections per unit, 100 times per second. This required about 200 GOPS (for comparison, the neural engine in the iPhone X decades later is 600 GOPS).

Then there came a realization. Wait, we haven't finished SPERT and we're doing to do another processor...who's going to write all the software? It turned out that the correct answer was to scrap the whole thing, since VLIW is a major pain, with no upward compatibility. Even the scalar compiler was complex, and the parallel compiler was harder than complex. Even assembly was next-to-impossible to write.

It's the Software, Stupid

The big realization: it's all about the software. So don't build something that is impossible to program. Instead, use a commercial RISC, add a vector unit, and extend a standard compiler. Basically, put a Cray on a chip. They called it Torrent-0 or T-0.

The system architecture choices were SPARC, HP-PA, PowerPC, Alpha. They picked MIPS since there were good tools, desktop workstations they could use for development, and even a 64-bit extension. However, there were no soft cores, this was still before the age of synthesis. Verilog was just showing up. So they decided to build their own MIPS core. After all, how hard can it be?

They tried to get a license to the MIPS core in 1991. This is just to use the architecture. But it was $2M. That was just for "their blessing and some test vectors".

If you want to know where RISC-V came from, then this was one of the formative experiences

So they cheated. They went ahead and built it anyway, and left out the few bits that they knew were patented and that they didn't need. It filled the reticle completely. HP built it (they still had their fab in Boise in those days). It had 8 MACs per cycle, a 32-bit datapath, 16x16 multipliers. It was 16.7mm square in 1um CMOS and ran at 40MHz.

spert-2

The chip, SPERT-2, worked. 35 boards were built in 1995, shipped to nine international sites. It was used as a research platform for nine years (which is more a century in computer years) and was last powered up in anger in 2004.

The Wilderness Years

Neural networks faded in popularity and became a niche. In 1992, Intel introduced MMX, which added narrow fixed-point arithmetic and data parallelism, primarily for MPEG video decode. So that meant that multimedia had almost all the capabilities needed. If that wasn't enough, Moore's Law was still in full swing and a few years later your code would just run faster without your having to do anything.

That was the past. But neural networks are back..."present and future" in a second post.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.