Get email delivery of the Cadence blog featured here
Yesterday, in Running the Program for an Embedded System, I wrote about using emulation and FPGA prototyping to test and debug the software load for an SoC. But I didn't discuss where the software came from, just that somehow there was some object code to be run. Of course, you could just have written a lot of C-code, but these days there are more specialized types of code, and more specialized languages and environments. Today, we'll take a look at this area.
There are embedded systems, such as Arduino or Raspberry Pi, or even a smartphone, that come "ready to program". In the rest of this post, I'm going to focus on designing an SoC, since that is the area with the biggest challenges and the most complex tradeoffs. In fact, even those "ready-to-program" systems look a lot more complex if you are responsible for the lower-level code. There is a big difference between programming a smartphone app for looking after conference agendas, and programming the part of the system that handles text messages, phone calls, and the like.
I said above that pretty much everything has a general-purpose processor as the heart of the system, to orchestrate everything else that has to happen. But increasingly, the general-purpose processor is enhanced by one or more domain-specific processors.
I wrote a series of three blog posts on the reasons for this trend, the reason that increasing compute performance requires domain-specific processors:
The result is that an embedded system often contains several codebases: code for the general-purpose processor, typically written in C or C++ (other languages are available), DSP code which is written in MATLAB, a trained neural net described in one of the neural network environments such as TensorFlow, TensorFlow Lite, Caffe, or Caffe2. For embedded applications, the TensorFlow Lite and Caffe2 variants are more lightweight and probably the most appropriate.
Each of these environments has its own flow to get from the level at which the programmers describe the code and the code itself.
The simplest case is C++ or another general-purpose programming language such as Python, Go, Swift, or Rust. I'm just going to use C++ as the generic language since the basic approaches remain the same. Other languages are available, as I discussed in my post last week Programming Languages for Embedded Systems.
For code that runs at a high level, and does not interact closely with the hardware, the simplest approach is to simply compile the code to run on a PC or a server in a farm. Most of the debugging can be done at this level, without ever needing to cross-compile the code to get a binary that can run on the actual microprocessor in the SoC. However, for code that runs close to the hardware, and for accurate performance monitoring, the code needs to be compiled to the real machine code for the processor(s) and then run on either the real chip or a surrogate for it. More about that in tomorrow's post.
The actual development environment is often whatever the team wants to use. Typically this will include an IDE, an integrated development environment, often based on Eclipse. The compilers may be any of the commonly used ones like gcc and LVMM, or there may be some requirement to use a specific compiler as a result of the operating system, debug environment, or other restriction. Typically, for most languages, the compilers, debuggers, and development environments are all open source. The big exception is when the code has to be certified, and then there may be limitations on what compilers are acceptable as part of a toolchain-producing software that has to meet safety certification such as in avionics (DO-178B Software Considerations in Airborne Systems and Equipment Certification) or automotive (ISO 26262 Road vehicles—Functional safety).
For example, Cadence partner Green Hills Software's compilers are certified:
It is not universal, but most DSP algorithms start life in MATLAB (from another Cadence partner, MathWorks) using full floating-point. The details of the algorithm and the accuracy can be assessed in MATLAB. It used to be the case that floating-point arithmetic in an embedded processor was slow and power hungry, so nobody would consider using it in the SoC. That is no longer the case. However, fixed-point algorithms are most likely even faster and even lower power, so it may be worth doing the analysis needed to do the conversion safely.
It is worth pointing out that there are two other ways to implement algorithms that start life in MATLAB, other than running code on a specialized DSP. MATLAB can write out Verilog under some circumstances, which can then be synthesized into gates with Genus synthesis. More flexibly still, the algorithm can be converted into C or C++ and then Stratus High-Level Synthesis (HLS) can be used to create Verilog that meets the performance constraints. For a good example of doing this, see my post Designing a Wi-Fi HaLow Baseband in Less than Six Months about Methods2Business creating a family of WiFi routers with different performances from a single MATLAB model.
For running code, MATLAB can output C. But the situation is not usually quite the same as with handwritten C since there is typically just one choice of compiler for the DSP chosen. For example, the Tensilica DSPs (such as the Vision Q6 DSP) have a compiler that is targeted to the precise instruction set of the processor, including any instructions added via the TIE (Tensilica Instruction Extension). In fact, any configurable processor has those additional degrees of freedom since, if you don't get the performance you want, in addition to simply changing your source code, you can modify the processor in some way. This diagram shows how the tool flow for code goes down the right-hand side, and all the scaffolding for altering the processor is on the left.
Deep learning in an embedded context typically comes in two parts. Training the neural network in the cloud, and doing inference on the SoC. I'm going to assume that split here, and ignore how the training gets done. By the time the neural network needs to be implemented on the SoC, all the weights have been calculated.
However, mapping that trained network onto the processor requires optimization, and then mapping the operations onto the array of MACs available. The keys to effective edge inference are to do this mapping in a way that optimizes the memory accesses. Neural network processors don't have caches since the order of memory access is known and under control of the compiler. Caches are useful when the order of access is completely unknown at compile time and so has to be deduced on the fly at run time.
The diagram below shows two flows for edge inference, on using Caffe or TensorFlow, and the other via Android Neural Network (for Android, obviously).
For more on Tensilica processors and tool chains, see the Tensilica IP product page.
For more on MathWorks, see their website.
For more on Green Hills Software, see their website.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.