Get email delivery of the Cadence blog featured here
At CadenceLIVE, Kioxia's Ravi Tangirala presented System-Level Emulation and Prototyping Performance for Storage Controllers. He is the director of validation engineering for Kioxia America (the former Toshiba Memory, before it was spun out as a separate company).
A storage controller is a "normal" SoC that interfaces to all the NAND flash (3D these days, although that doesn't make much difference from the point of view of the storage controller). The heart of the validation is to mix hardware and emulation with Cadence's Palladium Z1 Enterprise Emulation Platform, and hardware with FPGA prototyping with the Cadence Protium X1 Enterprise Prototyping Platform.
The chart above shows the team's flow. Note the thick red bar that rises and falls. This shows the amount of Palladium usage. At the start, there is not much usage since much of the RTL has not been completed. As the design is created, usage goes up. Firmware development, the blue shaded area, starts on Palladium but switches to Protium once the design is stable enough. The advantage of Protium is that it is much faster for running software than Palladium. The disadvantage is that it takes a lot longer to get the design ready due to the FPGA place and route that is required. So when the RTL is unstable, Protium is not attractive, but once it is stable enough, then the software developers in the firmware team will have a strong preference to use it. Once silicon is available, software development winds down, and Palladium is used to debug any remaining issues found in the silicon.
The diagram above shows the whole validation setup at a high level. In the center is the emulated SoC. On the left are the actual host (that is accessing the flash), and a debug host (accessing the debug port). On the right is the actual flash on DIMMS. This is a huge memory and so can't be directly emulated, which would also be slow. The host, a PC, is connected to Palladium with Cadence's SpeedBridge hardware interface. A debug host, another PC, is also connected to the debug port. The NAND is on DIMMS, also connected to Palladium directly.
Actually, they use two modes of operation on Palladium:
Kioxia allocated six servers in its server farm for Palladium compilation. A single iteration could compile in three hours. WIth iterative compile options, it could do a 30-iteration compile in just 12 hours. Of course, this is really only effective when the RTL is stable.
The results of all the clock optimization resulted in a 373% core functional clock improvement, as shown in the bar graph above. Note that this is an improvement in emulation performance, not an improvement in the SoC clock-rate itself (373% would be pretty amazing on real silicon).
The Palladium setup with all the hardware interfaces allows for end-to-end testing (all the way from application software running on the host PC to actual NAND memory chips). They can run disk applications that measure I/O rates, thrash the disk in some sort of worst-case way, and so on. The overall performance of the whole platform is critical since tests need long runs to produce accurate metrics. With this setup, they were able to improve I/O operations per second by 9X, and reduce boot time and NAND erase time by 5.5X.
The migration to Protium for firmware development was seamless (they use the same compiler front-end). If you look at the above diagram, it looks almost like a copy of the original one with Palladium. It is not quite the same, but the two platforms use identical Speedbridges and I/O cards between the two platforms.
This allows for the combined Palladium and Protium flow, where an RTL drop can be compiled for both platforms as in the diagram above. This flow allows for using Palladium for debugging the hardware, with good performance, and Protium for debugging the firmware, with great performance. Because the design takes 15-24 hours to compile in Protium, it really only gets usable once the design is stable enough that software developers can work effectively. For example, the operating system can boot and individual software developers can at least reach the code they are working on.
For a deeper dive into using Palladium and Protium together, see my post The Dynamic Duo.
Palladium was used for maximum debug ability at high speeds, 373% faster than when the project started.
Protium ran at 4.6X the Palladium speed, targeted at firmware development.
Easy migration between Palladium and Protium using QTDB.
Here's the video version of these flows, titled Early Firmware Development on Palladium and Protium, Enables 1st Silicon Success at Toshiba Memory. This is not a video of Ravi's presentation, it is a video made jointly with Cadence featuring T.R. Ramesh, Senior Director of Datacenter SSD Engineering, along with Ravi himself. Video is 3½ minutes long.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email