Mark Littlefield, Vertical Product Manager, Defense Dr. Mohamed Bergach, System/Software Architect Kontron
Deep learning is an increasingly popular approach to processing very large data sets. Many high-visibility projects involved with image processing and data mining such as Google’s Car, Brain and AlphaGo projects, as well as the US Department of Homeland Security’s Synthetic Environment for Analysis and Simulations (SEAS) project to predict and evaluate future events and courses of action rely on deep learning techniques. While deep learning methodologies are not exactly new, the processing power needed for such complex applications is finally becoming small and low-power enough for packaging into embedded systems.
History of Deep Learning
Modern deep learning approaches evolved out of research into perceptrons and neural networks performed in the 1970’s and ‘80’s, which attempted to leverage structures found in biological systems for processing tasks such as feature extraction. Systems that use deep learning (sometimes referred to as deep neural nets (DNNs)), use multiple and up to thousands of levels of processing layers and non-linear transforms to tease information or patterns from large or even enormous data sets. These systems are characterised by the system “learning”—either supervised or unsupervised— through iterative parameter optimisation. Another feature of deep learning systems is that they tend to form a feature hierarchy, with earlier processing layers representing low-level features, and progressing towards higher-level features in the deeper layers. Thus, they generally benefit from having additional processing added to the transformation chain.
Applications for deep learning include such diverse problems as computer vision, natural language processing, pattern recognition, navigation and route planning, and in the case of Google’s AlphaGo, defeating humans at board games. With the current embedded defense systems trend for greater operational autonomy, integrators are drawn to the recent breakthrough performance achieved by deep learning approaches and their ability to produce highly reliable autonomous decisions based on huge data sets.
In the early days, the copious amounts of processing needed for such computing architectures made them impractical for any sort of real-time application. That has all changed due to the number crunching capabilities available today with the latest very large FPGAs, power-efficient graphics processor units (GPUs), and advanced SIMD processing units tied to flexible multi-core processors. These advances make it possible now to apply deep learning algorithms in the military’s SWaP-constrained high-performance embedded computing (HPEC) space.
The Role HPEC Plays
Deep learning applications can leverage technologies such as high-speed switched serial links, rugged standardized form factors, and HPEC middlewares that have been developed and honed over the years on classical HPEC problems such as synthetic aperture radar (SAR) and military signal intelligence (SIGINT) applications (Figure 1). The challenge for the system integrator, therefore, is to define how deep learning algorithms can be applied to solve their particular problem. Subsequently, the challenge for the supplier base is to tailor and refine their HPEC-based platforms to ensure that they are well adapted to deep learning types of applications.
Any signal (observation) coming from sensors (image, sound, GPS position, RADAR, etc.) can be represented in an abstract way as features (shapes, corners, patterns, etc.). In DNNs, each layer of the network processes this data based on a particular type of feature and gives the result to the next layer (Figure 2). The beauty of this approach is that it can be applied to a wide range of problems with impressive results (sometimes better than human handcrafted solutions) such as for face recognition, image registration, natural language processing, fraud detection and so on.
All the processing is based on computing dot-products (a convolutional neural network, or CNN) and requires a huge amount of computation, especially for training the network. The training or learning phase is the stage where one supplies the deep neural network with a large number of data to search for optimal settings of the convolutional weights in order to minimise the error of the network. This phase is time-consuming and requires many rounds of optimisation to reach global minima.
For that reason, the learning phase is typically performed in data centres under continual 24/7 operation. Then, a snapshot is taken off the network with each training result and is deployed on the actual embedded HPEC system for testing. This process is repeated with the expectation that the next snapshot will respond better than the previous one. The processing needed for running DNN is similarly quite large, but dependent on relatively simple multiply/add operations known as multiply/accumulate (MAC) or fused multiply/accumulate (FMA) if they can be performed simultaneously.
An ideal platform for this task is available today with the Intel Xeon Processor D-1540 (Broadwell DE) processor which contains 16 cores, with each core having two AVX2 units. Each AVX2 unit can process 8 floating point FMA operations at a time. Thus, with 16 cores one can perform 512 floating point operations with each clock cycle. Using this amount of processing power available per processor, military program developers can build a modular HPEC system suitable for a wide range of deep learning applications. This is especially true using VPX-based boards and platforms that can deliver high-speed/low-latency communication via the backplane with PCIe Gen3 or 10Gbit Ethernet links. Another benefit of using Intel architecture is to ensure binary compatibility with each generation of Intel architecture 64-bit processors to keep software investments safe from any future incompatibilities.
Defence systems could also use the recently announced Intel Xeon Phi coprocessors (Knights Landing containing 72 cores, each with two AVX-512 units processing 16 FMA operations per clock. Intel has also announced a future Xeon chip that contains a multicore CPU tightly integrated with an FPGA that shares some levels of memory. This has the potential to be an ideal architecture for deploying deep learning algorithms on HPEC systems. For the CNN example, this solution allows the specialised part of the algorithm to be synthesised on the FPGA, while the more general purpose functions of the deep learning application can be deployed to the CPU cores.
In addition, OpenCL is quickly becoming the go-to standard for heterogeneous computing. It provides a rich and expressive API for managing data flow and computational objects. OpenCL also helps to ensure portability of the source code over different platforms like GPUs, CPUs, and FPGAs. Altera (now Intel) produced the first, and what many would argue is the best, compiler for OpenCL to VHDL (FPGA). It closes the code productivity gap that has stubbornly remained open for decades in the FPGA world. Now, developers and system integrators can write code once and, with relatively minor adjustments, it can be ported from a CPU to an FPGA or GPU. As a result, OpenCL is an increasingly popular middleware choice for deep learning applications.
Handling Voluminous Data
Many defense-related real-time applications could benefit from the application of deep learning techniques. As the trend in defense systems is towards greater application autonomy, and as deep learning techniques tend to be most useful for pattern recognition tasks such as natural language processing and image feature detection, it makes logical sense that deep learning could be successfully applied for on-platform processing of streaming signal or image data. These systems would have the powerful capabilities to sift through voluminous streams of data looking for either signals or targets of interest. It would allow them to even hunt for threats and autonomously deploy active protection systems. The increasingly congested terrestrial spectrum and satellite bandwidth in military areas of operation will further drive the need for smart autonomy, and thus make the application of deep learning techniques in defence systems a more popular and useful proposition.
Defence applications can leverage tools previously applied to large signal and image processing applications such as synthetic aperture radar, real-time signal intelligence collection and classification, and high altitude surveillance platforms like Global Hawk for new deep learning applications. By applying 10s or even 100s of processors, which may themselves be massively parallel processors such as GPUs or FPGAs, along with high-speed mesh fabrics and middlewares such as OpenCL and OpenMP to facilitate low-latency interprocessor communications and thread synchronization, one can realistically expect to be able to implement deep learning algorithms hundreds or even thousands of stages deep while maintaining real-time levels of performance and latency.
While it is still in the early days for deep learning in defence real-time systems, it is completely possible today to leverage existing technology solutions. One example is Kontron’s StarVX HPEC system based on the company’s VX3058 3U VPX single board computer (Figure 3). This high-performance VPX computing node leverages the breakthrough processor performance capabilities of the advanced 8-core version Intel Xeon Processor D-1540 (Broadwell DE).
For deep learning defence applications, the StarVX packs server-class silicon and highly ruggedized technologies in a compact 3U blade footprint. When combined with a high-speed 10 Gigabit Ethernet switch card such as Kontron’s VX3920 and off-the-shelf low-latency middleware, this powerful solution makes it possible to design a deployable platform suitable for deep-learning-based applications. Naturally, as processing densities and backplane speeds continue to increase, the possibilities for hosting computationally-challenging deep learning applications on realistically deployable hardware will also continue to grow.