Project — eki - energy efficient AI

What is the eki Research Project?

The eki research project aims at increasing the energy efficiency of AI systems for deep neural network (DNN) inference through approximation techniques and mapping to high-end FPGA systems in the data center. FPGAs (Field-Programmable Gate Array) are reprogrammable hardware devices that allow for specializing and optimizing computing units, memory, and interconnection networks specifically for a concrete application through a software configuration process. FPGAs are technologically mature and established and are already used by leading cloud providers, especially Microsoft and Amazon. The structural reconfigurability of the hardware, which is not supported by any other architecture, allows maximum specialization, e.g., extreme quantization of parameters or heterogeneous quantization of the layers of a DNN, and thus energy savings. However, a transfer to practical applications is still missing, because user-friendly, automated development tools are still lacking. Methods for the implementation of particularly complex DNNs through partitioning, parallelization and scaling in a cluster with high-end FPGAs are also not yet available. The eki project runs for a period of three years and is funded by the German Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection under the funding line “AI Lighthouses for Environment, Climate, Nature and Resources”.

Motivation

DNNs are statistical methods of machine learning and have become established in recent years as an approach to AI problems in numerous fields of application, especially pattern recognition, image analysis, and language processing. Programming environments and libraries such as Tensorflow and Keras have greatly simplified the use of DNNs. DNNs are trained with characteristic data in a one-time training phase and compute classification or regression results on typically very large sets of input data in the subsequent inference phase. The problem arises that DNNs are already responsible for a significant portion of the computational load and thus for the consumption of electrical energy and associated CO2 emissions in (cloud) data centers. To date, no precise peer-reviewed studies on the environmental costs of DNNs are available. However, recent estimates suggest that training an average DNN releases 284 t of CO2 and that (according to data from Facebook) inference of a DNN consumes 10 times more energy than training. For training DNNs, GPUs have become the ideal technology due to their parallelism and numerical accuracy. Today, inference also primarily takes place with GPUs or CPUs with high accuracy, but very often with low energy efficiency in return. Since the demand for DNN inference will continue to grow strongly, there is a high need for action to reduce the environmental impact and operating costs for DNN-based AI.

Methodical Approaches

DNN Approximation for FPGAs
The computational effort for DNN inference is deliberately reduced without falling short of the quality of the results required for the specific application. The reduction of the computational effort then leads to significant energy and hardware savings in the FPGA implementation. We apply two methodological approaches: First, DNN models are compressed by network pruning, and second, the compressed DNNs are directly mapped into hardware using the open source frame work FINN. FINN creates a specialized dataflow architecture that maps a DNN layer-by-layer in hardware, achieving very high throughput and very low latency. Fixed-point formats up to binary quantization can be chosen for the DNN parameters. Strong quantization reduces the hardware required and also the number of external memory accesses, which are a significant contributor to the energy budget. FINN uses Brevitas, a PyTorch library that supports training of quantized DNN models. Quantization to less than 8-bit usually comes with a slight loss in result quality (e.g., accuracy). Depending on the application, this can be tolerated to a certain degree, resulting in an efficiency that cannot be exploited with traditional architectures (e.g. GPU). For the adaptation of FINN for single-FPGA and development for multi-FPGA, we pursue several approaches: (i) Heterogeneous quantization, i.e., the individual layers of a DNN compute with different precisions. It has been shown that mixing highly quantized fixed-point with infrequently used floating-point operations leads to an increase in the accuracy of the DNN and can reduce the resource requirements. (ii) The reduction or elimination of external memory accesses. (iii) The balancing of the data flow pipeline across FPGA boundaries. The Layer 1 optical switch available in the PC2 provides the ability to implement various multi-FPGA topologies with very low latencies and high bandwidths. (iv) The ability to create different DNN implementations with different performance (throughput, latency), power, and accuracy tradeoffs.
Energy Characterization
We develop a function library that allows the allocation of energy consumption on the level of the system components CPU, FPGA, GPU, RAM and network adapter, based on IPMI mechanisms or hardware-side instrumentation. This library forms the core of a framework for continuous integration and benchmarking based on GitLab with full automation of DNN translation, optimization, execution and evaluation of different DNNs, implementation variants and execution architectures. The framework will be integrated with the HPC cluster and workload management software, enabling automated characterization of DNN inference energy consumption. In principle, any system component (i.e., in addition to x86 systems, also ARM-based systems such as the Apple M1 and various hardware accelerators) can be supported by the function library, provided that the component provides an appropriate interface for querying the energy values, regardless of the type of interface.
AutoML for Energy Optimization
AutoML describes the automated composition and optimal configuration of a machine learning application. In the eki project, we develop an AutoML method based on FINN for the configuration of DNN mappings on single and multi FPGA. The novelty here is that we consider not only the configuration parameters of the DNN, but also those of the blocks for the FPGA architecture and include both together in the optimization. AutoML then performs a systematic search in a high-dimensional design or configuration space in which numerous parameters can be varied and different optimization goals can be pursued. Search methods such as hill climbing, simulated annealing, or even AI methods can be used for this purpose. Parameters include, for example, quantization of weights (global, per layer), arithmetic (floating point, fixed point, binary), partitioning across multiple FPGAs (energy vs. throughput vs. latency optimized), location of weights (fixed, internal SRAM, HBM, DRAM), and network topology. As an optimization goal for our project, energy optimization while constraining quality and latency/throughput requirements is particularly central.
Empirical Evaluation
The quantitative evaluation of the developed methods is performed on the one hand using publicly available DNNs, e.g. YOLOv3, AlexNet, ResNet50, and on the other hand on two concrete and current application case studies. In the first case study, Partner FHSWF considers Transformer models, the de facto standard in Natural Language Processing (NLP). Transformer models do not require the input data sequentially like recurrent DNNs, which facilitates parallelization and reduces numerical problems during optimization. As a result, significantly larger DNNs can be trained, and the leading NLP models have rapidly growing parameters, e.g., from Google BERT with 350 million to OpenAI GPT3 with 175 billion to Google Switch Transformer with 1.6 trillion parameters. Inference of these models requires a correspondingly high number of computational operations. Since only a fraction of the trained capabilities are used in many applications, current approaches specialize pre-trained Transformer models via transfer learning and then reduce them via network pruning and quantization, resulting in significant energy reduction. This case study is extremely relevant to FPGA, as pruning leads to sparse parameter matrices, which CPUs, GPUs, or even TPUs cannot efficiently exploit, unlike FPGAs. In the second case study from the project "5G-Landwirtschaft-ML" (competition 5G.NRW) of the partner HSHL, it is considered how 5G can be used to make the agricultural process of sugar beet cultivation in NRW more ecological, economical and sustainable. Field crossings with unmanned vehicles are used to monitor plants and collect data. The low latency of 5G makes it possible to apply AI methods of DNN inference with high computational demand in the cloud in real time to categorize plants during a crossing and dynamically adjust the application rate of fertilizers and crop protection products using AI. In the EKI-App project, this use case study stands out as a cyber-physical system that combines aspects of edge and cloud computing and, in particular, has latency bounds for DNN inference, a requirement where FPGAs have significant advantages over GPUs.