Edge Computing Hardware for Smart Buildings

Picture "Edge Computing Hardware for Smart..."

Research Team

Research Overview

Observed Problem

Going through the PDCA cycle on a daily basis and generating the planned quantities of materials, the actual quantities of materials installed at the end of the day, and daily performance metrics, in reality, is challenging for foremen for it would require them to spend ten minutes or more every day—sometimes more than an hour depending on the work they need to quantify.

Primary Research Objective

Make the process of going through the PDCA cycle with digital information (i.e., LOD400-based daily BOM) usable and operable for a foreman.

Potential Value to CIFE Members and Practice

Value 1: The prototype and field methods presented in this research contributes to closing the daily feedback loop.
Value 2: The daily resource information (such as the daily BOM and the number of workers and their links to LOD400 objects) available from the prototype can be used for proactive and precise day-to-day field control.
Research provides relevant insights for: builders and suppliers.

Research and Theoretical Contributions

The prototype and field methods that a foreman can use for operating LOD400-based daily BOM through the PDCA cycle.

Industry and Academic Partners

CCC, Oscar Properties, DPR Construction, Quality Consulting Solutions, Drees & Sommer, and Cosapi

Keywords

LOD400, PDCA cycle, BIM, daily BOM

Progress Report - April 2020

Research Overview and Observed Problem

Deep learning has fueled new visions toward smart buildings and work environments. For example, recent work at Stanford [1] (Albert Haque, Serena Yeung, Arnold Milstein, Fei-Fei Li et al.) looked at vision-based smart hospitals and demonstrated a system for monitoring of the staff’s hand hygiene (see Figure 1). While the achieved results are impressive, taking such a system from a proof-of-concept demonstrator to a scalable solution for the real world will require considerable progress in hardware/algorithm/software co-design.

Fig.1_Edge Computing Hardware — Figure 1: Vision Based Smart Hospital Environment by Haque et al. [1]

Project Background

The system by Haque et al. employs 50 distributed depth sensors to monitor and classify the activity of hospital staff and patients. The choice of using depth sensors versus RGB cameras was motivated by two considerations. First, there exist privacy laws that prevent the recording and distribution of the full video data. Second, the acquisition of fine-grain video would lead to an unmanageable “data deluge” in the facility, at >10 GB/hour per camera. As another example of this problem, a commercial system, CogniPoint, has also attempted to leverage video streaming data in the smart building space. Similar to [1], they incur the same penalties of a heavy network load and privacy concerns in their smart building product. To alleviate this problem, Haque et al. developed their system based on adaptively sampled depth images that bring the data volume down to 25 MB/hour or 1.25 GB/hour for all sensors in aggregate. This data is streamed to a remote graphics processor unit (GPU) cluster that runs the machine learning (ML) and data analytics algorithms.

A critical next step is to move the algorithmic compute local to the sensors, i.e. to the “edge” of the system. This will enable scaling both to large hospitals, as well as diverse types of assisted living and home/workspaces where more powerful remote computing is either unavailable or undesired for privacy reasons. Through local processing, we expect to achieve: (1) data reduction of more than 100x, (2) reduced energy consumption by a similar magnitude, and (3) improved algorithmic performance due to the resulting low latency and increased image resolution.

While there has been a large push in industry for more efficient machine learning hardware in servers and in mobile/battery-powered devices, it is currently not known how to optimally engineer these systems for the constraints and desired features of smart building applications. For example, how does one optimally use distributed and privacy-preserving machine learning algorithms? How can we use the learned information for real-time action planning?

Assessing Algorithm Constraints and the Performance Under Quantization

A core technique to implementing the models at the edge is quantization. Neural networks parameters and intermediary activations are typically trained first using 32-bit floating point representations. For inference, the final trained network model is typically highly over parameterized and do not require the large dynamic range of floating point. For inference the model is quantized to lower bit widths, typically 8-bits or lower. This results in less hardware, latency and power to perform the same computation.
Quantization has its drawbacks. The final weights and activations are slightly different from the floating-point trained model, and this difference, called the quantization error, can result in a reduction in network performance. Since quantization is a necessary step to reducing the model size to meet the hardware performance requirements at the edge, we must balance hardware constraints with the error introduced by quantization techniques.

With this balance in mind, we investigated the sources of quantization error, and how this error can be reduced. The type of quantization error we focused on was uniform quantization. This is the simplest form of quantization and can is readily implementable by most edge devices.

Quantization error can be reduced in one of two ways. Either use more bits, which means more hardware (see Figure 2), or reshape the distribution of the values to be quantized (see Figure 3). We therefore set out to understand how neural network techniques affect the distributions of the weights and activations.

Fig.2_Edge Computing Hardware — Figure 2: Reduce error by using more bins

Fig.3_Edge Computing Hardware — Figure 3: Reduce error by reshaping the distribution

For uniform quantization, the ideal distribution is a uniform distribution. Any technique that makes the distributions of the weights and activations closer to uniform will result in a network with better quantization performance. On the flip side, uniform quantization does particularly bad when the distributions have large tails or a very strong peak near zero. We observed that networks trained with a technique called Batch Normalization (BatchNorm) seemed to have better quantization performance and part of the work was on understanding why.

Understanding Batch Normalization

BatchNorm is a training technique used to enable training at higher learning rates for convolutional neural networks. After training, the BatchNorm layer can be completely eliminated from the inference model with no degradation in network performance by rolling the BatchNorm parameters into the adjacent convolution layer weights. While BatchNorm is a training technique, our experiments show that networks trained with BatchNorm have better noise performance at same quantization levels. For these experiments, we use the VGG architecture, trained on CIFAR-10. We compare results of the network trained with and without BatchNorm. VGG is one of the few standard CNN architectures that can be easily trained to the same validation accuracy without BatchNorm.

In Figure 4, we quantize one layer’s activations at a time and measure the signal to quantized noise ratio (SQNR) as a layer-wise proxy metric for final network performance. Figure 4 shows that the network trained with BatchNorm is much more resilient to quantization error. Since quantization error is related to the shape of the distributions, we measured how the tail of the distribution changes during training with BatchNorm. The metric used is the 4th moment of the distribution and the results is shown in Figure 5. Figure 5 demonstrates that the BatchNorm trained network has activations with a much lighter tail, indicating that those activations are much more easily quantized.

Figure 4: SQNR vs Convolution Layer for VGG neural network. Two models are trained with and without BatchNorm to the same floating-point validation accuracy. However, the BatchNorm-trained network has vastly superior quantization performance. We use the SQNR metric instead of final network performance to demonstrate the layer-wise noise degradation.

Figure 4: SQNR vs Convolution Layer for VGG neural network. Two models are trained with and without BatchNorm to the same floating-point validation accuracy. However, the BatchNorm-trained network has vastly superior quantization performance. We use the SQNR metric instead of final network performance to demonstrate the layer-wise noise degradation.

Quantization with Normalization in the Smart Hospital (Ongoing)

Moving from ImageNet to Smart Hospital/Application Specific Data

Standard datasets such as CIFAR-10 and ImageNet enable rapid development of new techniques and architectures, and facilitate more level comparisons of ideas across the neural network community. Many of the models and techniques validated for standard datasets are applicable to the Smart Hospital project. However, as we saw in the earlier quantization experiments, one must consider how the distribution the new dataset changes and calibrate accordingly.

We applied quantization techniques [2] as well as error corrections to the Smart Hospital model. The model under test uses a ResNet18 backbone architecture for feature extraction, then a LSTM backend for activity recognition. The LSTM backend uses the features from an image sequence of length from 64 frames to infer the activity.

One of the first challenges we observed was in the quantization of the input data. [1] uses images from a depth camera (see Figure 6). Immediately one observes that it is very different from typical jpeg images, such as that in Figure 7.

Fig.6_Edge Computing Hardware — Figure 6: Depth Camera Image Sample

Fig.7_Edge Computing Hardware — Figure 7: ImageNet Image Sample

In applying quantization techniques to the Smart Hospital algorithms, even though the underlying model architecture is same as the ImageNet ResNet-18 architecture, the activation distributions are noticeably different because the Smart Hospital algorithm has been fine-tuned on the depth camera data. This difference is most stark for the input images. If one does not properly calibrate for the distributions of the Smart Hospital input data and weights, all the useful information in the input image is removed by the quantization operation in the first layer (see Figure 8).

Fig.8_Edge Comouting Hardware — Figure 8: Unquantized Data

Fig.9_Edge Computing Hardware — Figure 9: Quantized Data

Having properly calibrated for the different distribution of the input data, we applied various levels of 8-bit quantization. (see Figure 10). We started with a very fine-grained quantization, in which each channel of the convolution layers has their own set of scaling parameters. While this led to very little performance degradation, this type of quantization typically requires extra hardware, and is not supported by all edge devices. Therefore, we applied layer-wise quantization, in which quantization parameters are shared by all channels in a given layer. Naturally the network performance degraded. However, interestingly, the degradation amount was not equal across the 4 activities. For two of the four activities, performance degraded so much that it was indistinguishable from a network with completely random weights. To restore performance, we applied bias correction techniques, which was able to further restore some network performance.

Quantization Type	0/4	1/4	2/4	3/4
Float Ref	0.674094	0.451947	0.588575	0.665778
8-bit, per channel	0.738046	0.418778	0.637560	0.673258
8-bit, per layer	0.484788	0.466534	0.040068	0.0120136
8-bit, per layer, bias correction	0.654493	0.355384	0.143746	0.292883
Completely Random Weights	0.041191	0.050332	0.032721	0.019223

Figure 10: Preliminary Results from Quantization of Smart Hospital Algorithm on Simulation Data

Corrections such as bias correction is able to mitigate the very uneven degradation. However, network performance was not restored completely using these techniques. A core architectural departure from the standard ResNet18 architecture, is that our inference results are produced by an LSTM architecture in the algorithm back-end. Unlike standard CNN architectures, which are effectively directed acyclic graphs (DAG), LSTMs are a form of recurrent neural networks (RNNs). RNNs are characterized by feedback loops, that accumulate the quantization errors from the CNN frontend. Further work is needed to understand how to suppress the accumulation of this error in the LSTM.

Ongoing Work

Quantization is critical to reducing the model size and meeting the power and latency constraints of edge hardware. However, quantization comes with its own challenges as it introduces errors that degrade overall network classification performance. For the Smart Hospital algorithm under test, the backend uses a LSTM architecture, which has a recurrent structure so that it can perform classification over video data. However, this structure accumulates quantization error from the convolutional front end. We will continue to investigate techniques and latest trends in custom hardware development [3] to not just reduce error in the CNN front end, but to suppress the accumulated error in the LSTM backend.

We will merge the compressed model with the single-room experimental demo system. Last year we performed a base survey of off-the-shelf commercial solutions and determined the necessary parts for the demo system. However, this survey was conducted using off-the shelf algorithms similar to the front end of the Smart Hospital model. The next step is to implement this compressed model on the edge hardware and server backend. Then we will verify that we can meet the power constraints and study the bandwidth limitations and which will give us insight into the fundamental limits of data reduction and scaling limitations across building network infrastructure.

References

[1] A. Haque, M. Guo, A. Alahi, S. Yeung, Z. Luo, A. Rege, J. Jopling, L. Downing, W. Beninati, A. Singh, T. Platchek, A. Milstein, and L. Fei-Fei, “Towards Vision-Based Smart Hospitals: A System for Tracking and Monitoring Hand Hygiene Compliance,” arXiv 1708.00163, Aug. 2017.
[2] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar and G. Novik. “Neural Network Distiller”, 2018.
[3] Y-H. Chen, T. Yang, J. Emer, and V. Sze. "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices." IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019.

Original Research Proposal

Research Proposal

Final Project Report

TR244

Funding Year:

2020