Skip to content Skip to navigation

Edge Computing Hardware for Smart Buildings

Project Team

Boris Murmann, Elaina Chai
 

Overview

The astonishing progress in deep learning and inference techniques has opened up new capabilities and possibilities for vision-based smart buildings. However, state of the art in computer hardware has not kept up with the increasing computational and data requirements of these vision algorithms. As a result, existing real-time systems are not scalable due to the unmanageable amounts of raw data that need to be piped between local/edge sensors and remote servers in smart building applications. We propose a custom solution for vision-based smart buildings. By leveraging edge processing, the solution can use cascaded classifiers to reduce this data deluge, as well as open up the possibilities of online training and adaptation of the smart vision algorithms at the edge. Due to the resource constraints, such as power and latency, in processing so close to the sensor, this solution will also require an assessment of the practicality of vision-based smart building systems, and a study on how to optimally partition and distribute these increasingly more powerful vision-based machine learning algorithms between local and server processing.

Project Background

In recent work at Stanford, [5] (Albert Haque, Serena Yeung, Arnold Milstein, Fei-Fei Li et al.) looked at vision-based smart hospitals and demonstrated a system for monitoring of the staff’s hand hygiene as a specific example. While the achieved results are very impressive, due to the data deluge from transmitting all of the raw remote sensor data, taking such a system from a proof-of-concept demonstrator to scalable real-time solutions for the real world will require more than raw improvements in hardware speed/efficiency due to the end of Moore’s law (and the less well-known Dennard’s law). Instead, it is widely agreed that major improvements in hardware will need to come from application-centric optimization in the future [8], necessitating considerable progress in hardware/software co-design.

Research Objectives

1.     Acquire and analyze the datasets and algorithms used in the work of [4]. Duplicate their results in software and understand the key challenges. Analyze the complexity of the deployed machine learning algorithms and update them based on the most recent findings in this fast-paced field.

2.    Study the mapping of the system’s most basic machine learning tasks/kernels onto hardware that can be deployed local to the sensor. Compare GPU and FPGA implementations in terms of power dissipation and size.

3.    Devise a scheme that partitions the machine learning tasks between local processing and server processing to enable efficient data fusion and improved classification performance. Study cascaded classifiers to provide large compute power locally on demand.

4.    Deploy a small (single-room) experimental system to perform tests with real-time data on the trained machine learning algorithms.

5.    Perform a study on the fundamental limits of data and power reduction for future implementations on custom silicon chips.

Research Update:

Assessing Current Hardware Capabilities

Over the course of the last year, we met with our collaborator Serena Yeung to detail the design requirements of the Smart Hospital Project. We analyzed the datasets and algorithms used in her work in [5] and studied current hardware capabilities. Then we outlined hard performance requirements for real-time deployment, and began the hardware/software co-design process for current and future algorithms and tasks. In the activity recognition tasks described in [5], hardware capabilities were as follows. 50 depth cameras captured image data that was then sent over the hospital network. The network fed this data to a backend server (see Figure 2) that processed the images offline. The bandwidth of this network was determined to be one of the key road blocks to realizing a scalable real-time solution.

For the proposed scalable real-time hardware solution, we decided to partition the backend algorithms between local/edge processing and server processing. Instead of sending raw camera images over the hospital network, we instead send the much smaller extracted features. This has two key benefits. The first is that we will have mitigated the data deluge, and therefore increased the scalability of the hardware solution. The second is that the extracted features will have less identifying information of individuals, leading to increased privacy in the proposed solution in the event of a network breach. The partitioning of the algorithms is shown in Figure 3, using a Xilinx Device as an example of edge processing hardware. To enable real-time performance, we need to ensure an edge processing latency of at most 100 ms per image.

(Ongoing) Target Tasks and Algorithms

The Smart Hospital project targets tasks for "rich visual recognition". The core tasks under this group that we are interested in are:

  • Activity Recognition [5],[3]
  • Dense Human Pose Estimation [2]
  • Instance Segmentation [2]

While the specific hand hygiene task in [5] required only activity recognition, the goal of the smart hospital project is to expand the capabilities of the system using a combination of the core tasks outlined above. At a high level, the state-of-the-art algorithms for the above tasks are comprised of two parts: a front-end deep neural network (DNN) for feature extraction of the camera images, and a second DNN on the back-end with a recurrent structure. We analyzed state of the art algorithms for these tasks ([5],[3],[2],[6]), looking for areas of application-centric optimization, specifically areas of algorithm commonality. These areas of commonality are obvious targets for applying parameter sharing techniques among the core tasks. Parameter sharing techniques are often seen in multi-task learning (MTL) [10]. In MTL, the underlying idea is that by learning shared features between similar tasks, model generalization on the original task is improved. In our case, we would instead use these ideas to reduce any redundant feature extraction between DNNs. This can potentially lead to dramatic reductions in overall computation costs at the edge as we expand the core tasks for the project.

Indeed, for each of the state-of-the-art algorithms for the core tasks above, a central commonality is the backbone architecture of the feature extraction front-end. All use a form of a standard convolutional neural network, ResNet [7], as the backbone. Currently, one of Serena Yeung’s networks use the 18-layer ResNet, known as ”ResNet18”, as the backbone architecture. Other algorithms use the much larger ”ResNet50” architecture as the backbone (see Figure 5). Since identifying this commonality in the feature extraction backbone architecture, one of the ongoing research steps is to explore to what extent these ResNet-based front-ends can be successfully combined into a single front-end to be shared among all core tasks of interest.

 

Hardware Exploration

Due to the large growth of commercial opportunities for deep learning at the edge, many companies are dedicating considerable amounts of engineering efforts to provide solutions in this space. Examples include Xilinx Zynq MPSoCs, Nvidia Jetson SoC, and most recently the Google Edge TPU. For any of these edge solutions to be practical, they must be hardware-software co-designed solutions in order to both manage the heavy workloads and adopt the latest developments in the deep neural network algorithmic space. That is, solutions must fundamentally have two parts:

  1. Dedicated hardware to run core operations of DNN at low power and low latency.
  2. Software compilers to quickly and seamlessly convert DNN models from popular development frameworks such as Tensorflow [1] and PyTorch [9] into a form optimized for specific target edge hardware.

In light of this rapidly developing space of hardware-software co-designed solutions, we first established a baseline performance survey of off-the-shelf hardware solutions. An experiment was constructed to compare the offering of two of the largest companies in this edge hardware space: Xilinx and Nvidia. We targeted chips that had readily-available evaluation boards. Additionally, we targeted solutions with publicly available compilers that could deploy our standard version of ResNet50. Due the commonalities in the backbone architecture that we identified in the previous section, we determined that the standard ResNet50 architecture provided the most representative algorithm for evaluating the off-the-shelf hardware solutions.

While we evaluated multiple hardware solutions, the rest of this update will detail the results on the lowest power hardware solutions publicly available at the time of testing:  The Ultra96 board featuring a Xilinx ZU3 chip and using the Xilinx/DeepPhi Edge AI Compiler, and the Nvidia Jetson TX2 Evaluation board, featuring the Jetson TX2 SoC and using TensorRT 1.0.  For the Jetson TX2, we evaluated the system in its Max-Q mode, the most power efficient setting available. For an example of the measurement setup, see Figure 6.

We found that both hardware solutions easily met the 100 ms latency requirement. Both solutions have a similar steady-state power consumption of a little over 7 W. This power consumption of 7 W was determined to be suitable for this particular edge application. Both the camera and edge board will be wall-powered, therefore we are not power constrained. Additionally, the 7 W power consumption is on the same scale of the power consumption of the depth camera (under 5 W). However, the more consistent single-image latency performance of the Ultra96 board made this board the more desirable solution with which to begin development of the single room experimental system. Additionally the Xilinx/DeePhi compiler provided practically turn-key compilation of target models from standard ML frameworks for the Ultra96 board.

(Ongoing) Single-Room Experimental Demo System

Having completed a base survey of off-the-shelf commercial solutions, one of the ongoing tasks of this project is the deployment of the single-room experimental system. Parts have been ordered, and we are currently waiting for their arrival. Upon arrival, we will construct a small experimental system using the target depth camera and Ultra96 board. The target depth camera in this case will be the newer ASUS Xtion 2, because the original cameras (ASUS Xtion Pro Live) used in [5] have since been discontinued. The aim is to demonstrate a real-time deployment of front-end architectures of the algorithms for the core tasks outlined above. From this demo system, we will further study the limits of thesystem to provide large compute power locally on demand. This will provide further insight into the fundamental limits of data and power reduction for front-end DNN architectures at the edge, and how we can approach these limits with future implementations on custom silicon chips.

References

[1]  Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et  al. "Tensorflow:   A  system  for large-scale machine learning." In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016.

[2]  Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. "Dense pose: Dense human pose estimation in the wild." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.

[3]  Joao Carreira and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.

[4]  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database." In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. 2009.

[5]  Albert Haque, Michelle Guo, Alexandre Alahi, Serena Yeung, Zelun Luo, Alisha Rege, Jeffrey Jopling, Lance Downing, William Beninati, Amit Singh, et al. "Towards vision-based smart hospitals: A system for tracking and monitoring hand hygiene compliance." arXiv preprint arXiv:1708.00163, 2017.

[6]  Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. "Mask r-cnn." In Proceedings ofthe IEEE international conference on computer vision, pages 2961–2969, 2017.

[7]  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.  "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

[8]  Trevor Mudge. "The specialization trend in computer hardware: Techincal Perspective." Communications of the ACM, 58(4):84–84, 2015.

[9]  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. "Automatic differentiation in pytorch." In NIPS-W, 2017.

[10]  Sebastian Ruder.  "An overview of multi-task learning in deep neural networks." arXiv preprint arXiv:1706.05098, 2017.

Original Research Proposal

Final Project Report

CIFE Technical Report TR236

Funding Year: 
2019
Stakeholder Categories: 
Owners
Users
Operators/Facility Managers