To Compress, or Not to Compress - Research Portal

Associated organisational units

Electronic data

ispa18
Rights statement: ©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Accepted author manuscript, 1.57 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/BDCloud.2018.00110
Final published version

View graph of relations

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference. / Qing, Qin; Yu, Jialong; Ren, Jie et al.
The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE, 2018. p. 729-736.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Qing, Q, Yu, J, Ren, J, Gao, L, Wang, H, Zheng, J, Feng, Y, Fang, J & Wang, Z 2018, To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference. in The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE, pp. 729-736, 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications, Melbourne, Australia, 11/12/18. https://doi.org/10.1109/BDCloud.2018.00110

APA

Qing, Q., Yu, J., Ren, J., Gao, L., Wang, H., Zheng, J., Feng, Y., Fang, J., & Wang, Z. (2018). To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference. In The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) (pp. 729-736). IEEE. https://doi.org/10.1109/BDCloud.2018.00110

Vancouver

Qing Q, Yu J, Ren J, Gao L, Wang H, Zheng J et al. To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference. In The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE. 2018. p. 729-736 doi: 10.1109/BDCloud.2018.00110

Author

Qing, Qin ; Yu, Jialong ; Ren, Jie et al. / To Compress, or Not to Compress : Characterizing Deep Learning Model Compression for Embedded Inference. The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE, 2018. pp. 729-736

Bibtex

@inproceedings{8e12311f8b1c45d68735db6f352c8423,

title = "To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference",

abstract = "The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.",

author = "Qin Qing and Jialong Yu and Jie Ren and Ling Gao and Hai Wang and Jie Zheng and Yansong Feng and Jianbin Fang and Zheng Wang",

note = "{\textcopyright}2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.; 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications, (ISPA/IUCC/BDCloud/SocialCom/SustainCom) ; Conference date: 11-12-2018 Through 13-12-2018",

year = "2018",

month = dec,

day = "11",

doi = "10.1109/BDCloud.2018.00110",

language = "English",

isbn = "9781728111421",

pages = "729--736",

booktitle = "The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)",

publisher = "IEEE",

}

RIS

TY - GEN

T1 - To Compress, or Not to Compress

T2 - 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications

AU - Qing, Qin

AU - Yu, Jialong

AU - Ren, Jie

AU - Gao, Ling

AU - Wang, Hai

AU - Zheng, Jie

AU - Feng, Yansong

AU - Fang, Jianbin

AU - Wang, Zheng

N1 - ©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2018/12/11

Y1 - 2018/12/11

N2 - The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.

AB - The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.

U2 - 10.1109/BDCloud.2018.00110

DO - 10.1109/BDCloud.2018.00110

M3 - Conference contribution/Paper

SN - 9781728111421

SP - 729

EP - 736

BT - The 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)

PB - IEEE

Y2 - 11 December 2018 through 13 December 2018

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us