Home > Research > Publications & Outputs > ST-CNN: Spatial-Temporal Convolutional Neural N...

Links

Text available via DOI:

View graph of relations

ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos. / Miao, Y.; Han, J.; Gao, Y. et al.
In: Pattern Recognition Letters, Vol. 125, 01.07.2019, p. 113-118.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

APA

Vancouver

Miao Y, Han J, Gao Y, Zhang B. ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos. Pattern Recognition Letters. 2019 Jul 1;125:113-118. Epub 2019 Apr 16. doi: 10.1016/j.patrec.2019.04.012

Author

Miao, Y. ; Han, J. ; Gao, Y. et al. / ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos. In: Pattern Recognition Letters. 2019 ; Vol. 125. pp. 113-118.

Bibtex

@article{6a1a42ad240d4e8d96be99e089a50c9c,
title = "ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos",
abstract = "The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets {\^a} Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE). ",
keywords = "Crowd analysis, Crowd counting, Spatio-temporal feature, Convolution, Deep learning, Mean square error, Convolutional neural network, Learning techniques, Perspective distortion, Spatial-temporal features, Spatio temporal features, Temporal correlations, Neural networks",
author = "Y. Miao and J. Han and Y. Gao and B. Zhang",
year = "2019",
month = jul,
day = "1",
doi = "10.1016/j.patrec.2019.04.012",
language = "English",
volume = "125",
pages = "113--118",
journal = "Pattern Recognition Letters",
issn = "0167-8655",
publisher = "Elsevier Science B.V.",

}

RIS

TY - JOUR

T1 - ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos

AU - Miao, Y.

AU - Han, J.

AU - Gao, Y.

AU - Zhang, B.

PY - 2019/7/1

Y1 - 2019/7/1

N2 - The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets â Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE).

AB - The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets â Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE).

KW - Crowd analysis

KW - Crowd counting

KW - Spatio-temporal feature

KW - Convolution

KW - Deep learning

KW - Mean square error

KW - Convolutional neural network

KW - Learning techniques

KW - Perspective distortion

KW - Spatial-temporal features

KW - Spatio temporal features

KW - Temporal correlations

KW - Neural networks

U2 - 10.1016/j.patrec.2019.04.012

DO - 10.1016/j.patrec.2019.04.012

M3 - Journal article

VL - 125

SP - 113

EP - 118

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

SN - 0167-8655

ER -