Final published version
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos
AU - Miao, Y.
AU - Han, J.
AU - Gao, Y.
AU - Zhang, B.
PY - 2019/7/1
Y1 - 2019/7/1
N2 - The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets â Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE).
AB - The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets â Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE).
KW - Crowd analysis
KW - Crowd counting
KW - Spatio-temporal feature
KW - Convolution
KW - Deep learning
KW - Mean square error
KW - Convolutional neural network
KW - Learning techniques
KW - Perspective distortion
KW - Spatial-temporal features
KW - Spatio temporal features
KW - Temporal correlations
KW - Neural networks
U2 - 10.1016/j.patrec.2019.04.012
DO - 10.1016/j.patrec.2019.04.012
M3 - Journal article
VL - 125
SP - 113
EP - 118
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
SN - 0167-8655
ER -