The goal of fine-grained action recognition is to successfully
discriminate between action categories with subtle differences. To tackle
this, we derive inspiration from the human visual system which contains
specialized regions in the brain that are dedicated towards handling specific
tasks. We design a novel Dynamic Spatio-Temporal Specialization
(DSTS) module, which consists of specialized neurons that are only activated
for a subset of samples that are highly similar. During training,
the loss forces the specialized neurons to learn discriminative fine-grained
differences to distinguish between these similar samples, improving finegrained
recognition. Moreover, a spatio-temporal specialization method
further optimizes the architectures of the specialized neurons to capture
either more spatial or temporal fine-grained information, to better
tackle the large range of spatio-temporal variations in the videos. Lastly,
we design an Upstream-Downstream Learning algorithm to optimize our
model’s dynamic decisions during training, improving the performance
of our DSTS module. We obtain state-of-the-art performance on two
widely-used fine-grained action recognition datasets.