Let’s assume we have a series of images that have been augmented as above. If we only train on the original image, then the neural network fails to predict well on augmented images.
One more thing to note about is that such augmentations are human-crafted methods. What we ultimately want the neural networks to do might be letting these models figure out how to augment images as well.
Q. How can we overcome/achieve the aforementioned points?
A. Let’s make the convolution deformable (i.e. remove the “fixed” geometric structures of CNN filters) so it can learn various shapes of features even without the actual augmentation!
What is Deformable Convolution?
- Usual convolution take all “direct” neighbors of a pixel and compute linear combination of them.
- “Deformable” Convolution does not merely take neighbors. It learns which neighbors to take into computation by adding 2D offsets to the regular sampling grid in the standard convolution
- It’s somewhat like Spatial Transformer Networks in a way that it can also recognize scaled, rotated, and shifted objects.
How does D-Conv work?
- Δp_n is the key here. Instead of indexing adjacent pixels, we index with offsets considered.
- So how do we get this offset? We use offset field that has been extracted from the feature map using a standard convolution.
- One thing to note is that Δp_n is usually fractional. We can take nearest but we use bilinear interpolation here.
Deformable RoI Pooling
- It’s pretty much the same as D-Conv. Instead, we have separate offsets for each pixel location. (Much more memory instensive!)
Great Figures to Intuitively Understand D-Conv
- Red dots represent receptive field. What an interesting visualization!
- This is the usual receptive field size depending on the object size.