This is the first try that uses a transformer for object detection.
- Originally, object detection has been done with convolution layers and major architectures such as Faster R-CNN, YOLO are not fully differentiable mainly due to NMS operation.
- DETR utilized a transformer to increase large object detection ability and is fully differentiable by removing NMS operations.
- DETR made use of Hungarian loss for bipartite matching between label and predicted boxes. Furthermore, DETR can also be naturally extended to be used for instance segmentation by adding a small mask head on the output of decoder.