- global max pooling over sliding window的定位方法值得借鉴
Following three modifications to a classification network.
- Treat the fully connected layers as convolutions, which allows us to deal with nearly arbitrary-sized images as input.
- The aim is to apply the network to bigger images in a sliding window manner thus extending its output to n×m× K, where n and m denote the number of sliding window positions in the x- and y- direction in the image, respectively.
- 3xhxw —> convs —> kxmxn (k: number of classes)
- Explicitly search for the highest scoring object position in the image by adding a single global max-pooling layer at the output.
- kxmxn —> kx1x1
- The max-pooling operation hypothesizes the location of the object in the image at the position with the maximum score
- Use a cost function that can explicitly model multiple objects present in the image.
- mAP on VOC 2012 test: ＋3.1% compared with 
- mAP on VOC 2012 test: ＋7.6% compared with kx1x1 output and single scale training
- mAP on VOC: ＋2.6% compared with RCNN
- mAP on COCO 62.8%
- Metric: if the maximal response across scales falls within the ground truth bounding box of an object of the same class within 18 pixels tolerance, we label the predicted location as correct. If not, then we count the response as a false positive (it hit the background), and we also increment the false negative count (no object was found).
- metric on VOC 2012 val: -0.3% compared with RCNN
- mAP on COCO 41.2%
- max pooling改为average pooling会不会对于多个instance的情况更好一些