Pre-course research

1. Papers with code object detection ranking

Rank	Model	Box AP
1	Group DETR v2	64.5
2	FocalNet – H (DINO)	64.4
3	FD-SwinV2-G	64.2
4	BeiT – 3	63.7
5	DINO	63.3

* AP = Average precision; average precision value for recall value

Model outputs the prediction scores

-> convert prediction scores to class labels

-> calculate the confusion matrix, precision and recall metrics

-> calculate the area under the precision-recall curve; measure the average precision

2. Main object detection models

https://www.v7labs.com/blog/object-detection-guide

- Deep learning-based approaches use neural network architectures like RetinaNet, YOLO (You Only Look Once), CenterNet, SSD (Single Shot Multibox detector), Region proposals (R-CNN, Fast-RCNN, Faster RCNN, Cascade R-CNN) for feature detection of the object, and then identification into labels.

- Single stage object detectors: removes the RoI extraction process and directly classifies and regresses the candidate anchor boxes.

- Two-stage object detectors: extract RoIs (Region of interest), then classify and regress the RoIs.

R-CNN model family

- R-CNN—This utilizes a selective search method to locate RoIs in the input images and uses a DCN (Deep Convolutional Neural Network)-based region wise classifier to classify the RoIs independently.

- SPPNet and Fast R-CNN—This is an improved version of R-CNN that deals with the extraction of the RoIs from the feature maps. This was found to be much faster than the conventional R-CNN architecture.

- Faster R-CNN—This is an improved version of Fast R-CNN that was trained end to end by introducing RPN (region proposal network). An RPN is a network utilized in generating RoIs by regressing the anchor boxes. Hence, the anchor boxes are then used in the object detection task.

- Mask R-CNN adds a mask prediction branch on the Faster R-CNN, which can detect objects and predict their masks at the same time.

- R-FCN replaces the fully connected layers with the position-sensitive score maps for better detecting objects.

- Cascade R-CNN addresses the problem of overfitting at training and quality mismatch at inference by training a sequence of detectors with increasing IoU thresholds.

Yolo model family

- YOLO uses fewer anchor boxes (divide the input image into an S × S grid) to do regression and classification. This was built using darknet neural networks.

- YOLOv2 improves the performance by using more anchor boxes and a new bounding box regression method.

-YOLOv3 is an enhanced version of the v2 variant with a deeper feature detector network and minor representational changes. YOLOv3 has relatively speedy inference times with it taking roughly 30ms per inference.

- YOLOv4 (YOLOv3 upgrade) works by breaking the object detection task into two pieces, regression to identify object positioning via bounding boxes and classification to determine the object's class. YOLO V4 and its successors are technically the product of a different set of researchers than versions 1-3.

- YOLOv5 is an improved version of YOLOv4 with a mosaic augmentation technique for increasing the general performance of YOLOv4.

CenterNet model family

- SSD places anchor boxes densely over an input image and uses features from different convolutional layers to regress and classify the anchor boxes.

- DSSD introduces a deconvolution module into SSD to combine low level and high-level features. While R-SSD uses pooling and deconvolution operations in different feature layers to combine low-level and high-level features.

- RON proposes a reverse connection and an objectness prior to extracting multiscale features effectively.

- RefineDet refines the locations and sizes of the anchor boxes for two times, which inherits the merits of both one-stage and two-stage approaches.

- CornerNet is another keypoint-based approach, which directly detects an object using a pair of corners. Although CornerNet achieves high performance, it still has more room to improve.

- CenterNet explores the visual patterns within each bounding box. For detecting an object, this uses a triplet, rather than a pair, of keypoints. CenterNet evaluates objects as single points by predicting the x and y coordinate of the object’s center and it’s area of coverage (width and height). It is a unique technique that has proven to out-perform variants like the SSD and R-CNN family.

* history of object detection models는 아래 블로그 참고

https://89douner.tistory.com/324

3. single vs two-stage detectors

https://viso.ai/deep-learning/object-detection/

- One-stage detectors: One-stage detectors predict bounding boxes over the images without the region proposal step. This process consumes less time and can therefore be used in real-time applications.

- One-stage object detectors prioritize inference speed and are super fast but not as good at recognizing irregularly shaped objects or a group of small objects.

- The most popular one-stage detectors include the YOLO, SSD, and RetinaNet. The latest real-time detectors are YOLOv7 (2022), YOLOR (2021) and YOLOv4-Scaled (2020). View the benchmark comparisons below.

- The main advantage of single-stage is that those algorithms are generally faster than multi-stage detectors and structurally simpler.

- Two-stage detectors: In two-stage object detectors, the approximate object regions are proposed using deep features before these features are used for the classification as well as bounding box regression for the object candidate.

- The two-stage architecture involves (1) object region proposal with conventional Computer Vision methods or deep networks, followed by (2) object classification based on features extracted from the proposed region with bounding-box regression.

- Two-stage methods achieve the highest detection accuracy but are typically slower. Because of the many inference steps per image, the performance (frames per second) is not as good as one-stage detectors.

- Various two-stage detectors include region convolutional neural network (RCNN), with evolutions Faster R-CNN or Mask R-CNN. The latest evolution is the granulated RCNN (G-RCNN).

- Two-stage object detectors first find a region of interest and use this cropped region for classification. However, such multi-stage detectors are usually not end-to-end trainable because cropping is a non-differentiable operation

- Transformer 모델들도 정리하고 싶었지만 너무 길어서 포기

- ViT는 훈련할 때 엄청 많은 데이터가 필요하고 parameter수가 엄청 많기 때문에 훈련 시키기 힘들 듯 하다. 만약에 transformer 모델을 쓴다면 이 외에 더 작은 모델을 골라야할 것 같다

- 우리 task에서는 real time inference가 필요한 게 아니기 때문에 two stage detector가 적절한 듯 하다

저작자표시

'Object detection' 카테고리의 다른 글

Ensemble strategies (0)	2022.11.27
Shell script command - sed (0)	2022.11.26
Github branch & sh 파일 & argparse 만들기 (0)	2022.11.21
FPN (0)	2022.11.17
Faster R-CNN (0)	2022.11.16

한별쓰로그

Pre-course research

'Object detection' 카테고리의 다른 글

티스토리툴바

Pre-course research

'Object detection' 카테고리의 다른 글

'Object detection' Related Articles

티스토리툴바