An Efficient and Accurate Scene Text Detector (EAST)

논문발췌

we propose a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps

> 우리는 간단하면서도 강력한, 자연스러운 장면에서의 빠르고 정확한 글자 인식을 할 수 있는 파이프라인을 제안한다.

> 이 파이프라인은 아무 방향이나 모양으로 생긴 단어나 글을 예측하며, 필요없는 단계들을 생략하였다.

Contributions:
(1) We propose a scene text detection method that consists of two stages: a Fully Convolutional Network and an NMS merging stage. The FCN directly produces text regions, excluding redundant and time-consuming intermediate steps.
(2) The pipeline is flexible to produce either word level or line level predictions, whose geometric shapes can be rotated boxes or quadrangles, depending on specific applications.
(3) The proposed algorithm significantly outperforms state-of-the-art methods in both accuracy and speed.

> 우리는 두 단계로 이루어진 scene text detection 방식을 제안한다: fully convolutional network와 NMS merging 단계. FCN이 직접 반복적이고 시간이 걸리는 단계를 제외한 글자가 있는 부분을 만들어낸다.

> 파이프라인은 단어 혹은 라인 레벨의 예측을 특정 상황에 따라 기하학적인 모양을 시킬 정도로 융통성이 있다.

> 제안된 알고리즘은 SOTA 방법을 정확도와 속도 측면에서 모두 앞선다.

EAST 전에 알아야 하는 내용 - FCN

Semantic segmentation을 위해 고안된 모델
특징 1. 네트워크 뒷단에 FC layer 대신 CNN을 연결
- FC layer는 이미지의 위치 정보가 사라짐 > class/instance/background 구분할 수 없음
- input image size가 고정됨
따라서 FCN에서는 deconvolution & interpolation 을 선택, feature map을 원본 이미지 크기와 비슷하게 만들어줌

특징 2. skip architecture
deep & coarse layer의 semantic 정보와 shallow & fine layer의 appearance 정보를 결합한 skip architecture를 정의

EAST

특징
- 기존 text detection model들이 3-5차례 convolution block을 거치게 한 것과 달리 하나의 convolution block으로 줄여 연산 시간을 대폭 단축
- 이미지 분할을 위해 고안된 FCN 알고리즘을 활용해 단어가 포함된 rotated rectangle(quadrilateral box를 예측

U자 모양의 FCN 구조
Input: 512x512 RGB image
Output: Input image의 1/4정도의 score map. 이는 우리가 구하고자 하는 변형된 사각형(혹은 다각형)의 5개 정보 (x1, y1, w, h, 각도)를 구하기 유용한 정보로 구성되어 있음. 여기서 출력된 score map을 취합, thresholding을 지나 최종적으로 구하고자 하는 박스를 output으로 도출.
- [0,1] 범위의 글자가 있는 부분을 예측한 score map 한 개
- 단어 박스 네 개 변으로부터의 거리가 얼마인지 네 개
  - 거리가 높을수록 네 개 변과 모두 거리가 멀다는 의미이므로 단어 영역의 중심이 확률이 높다
- 중심 정보를 가지고 박스가 수평기준 얼마나 회전했는지 각도 값을 저장
- 최종적으로 한 가지 score map + 다섯가지 추가정보(x1, y1, w, h, 각도)가 담기게 됨. 이를 취합하여 threshold를 지나 output으로 내게 된다.
- 실제 레이블을 바탕으로 score map + 다섯가지 정보를 생성하여 loss값을 낸다.
  - 최종 loss = Ls(score map간 차이) + λg Lg(5가지정보간 손실)