1 Introduction
As scene text recognition plays a critical role in many realworld applications, it has consistently drawn much research attention from the computer vision community over decades. However, due to the diversity of scene text (e.g., fonts, colors, and arrangements) , the complexity of backgrounds, as well as tough imaging conditions (e.g., blur, perspective distortion, and partial occlusion), scene text recognition remains an extremely challenging task.
Early scene text recognition systems detect and classify each character separately, and then join the classified results into sequence predictions
[26, 4, 36, 34]. In these methods, characterlevel annotations are required, which are usually expensive in realworld tasks. Furthermore, there might be mistakes and confusions in character detection and classification, leading to degraded recognition accuracy.Therefore, inspired by speech recognition, Connectionist Temporal Classification (CTC) [12] is introduced to train imagebased sequence recognition endtoend with unsegmented data and sequence level labels [31]. In CTCbased methods [21, 23]
, the whole sequence labels are used to compute conditional probabilities from sequence predictions directly. Many recent studies
[6, 7, 21, 33, 38, 9, 3] followed the framework where network produces framewise predictions and aligns them with labels using CTC or attention mechanism.However, vanilla CTC [12] was essentially designed for 1D sequence recognition and can only handle 1D probability distributions. As shown in Fig.1, vanilla CTCbased sequence recognition methods have to collapse 2D features of images into 1D probability distribution at each frame. Due to the significant conflict between 2D text distribution and 1D sequence representation, it may lose crucial information and import extra noises, thus resulting in errors.
To solve this severe limitation of vanilla CTC, we propose a new 2DCTC (short for two dimensional CTC) formulation to directly compute the conditional probability of labels from 2D distributions.
In vanilla CTC [12], given a group of probability distributions, it seeks various paths that produce the same target sequence. Summation of the conditional probabilities of all paths conducts the probability of label, which is used to measure the likelihood of label and prediction. As for 2DCTC, an extra dimension is added for path searching: in addition to time step (length of prediction), probability distributions in the height dimension are preserved. It guarantees that all possible paths over height are taken into consideration, and different path choices over height may still lead to the same target sequence.
Without destructing 2D distribution information of characters, more accurate predictions could be achieved. As shown in Fig. 2, it is natural for 2DCTC to recognize curved and irregular text, which is difficult for framewise prediction based methods. Furthermore, Fig. 3 also shows that our proposed method can detect approximate directions and locations of characters, even though trained without additional characterlevel annotations. This can be used in applications requiring character location information, or weakly supervised character detectors.
In addition, to reduce the computation burden brought by the dimension increase in 2DCTC, we devise an effective dynamic programming algorithm(see Sec 3), which significantly decreases the time complexity of the conditional probability computation.
The contributions of this paper are threefold: (1) We extend the vanilla CTC model by adding another dimension along the height direction to support 2D probability distributions, which can naturally handle various cases such as arbitrarily oriented, curved or rotated text instances. The proposed 2DCTC model introduces a novel and natural perspective for scene text recognition, where text distributions in 2D space are preserved. (2) It is demonstrated that the proposed method outperforms the current stateoftheart method on regular benchmarks, such as IIIT5K, and archives significant improvement on irregular benchmarks, such as CUTE80 and TotalText. (3) We devise an efficient dynamic programming algorithm to compute the conditional probability of a specific label from two or onedimensional probability distribution. With dynamic programming, the time consumption of 2DCTC is reduced to an ignorable cost.
2 Related Work
Early scene text recognition methods detect individual characters and recognize each character separately [26, 4, 25]. These methods suffer from bad character detection accuracy, which limited recognition performance.
Inspired by speech recognition, CRNN [31] introduce CTC into imagebased sequence recognition, which makes it possible to train sequence recognition endtoend. Following this design, various methods are proposed [21, 23], and achieve significant improvement in performance. However, vanilla CTC is designed for sequencetosequence alignment. To fit the formulation, these methods have to collapse 2D image features into 1D distribution, which may lead to loss of relevant information or induction of background noises. Therefore, many of the recent works in scene text recognition have deprecated CTC and turned to other objective functions [33, 6, 14].
Another admirable direction of framewise prediction alignment is attentionbased sequence encoder and decoder framework [6, 7, 21, 33, 38, 9, 3]. The models focus on one position and predict the corresponding character at each time step, but suffer from the problems of misalignment and attention drift [6]
. Missclassification at previous steps may lead to drifted attention locations and wrong predictions in successive time steps because of the recursive mechanism. Recent works take attention decoders forward to even better accuracy by suggesting new loss functions
[6, 3] and introducing image rectification modules[33, 41]. Both of them bring appreciable improvement to attention decoders. However, despite the high accuracy attention decoder has achieved, the considerably lower inference speed is the fundamental factor which has limited its application in realworld text recognition systems. Detailed experiments are presented in Sec 4.4.3.In contrast to the aforementioned methods, Liao [22] recently propose to utilize instance segmentation to simultaneously predict character locations and recognition results, avoiding the problem of attention misalignment. They also notice the conflict between the 2D image feature and collapsed sequence presentation, and propose a reasonable solution. However, this method requires characterlevel annotations, and are limited to realworld applications, especially for areas where detailed annotations are hardly available (e.g., handwritten text recognition).
In concern of both accuracy and efficiency, 2DCTC recognizes text from a 2D perspective similar to [22], but trained without any characterlevel annotations. By extending vanilla CTC, 2DCTC achieves stateoftheart performance, while retaining the high efficiency of CTC models.
3 The 2DCTC Method
In this section, we present the details of the proposed 2DCTC algorithm, which can be applied to imagebased sequence recognition problems. Sec 3.1 discusses the output representation of 2DCTC that allows for the computation of the target probability. Sec 3.2 describes the procedure where 2D probability distributions are decoded into sequence predictions. Sec 3.3 construes the dynamic programming algorithm designed for effectively and efficiently computing the 2DCTC loss.
3.1 Network Outputs
The outputs of the network of 2DCTC can be decomposed into two types of predictions: probability distribution map [3] and path transition map.
Firstly we define as the alphabet. Similar to vanilla CTC, the probability distribution of 2DCTC is predicted by a softmax output layers with units for class labels. The activation of the first unit at each position represents the probability a ‘blank’, which means no valid label. The following units correspond to the probabilities of the character classes, respectively. The shape of probability distribution map is . and stand for the spatial size of the prediction maps, which is proportional to the size of the original image.
The distinction between the outputs of vanilla CTC and 2DCTC is the spatial distribution of predicted probabilities. The output of vanilla CTC models is composed with prediction frames, while the probability of 2DCTC includes prediction at
spatial positions. Each predicted vector in the 2D distribution map indicates the classification probabilities of the corresponding spatial location.
To satisfy the definition of 2DCTC for transition probability, normalization over the height dimension is also required by the 2DCTC formulation. In practice, an extra prediction, path transition map, of shape
is produced by another individual softmax layer. Path transition probability is interpreted as the path selection over the vertical direction, which is one of the substantial improvements of 2DCTC over vanilla CTC.
The two separate predictions, probability distribution map and path transition map, are used for loss computation and sequence decoding.
3.2 Prediction Alignment
As discussed in Sec. 2, vanilla CTC is widely used in sequence recognition. It decodes framewise predictions into sequence labels. The key conception of CTC decoding is to skip frames in predictions. To allow for ignoring steps in output paths, vanilla CTC introduced a blank token . implies no valid character and will be removed from the output. Then the predicted sequence can be aligned by ignoring steps where the predicted class is the same with the previous step. A simple example is illustrated in Fig. 4.
Equipped with CTC algorithm, instead of predicting each corresponding characters in labels, models predict probability distributions of character classes and blank(). The way of alignmentfree prediction brings better integration and make it easier for models to converge. However, the classical vanilla CTC formulation still lacks the ability to produce multiple outputs in a single column. Therefore we propose 2DCTC.
2DCTC inherits the alignment conception of vanilla CTC. In the decoding procedure of 2DCTC, maximum probabilities over height positions and character classes are reduced to sequence prediction. The alignment of 2DCTC is illustrated in Fig. 2. Aside from precedent frame alignment of vanilla CTC, 2DCTC chooses paths over height(see Fig 2 (b)). Then the predicted probabilities of character classes on chosen paths will be joint into framewise predictions, thus resort the decoding procedure into vanilla CTC alignment.
3.3 Training the Networks
We have discussed the transition from 1D and 2D probability distribution to sequence prediction. To optimize the networks using gradient descent, we now roughly describe the object function of vanilla and our proposed CTC models.
3.3.1 Vanilla CTC Loss
As known, the optimizing of CTC models is to maximize the loglikelihood of target labels[12]. Formally, the loss function of Vanilla CTC can be summarized as follows:
(1) 
where X is the predicted probability distribution and Y is the corresponding label. The objective for the probability of over is:
(2) 
where denotes all valid alignments in X over Y, is the length of X.
As shown above, can be computed by summing the probabilities of all valid paths, but the time complexity of straightly adding them together is .
Fortunately, dynamic programming provides an efficient way to solve this problem.
Since the symbol before and after any symbol can be handled identically, we make a variant of the label to describe it more clearly:
(3) 
Given , let be the first symbols of , we define as conditional probability of after time steps. Then can be computed with divideandconquer strategy.
For cases where symbol at can not be ignored, or , can be inferred from the previous step by following equation:
(4) 
For other cases where is between different symbols, if and are satisfied, can be ignored, then the probability of twosymbol shorter state may contribute to :
(5) 
In combination, state transition equation of dynamic programming for Vanilla CTC can be described as:
(6) 
Finally, required by loss function can be computed by dynamic programming, with a totally differentiable procedure:
(7) 
3.3.2 From Vanilla CTC to 2DCTC
In vanilla CTC, the input probability distribution is limited to sequence. For compatibility of 2D distributions, which are essential to imagebased sequence recognition, we extend the vanilla CTC formulation to adapt 2D distributions.
Conceptually, given an input 2D probability distribution map and target label , the target of 2DCTC loss is also to present the conditional probability for all valid paths. This conditional probability is used to measure the diversity between the label and the prediction.
We follow the assumption where the paths spread from left to right, which is natural for most sequence recognition problems. More formally, given 2D distribution with height and width , we define the path transition map . concretely indicates the transition probability of decoding path transfers from to , where , .
In definition the summation of path transition probabilities from each location equals to one:
(8) 
The essential enhancement of 2DCTC over 1D is the introduction of the dimension along height, which alleviates the problem of losing or mixing information and provides more paths for CTC to decode. This enhancement changes the state transition equations for both two aforementioned partial cases.
Given the same expanded label , recursive formula for cases where can not be ignored( or ) comes to:
(9) 
And for other cases:
(10) 
As we can perceive from the equations, conditional probability at each time step is separately computed along height dimension. Paths in height dimension are weighted by corresponding transition probability and summed into distribution presentation.
Thus, the dynamic programming procedure of 2DCTC can be described as:
(11) 
In implementation, we need to define the initialization for the first state of :
(12) 
where , .
So far as discussed, the conditional probability of 2DCTC is summarized as:
(13) 
Besides, as an implementation trick, the transition probabilities of the locations in the same column to different locations in the next column can be assumed to be equal:
(14) 
It simplifies probability prediction to . We implemented 2DCTC with both formulations, considering the assumption or not, the diversity of performances is negligible in our experiments. Thus we recommend using the implementation where transition probability is simplified. The experiments in this work are based on this simplified implementation.
4 Experiments
To verify the effectiveness and advantages of our proposed 2DCTC, we conduct experiments on standard datasets for scene text recognition and compare it with previous methods in this field.
4.1 Implementation Details
The whole system of 2DCTC is implemented with PyTorch
[27]. The overall structure of the network is illustrated in Fig. 5. The base network is identical to the typical PSPNet [42]. PSPNet is chosen due to its simplicity and excellent performance. The path transition probability map and probability distribution map are produced by a convolution layer followed by a convolution layer separately. The input image is resized to 64 pixels in height and 256 pixels in width, for the sake of training in batch. However, as 2DCTC is a fullyconvolutional network, in fact, it can handle images of arbitrary sizes.Our models are trained and evaluated with NVIDIA Titan X. The CPU used for efficiency evaluation is 2.40GHz Intel(R) E52680. During inference, all images are scaled to 64 pixels in height. Images with an aspect ratio (width over height) smaller than 4 will be resized to fix the width of 256 pixels, while the others will be resized with aspect ratios unchanged. Greedy or beam search can be used to find the path with maximum probability during inference, where beam search brings negligible accuracy improvement.
We use the Adam[20] optimizer to train our model with batch size of 256. The learning rate is set to for initialization and is then decayed to and .
4.2 Benchmark Datasets
Following the settings of recent works, our model is purely trained with synthetic data MJSynth [13] and SynthText [16], and evaluated on all the standard benchmarks without any further finetuning. The benchmarks for evaluation include both regular and irregular text instances. The details of the datasets are described as follows.
IIIT 5KWords (IIIT5k) is a dataset released for scene text recognition by [24]
. It consists of 5000 images, including 3000 images for testing and 2000 images for training. Two lexicons are provided for each image, which contains 50 and 1000 words respectively.
Street View Text [36] (SVT) contains roadside images collected from Google Street View. 647 word images were cropped from 249 test images in the dataset. 50word lexicon associates with each word image is also defined.
ICDAR 2013 [19] (IC13) inherits most of its data from IC03. There are 233 scene images in the dataset that comprise 1,015 cropped word images with ground truths.
ICDAR 2015 [18] (IC15) contains 500 testing images in total. These images are of low quality and have multiorientation bounding boxes. Following [6], we ignore images containing nonalphanumeric characters or irregular text instances, resulting in 1811 cropped text images.
Street View Text Perspective [28] (SVTP) contains 238 street images taken at the same addresses on SVT. 645 word images with a great variety of viewpoints are cropped. It is specifically designed for perspective text recognition.
CUTE80 [29] (CUTE) is a dataset focusing on curved text, which contains 80 natural scene images. No lexicon is provided in the dataset.
TotalText [8] is a scene text dataset with 2201 cropped images, in which text instances are ranged from slightly to extremely curved.
4.3 Performances on Standard Benchmarks
Methods  IIIT5k  SVT  IC13  IC15  SVTP  CUTE  TotalText  
50  1k  0  50  0  0  0  0  0  0  
ABBYY [36]  24.3      35.0             
Wang et al. [36]        57.0             
Mishra et al. [24]  64.1  57.5    73.2             
Wang et al. [37]        70.0             
Goel et al. [10]        77.3             
Bissacco et al. [5]            87.6         
Alsharif and Pineau [2]        74.3             
Almazán et al. [1]  91.2  82.1    89.2             
Yao et al. [40]  80.2  69.3    75.9             
RodríguezSerrano et al. [30]  76.1  57.4    70.0             
Jaderberg et al. [15]        86.1             
Su and Lu [35]        83.0             
Gordo [11]  93.3  86.6    91.8             
Jaderberg et al. [17]  97.1  92.7    95.4  80.7  90.8         
Jaderberg et al. [15]  95.5  89.6    93.2  71.7  81.8         
Shi et al. [31]  97.6  94.4  78.2  96.4  80.8  86.7         
Shi et al. [32]  96.2  93.8  81.9  95.5  81.9  88.6    71.8  59.2   
Lee et al. [21]  96.8  94.4  78.4  96.3  80.7  90.0         
Yang et al. [39]  97.8  96.1    95.2        75.8  69.3   
Cheng et al. [6]  99.3  97.5  87.4  97.1  85.9  93.3  70.6       
Cheng et al. [7]  99.6  98.1  87.0  96.0  82.8    68.2  73.0  76.8   
Bai et al. [3]  99.5  97.9  88.3  96.6  87.5  94.4  73.9       
Shi et al. [33]  99.6  91.8  76.1    
Vanilla CTC (Baseline)  97.8  92.2  95.7  86.9  91.2  69.8  75.6  77.1  
2DCTC (Ours)  99.8  98.9  94.7  97.9  90.6  93.9  75.2  79.2  81.3  63.0 
We first evaluate our proposed 2DCTC model on the standard benchmarks. A set of intermediate results produced by 2DCTC are depicted in Fig. 3. The probability distribution maps and path transition maps clearly show that 2DCTC can adaptively describe the geometric properties of the text instances being recognized, exclude the interference from background clutters and give precise predictions of the classes of the characters and their overall shapes.
The quantitative results of 2DCTC, vanilla CTC^{1}^{1}1In this work, vanilla CTC has the same base network as 2DCTC and is trained using the same training data and learning policy. The main difference lies in the formulation of the prediction and loss function. and other competitors are reported in Tab. 1. As can be seen, 2DCTC consistently outperforms with large margins vanilla CTC on all the benchmarks (improvements of 2.7%5.4%). This proves that 2DCTC, an extension and generalization of vanilla CTC, is effective at recognizing text with various forms (horizontal, oriented and curved).
Compared with previous algorithms, the advantage of 2DCTC in recognition accuracy is also obvious. Among 10 evaluation settings of the benchmarks, 2DCTC achieves 8 highest and 2 second highest performances, substantially surpassing all existing text recognition methods (including recent strong competitors such as EP [3] and ASTER [33]).
Note that 2DCTC performs particularly well on oriented and curved text, e.g., those from SVTP, CUTE and TotalText. The main reason is that oriented and curved text instances are nonlinear objects distributed in 2D spaces, while 2DCTC is specifically designed to preserve the 2D nature of objects. In contrast to vanilla CTC, in 2DCTC the collapsing operation along the height direction is eliminated and the optimal alignments are sought in a larger space ( instead of ), thus it is possible to better utilize useful information and avoid the negative influence of clutters or noises. This point is also evidenced by the qualitative examples shown in Fig.6. When clutters or noises appear in the images (notice the borders and small characters in the backgrounds in the first row), vanilla CTC is prone to errors since it simply compressing and employing all features, regardless of the source and quality of the features. However, 2DCTC handles such cases well, as it can selectively concentrate on features that are relevant and bypass those from clutters or noises. In summary, the ability to model the 2D nature of text is the key to the excellent text recognition performance of 2DCTC.
Methods  Performance on Benchmarks (%)  Speed (FPS)  

IIIT5k  SVT  IC13  IC15  SVTP  CUTE  TotalText  Train  Test (GPU)  Test (CPU)  
Vanilla CTC  92.2  86.9  91.2  69.8  75.6  77.1  58.3  1.42  41.63  4.55 
Vanilla CTC + Attention  91.1  88.3  91.0  70.7  76.2  78.3  60.4  1.40  39.47  4.31 
Attention Decoder [33]  94.4  90.5  92.6  75.8  78.3  80.4  61.3  1.15  11.35  1.13 
2DCTC  94.7  90.6  93.9  75.2  79.2  81.3  63.0  1.36  36.22  3.96 
4.4 Comparisons and Discussions
4.4.1 Time Cost of Loss Function
Loss Function  Device  Time (ms)  

1  256  
Vanilla CTC  CPU  0.12  1.24 
GPU  0.23  0.24  
2DCTC  CPU  0.2  1.98 
GPU  0.26  0.27 
The major difference between the loss function of 2DCTC and that of vanilla CTC is that the former introduces another dimension in the formulation, thus theoretically has higher computation complexity. However, with the proposed dynamic programming algorithm (see Sec.3.3.2), the computation of 2DCTC loss is constrained to a nearly ignorable cost. As shown in Tab. 3, both loss functions runs quite fast on CPU and GPU. The actual runtime of the 2DCTC loss is slightly longer than that of the vanilla CTC loss, which means that 2DCTC can obtain notable improvement in accuracy, with only a marginal decrease in speed.
4.4.2 Vanilla CTC with 2D Attention
Observing the formulation of 2DCTC, the path transition probabilities can be approximately regarded as a kind of attention, which separates the foreground text from background clutters and noises. To better demonstrate the contribution of the attentionlike mechanism, we also introduce a similar attention module into vanilla CTC. The variant “Vanilla CTC + Attention” has the identical model structure and outputs with 2DCTC, an extra summation over height dimension is adopted to fit into the formulation of vanilla CTC. The transition probability map can be interpreted as a twodimensional attention map on prediction. The additional attention module helps to solve images with perspective transformations but does not improve performance on regular text instances.
As can be observed from Tab. 2, with a different loss formulation, 2DCTC consistently outperforms “Vanilla CTC + Attention” on both regular and irregular text benchmarks. This confirms the advantage of the 2DCTC formulation, which is the essential reason for the improvements in recognition performance.
4.4.3 Ctc versus Attention Decoder
Besides CTC, attention decoder[21, 6, 33] is another widelyused paradigm for sequence prediction, which has been prior art for years. Using RNN and attention mechanism, attention decoder is fed with the global image feature and previous output; then it predicts one character at each timestep. Although remaining obvious drawbacks[6], methods with attention decoder have proven to outperform CTCbased methods. However, CTC based algorithms are still widely adopted in realworld applications, due to the superiority over attention decoder in running efficiency.
To compare 2DCTC with attention decoder based methods fairly, we reimplemented ASTER[33]’s decoder, the stateoftheart method up to date, with the same base model as 2DCTC. The evaluated performances and training/test speed are shown in Tab. 2. As can be seen, the proposed 2DCTC model achieves higher or comparable performances in regular text datasets and outperforms attention decoder in irregular text datasets.
Fig. 7 gives an intuitive comparison of different types of methods in accuracy and speed, where accuracy is measured using the average recognition rate on all the images from the benchmarks and speed is measured using the image number processed per second (in FPS). Attention decoder achieves promising accuracy, but its speed is quite limited, as the builtin recursive modeling in RNNs prevents it from parallelism. In contrast, 2DCTC outperforms attention decoder in accuracy while running times faster than it (36.22FPS vs. 11.35FPS), as 2DCTC only employs a fullyconvolutional network and the formulation allows for efficient inference. This indicates that with the creation of 2DCTC, the algorithms from the CTC family gain advantages over attention decoder based methods in terms of both accuracy and efficiency.
Nevertheless, the internal capability of modeling the contextual relationships in sequences in attention decoder is actually complementary to the characteristic of the CTC family. For applications that require extra high recognition accuracy, it would be a feasible solution to combine these two types of methodologies, taking the best of the two worlds for higher performance. This direction is worthy of further exploration and investigation.
5 Conclusion
In this paper, we have presented a novel 2DCTC model for scene text recognition, which is an extension to the vanilla CTC model. Motivated by the observation that instances of scene text are actually in 2D forms, 2DCTC is devised and formulated to describe this distinctive property and produce more precise recognition and more explainable intermediate predictions. The qualitative and quantitative results on standard benchmarks confirmed the effectiveness and advantages of 2DCTC.
References
 [1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence, 36(12):2552–2566, 2014.
 [2] O. Alsharif and J. Pineau. Endtoend text recognition with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811, 2013.

[3]
F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou.
Edit probability for scene text recognition.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018
, pages 1508–1516, 2018.  [4] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In 2013 IEEE International Conference on Computer Vision, pages 785–792, Dec 2013.
 [5] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, pages 785–792, 2013.
 [6] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5086–5094, Oct 2017.
 [7] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou. AON: towards arbitrarilyoriented text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pages 5571–5579, 2018.
 [8] C. K. Ch’ng and C. S. Chan. Totaltext: A comprehensive dataset for scene text detection and recognition. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 935–942. IEEE, 2017.

[9]
S. K. Ghosh, E. Valveny, and A. D. Bagdanov.
Visual attention models for scene text recognition.
In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 943–948, Nov 2017.  [10] V. Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 398–402. IEEE, 2013.
 [11] A. Gordo. Supervised midlevel features for word image representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pages 2956–2964, 2015.

[12]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber.
Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks.
InProceedings of the 23rd International Conference on Machine learning
, pages 369–376, Pittsburgh, Pennsylvania, USA, 2006. IMLS.  [13] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2315–2324, 2016.
 [14] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun. An endtoend textspotter with explicit alignment and attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pages 5020–5029, 2018.
 [15] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. arXiv preprint arXiv:1412.5903, 2014.

[16]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.
Synthetic data and artificial neural networks for natural scene text
recognition.
NIPS Deep Learning Workshop
, 2014. 
[17]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.
Reading text in the wild with convolutional neural networks.
International Journal of Computer Vision, 116(1):1–20, 2016.  [18] D. Karatzas, L. GomezBigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on robust reading. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pages 1156–1160. IEEE, 2015.
 [19] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug 2013.
 [20] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. international conference on learning representations, 2015.
 [21] C.Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2231–2239, 2016.

[22]
M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and
X. Bai.
Scene text recognition from twodimensional perspective.
In
Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence
, 2019.  [23] W. Liu, C. Chen, K.Y. K. Wong, Z. Su, and J. Han. Starnet: A spatial attention residue network for scene text recognition. In BMVC, volume 2, page 7, 2016.
 [24] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVCBritish Machine Vision Conference. BMVA, 2012.
 [25] A. Mishra, K. Alahari, and C. V. Jawahar. Enhancing energy minimization framework for scene text recognition with topdown cues. Computer Vision and Image Understanding, 145:30–42, 2016.
 [26] T. Novikova, O. Barinova, P. Kohli, and V. S. Lempitsky. Largelexicon attributeconsistent text recognition in natural images. In Computer Vision  ECCV 2012  12th European Conference on Computer Vision, Florence, Italy, October 713, 2012, Proceedings, Part VI, pages 752–765, 2012.

[27]
A. Paszke, S. Gross, S. Chintala, and G. Chanan.
Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.
URL https://github. com/pytorch/pytorch, 2017.  [28] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, Dec 2013.
 [29] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027 – 8048, 2014.
 [30] J. A. RodriguezSerrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3):193–207, Jul 2015.
 [31] B. Shi, X. Bai, and C. Yao. An endtoend trainable neural network for imagebased sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2017.
 [32] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016.
 [33] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An and attentional scene and text recognizer and with flexible and rectification. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1. IEEE, 2018.
 [34] D. L. Smith, J. Field, and E. LearnedMiller. Enforcing similarity constraints with integer programming for better scene text recognition. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 73–80, Washington, DC, USA, 2011. IEEE Computer Society.
 [35] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In D. Cremers, I. Reid, H. Saito, and M.H. Yang, editors, Computer Vision – ACCV 2014, pages 35–48, Cham, 2015. Springer International Publishing.
 [36] K. Wang, B. Babenko, and S. Belongie. Endtoend scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1457–1464, Washington, DC, USA, Nov 2011. IEEE Computer Society.
 [37] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. Endtoend text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 3304–3308, Nov 2012.
 [38] Y.C. Wu, F. Yin, X.Y. Zhang, L. Liu, and C.L. Liu. Scan: Sliding convolutional attention network for scene text recognition. arXiv preprint arXiv:1806.00578, 2018.
 [39] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pages 3280–3286, 2017.
 [40] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multiscale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4049, 2014.
 [41] F. Zhan and S. Lu. Esir: Endtoend scene text recognition via iterative image rectification. arXiv preprint arXiv:1812.05824, 2018.
 [42] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, Honolulu, HI, USA, July 2017.
Comments
There are no comments yet.