Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks
© The Author(s) 2016
Received: 17 August 2015
Accepted: 29 September 2016
Published: 25 November 2016
The recognition of Arabic script and its derivatives such as Urdu, Persian, Pashto etc. is a difficult task due to complexity of this script. Particularly, Urdu text recognition is more difficult due to its Nasta’liq writing style. Nasta’liq writing style inherits complex calligraphic nature, which presents major issues to recognition of Urdu text owing to diagonality in writing, high cursiveness, context sensitivity and overlapping of characters. Therefore, the work done for recognition of Arabic script cannot be directly applied to Urdu recognition. We present Multi-dimensional Long Short Term Memory (MDLSTM) Recurrent Neural Networks with an output layer designed for sequence labeling for recognition of printed Urdu text-lines written in the Nasta’liq writing style. Experiments show that MDLSTM attained a recognition accuracy of 98% for the unconstrained Urdu Nasta’liq printed text, which significantly outperforms the state-of-the-art techniques.
The tremendous advances in the field of image processing and computational intelligence have resulted in a significant progress in the development of character recognition applications for complex scripts. Particularly, several OCR systems have been developed in the commercial as well as open source domain for the recognition of Asian scripts like Chinese, Japanese, and Korean; such as ABBYY FineReader,1 MeOCR,2 JOCR3 and Tesseract (Smith 2007). However, progress in the recognition of Arabic script has been relatively slow mainly due to the special cursive characteristics of the script. Recognition of its derivative scripts like Nasta’liq is further complicated due to its calligraphic nature (Naz et al. 2014a). We point out these complexities to show that the work done for Arabic script recognition is not suitable for Urdu Nasta’liq (cf. “Urdu–Nasta’liq script” section) script.
To handle these complexity in Arabic script in general, and in Urdu Nasta’liq script in particular, a number of different approaches have been studied (Naz et al. 2014a). These approaches can be primarily categorized under Analytical and Holistic frameworks. Analytical approaches are further divided into explicit segmentation and implicit segmentation based methods. Explicit segmentation approaches usually have three major steps: over-segmentation, grouping, and classification. In the first phase, the ligature is segmented into units not bigger than various shapes of character and then grouping is performed onto the recognized unit to form ligature hypotheses. These ligature candidates are then fed to the recognition engine to find the most plausible combination. These approaches are script dependent and are mainly based on the analytical characteristics of the particular script to perform segmentation (Naz et al. 2015b). Accurate and consistent segmentation under various document degradations usually becomes a performance bottle-neck of such systems. Implicit segmentation approaches are based on predefined labels or code-books for images of text-lines, words or ligatures. The labels with their corresponding images are fed to a given machine learning model, which is then used to identify segmentation cue points at recognition time without pre-segmented units of ligatures (Saeed and Albakoor 2009). On the other hand, holistic approaches deal with the shapes of the entire ligatures. In this way, the shape of the ligature or sub-word is learned by the model without segmenting it into sub units. In holistic approaches, the system is trained for recognizing each ligature/word directly. Holistic approaches are considered to be script independent. However, they suffer from scalability issues as the number of unique shapes regarding ligatures or sub-word in a particular script may be very large. Urdu has more than 25,000 ligatures (Lehal 2012), thus the holistic based approaches are not suitable for such a large number of classes. However, small scale applications such as city names, bank checks etc, with limited vocabulary could be developed using holistic approaches.
Urdu Nasta’liq writing style has a diagonal nature (the pen stroke not only moves from right to left, but also from top to bottom). Therefore, we need such a model, which not only learns patterns/sequences from right to left and from left to right, but also from top to bottom and from bottom to top. Therefore, in this work we are proposing the adaptation of Multidimensional Long Short Term Memory (MDLSTM) neural networks for the recognition of Urdu Nasta’liq script, under the implicit segmentation approach. The reason of choosing MDLSTM is that it can scan the input image in all four directions (up, down, right and left). The MDLSTM is one of the variants of Recurrent Neural Network (RNN) and is effectively used for multi-dimensional sequence learning (Naz et al. 2013b). The novelty of this work in general is the use of MDLSTM for the first time for the Urdu Nasta’liq script recognition and particularly to investigate MDLSTM architecture against the diagonal nature of Nasta’liq script. Furthermore, we are also proposing the use of Connectionist Temporal Classification (CTC) layer as an output layer. CTC can probabilistically align the labels against the learned sequences in the image, thus avoiding explicit segmentation. To evaluate the performance of MDLSTM against Urdu Nasta’liq script, we have used Urdu Printed Text Images (UPTI) dataset. This dataset has 10,000 text-lines written in Urdu Nasta’liq writing style.
The rest of this paper is organized as follows: “Urdu–Nasta’liq script” section illustrates the complexities of Urdu script. “Related work” section and “Database” section describe the related work and dataset. “Methods” section presents MDLSTM based Urdu Nasta’liq recognition system and finally “Conclusion” section presents conclusions of the work.
Nasta’liq script emerged as a combination of two other Arabic scripts Naskh and Talique and gained popularity due to its beauty and compactness. Hence, Nasta’liq script carries the properties of both script and due to the calligraphic nature of this script; it introduces unique challenges that do not occur in Naskh and other Arabic scripts (Naz et al. 2014a, 2015b). These complexities make the character segmentation and recognition in Nasta’liq script a very challenging task.
Another complexity is introduced by multiple baselines in the Nasta’liq script (Naz et al. 2014b). The baseline is a virtual line on which characters are combined to form the ligatures and it facilitates both readers and writers. Unlike the Naskh script, character may appear at different descender line depending upon the associated characters. In Nasta’liq writing style, the varying locations of ascenders and descenders leads to errors in the accurate detection of the baseline because of their oblique orientation and long tail. Thus, without prior knowledge of the word and text-line structure it is quite difficult to estimate the baseline.
Due to the calligraphic nature of Nasta’liq script, character segmentation is challenging and prone to error (Hussain and Ali 2015). The purpose of segmentation is to divide the ligature into recognizable units or characters. Segmentation has considerable overheads and it is difficult to find accurate segmentation points for Nasta’liq script (Lehal 2012).
In traditional segmentation-based approaches, the performance of character recognition depended on character segmentation accuracy (Naz et al. 2013b). As discussed earlier in “Urdu–Nasta’liq script” section, explicit segmentation of cursive Nasta’liq script is difficult and prone to errors (Naz et al. 2015b). The new trend is diverting towards implicit segmentation as these approaches, especially the ones based on Recurrent Neural Networks (RNN), have shown promising results for cursive scripts. In the following text, we discuss the benchmark results based on implicit segmentation for different languages using RNN classifier.
Graves et al. (2009) applied Bidirectional Long Short-Term Memory (BLSTM) networks for online and offline character recognition on IAM-onDB and IAM-DB databases without explicit segmentation of words into characters and reported word-level recognition accuracies of 79.7 and 74.1%, and character-level accuracies of 88.5 and 81.8% respectively. The experiment showed that BLSTM out-performed the state-of-the-art segmentation based and segmentation free approaches. Graves further extended 1-dimensional (1D) LSTM into two-dimensional (2D) LSTM and presented an MDLSTM system based on a hierarchy of MDRNN and CTC (Graves et al. 2006) in ICDAR-2009 cursive handwriting recognition competition (Mozaffari and Soltanizadeh 2009). They used the raw pixels as input to the MDLSTM classifier and obtained accuracy of 91.85 and 95.9% for Arabic characters and digits respectively. Further, Graves and Schmidhuber (2009) presented another MDLSTM based system (Märgner and El Abed 2009) and achieved the highest results (91.4%) in the ICDAR 2009 competition on IFN/ENIT dataset (Mozaffari et al. 2008). A remarkable contribution of Graves in the field of character and speech recognition is the development of open source library, RNNLIB (Graves 2013), that implements RNNs, BLSTM, and MDLSTM architectures. Rashid et al. (2013) extracted raw pixels from Arabic words and fed them to MDLSTM to achieve 99% recognition on APTI dataset and subsequently win ICDAR 2013 Printed Arabic Recognition Competition.
Recently, Anupama and Sai (2015) implemented BLSTM using raw pixels for Oriya language and claimed 95.85% recognition rate. Another recent contribution (Pham et al. 2013) performed classification based on raw pixels from the text image for English, French and Arabic using MDLSTM classifier. Pam et al. presented the effectiveness of dropout in the traditional RNN architectures and reported 91.1, 85.6 and 90.1% on RIMES (French) (Grosicki et al. 2009), IAM (English) (Marti and Bunke 2002) and OpenHaRT (Arabic) (Morillot et al. 2013b) datasets, respectively. In ICDAR-2015, Chherawala et al. (2013) presented a scale invariant Pashto ligature recognition system using MDLSTM and reported 99% recognition rate. There are also some works using BLSTM or MDLSTM systems based on feature vectors rather than raw pixels (Ahmad et al. 2015; Chherawala et al. 2013; Liwicki et al. 2007).
In the literature of Urdu OCR using implicit segmentation approach, Ul-Hasan et al. (2013) performed two experiments for Urdu text-lines recognition on UPTI database (Ahmed et al. 2016) using one dimensional BLSTM and a sliding window. In the first experiment, they considered the shape variations of Urdu characters (i.e. initial, middle, final and isolated) as separate classes. In the second experiment, they merged all shape variations of one basic character into one class and extracted the raw pixels from a \(30 \times 1\) sliding window to train the BLSTM classifier. They achieved character recognition rates as 86.4 and 94.85% for the two experiments respectively. Another work on UPTI dataset is reported in Ahmed et al. (2016), in which Ahmed et al. employed BLSTM on raw pixels for shape variations scenario and without shape variations scenario using a \(30 \times 1\) sliding window for Urdu text-lines and reported recognition rate upto 88.4% for the first scenario and 88.94% for the second scenario.
Due to the use of UPTI dataset for Urdu text recognition, we also mention the work of Morillot et al. (2013a). They presented a segmentation free OCR system for recognition of clean as well as degraded ligatures images of Urdu Nasta’liq. They segmented the ligatures from the text-lines and recognized the ligatures based on holistic features. They achieved 88.8% accuracy rates for degraded ligatures and 91% recognition rate for the clean ligatures.
It is mentioned above, that works in Ul-Hasan et al. (2013) and Ahmed et al. (2016) implemented BLSTM for recognition of Urdu Nasta’liq text recognition and statistical features extracted and fed to MDLSTM in Naz et al. (2015a). To the best of our knowledge, MDLSTM approach using raw pixels has not been explored for Urdu Nasta’liq recognition. In the proposed system, we investigate MDLSTM using raw pixels for Urdu Nasta’liq recognition. The description of the database used in our study is given in the following sections.
The UPTI dataset splits used in this work
In this section, we present the experimental design of Urdu Nasta’liq text line recognition. We adopted pixel based MDLSTM approach reported in Naz et al. (2013b) for recognition of cursive Urdu script. The normalized grayscale text-lines and the corresponding transcriptions are fed to the MDLSTM network. The network is trained on raw pixels of images having Urdu text-lines and the CTC layer is deployed to generate the sequence of labels for the text line images. During recognition, a normalized grayscale test image is classified through the trained network and it generates the text line transcription.
Preprocessing and features extraction
The overall architecture of MDLSTM network for recognition of Urdu Nasta’liq text-lines (see Fig. 7) is composed of the input block size, hidden block size, sub sample size and LSTM layer size with the maximum number of nodes for CTC output layer. The input block is the size of small patches that scan the pixels of the image for further processing. The hidden block size is the size of small patches at each hidden layer in the MDLSTM network. The sub-sampling layers are between each pair of hidden layers and the size of the sub-sampling specifies the total number of feed forward \(\tanh\) units in the layers of sub-sampling.
Selected parameters for training the network
Input block size
4 × 1
Hidden bolck size
4 × 2
6 and 20
2, 10, 50
1 × 10−4
Total network weight
Different parameters for training MDLSTM and the corresponding training and validation error rates
Error rate (%) train set/Validation set
Number of passes
Approx. Ave. time per epoch (minutes)
398 (experiment was terminated)
403 (experiment was tenninited)
6 and 20
6 and 40
12 and 40
24 and 80
Hidden layer sizes
2, 4 and 20
4, 10 and 30
2, 10 and 50
4, 20 and 100
MDLSTM based Urdu character recognition system
The remaining network layers are described as follows; there are 3 hidden layers consisting of LSTM cells. The size of the each layer is 2, 10 and 50 respectively. The hidden layers are fully connected. These three layers are further separated by two sub-sampling layers. These sub-sampling layers have size of 6 and 20 respectively. The sub-sampling layers are feed-forward \(\tanh\) layers. The features are then collected into 4 × 2 hidden blocks. These 4 × 2 blocks are then fed to the layer of feed forward which is using \(\tanh\) summation units for the cell activation as shown in Fig. 7. The MDLSTM activation finally collapses into a one dimensional sequence and CTC layer labels the contents of the one dimensional sequence (Fig. 8).
Error rates for Urdu Nasta’liq text line recognition for training and validation sets
Training set (%)
Validating set (%)
The test set of 1600 unseen images is fed to the trained MDLSTM model for classification. Once again, each image is converted to gray scale and then its height is normalized to 46 pixels. The classification and recognition of 1600 images has taken a total time of 1 minute and 43.3 seconds on a 3.4 Ghz Intel Core i7 machine with 8 GB RAM. After recognition, the predicted text is generated against each image as the output. Meanwhile, the predicted text is compared against the corresponding ground truth and the overall error rates are calculated.
Results and discussion
A sample input image and the corresponding OCR output text are shown in Fig. 9 with an illustration of insertion, deletion and substitution errors. A closer analysis of the results revealed that the most mis-classification errors originate from the recognition of “space” character. This issue is inherent in Nasta’liq script, because after each non-joiner character there is a space like gap (Naz et al. 2013a). However, it is not a “space” as it naturally occurs within a word when a non-joiner character is present at the initial or middle position of the word. Due to the compact nature of Nasta’liq script, spaces between words are not larger than the spaces withing ligatures of the same word that occur due to the above-mentioned characteristics of the non-joiner characters. Technically, it is even difficult for non-native speakers to distinguish this break from the regular “space” character. Thus, the MDLSTM model confuses the space character with the gap caused by non-joiner characters. The number of insertion and deletion errors for “space” are 304 and 279 respectively. If the errors related to spaces are ignored, the accuracy goes up to 99.82%. This indicates that our network is able to achieve near-perfect results on discriminating shapes of different characters. To further improve recognition of spaces and hence achieving better word segmentation, the use of language modeling could be explored (Durrani and Hussain 2010).
Three types of analysis techniques employed for generalization of recognition error rates on UPTI dataset
Type of model validation technique
Error rate (%) of exp-1
Error rate (%) of exp-2
Error rate (%) of exp-3
Error rate (%) of exp-4
Error rate (%) of exp-5
Ave. error rate (%)
Train-set size based validation
Five-fold cross validation
Repeated random sub-sampling validation
A comparison of the presented system on UPTI dataset with other techniques reported in the literature
Ave. char. accuracy (%)
Ul-Hasan et al. (2013)
46% train set
34% validation set
20% test set
Ahmed et al. (2016)
46% train set
44% validation set
10% test set
Naz et al. (2015a)
68% train set
16% validation set
16% test set
68% train set
98 ± 0.25
16% validation set
16% test set
We presented an Urdu Nasta’liq text line recognition system using Multidimensional deep learning approach (MDLSTM). The proposed approach is particularly suitable due to the diagonal nature of the script. Our results demonstrate that the presented system out-performed state-of-the-art approaches based on Bidirectional LSTM networks. We also show that automated feature extraction using raw pixels as input to MDLSTM classifier achieved better results than manually designed statistical features. Results of our approach on publicly available UPTI dataset show an over 50% reduction in error rate as compared to state-of-the-art systems.
SN designed and performed experiment, analysed data and wrote paper; RA did implementation, analysis and wrote paper; AIU and MIR supervised and helped in experiment, analysis and paper writing. SFR and FS performed analysis and wrote manuscript. All authors discussed the results and implications and commented on the manuscript at all stages. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Ahmad R, Zeeshan MA, Rashid SF, Lickiwi M, Breuel T (2015) Scale and rotation invariant OCR for Pashto cursive script using MDLSTM network. In: Document analysis and recognition (ICDAR)Google Scholar
- Ahmed SB, Naz S, Razzak MI, Rashid SF, Afzal MZ, Breuel TM (2016) Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput Appl 27(3):603–613View ArticleGoogle Scholar
- Akram QUA, Hussain S, Niazi A, Anjum U, Irfan F (2014) Adapting Tesseract for complex scripts: an example for Urdu Nastalique. In: 11th IAPR international workshop on document analysis systems (DAS). IEEE, New York, pp 191–195Google Scholar
- Anupama R, Sai CSR (2015) Text recognition using deep BLSTM networks. In: 2015 eighth international conference on advances in pattern recognition (ICAPR)Google Scholar
- Chherawala Y, Roy PP, Cheriet M (2013) Feature design for offline Arabic handwriting recognition: handcrafted vs automated. In: 12th international conference on document analysis and recognition (ICDAR)Google Scholar
- Durrani N, Hussain S (2010) Urdu word segmentation. In: Proceedings of the human language technologies: conference of the North American chapter of the association of computational linguistics, Los Angeles, CA, USA, pp 528–536Google Scholar
- Graves A (2012) Offline arabic handwriting recognition with multidimensional recurrent neural networks. Springer, LondonView ArticleGoogle Scholar
- Graves A (2013) RNNLIB: a recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/
- Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in neural information processing systems, pp 545–552Google Scholar
- Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning (ICML). ACM, New York, pp 369–376Google Scholar
- Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868View ArticlePubMedGoogle Scholar
- Grosicki E, Carré M, Geoffrois E (2009) Results of the RIMES evaluation campaign for handwritten mail processing abstract, pp 941–945Google Scholar
- Hussain S, Ali S (2015) Nastalique segmentation-based approach for Urdu OCR. Int J Doc Anal Recogn (IJDAR) 18(4), 357–374MathSciNetView ArticleGoogle Scholar
- Lehal GS (2012) Choice of Recognizable Units for Urdu OCR. In: Proceeding of the workshop on document analysis and recognition, pp 79–85Google Scholar
- Liwicki M, Graves A, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of the 9th international conference on Ddocument analysis and recognition, vol 1, pp 367–371Google Scholar
- Märgner V, El Abed H (2009) ICDAR 2009 Arabic handwriting recognition competition. In: Proceedings of the international conference on document analysis and recognition, ICDAR, no. Table 1, pp 1383–1387Google Scholar
- Marti U, Bunke H (2002) The IAM-database: an English sentence database for offline handwriting recognition, pp 39–46Google Scholar
- Morillot O, Likforman-Sulem L, Grosicki E (2013a) New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks. J Electron Imaging 22(2):023028ADSView ArticleGoogle Scholar
- Morillot O, Oprean C, Likforman-sulem L, Mokbel C, Chammas E, Grosicki E, Paristech IMT, Ltci C (2013b) The UOB-telecom ParisTech Arabic handwriting recognition and translation systems for the OpenHart 2013 competition, vol 1Google Scholar
- Mozaffari S, Soltanizadeh H (2009) ICDAR 2009 handwritten Farsi/Arabic character recognition competition. In: 2009 10th international conference on document analysis and recognition, pp 1413–1417Google Scholar
- Mozaffari S, El Abed H, Maergner V, Faez K, Amirshahi A (2008) IfN/Farsi-database: a database of Farsi handwritten city names. In: Proceedings of the 11th international conference of frontiers of handwriting recognitionGoogle Scholar
- Naz S, Hayat K, Razzak MI, Anwar MW, Akbar H (2013a) Arabic script based language character recognition: Nastaliq vs Naskh analysis. In: World congress on computer and information technology (WCCIT’13), pp 1–7Google Scholar
- Naz S, Hayat K, Razzak MI, Anwar MW, Akbar H (2013b) Arabic script based character segmentation: a review. In: World congress on computer and information technology (WCCIT’13), pp 1–6Google Scholar
- Naz S, Hayat K, Razzak MI, Anwar MW (2014a) The optical character recognition of Urdu-like cursive scripts. Pattern Recognit 47(3):1229–1248View ArticleGoogle Scholar
- Naz S, Razzak MI, Hayat K, Anwar MW, Khan SZ (2014b) Challenges in baseline detection of Arabic script based languages. Intell Syst Sci Inf 542:181–196View ArticleGoogle Scholar
- Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Razzak MI (2015a) Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features. Neural Comput Appl 1–13. doi:10.1007/s00521-015-2051-4
- Naz S, Umar AI, Ahmed SB, Shirazi SH, Razzak MI, Siddiqi I (2015b) Segmentation techniques for recognition of Arabic-like scripts: a comprehensive survey. Educ Inf Technol 21(5):1225–1241. doi:10.1007/s10639-015-9377-5
- Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Siddiqi I, Razzak MI (2016a) Offline cursive Urdu–Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177:228–241View ArticleGoogle Scholar
- Naz S, Ahmed SB, Ahmad R, Razzak MI (2016b) Zoning features and 2DLSTM for Urdu text-line recognition. Procedia Comput Sci 96(96):16–22View ArticleGoogle Scholar
- Pham V, Bluche T, Kermorvant C, Louradour J (2013) Dropout improves recurrent neural networks for handwriting recognition. arXiv:1312.4569
- Rashid SF, Schambach MP, Rottland J, Nüll SVD (2013) Low resolution Arabic recognition with multidimensional recurrent neural networks. In: Proceedings of the 4th international workshop on multilingual OCR, p 6Google Scholar
- Sabbour N, Shafait F (2013) A segmentation-free approach to Arabic and Urdu OCR. In: Proceedings of the SPIE international society for optics and photonics, vol 86580, p 86580NGoogle Scholar
- Saeed K, Albakoor M (2009) Region growing based segmentation algorithm for typewritten and handwritten text recognition. J Appl Soft Comput 9(2):608–617View ArticleGoogle Scholar
- Slimane F, Ingold R, Kanoun S, Alimi AM, Hennebert J (2009) A new Arabic printed text image database and evaluation protocols. In: Proceedings international conference document analysis and recognition, ICDAR, pp 946–950Google Scholar
- Smith R (2007) An overview of the Tesseract OCR engine. In: IEEE international conference of document anaysis and recognition (ICDAR), pp 629–633Google Scholar
- Ul-Hasan A, Ahmed SB, Rashid F, Shafait F, Breuel TM (2013) Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: 12th international conference on document analysis and recognition (ICDAR’13), pp 1061–1065Google Scholar