Punjabi University Patiala, India, Website http://www.punjabiuniversity.ac.in http://www.advancedcentrepunjabi.org http://www.punjabiuniversity.ac.in/sangam/ http://www.advancedcentrepunjabi.org/intro1.asp
ACTDPL Logo
 

Aim of Project

Project Scope

 

Project Methodology

Intermediate Milestones

Team Members

International Projects

 

Inhouse Projects

Project TitleDevelopment of Robust Document Image Analysis and Recognition System for Printed Urdu Script
Sponsored by Department of Science & Technology
Grant52.90 Lakh
DurationYear 2010 to 2013

Aim of Project

     
  1. To develop an Urdu OCR system, with the following capability:
    1. Recognize commonly used Urdu fonts with 95% recognition accuracy at character level.
    2. Recognize the common Urdu symbols and numerals
    3. Handling Documents with complex layout(Tables, Multicolumn’s, etc.)
    4. Processing Multi-color pages
  2. Prepare 5000 pages of annotated corpus for Urdu script

Scope of the Project

     
 

We shall also be developing the first generation OCR for Urdu script. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. The Urdu word grows both in horizontal and vertical direction. An Urdu word is a combination of ligatures (characters which join together) and isolated characters. The concept of space as a word boundary marker is not present in Urdu writing, which makes word segmentation a challenging task. It has been estimated by Urdu font developers that there are around 18,000 ligatures in Urdu, which makes ligature classification a tough job

Project Methodology

     
 

Urdu is written using Arabic script in Natalique writing style. Urdu words are written from right to left and numbers are written left to right. Thus the script is bidirectional. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Nastalique style for Urdu presents much more challenges and thus a very different OCR challenge. In summary, the challenges include much more cursiveness, diagonality, mark placement and significantly more contextual shaping. This entails that though the work on Arabic language is relevant, these algorithms need to be further evolved for Urdu. The main modules for OCR of Urdu script will be:

  1. Pre-processing routines for slant normalisation, smoothing, noise cleaning, skew correction, thinning etc.
  2. Segmentation routines for line, ligature and character segmentation
  3. Feature extraction and Recognition of Urdu ligatures and characters.
  4. Development of language model for Urdu for combining the adjacent Urdu ligatures to form Urdu words.
  5. Annotated Corpus of images of 5000 Urdu pages will be created
We propose to develop Urdu OCR with around 95% character recognition accuracy on noise-free documents.

Intermediate Milestones

     
  • Year-I
    • Initiation of OCR development for Urdu script with already developed tools
    • Pre-processing routines of Urdu script
    • Statistical analysis of ligatures
    • Development of line and ligature segmentation routines
    • Develop language models for Urdu characters, ligatures and words.
  • Year-II
    • Feature extraction routines for Urdu Script
    • Multi-classifier system for ligature recognition. Develop separate classifiers for high and low frequency ligatures
    • Rules and language models to combine adjacent ligatures to form valid Urdu words.
  • Year-III
    • Testing and development of 1st generation OCR for Urdu script

Project Team

     
 
Project staff members for development:

Principle InvestigatorDr. Gurpreet Singh Lehal
Punjabi University, Patiala
Co-Investigator Dr. Dharam Veer Sharma
Punjabi University, Patiala

© 2009 ACTDPL Punjabi University, Patiala