Punjabi University Patiala,India,Website http://www.universitypunjabi.org http://www.advancedcentrepunjabi.org http://g2s.learnpunjabi.org/default.aspx http://www.advancedcentrepunjabi.org/intro1.asp

Home Page

Project Background


Project beneficiaries

Project Time-Line

Team Members

Project Progress

Optical Character Recognition (OCR) is a process of converting printed materials into text or word processing files that can be easily edited and stored. When we scan a sheet of paper we reformat it from tangible “hard object” to a digital object, which we save as an image. The image can be manipulated as a whole but its text cannot be manipulated separately. In order to be able to do so, we need to “tell” the computer to recognize the text as such and to let us manipulate it as if it was a text in a word document. The OCR application does that; it recognizes the characters and makes the text editable and searchable, which is what we need. The technology has also enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and edited. Character accuracy, the most important aspect of text recognition, varies widely based on the quality and nature of the image (type and size of font, presence of special characters, complex layouts, and non-roman characters), its scanning resolution and the OCR software itself. The better the image’s quality is, and the higher the resolution, the higher the accuracy is. The accuracy is presented as a percentage, such as 98% accuracy, which will imply that there are two errors out of 100 characters. Depending on these mentioned factors, OCR accuracy might range between 80% and 99%.

Optical Character Recogntion (OCR) is one of the most common and useful applications of machine vision technology. Researchers have experimented with programs designed to recognize images of printed characters since at least the 1960s, but it was in the 1980s that OCR systems expanded in use and significance. Improvements in the power and price of software and hardware since the 1980s have made OCR practical and affordable on standard desktop computers. The history of development of OCR for Gurmukhi script is not very old. In fact it is only in 2000s that an offline OCR for Gurmukhi script has been developed. The OCR which has been developed by the coordinator of this project has an accuracy of around 97% on clean text. We propose for increasing the accuracy of the Gurmukhi OCR and work on noisy and old documents. Also support will be provided for more type faces and symbols such as numerals.

To develop the Gurmukhi OCR we shall be participating in a consortium with multiple participating institutes. These modules will be developed for Gurmukhi script.

  1. Line-to-Word Segmentation

    • Line-level Text Statistics

    • Detection of Word Boundaries

  2. Word-to-Component Segmentation

    • Detection/Removal of Shirorekhas

    • Dectection of Connected Components

    • Script-based Component Splitting

    • Removal of Cuts and Merges

  3. Component Recognition

    • Feature Extraction from Components

    • Script-based Classifier Design

    • Learning of Classifier Parameters

    • Combination of Multi-Modal Classifiers

  4. Language-specific Post Processing

    • Character Recognition from Component Labels

    • Word Recognition from Characters

    • Language-model based Disambiguation of Labels

    • Dictionary Lookup for Spell Check