Punjabi University Patiala, India, Website http://www.punjabiuniversity.ac.in http://www.advancedcentrepunjabi.org http://www.punjabiuniversity.ac.in/sangam/ http://www.advancedcentrepunjabi.org/intro1.asp

Aim of Project

Project Scope

Intermediate Milestones

Team Members

International Projects


Inhouse Projects

Project TitleDevelopment of Robust Document Image Analysis and Bilingual Recognition System for Printed Gurmukhi and Roman Scripts
Sponsored by Department of Science & Technology
Grant52.90 Lakh
DurationYear 2010 to 2013

Aim of Project

  1. To enhance the existing Gurmukhi OCR in terms of robustness and the speed incorporating feed backs obtained through deployment.
  2. Development of the next version of the existing Gurmukhi OCR with the following capability:
    1. Bi-lingual (English & Gurmukhi) text
    2. Handling Documents with complex layout(Tables, Multicolumn’s, etc.)
    3. Processing Multi-color pages
    4. Multiple fonts (about 25 fonts based upon publication house and newspapers)
    5. Italicized and bold font,
    6. Common symbols and numerals
    7. Higher word level recognition accuracy with appropriate language specific post-processing schemes
  3. Prepare 5000 pages of annotated corpus

Scope of the Project

  1. We shall deploy existing Gurmukhi OCR, among other applications, for generation of Braille books for visually challenged to meet the current demand for Braille Books in Indian languages. These will be deployed with the help of NGO’s and government institutions. The OCR will be packaged appropriately with interfacing for Braille printing with the human operator in the loop will be implemented for the purpose of deployment.
  2. Technology will be developed for dealing with documents having complex layout e.g newspaper and magazines. Current Gurmukhi OCR cannot deal with italicized and bold fonts. Also, the OCR fails for large point sizes. We shall remove these limitations and make the OCR more robust against font variations. Further, language specific post processing tools which can exploit performance models of the OCR can enhance the final word recognition rates.
  3. There are a large number of scientific and research issues involved in developing technology for processing images of old books. Degradation in the quality of the pages and printed matter pose a different kind of a challenge for historical documents. Old layouts and printing technology affects appearance in a distinct way which makes application of standard OCR techniques difficult. Further, for many languages scripts and alphabets have undergone historical evolutions making current recognition systems inadequate.

Intermediate Milestones

  • Year-I
    • Development of modules to recognize bold and italics Gurmukhi characters
    • Handling common symbols, numerals and punctuation marks in Gurmukhi
    • Enhancement of corpus of image collections
    • Development of script identification and handling English and Gurmukhi text
  • Year-II
    • Development of language based post-processing modules for Gurmukhi
    • Technology for dealing with degraded documents
    • Classification techniques for larger set of fonts
    • Techniques for dealing with noisy outputs of OCR’s
    • Web based OCR service for Gurmukhi OCR
    • Year-III
      • 2nd generation Gurmukhi OCR
      • Web based search engine
      • Mobile platform based OCR

Project Team

Project staff members for development:

Principle InvestigatorDr. Gurpreet Singh Lehal
Punjabi University, Patiala
Co-Investigator Dr. Chandan Singh
Punjabi University, Patiala
Co-Investigator Dr. Renu Dhir
NIT, Jalandhar
Co-Investigator Ms. Rajneesh Rani
NIT, Jalandhar

© 2009 ACTDPL Punjabi University, Patiala