The objective of this project is to develop robust OCR for printed Gurmukhi script, which can deliver desired performance for possible conversion of legacy, printed documents into electronically accessible format with the following specifications.

System Specification

The end-to-end OCR system will be developed with reference to following functional specifications:

OCR for Gurmukhi Script will be developed with

   Font and point-size independent recognition capability


    • Scanned Pages of books published after 1950

    • 25 books published at different times over last 50 years will be considered. Each book is expected to have on average 200 pages. These pages are expected to be representative examples of the quality of pages that the developed OCR system will be able to handle. These pages are also expected to contain representative examples of Gurmukhi script (fonts and sizes) and layout patterns (including graphics and image components). These pages will form the annotated corpus for development and testing.


    • XML/HTML representation of the pages with appropriate tags so that layout and font information along with graphics and image component can be retained to the maximum possible extent.


The developed system is expected to meet following performance metrics

  • Expected Accuracy

    • Page Segmentation: 99% (text and non-text separation)

    • Character level Recognition after post-processing

  • Documents grouped into four classes (A,B,C,D) depending upon the quality of page, quality of printing, etc.The OCR is expected to provide minimum of following recognition accuracy for different classes of documents

    • Class A : 98%

    • Class B : 97%

    • Class C : 96%

    • Class D : 95%