The objective of this project is to develop robust OCR for printed Gurmukhi script, which can deliver desired performance for possible conversion of legacy, printed documents into electronically accessible format with the following specifications.
The end-to-end OCR system will be developed with reference to following functional specifications:
OCR for Gurmukhi Script will be developed with
Font and point-size independent recognition capability
Scanned Pages of books published after 1950
25 books published at different times over last 50 years will be considered. Each book is expected to have on average 200 pages. These pages are expected to be representative examples of the quality of pages that the developed OCR system will be able to handle. These pages are also expected to contain representative examples of Gurmukhi script (fonts and sizes) and layout patterns (including graphics and image components). These pages will form the annotated corpus for development and testing.
The developed system is expected to meet following performance metrics
Documents grouped into four classes (A,B,C,D) depending upon the quality of page, quality of printing, etc.The OCR is expected to provide minimum of following recognition accuracy for different classes of documents
Class A : 98%
Class B : 97%
Class C : 96%
Class D : 95%