Gurmukhi OCR :: Robust Document Analysis and Recognition System

Punjabi University Patiala,India,Website http://www.universitypunjabi.org

http://www.universitypunjabi.org/sangam/

http://www.advancedcentrepunjabi.org/intro1.asp



	Home Page

	Project Background

	Objectives

	Project beneficiaries

	Project Time-Line

	Team Members

	Project Progress

The objective of this project is to develop robust OCR for printed Gurmukhi script, which can deliver desired performance for possible conversion of legacy, printed documents into electronically accessible format with the following specifications.

System Specification

The end-to-end OCR system will be developed with reference to following functional specifications:

OCR for Gurmukhi Script will be developed with

Font and point-size independent recognition capability

INPUT

Scanned Pages of books published after 1950
25 books published at different times over last 50 years will be considered. Each book is expected to have on average 200 pages. These pages are expected to be representative examples of the quality of pages that the developed OCR system will be able to handle. These pages are also expected to contain representative examples of Gurmukhi script (fonts and sizes) and layout patterns (including graphics and image components). These pages will form the annotated corpus for development and testing.

OUTPUT

XML/HTML representation of the pages with appropriate tags so that layout and font information along with graphics and image component can be retained to the maximum possible extent.

Performance

The developed system is expected to meet following performance metrics

Expected Accuracy

Page Segmentation: 99% (text and non-text separation)
Character level Recognition after post-processing

Documents grouped into four classes (A,B,C,D) depending upon the quality of page, quality of printing, etc.The OCR is expected to provide minimum of following recognition accuracy for different classes of documents

Class A : 98%
Class B : 97%
Class C : 96%
Class D : 95%