Research Centre for Punjabi Language Technology,
Punjabi University, Patiala
- Breaking the Script barrier
GURMUKHI-SHAHMUKHI Transliteration System
- F i r s t Gurmukhi-OCR System with 97% accuracy
- F i r s t Punjabi Grammar Checker System
- F i r s t Gurmukhi Unicode Typing Pad
- Breaking the Language barrier
HINDI-PUNJABI Machine Translation System
- F i r s t Customized Punjabi Search Engien
- F i r s t Gurmukhi text Sumarization System
- F i r s t Intelligent bilingual (Legacy to Unicode) Font Converter
|ਪੰਜਾਬੀ ਭਾਸ਼ਾ ਤਕਨਾਲੋਜੀ ਦਾ ਖੋਜ ਕੇਂਦਰ, ਪੰਜਾਬੀ ਯੂਨੀਵਰਸਿਟੀ, ਪਟਿਆਲਾ |
|Indian sub-continent is one of those unique parts of the world where single languages are written in different scripts. Thus far, with the aid of Pan Asia ICT and ISF grants we have resolved the communication issues of 500 million habitants of South Asia, with the development of Urdu/ Hindi translation and Punjabi transliteration. A similar problem resides with the Sindhi language, which is written in a persio-Arabic script in Pakistan and in Devanagri in India and a growing numbers of Sindhis (2 million) in the EU and US who use Roman script. Whilst in speech, Sindhi spoken in India and Pakistan is mutually comprehensible in the written form it is not. We aim to provide a tool that will help Sindhi people to link across a hostile geographical divide. In so doing we will provide an ITC solution to a social problem that had seemed insurmountable for centuries. |
The aim of the project is to facilitate electronic and written communication between Sindhi people living in India and Pakistan through the development of a bi-directional web based Sindhi Language Transliteration Tool. The target groups will be Media organizations (such as magazines/newspapers), literary and literacy promotional organizations, writers and NGOs involved in dissemination activities amongst the urban and rural poor, virtual Sindhi speaking communities, schools, and colleges. This project will develop a complete machine transliteration system for Sindhi scripts to facilitate the use of these technologies on the web, thus enhancing networking between India and Pakistan... more
|The proposed project is a multi disciplinary project involving Cognitive Science, Computer Science and Linguistics. The objective of project is to develop a text-to-speech (TTS) synthesis system for Punjabi language as a helping aid to the persons with cognitive disabilities like dyslexia, visual comprehension and other learning disabilities. This TTS system will be used as an add-on tool embedded with web browsers that will enable the browser to read aloud a website in Punjabi language. With more and more electronic data becoming available online, software’s with this TTS system as add-on tool will be helpful for information dissemination, as the user who can not read Punjabi but can understand it will then be able to get the information contained in a document/webpage by listening to it. This type of assistive technology can be particularly helpful to individuals with cognitive disabilities, visually impaired persons and old people who find it difficult to read from the computer screen... more|
|The aim of the project is to enhance the existing Gurmukhi OCR in terms of robustness and the speed incorporating bi-lingual (English & Gurmukhi) text. In addition to this, the development of the next version of Gurmukhi OCR has the following capability: |
- Handling Documents with complex layout(Tables, Multicolumn’s, etc.)
- Processing Multi-color pages
- Multiple fonts (about 25 fonts based upon publication house and newspapers)
- Italicized and bold font, Common(symbols And numerals)
- Higher word level recognition accuracy with appropriate language specific post-processing schemes
|We shall also be developing the first generation OCR for Urdu script. From OCR point of view, Urdu is one of the most challenging script as the character and word shape changes according to context and usually the characters are joined together. The Urdu word grows both in horizontal and vertical direction. An Urdu word is a combination of ligatures (characters which join together) and isolated characters. The concept of space as a word boundary marker is not present in Urdu writing, which makes word segmentation a challenging task. It has been estimated by Urdu font developers that there are around 18,000 ligatures in Urdu, which makes ligature classification a tough job... more|
|A grammar checker of a language is a system that detects various grammatical errors in a given text based on the grammar of that particular language, and reports those errors to the user along with a list of helpful suggestions to rectify those errors. The input text will be first given to a preprocessor, which will break the input text into sentences and words. Then the tokenized text will be passed on to a morphological analyzer, which will provide grammatical information for each word in the given text. Then a POS tagger will perform part of speech tagging. Then this POS tagged text will be passed on to a phrase chunker to mark phrase and clause boundaries. Then in the last stage, syntax/agreement checks will be performed based on the POS tag information at the phrase level and then at the clause level. Any discrepancy found will be reported to the user along with suggested corrections and detailed error information... more|
| Hindi and Urdu are mutually comprehensible languages written in mutually incomprehensible scripts and spoken by more than 600 million people in India and Pakistan. Over the time, with the influence of Persian in Urdu and Sanskrit in Hindi, the vocabularies of the two languages have also become different though they still share more than 70% of common words. Though, the grammar of the two languages is still same. This project is a culmination of twelve years of academic research and literacy development both in the UK and the development of languages in Pakistan and India. The aim of the partnership is to facilitate electronic and written communication between people living in India and Pakistan through the development of a bi-directional web based Hindi-Urdu Language Transliteration/Translation Tool. The target groups will be Media organisations (such as magazines/newspapers), literary and literacy promotional organizations, writers and NGOs involved in dissemination activity amongst the urban and rural poor, virtual Hindi-Urdu speaking communities, schools and colleges. The intellectual background to this work has already been completed via a grant from the EU, Asia-ITC programme. The Punjabi University at Patiala has also developed Gurmukhi to Shahmukhi(Urdu) and reverse Transliteration softwares. Punjabi University is also currently working on development of Urdu-Hindi Transliteration tool through a funded research project. This project will develop the complementary Hindi to Urdu Transliteration Tool as well as a complete machine translation system between Hindi and Urdu languages as facilitate use of these technologies on the web, thus enhancing networking between India and Pakistan... more|
|ptical Character Recognition (OCR) is a process of converting printed materials into text or word processing files that can be easily edited and stored. When we scan a sheet of paper we reformat it from tangible hard object to a digital object, which we save as an image. The image can be manipulated as a whole but its text cannot be manipulated separately. In order to be able to do so, we need to tell the computer to recognize the text as such and to let us manipulate it as if it was a text in a word document. The OCR application does that; it recognizes the characters and makes the text editable and searchable, which is what we need. The technology has also enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and edited. Character accuracy, the most important aspect of text recognition, varies widely based on the quality and nature of the image (type and size of font, presence of special characters, complex layouts, and non-roman characters), its scanning resolution and the OCR software itself. The better the image’s quality is, and the higher the resolution, the higher the accuracy is. The accuracy is presented as a percentage, such as 98% accuracy, which will imply that there are two errors out of 100 characters. Depending on these mentioned factors, OCR accuracy might range between 80% and 99%... more|
|This project is a culmination of ten years of academic research and literacy development both in the UK and the development of languages in Pakistan and India. The aim of the partnership is to facilitate electronic and written communication between people living in and originating from East (Indian) and West Punjabi (Pakistani) through the development of a Punjabi Language Transliteration Tool. The target groups will be Media organisations (such as magazines/newspapers), literary and literacy promotional organizations, writers and NGOs involved in dissemination activity amongst the urban and rural poor, virtual Punjabi community, schools and colleges. The intellectual background to this work has already been completed via a grant from the EU, Asia-ITC programme... more|
|Punjabikhoj uses the Google database to search for Gurmukhi words in Shahmukhi, Devanagari and Gurmukhi Unicode based web sites. User Interface is provided for easy Punjabi Typing. Intelligent facility for similar meaning words is also integrated such as: ਭਾਰਤ OR ਇੰਡੀਆ OR ਹਿੰਦੋਸਤਾਨ. Advanced search facility using Fuzzy search is provided for multiple spellings variations... more|
|Before Unicode was invented, no single encoding could contain enough characters even for a single language. And the presence of non standard fonts makes the situation complicated. Based on Unicode Standard for Gurmukhi, Punjabi University, Patiala has recently developed a Gurmukhi Unicode typing pad to provide:|
- A unique solution to the problems of Gurmukhi Fonts
- Converts existing Gurmukhi text of common fonts into Unicode with the single click
- Provides a user friendly interface for Unicode typing in both Phonetic, Remington and Romanized style
- Gives a freedom to email the Unicode text into Gurmukhi as well as in Shahmukhi script on the fly.
|The online resource of Punjabi-English dictionary is very useful for Punjabi language learner, researchers and translation tasks. The dictionary size is more than 35000 words that are complied in both Gurmukhi & Shahmukhi script. Each enrty has sound, part-of-speech tagging and flexible fuzzy Search in Gurmukhi, Shahmukhi or English... more|
|The site offers verbal help and guide on pronunciation with variations on the pronunciation of words that sound alike, but have several different meanings.|
- The website also includes a pictorial vocabulary of more than 3,000 words along with their pronunciation that are organised into 80 related topics such as animals, birds, colours, fruits and the days of the week.
- Games like crossword puzzles, hanging man, recognising a word from its pronunciation, arranging letters in correct sequence are part of this website.
- Rhymes, Animated Stories that make learning easy and interesting are nicely presented along with the text in Gurmukhi script and a corresponding English translation.
- A set of talking stories, in which the user can click on any word or sentence of a story to get its meaning and pronunciation is another highlight of the website.... more
- Any Gurmukhi text in Unicode can be converted into Shahmukhi with correct spellings at more than 98% accuracy.
- Using real time Web page Transliteration support, a Shahmukhi reader can read Gurmukhi web site pages into Shahmukhi with just a single click.
- Muliple Input Support: The input text can in Unicode or two popular Gurmukhi Fonts Satluj & Anmol lipi are directly supported.
- User Interface is provided for easy Punjabi typing.
- The existence of two scripts for Punjabi has created a script barrier between the Punjabi literature written in India and Pakistan. This is very useful tool for Punjabi people for exchanging Punjabi literature between India and Pakistan.... more
|The Janam Sakhis have great historical importance and the online availability of the Janam Sakhis in Shahmukhi will greatly benefit the Urdu speaking researchers and thousands of Guru Nanak Dev Ji followers in Pakistan, Afghanistan, Iran, Iraq and other Muslim countries. This is the first time, that the Janam Sakhis have been transliterated into Shahmukhi and the Sangam Software developed at the Centre was used for transliterating the original text from Gurmukhi script to Shahmukhi. The Janam Sakhis are simultaneously available in Gurmukhi, Shahmukhi, Devanagari and Roman scripts.... more|
|The Punjabi text in Gurmukhi script can be translated into Hindi with a single click of mouse. The translation accuracy of this system is more than 90% at word level.|
- Online Web page Translation support is provided to translate any Unicode based Gurmukhi website into Hindi
- On screen keyboard is provided for easy typing in Gurmukhi and Unicode file is also supported to provide another way of inputting Gurmukhi text
- Email facility: User can send or share translated data using inbuilt interface for sending emails... more
- Hindi to Punjabi machine based translation system is also developed and contributed in online resources list. The translation accuracy of this system is more than 90% at word level.
- Like many other resources, online web page translation support is provided to translate any Unicode based Hindi website into Punjabi
- Similarly, user can share translated data using inbuilt interface for sending emails in Hindi as well as in Punjabi text
- On screen keyboard is provided for easy typing in Hindi and alternative input methods includes Roman, Krutidev or AnmolHindi and bilingual text typing facility... more
|The Punjabi Morphological Analyser and Generator was developed by Mandeep Singh, a Ph.D. student, under the guidance of Dr. Gurpreet Singh Lehal and a copy of the software was bought by C-DAC, Pune for Rs. One Lakh. The software was released by Thiru Dayanidhi Maran Hon'ble Union Minister of Communications and Information Technology in 2007 for mass distribution in the Punjabi CD launched by Ministry of Communications and Information Technology.... more|
|In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context —i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.|
- For example, consider the following sentence ਕੰਪਿਊਟਰ ਸਾਡੀ ਜ਼ਿੰਦਗੀ ਦਾ ਬਹੁਤ ਹੀ ਅਹਿਮ ਅੰਗ ਬਣ ਗਿਆ ਹੈ। The out put of our POS tagger is
ਕੰਪਿਊਟਰ_NNMXD ਸਾਡੀ_AJIFSO ਜ਼ਿੰਦਗੀ_NNFSO ਦਾ_PPIDAMSD ਬਹੁਤ_AJU ਹੀ_PTUE ਅਹਿਮ_AJU ਅੰਗ_NNMXD ਬਣ_VBMAXSS3XINO ਗਿਆ_VBOPMSXXPINIA ਹੈ_VBAXBST1
- Phrase chunking is a natural language process that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases.... more
|GTrans v1.0 helps to transliterate a Gurmukhi script based Unicode text file into Roman Script that follows phonetics in Punjabi Language (Gurmukhi Script). The transliteration scheme is mainly based on the ISO:15919 international standard. A unique feature of Gtrans is the rule based schwa deletion algorithm.... more|
|© 2013 Research Centre for Punjabi Language Technology,|
Punjabi University, Patiala (Punjab) INDIA 147 002.
Best viewed in latest and common browsers like Chrome, Firefox, Internet Explorer (compatibility view) and Apple Safari etc.