Tokenization and its challenges in Sindhi language

International Journal of Computer Science and Emerging Technologies

View Publication Info
Field Value
Title Tokenization and its challenges in Sindhi language
Creator Farooqui, Saira
Shaikh, Noor Ahmed
Rajper, Saima
Subject Sindhi, Tokenization, Tokens, SVR, POS. J Mahar model.
Description Natural language processing, is a branch of Artificial Intelligence (AI). This is computational techniques which are used to analysis and synthesis of NLP and its applications. Natural Language is the ability and capability to understand the spoken language. Sindhi language has polymorphic characteristics. Sindhi is an old as well as complex language in the world because of its semantic features, so the tokenization is difficult task for Sindhi language. Tokenization is also called word segmentation into words or script (numbers, alphabets). In this research issues of tokenization are discussing. In many language just like Urdu, Sindhi Arabic and so on. Most of the language have space insertion and space omission errors. So, it’s very important to measure the different corpus with different algorithms in this research we utilize and develop J.Mahar model on corpus. When this tokenizer is tested on this data with one lac and seventy five thousand words of Sindhi text. On this corpus JM tokenizer provides 96% accuracy.
Publisher Shah Abdul Latif University, Khairpur
Date 2019-09-02
Type info:eu-repo/semantics/article
Format application/pdf
Source International Journal of Computer Science and Emerging Technologies ; Vol 1 No 1 (2017): IJCET Vol 1 Issue 1 Dec 2017; 53-56
Language eng

Contact Us

The PKP Index is an initiative of the Public Knowledge Project.

For PKP Publishing Services please use the PKP|PS contact form.

For support with PKP software we encourage users to consult our wiki for documentation and search our support forums.

For any other correspondence feel free to contact us using the PKP contact form.

Find Us


Copyright © 2015-2018 Simon Fraser University Library