Developing OCR for ancient scripts like Tamizhi (Tamil-Brahmi) and Kurdish historical texts is uniquely challenging due to character complexity, noise in source materials, and the lack of specialized datasets. Recent research using AI models such as LSTM, CNN, and fine-tuned Tesseract systems shows promising results, with Tamizhi OCR achieving over 91% accuracy. While no Kurdish-specific OCR exists yet, leveraging pre-trained Arabic models offers a practical pathway. These findings highlight the importance of tailored datasets, advanced machine learning techniques, and ongoing research in preserving and digitizing historical documents.Developing OCR for ancient scripts like Tamizhi (Tamil-Brahmi) and Kurdish historical texts is uniquely challenging due to character complexity, noise in source materials, and the lack of specialized datasets. Recent research using AI models such as LSTM, CNN, and fine-tuned Tesseract systems shows promising results, with Tamizhi OCR achieving over 91% accuracy. While no Kurdish-specific OCR exists yet, leveraging pre-trained Arabic models offers a practical pathway. These findings highlight the importance of tailored datasets, advanced machine learning techniques, and ongoing research in preserving and digitizing historical documents.

Building OCR Systems for Tamizhi and Kurdish Historical Documents

3 min read

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2.6 Tamizhi

Based on Munivel and Enigo (2022), digitizing documents from ancient history typically involves OCR. However, OCR for Tamizhi documents poses significant challenges due to the inherent similarities in shape and structure among many characters, along with their subtle variations. The Tamizhi script, also known as Tamil-Brahmi, serves as the precursor to numerous modern Indian scripts and is recognized as one of the oldest scripts in India. Developing an OCR system for Tamizhi script is exceptionally difficult due to the abundance of combined characters, where a character can consist of a single vowel, consonant, or a combination of both. In their research paper, the authors discuss their efforts in creating an OCR system specifically designed for printed Tamizhi documents. The system aims to perform effectively despite various factors, including the poor quality of the documents, the presence of noise, and the diverse formats of the input data. The authors report that their Tamizhi OCR achieves an accuracy rate of 91.12 percent for printed text, demonstrating promising results in recognizing Tamizhi characters.

\ To summarize, we can mention that up to the time we publish this research, the literature does not report on any efforts made to specifically develop OCR for historical Kurdish documents. Also currently no accessible dataset is available to train OCR systems that are specifically designed to extract text from historical Kurdish documents. That significantly restricts our options when it comes to selecting the most suitable approach for our study.

\ To develop an OCR system specifically tailored for historical documents, researchers employed different techniques and strategies such as SVM, LSTM, and CNN. The variability in the obtained results, which reached a maximum of 99.7% CLA, can be attributed to several contributing factors. These factors include the quality of the dataset used, the specific methodology employed during the development of the OCR system, and the intrinsic complexity of the documents being processed.

\ The studies that were reviewed in this chapter employed both proprietary datasets that were created by researchers themselves and publicly available datasets. These datasets include TWDB, HWDB, GT4HistOCR, Stockholm Archive, Dunhuang data, Tripitaka, TKH, MTH, and Kana-PRMU. According to the literature in this field, there are ongoing efforts to improve OCR techniques for different kinds of historical documents.

\ Based on our research, we identified that LSTM is a widely adopted approach for developing OCR systems with acceptable accuracy. As a result, we used the latest version of Tesseract, which integrates LSTM functionality, to ensure optimal performance in our project research. Additionally, we discovered the availability of pre-trained models that can be used for fine tuning on our dataset. Recognizing the similarities between the Kurdish and Arabic scripts, we made the decision to use an Arabic pre-trained model as our base model.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Market Opportunity
Wink Logo
Wink Price(LIKE)
$0.001975
$0.001975$0.001975
+1.80%
USD
Wink (LIKE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

The post Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment? appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 17:39 Is dogecoin really fading? As traders hunt the best crypto to buy now and weigh 2025 picks, Dogecoin (DOGE) still owns the meme coin spotlight, yet upside looks capped, today’s Dogecoin price prediction says as much. Attention is shifting to projects that blend culture with real on-chain tools. Buyers searching “best crypto to buy now” want shipped products, audits, and transparent tokenomics. That frames the true matchup: dogecoin vs. Pepeto. Enter Pepeto (PEPETO), an Ethereum-based memecoin with working rails: PepetoSwap, a zero-fee DEX, plus Pepeto Bridge for smooth cross-chain moves. By fusing story with tools people can use now, and speaking directly to crypto presale 2025 demand, Pepeto puts utility, clarity, and distribution in front. In a market where legacy meme coin leaders risk drifting on sentiment, Pepeto’s execution gives it a real seat in the “best crypto to buy now” debate. First, a quick look at why dogecoin may be losing altitude. Dogecoin Price Prediction: Is Doge Really Fading? Remember when dogecoin made crypto feel simple? In 2013, DOGE turned a meme into money and a loose forum into a movement. A decade on, the nonstop momentum has cooled; the backdrop is different, and the market is far more selective. With DOGE circling ~$0.268, the tape reads bearish-to-neutral for the next few weeks: hold the $0.26 shelf on daily closes and expect choppy range-trading toward $0.29–$0.30 where rallies keep stalling; lose $0.26 decisively and momentum often bleeds into $0.245 with risk of a deeper probe toward $0.22–$0.21; reclaim $0.30 on a clean daily close and the downside bias is likely neutralized, opening room for a squeeze into the low-$0.30s. Source: CoinMarketcap / TradingView Beyond the dogecoin price prediction, DOGE still centers on payments and lacks native smart contracts; ZK-proof verification is proposed,…
Share
BitcoinEthereumNews2025/09/18 00:14
XRPL Validator Reveals Why He Just Vetoed New Amendment

XRPL Validator Reveals Why He Just Vetoed New Amendment

Vet has explained that he has decided to veto the Token Escrow amendment to prevent breaking things
Share
Coinstats2025/09/18 00:28
US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

PANews reported on February 4th that, according to Crypto In America, US Senate Democrats plan to reconvene on the afternoon of February 4th to discuss legislation
Share
PANews2026/02/04 23:12