Plagiarism Detection Tool for Khmer Language

Abstract

Plagiarism in the Khmer language remains a critical challenge in Cambodia, as the limited availability of digitized texts and continued reliance on hard-copy sources impede the development of effective digital detection tools. This gap has enabled widespread plagiarism in research papers, books, and educational documents published by students, researchers, and authors from various academic and research organizations, compromising academic integrity and highlighting the urgent need for a digital solution. Although advanced plagiarism detection tools, such as Grammarly and Chegg, have significantly contributed to ensuring originality in many languages worldwide, they fail to detect plagiarism in under-resourced languages like Khmer. Therefore, this project aims to develop a plagiarism detection tool specifically for the Khmer language and to identify the most efficient approach by comparing four different methods: Term Frequency–Inverse Document Frequency (TF-IDF) with cosine similarity; N-gram with Jaccard similarity stored in a PostgreSQL inverted index; Bidirectional Encoder Representations from Transformers (BERT); and Elasticsearch. Additionally, to analyze hard-copy source documents for plagiarism, they are first scanned and then processed using Optical Character Recognition (OCR) to extract the necessary text. The system enables educational institutions, libraries, and publishers to upload large volumes of documents, books, and academic papers to detect plagiarized content, receive a plagiarism score, and identify matched sources within seconds. All uploaded documents are stored in a centralized storage system, allowing users to access their digital copies easily. The tool is currently deployed on self-managed servers at the Cambodian Ministry of Education, the primary funder of this initiative. It is being used to assess official educational content and academic papers from 13 universities in Cambodia, helping evaluate the tool’s effectiveness and identify areas for further improvement. While the system effectively detects thousands of identical and similar plagiarized sentences, its accuracy is limited by the OCR text extraction. Common OCR inaccuracies can result in distorted text, reducing the effectiveness of plagiarism detection. Future enhancements will focus on improving OCR performance and integrating internet-based plagiarism detection, ultimately expanding the system's capabilities and further strengthening research integrity in Cambodia.

Introduction

Problem Statement

Many authors, PhD students, researchers at more than thirteen universities in Cambodia, and book publishers have been found reusing redundant content to produce additional publications. This practice continues despite growing awareness of the consequences of plagiarism and is evident in numerous official books and research publications. A 2023 report by the Cambodian Education Forum further highlights that plagiarism is widespread among both students and academic staff. The problem is worsened by the limited availability of digital Khmer texts, making internet-based plagiarism detection tools ineffective for hard-copy documents.

Motivation

To address this challenge, a Khmer-language plagiarism detection system was developed to support Cambodian universities, publishers, and institutions in verifying the originality of academic documents. The system compares submitted files against a centralized database and significantly reduces manual workload by enabling fast, automated plagiarism detection through a web-based interface. It also helps overcome the lack of digitized content by using Optical Character Recognition (OCR) to extract searchable text from printed materials.

Project Funder

Funded and supported by the Cambodian Ministry of Education, Youth, and Sport (MoEYS), this project was initiated to improve research integrity across academic institutions. The Secretary of State of the Ministry of Education enabled 13 public universities to test the platform for detecting duplicated content in books, research papers, and academic publications.

Technology Stack and Development Environment

Implementation and Method Exploration

Explored Plagiarism Detection Methods

  1. TF-IDF and Cosine Similarity: Measure document similarity based on term frequency and inverse document frequency.
  2. N-gram, Jaccard Similarity, and Inverted Index in PostgreSQL: Use n-gram segmentation and Jaccard similarity to compare overlapping sequences via fast lookup in a PostgreSQL-based inverted index.
  3. BERT: Leverage transformer-based embeddings for semantic similarity.
  4. Elasticsearch: Utilize full-text search capabilities for plagiarism detection.

Final System Architecture and Technical Implementation

Elasticsearch is chosen as the final implementation method and the diagram below illustrates the overall workflow of the plagiarism detection system - from document upload to result delivery. It highlights how various technologies such as Flask, Redis, MinIO, Tesseract, and Elasticsearch work together to automate the plagiarism detection process and uploading documents to the system.

System Architecture Diagram

Full Final Year Project Document

To view the full version of the final year project report, click the link below:

View Full Final Year Project Report (PDF)

Conclusion

Successfully deployed on a self-hosted server within the Cambodian Ministry of Education, Youth and Sport, this system addresses the lack of digital tools for supporting plagiarism detection in Cambodia. Designed specifically for Khmer-language documents, it enables 13 public universities to access the platform and test the plagiarism detection tool through a web-based interface. Several detection methods—including TF-IDF with cosine similarity, N-gram with Jaccard similarity, BERT, and Elasticsearch—were explored and evaluated for accuracy, scalability, and performance. Based on these evaluations, Elasticsearch was selected and integrated into the system to provide high-performance document retrieval. Future development will focus on improving OCR accuracy for scanned PDFs and expanding detection capabilities to include online sources, further strengthening research integrity nationwide.

About the Author

My name is Sothyro Meas, and I completed this project as part of my undergraduate final year project in Information and Communication Technology. I am passionate about applying machine learning and artificial intelligence across various domains to address real-world challenges.