Abstract
Plagiarism in the Khmer language remains a significant concern in Cambodia due to the lack of effective detection tools, primarily because of the limited availability of digitized texts and a heavy reliance on hard-copy sources. Consequently, plagiarism is widespread across books and documents published by government officers in the ministries, universities and various organizations in Cambodia, as many authors exploit this opportunity. Although advanced plagiarism detection tools such as Grammarly and Chegg have contributed tremendously to ensuring original work, they fail to detect plagiarism in under-resourced languages like Khmer, especially when documents are stored as scanned Portable Document Formats (PDF) or images that impede accurate text extraction for detection. This project aims to develop a plagiarism detection tool specifically for the Khmer language and identify the most efficient approach by comparing five different methods: Term Frequency-Inverse Document Frequency (TF-IDF) with cosine similarity, N-gram with Jaccard similarity, MinHash and Local Sensitivity Hashing (LSH), Bidirectional Encoder Representations from Transformers (BERT), and Elasticsearch. Moreover, to support text extraction from hard-copies sources such as PDF and images, Optical Character Recognition (OCR) is also utilized. The system allows educational institutions, libraries, and publishers to upload large volumes of documents, books and academic papers to detect plagiarized content, receive a plagiarism score and identify matched sources within seconds. Additionally, all uploaded documents are stored in a centralized storage system, allowing users to search and access their digital copies easily. Currently, it is successfully deployed on self-managed servers at the Cambodian Ministry of Education, which also funded this initiative. This tool is used to assess official educational content and detects plagiarism in academic papers across 13 universities in Cambodia. While the current implementation can detect thousands of similar identical text matches, it faces limitations due to imperfect OCR text extraction for the Khmer language. Future enhancements will focus on improving OCR accuracy and integrating internet-based plagiarism detection, representing a step toward strengthening research integrity in Cambodia.
Introduction
Problem Statement
Many authors, PhD researchers, professors at more than thirteen universities in Cambodia, and book publishers have been found reusing redundant content to produce additional publications. Over a hundred documents containing substantial overlap were identified within the Ministry of Education’s annual books and research reports. A report from the Cambodian Education Forum (2023) also highlights that plagiarism is widespread among both students and academic staff. The problem is worsened by the limited availability of digital Khmer texts, making internet-based plagiarism tools ineffective for hard-copy documents.
Motivation
This tool was designed to help Cambodian universities, publishers, and institutions verify the originality of Khmer-language documents by comparing them against a centralized database. It significantly reduces manual workload by enabling fast plagiarism detection on large documents through a web-based interface. It also addresses the lack of digitized content by using OCR to extract searchable text from printed materials.
Project Background
Funded and supported by the Cambodian Ministry of Education, Youth, and Sport (MoEYS), this project was initiated to improve research integrity across academic institutions. The Secretary of State enabled 13 public universities to test the platform for detecting duplicated content in books, research papers, and academic publications.
Technologies Used
- Machine Learning: Scikit-learn (TF-IDF), Transformers (BERT)
- Text Extraction: Tesseract OCR (extracts Khmer text from scanned documents)
- Similarity Search: FAISS (semantic search with BERT), Elasticsearch (full-text retrieval), PostgreSQL (inverted index for n-gram matching)
- Backend & Task Management: Python, Flask (API), Redis (task queue)
- Frontend: React (web interface), Tailwind CSS
- Storage: MinIO (object storage for uploaded documents)
- Deployment: Docker (containerization), MoEYS server (production hosting)
- Development Tools: Visual Studio Code, Jupyter Notebook, Postman (API testing), Google Colab(Run BERT on GPU)
Implementation
Plagiarism Detection Methods
Multiple methods were implemented:
- TF-IDF and Cosine Similarity: Measures document similarity based on term frequency and inverse document frequency.
- N-gram, Jaccard Similarity, and Inverted Index in PostgreSQL: Uses n-gram segmentation and set-based Jaccard similarity to compare overlapping sequences via fast lookup in a PostgreSQL-based inverted index.
- MinHash and LSH: Efficiently identifies similar documents using hashing techniques.
- BERT: Leverages transformer-based embeddings for semantic similarity.
- Elasticsearch: Utilizes full-text search capabilities for plagiarism detection.
System Architecture
The diagram below illustrates the overall workflow of the plagiarism detection system, from document upload to result delivery. It highlights how various technologies such as Flask, Redis, MinIO, Tesseract, and Elasticsearch work together to automate the detection process.

Full Final Year Project Document
You can download or view the full version of the final year project below:
View Full Final Year Project Report (PDF)Conclusion
This system demonstrates a solution for detecting plagiarism in the Khmer language, addressing challenges unique to under-resourced languages and hard-copy texts. Successfully deployed at the Ministry of Education and across 13 Cambodian universities, it enables fast detection of duplicate content. Future development will focus on enhancing OCR accuracy and expanding semantic search capabilities to further support research integrity in Cambodia.
About the Author
My name is Sothyro Meas, and I completed this project as part of my undergraduate final year project in Information and Communication Technology. I am passionate about applying AI, NLP, and building tools that serve my native language, Khmer.
đź“§ Email: meassothyro3@gmail.com