UAEU student creates new model for identifying plagiarism of Arabic text

Pioneering techniques designed to detect whether Arabic text is being reused online, and help to identify plagiarism, have been developed by a graduate student at United Arab Emirates University, UAEU.

A thesis produced by Leena Mahmoud Ahmed Lulu, who is studying for a Ph.D. and carrying out research at the United Arab Emirates University College of Information Technology, has outlined how a new method based on document-fingerprinting can discover whether original Arabic content on the internet is being re-used by others.

She conducted the research after noting that little or no work had been carried out on discovering instances of text reuse, where existing documents are used, partially or wholly, to make new ones, and plagiarism in the Arabic language. Her research paper, which has now been published, proposes a new web search tool to accompany the detection method, allowing lengthier queries to be entered when trying to assess if content is entirely original or has been used before.

"The Arabic language is a rich, morphological language that is among the most widely-used in the world, and on the web," said Lulu in her thesis.

"While the local text reuse detection problem has been mostly studied for Western languages, it is still one of the biggest challenges in the Arabic language and the research has remained quite limited. The results of this research can be thought of as rich tools for information analysts, to validate and assess information coming from uncertain sources.

"It is also time for web-users to become ‘fact inspectors’, by providing them with a tool that allows people to quickly check the validity and originality of statements and sources."

A series of experiments were conducted to see how the "unique features" of the Arabic language affected the possibility of text reuse being detected using existing techniques. Lulu’s research paper explains that the most widely-used and effective approach is the detection of documents which share one or more "fingerprints", a reliable indicator that they share some reused text.

However, it also points out that the linkage between Arabic letters, the right-to-left writing direction of Arabic text, and the flexibility of its word order, reduces the efficiency of such techniques, a problem the new fingerprinting model developed through her research and tailored for the Arabic language, aims to solve. "Our proposed method proved to be more robust for detecting text reuse, particularly when the sentence length increases toward the average sentence length in the Arabic language," Lulu said.

"The system first creates an initial document collection obtained from the web, then applies the detection techniques for finding text reuse with a given input document from this collection."

Describing text reuse detection as "an interesting and challenging area which has not been given the attention it deserves in terms of the Arabic language", Lulu said possible future research could focus on areas including the development of new approaches that would allow the document fingerprints to be more targeted and specific, enhancing the effectiveness of the method. She also suggested that a thesaurus of "paraphrased" Arabic sayings could be compiled, to find examples which might otherwise go undetected.