The use of sequences of linguistic categories in forensic written text comparison revisited

Bel, Nuria and Queralt, Sheila and Spassova, Maria and Turell, Maria Teresa (2012) The use of sequences of linguistic categories in forensic written text comparison revisited. In: Proceedings of The International Association of Forensic Linguists’ Tenth Biennial Conference. Centre for Forensic Linguistics, Aston University, Birmingham, pp. 192-209. ISBN 9781854494320

[thumbnail of iafl2010.pdf] PDF
Restricted to Repository staff only


Official URL:


In recent years, the possibility of studying syntax use through computer-aided queries of annotated corpora has led researchers working in the field of forensic written text comparison to explore a new possible marker of authorship, namely, tag sequences as representation of combinations of linguistic categories. A series of studies carried out urging the first research stage at ForensicLab using Spanish language data have shown that tag sequences exhibit a significant discriminatory capacity and can be applied to authorship attribution tasks more effectively. In the second research stage reported in this paper, the analysis aims to identify specific traits of each linguistic category implemented in those tags within the exploited tag set which play a major role in the correct classification of texts and those which do not, without losing sight of the fact that either their exclusion or inclusion in tag composition can help to improve this forensic linguistic comparison method. This paper reports on the findings from the statistical testing of several variants of the Institut Universitari de Lingüística Aplicada’s (IULA) tag set system and their evaluation in the context of authorship analysis. For testing purposes, a corpus of two types of written texts (novel fragments and newspaper articles), from six contemporary Spanish speaking novelists, was compiled. Furthermore, a subcorpus was used of texts written by one of the writers included in this study, whose authorship was anonymised. Preliminary studies show that, in both types of written texts, the use of trigrams produces more statistically significant results than the use of bigrams, especially trigrams consisting of prepositional phrases and, to a lesser extent, verbal and compound adjective phrases.

Item Type:Book Section
Uncontrolled Keywords:Morpho-syntactically annotated tag sequences, N-grams, discriminant function analysis, forensic written text comparison, authorship attribution, Spanish
Subjects:Language. Linguistics. Literature > Applied linguistics
ID Code:2260
Deposited By: ас. д-р Мария Спасова
Deposited On:15 Jul 2014 08:25
Last Modified:31 Jul 2014 09:39

Repository Staff Only: item control page