The use of morpho-syntactically annotated tag sequences as markers of authorship

Spassova, Maria and Turell, Maria Teresa (2007) The use of morpho-syntactically annotated tag sequences as markers of authorship. In: Proceedings of the Second European IAFL Conference on Forensic Linguistics, Language and the Law. Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra, Barcelona, Spain, pp. 229-237. ISBN 9788496742284

[thumbnail of SpassovaM_TurellMT_ECFL06.pdf] PDF
SpassovaM_TurellMT_ECFL06.pdf
Restricted to Repository staff only

182kB

Official URL: http://www.iula.upf.edu/publi082.htm

Abstract

Since the early days of Forensic Linguistics there has been an incessant search for idiosyncratic features that would allow us to distinguish one author from another on the basis of their linguistic production. Studies in Authorship Attribution constantly report on the latest findings and newly developed techniques, applying different linguistic units as measures, in forensic stylometric analysis. In this quest for valid and reliable identification markers, syntactic structure has been shown to be less appealing. That is easily explained by the fact that syntactic constructions are notorious for their structural complexity and processing difficulty. However this does not seem to be a problem if we consider syntactic structure to be a simple sequence of categories grouped together and whose function analysis is set apart. This paper presents the preliminary results of a series of experiments in author identification aimed at evaluating the discriminatory capacity of sequences of linguistic categories (Morpho-syntactically Annotated Tag Sequences (MATS)) and demonstrating which one of the tested three is the best candidate marker of authorship.
The hypothesis tested in the experiments is that the most frequent tag sequences will discriminate effectively between authors in limited-size samples of texts. All the experiments were carried out on a morpho-syntactically annotated corpus consisting of 15 newspaper articles and 15 novel fragments written in Spanish by 3 authors, 5 of each genre per author.
The occurrences of each type of MATS were extracted by means of the programming language AWK and later their frequencies were calculated using the environment for statistical computing and graphics R. An innovative combination of traditional statistical methods was applied to determine the authorship of a pseudo anonymous text. Preliminary results show that MATS can become valid and reliable markers of authorship.

Item Type:Book Section
Additional Information:We would like to express our gratitude to Dr. R. Harald Baayen from Max Planck Institute at University of Nijmegen (The Netherlands) for introducing us to using R software and for his valuable help in the statistical analysis of our data.
Subjects:Language. Linguistics. Literature > Applied linguistics
Language. Linguistics. Literature > Language
ID Code:2259
Deposited By: ас. д-р Мария Спасова
Deposited On:15 Jul 2014 09:18
Last Modified:15 Jul 2014 09:18

Repository Staff Only: item control page