Stylometry

Stylometry is the application of the study of linguistic style, usually to written language, but it has been applied successfully to music^[1] and to fine-art paintings^[2] as well.^[3] Another conceptualization defines it as the linguistic discipline that evaluates an author's style through the application of statistical analysis to a body of their work.^[4]

Stylometry is often used to attribute authorship to anonymous or disputed documents.^[5] It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works to forensic linguistics.

History[]

Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, author identity, and other questions.

The modern practice of the discipline received publicity from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors of uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use John Fletcher's preference for "⁠ ⁠’em", the contractional form of "them", as a marker to distinguish between Fletcher and Philip Massinger in their collaborations—- but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of "⁠ ⁠’em" to "them".^[6]

The basics of stylometry were established by Polish philosopher Wincenty Lutosławski in Principes de stylométrie (1890). Lutosławski used this method to develop a chronology of Plato's Dialogues.^[7]

The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee good quality output. During the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which indicated that six different authors had written that body of work. A check of his method, applied to the works of James Joyce, gave the result that Ulysses, Joyce's multi-perspective, multi-style novel, was composed by five separate individuals, none of whom apparently had any part in the crafting of Joyce's first novel, A Portrait of the Artist as a Young Man.^[8]

In time, however, and with practice, researchers and scholars have refined their methods, to yield better results. One notable early success was the resolution of disputed authorship of twelve of The Federalist Papers by Frederick Mosteller and David Wallace.^[9] While there are still questions concerning initial assumptions and methods (and, perhaps, always will be), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic analysis to the Fletcher canon by Cyrus Hoy and others yielded clear results during the late 1950s and early 1960s.)

Applications[]

Applications of stylometry include literary studies, historical studies, social studies, information retrieval, and many forensic cases and studies.^[10]^[11] It can also be applied to computer code^[12] and intrinsic plagiarism detection, which is to detect plagiarism based on the writing style changes within the document.^[13] Stylometry can also be used to predict whether someone is a native or non native English speaker by their typing speed.^[14]

Stylometry as a method is vulnerable to the distortion of text during revision.^[15] There is also the case of the author adopting different styles in the course of his career as was demonstrated in the case of Plato, who chose different stylistic policies such as the those adopted for the early and middle dialogues addressing the Socratic problem.^[16]

Current research[]

Modern stylometry uses computers for statistical analysis, and artificial intelligence and access to the growing corpus of texts available via the Internet.^[17] Software systems such as Signature^[18] (freeware produced by Dr Peter Millican of Oxford University), JGAAP^[19] (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola of Duquesne University), stylo^[20]^[21] (an open-source R package for a variety of stylometric analyses, including authorship attribution, developed by , and ) and Stylene^[22] for Dutch (online freeware by Prof Walter Daelemans of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.

Academic venues and events[]

Stylometric methods are used for several academic topics, as an application of linguistics, lexicography, or literary study,^[3] in conjunction with natural language processing and machine learning, and applied to plagiarism detection, authorship analysis, or information retrieval.^[17]

Forensic linguistics[]

The International Association of Forensic Linguists (IAFL) organises the (13th edition in 2016 in Porto) and publishes with forensic stylistics as one of its central topics.

AAAI[]

The Association for the Advancement of Artificial Intelligence (AAAI) has hosted several events on subjective and stylistic analysis of text.^[23]^[24]^[25]

PAN[]

PAN workshops (originally, plagiarism analysis, authorship identification, and near-duplicate detection, later more generally workshop on uncovering plagiarism, authorship, and social software misuse) organised since 2007 mainly in conjunction with information access conferences such as ACM SIGIR, , and CLEF. PAN formulates shared challenge tasks for plagiarism detection,^[26] authorship identification,^[27] author gender identification,^[28] author profiling,^[29] vandalism detection,^[30] and other related text analysis tasks, many of which hinge on stylometry.

Case studies of interest[]

In 1439, Lorenzo Valla showed that the Donation of Constantine was a forgery, an argument based partly on a comparison of the Latin with that used in authentic 4th-century documents.
In 1952, the Swedish priest Dick Helander was elected bishop of Strängnäs. The campaign was competitive and Helander was accused of writing a series of a hundred-some anonymous libelous letters about other candidates to the electorate of the bishopric of Strängnäs. Helander was first convicted of writing the letters and lost his position as bishop but later partially exonerated. The letters were studied using a number of stylometric measures (and also typewriter characteristics) and the various court cases and further examinations, many contracted by Helander himself during the years until his death in 1978, discussed stylometric method and its value as evidence in some detail.^[31]^[32]
In 1975, after Ronald Reagan had served as governor of California, he began giving weekly radio commentaries syndicated to hundreds of stations. After his personal notes were made public on his 90th birthday in 2001, a study used stylostatistical methods to determine which of those talks were written by him and which were written by various aides.^[33]
In 1996, the stylometric analysis of the controversial, pseudonymously authored book Primary Colors, performed by Vassar College professor Donald Foster^[34] brought the topic to the attention of a wider audience after correctly identifying the author as Joe Klein. (This case was resolved only after a handwriting analysis confirmed the authorship.)
In 1996, stylometric methods were used to compare the Unabomber manifesto with letters written by one of the suspects, Theodore Kaczynski, which resulted in Theodore's apprehension and later conviction.^[35]
In April 2015, researchers using stylometry techniques identified a play, Double Falsehood, as being the work of William Shakespeare.^[36]^[37] Researchers analyzed 54 plays by Shakespeare and John Fletcher, and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological valence of their language.
In 2016, MacDonald P. Jackson, Emeritus Professor of English at the University of Auckland, New Zealand and a Fellow of the Royal Society of New Zealand, who had spent his entire academic career analyzing authorship attribution, wrote a book titled Who Wrote "The Night Before Christmas"?: Analyzing the Clement Clarke Moore Vs. Henry Livingston Question,^[38] in which he evaluates the opposing arguments and, for the first time, uses the author-attribution techniques of modern computational stylistics to examine the long-standing controversy. Jackson employs a range of tests and introduces a new one, statistical analysis of phonemes; he concludes that Livingston is the true author of the classic work.
In 2017, Simon Fuller and James O'Sullivan published a study claiming that bestselling author James Patterson does not do any writing in his apparently co-authored novels.^[39]^[40]^[41] According to O'Sullivan, his collaboration with former U.S. president Bill Clinton, The President is Missing, is an exception to this rule.^[42]
In 2017, a group of linguists, computer scientists, and scholars analysed the authorship of Elena Ferrante. Based on a corpus created at University of Padua containing 150 novels written by 40 authors, they analyzed Ferrante's style based on seven of her novels. They were able to compare her writing style with 39 other novelists using, for example, stylo.^[20] The conclusion was the same for all of them: Domenico Starnone is the secret author of Elena Ferrante.^[43]
In 2018, Mark Glickman, a senior lecturer in statistics at Harvard University, worked with Ryan Song, a former statistics student at Harvard, and Jason Brown, a professor at Dalhousie University in Nova Scotia, applying stylometry to find that, most likely, The Beatles' song "In My Life" was composed by John Lennon, but with a 50% chance that Paul McCartney wrote the middle eight.^[44]^[45]
In 2019, the ETSO project: Stylometry applied to the Spanish Golden Age Theater, directed by Álvaro Cuéllar González and Germán Vega García-Luengos (University of Valladolid) managed to gather more than 1200 plays of the Spanish Golden Age. After applying stylometrical analysis, the attribution of Mujeres y criados to Lope de Vega^[46]^[47] was ratified, and an authorship problem was detected in La monja alférez, a play attributed to Pérez de Montalbán which, thanks to these analyzes and through historical and philology research, was eventually attributed to Juan Ruiz de Alarcón.^[48]^[49]^[50]^[51]
In 2020, Rachel McCarthy and James O'Sullivan argued that Emily Brontë is the true author of Wuthering Heights, ending speculation by some critics that the novel might have been written by one of her siblings, specifically either Branwell or Charlotte.^[52]
In 2020, Hartmut Ilsemann used Rolling Delta and Rolling Classify from the R Stylo program suite to show that the Marlowe corpus is stylistically inhomogeneous, and that the author of the two Tamburlaines was hardly present in the remaining official corpus of Marlowe.^[53]^[54]^[55]

Data and methods[]

Since stylometry has both descriptive use cases, used to characterise the content of a collection, and identificatory use cases, e.g. identifying authors or categories of texts, the methods used to analyse the data and features above range from those built to classify items into sets or to distribute items in a space of feature variation. Most methods are statistical in nature, such as cluster analysis and discriminant analysis, are typically based on philological data and features, and are fruitful application domains for modern machine learning methods.

Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech. Most systems are based on lexical statistics, i.e. using the frequencies of words and terms in the text to characterise the text (or its author). In this context, unlike for information retrieval, the observed occurrence patterns of the most common words are more interesting than the topical terms which are less frequent.^[56]^[57]

The primary stylometric method is the writer invariant: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of function words used by the writer.

In one such method, the text is analyzed to find the 50 most common words. The text is then divided into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.

1. Gaussian statistics[]

Stylometric data are distributed according to the Zipf-Mandelbrot law. The distribution is extremely spiky and leptokurtic, the reason why researchers could not use statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying data transformation.^[58]

2. Neural networks[]

Neural networks, a special case of statistical machine learning methods, have been used to analyze authorship of texts. Texts of undisputed authorship are used to train a neural network by processes such as backpropagation, such that training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of Shakespeare with his contemporaries John Fletcher and Christopher Marlowe,^[59]^[60] and confirmed the opinion, based on more conventional scholarship, that such collaboration had indeed occurred.

A 1999 study showed that a neural network program reached 70% accuracy in determining the authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den".^[61]

A study used deep belief networks (DBN) for authorship verification model applicable for continuous authentication (CA).^[62]

One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has analyzed more often.^[61]

3. Genetic algorithms[]

The genetic algorithm is another machine learning technique used for stylometry. This involves a method that starts with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are not used. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules attribute the texts correctly.

4. Rare pairs[]

One method for identifying style is termed "rare pairs", and relies upon individual habits of collocation. The use of certain words may, for a particular author, be associated idiosyncratically with the use of other, predictable words.

Authorship attribution in instant messaging[]

The diffusion of the internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as colors, layout, fonts, graphics, emoticons, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in.^[63] In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices.^[64]

Standard stylometric features have been employed to categorize the content of a chat by instant messaging,^[65] or the behavior of the participants,^[66] but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a major difference between chat data and any other type of written information.

Notes[]

^ Westcott, Richard (15 June 2006). "Making hit music into a science". BBC News.
^ Sethi, Ricky (2016-06-07). "Using computers to better understand art". The Conversation. Retrieved 2021-12-01.
^ ^a ^b Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding manner and meaning. Springer Science & Business Media, 2010.
^ Yang, Christopher C.; Chen, Hsinchun; Chau, Michael; Chang, Kuiyu; Lang, Sheau-Dong; Chen, Patrick; Carley, Kathleen M.; Hsieh, Raymond; Zeng, Daniel (2008). Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings. Berlin: Springer Science & Business Media. p. 252. ISBN 9783540691365.
^ Chen, Hsinchun; Yang, Christopher C.; Chau, Michael; Li, Shu-Hsing (2009). Intelligence and Security Informatics: Pacific Asia Workshop, PAISI 2009, Bangkok, Thailand, April 27, 2009. Proceedings. Berlin: Springer Science & Business Media. p. 15. ISBN 9783642013928.
^ Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 171.
^ Lutoslawski, W. (1898). "Principes de stylométrie appliqués à la chronologie des œuvres de Platon". Revue des Études Grecques. 11 (41): 61–81. doi:10.3406/reg.1898.5847. ISSN 0035-2039.
^ Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 196.
^ F. Mosteller & D. Wallace (1964). Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley.
^ Chaski, Carole (2012). Solan, Lawrence M; Tiersma, Peter M (eds.). Author Identification in the Forensic Setting. The Oxford Handbook of Language and Law. Oxford University Press. doi:10.1093/oxfordhb/9780199572120.001.0001. ISBN 9780199572120.
^ Chaski, Carole (22 December 2005). Wecht, Cyril H.; Rago, John T. (eds.). Forensic Science and Law: Investigative Applications in Criminal, Civil and Family Justice. CRC Press. ISBN 978-1-4200-5811-6.
^ Claburn, Thomas (March 16, 2018). "FYI: AI tools can unmask anonymous coders from their binary executables". The Register. Retrieved August 2, 2018.
^ Bensalem, Imene; Rosso, Paolo; Chikhi, Salim (2019). "On the use of character n-grams as the only intrinsic evidence of plagiarism". Language Resources and Evaluation. 53 (3): 363–396. doi:10.1007/s10579-019-09444-w. hdl:10251/159151. S2CID 86630897.
^ Brizan, David (October 2015). "Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics". International Journal of Human-Computer Studies. 82: 57–68. doi:10.1016/j.ijhcs.2015.04.005.
^ Alican, Necip Fikri (2012). Rethinking Plato: A Cartesian Quest for the Real Plato. Amsterdam: Rodopi. p. 183. ISBN 9789042035379.
^ Rowe, Christopher (2000). The Cambridge History of Greek and Roman Political Thought. Cambridge, UK: Cambridge University Press. p. 160. ISBN 0521481368.
^ ^a ^b Argamon, Shlomo, Jussi Karlgren, and . Stylistic analysis of text for information access. Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Information Retrieval, August 13–19, 2005, Salvador, Bahia, Brazil. Swedish institute of computer science, 2005.
^ "The Signature Stylometric System". PhiloComp. Retrieved 2014-01-03.
^ "JGAAP". JGAAP. 2012-09-04. Retrieved 2012-10-15.
^ ^a ^b "The stylo for R package". Computational Stylistics Group. 2014-10-24. Retrieved 2014-10-24.
^ Eder, Maciej; Rybicki, Jan; Kestemont, Mike (2016). "Stylometry with R: a package for computational text analysis" (PDF). R Journal. 8 (1): 107–121. doi:10.32614/RJ-2016-007.
^ Daelemans, Walter & Hoste, Véronique (2013). STYLENE: an Environment for Stylometry and Readability Research for Dutch (Technical report). CLiPS Technical Report Series. ISSN 2033-3544.
^ , , and . "Exploring attitude and affect in text: Theories and applications." AAAI Spring Symposium Technical report SS-04-07. AAAI Press, Menlo Park, CA. 2004.
^ Jussi Karlgren, , and Pentti Kanerva. "Acquiring (and Using) Linguistic (and World) Knowledge for Information Access." (2002). AAAI Spring Symposium. Technical report SS-02-09. AAAI Press, Menlo Park, CA. 2002.
^ Shlomo Argamon, Shlomo Dubnov, and . "Style and Meaning in Language, Art, Music, and Design" (2004). AAAI Fall Symposium. Technical report FS-04-07.
^ Potthast, Martin, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. "An evaluation framework for plagiarism detection." In Proceedings of the 23rd international conference on computational linguistics: Posters, pp. 997–1005. Association for Computational Linguistics, 2010.
^ Stamatatos, Efstathios, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. "Overview of the Author Identification Task at PAN 2014." In CLEF (Working Notes), pp. 877–897. 2014.
^ Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. "Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter." Working Notes Papers of the CLEF (2017).
^ Rangel Pardo, Francisco Manuel, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. "Overview of the 3rd Author Profiling Task at PAN 2015." In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 1–8. 2015.
^ Potthast, Martin, Benno Stein, and Teresa Holfeld. "Overview of the 1st International Competition on Wikipedia Vandalism Detection." In CLEF (Notebook Papers/LABs/Workshops). 2010.
^ Text processing text analysis and generation – text typology and attribution. Proceedings of Nobel symposium 51 / ed. by Sture Allén Stockholm : Almqvist & Wiksell international 1982 653 pp. Data linguistica; 16 Nobel symposium; 51 ISBN 91-22-00594-3
^ Karlgren, Jussi (2003). "Helander: An Authorship Attribution Case". Retrieved 4 October 2017.
^ Edoardo M. Airoldi; Stephen E. Fienberg; Kiron K. Skinner (July 2007). "Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses" (PDF). PS: Political Science & Politics. 40 (3): 501–506. CiteSeerX 10.1.1.190.5798. doi:10.1017/S1049096507070874.
^ Author Unknown by Gavin McNett Salon November 2, 2000
^ Belluck, Pam (April 10, 1996). "In Unabom Case, Pain for Suspect's Family". The New York Times. Archived from the original on August 10, 2017. Retrieved July 5, 2008.
^ "Study finds a disputed Shakespeare play bears the master's mark". LATimes.com. 2015-04-10. Retrieved 2015-04-13.
^ Boyd, Ryan L.; Pennebaker, James W. "Did Shakespeare Write Double Falsehood? Identifying Individuals by Creating Psychological Signatures With Text Analysis". Psychological Science. 26 (5): 570–582. doi:10.1177/0956797614566658.
^ Jackson, MacDonald P (April 27, 2016). Who Wrote "The Night Before Christmas"? Analyzing the Clement Clarke Moore Vs. Henry Livingston Question. McFarland & Co. ISBN 978-1476664439.
^ Fuller, Simon; O'Sullivan, James (2017). "Structure over Style: Collaborative Authorship and the Revival of Literary Capitalism". Digital Humanities Quarterly. 11 (1). Retrieved April 20, 2017.
^ Lane, Anthony (June 18, 2018). "Bill Clinton and James Patterson's Concussive Collaboration". The New Yorker. Retrieved 2018-06-07.
^ "Why you don't need to write much to be the world's bestselling author". The Conversation. April 3, 2017. Retrieved April 20, 2017.
^ O'Sullivan, James (2018-06-07). "Bill Clinton and James Patterson are co-authors – but who did the writing?". The Guardian. Retrieved 2018-06-07.
^ Savoy, Jacques (2018). "Is Starnone really the author behind Ferrante?". Digital Scholarship in the Humanities. 33 (4): 902–918. doi:10.1093/llc/fqy016.
^ Peter Reuell. "You say John, I say Paul. But what does stylometry say?". https://news.harvard.edu/gazette/story/2018/09/harvard-statistician-examines-beatles-mystery/
^ Glickman, Mark; Brown, Jason; Song, Ryan (2019). "(A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs". Harvard Data Science Review. 1 (1). doi:10.1162/99608f92.130f856e.
^ "Un monstruo de la naturaleza llamado Lope". abc (in Spanish). 2018-11-28. Retrieved 2019-08-11.
^ "Rastreadores digitales en el Siglo de Oro". El Norte de Castilla (in Spanish). 2018-12-23. Retrieved 2019-08-11.
^ Real, La Tribuna de Ciudad (2019-07-09). "Juan Ruiz de Alarcón aumenta su obra cinco siglos después". La Tribuna de Ciudad Real (in Spanish). Retrieved 2019-08-11.
^ Chamberí, PSOE. "PSOE | PSOE Chamberí | chamberí | suplemento cultural | domingo, 28 de julio 2019 | número 06 | Daniel Migueláñez | Pág nº 08 | El Holmes de la filología". Retrieved 2019-08-11.
^ "Sor Juana Inés centró las 42 Jornadas de Teatro Clásico". Lanza Digital (in European Spanish). 2019-07-14. Retrieved 2019-08-11.
^ "'La monja alférez' ya no es de Pérez de Montalbán, sino de Ruiz de Alarcón". El Norte de Castilla (in Spanish). 2019-07-10. Retrieved 2019-08-11.
^ McCarthy, Rachel; O'Sullivan, James (2020). "Who wrote Wuthering Heights?". Digital Scholarship in the Humanities. 36 (2): 383–391. doi:10.1093/llc/fqaa031.
^ Ilsemann, Harmut (2020) "Phantom Marlowe: Paradigmenwechsel in Autorschaftsbestimmungen des englischen Renaissancedramas". Düren: Shaker, ISBN 978-3-8440-7412-3
^ Ilsemann, Harmut (2020). "The Marlowe corpus revisited". Digital Scholarship in the Humanities. 36 (2): 333–360. doi:10.1093/llc/fqaa010.
^ Ilsemann, Harmut (2021). "A brief supplement to 'The Marlowe Corpus Revisited' and Phantom Marlowe". Digital Scholarship in the Humanities. doi:10.1093/llc/fqab078.
^ . Variation across speech and writing. Cambridge University Press, 1991.
^ Karlgren, Jussi; Cutting, Douglass (1994). "Recognizing Text Genres with Simple Metrics Using Discriminant Analysis". Proceedings of the International Conference on Computational Linguistics. 2: 1071. arXiv:cmp-lg/9410008. Bibcode:1994cmp.lg...10008K. doi:10.3115/991250.991324. S2CID 1297432.
^ Van Droogenbroeck F.J., 'An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics' (2019) [1]
^ Matthews, Robert A. J.; Merriam, Thomas V. N (1993). "Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher". Literary and Linguistic Computing. 8 (4): 203–209. doi:10.1093/llc/8.4.203.
^ Merriam, Thomas V. N; Matthews, Robert A. J. (1994). "Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe". Literary and Linguistic Computing. 9 (1): 1–6. doi:10.1093/llc/9.1.1.
^ ^a ^b JF Hoorn; SL Frank; W Kowalczyk; F van der Ham (2012-09-03). "Neural network identification of poets using letter sequences". Literary and Linguistic Computing. 14 (3): 311–338. doi:10.1093/llc/14.3.311.
^ Brocardo, ML; Traore, I; Woungang, I; Obaidat, MS (2017). "Authorship verification using deep belief network systems". Int J Commun Syst. 30 (12): e3259. doi:10.1002/dac.3259.
^ de Vel, O.; Anderson, A.; Corney, M.; Mohay, G. (2001-12-01). "Mining e-Mail Content for Author Identification Forensics". SIGMOD Rec. 30 (4): 55–64. CiteSeerX 10.1.1.408.4231. doi:10.1145/604264.604272. ISSN 0163-5808. S2CID 1623521.
^ Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan (2009-02-01). "Automatically Profiling the Author of an Anonymous Text". Commun. ACM. 52 (2): 119–123. CiteSeerX 10.1.1.136.9952. doi:10.1145/1461928.1461959. ISSN 0001-0782. S2CID 5413411.
^ "Classification of Instant Messaging Communications for Forensics Analysis – TechRepublic". TechRepublic. Retrieved 2016-01-26.
^ Zhou, L.; Zhang, Dongsong (2004-01-01). Can online behavior unveil deceivers? – an exploratory investigation of deception in instant messaging. Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004. pp. 9 pp.–. doi:10.1109/HICSS.2004.1265079. ISBN 978-0-7695-2056-8. S2CID 7154702.

References[]

Brocardo, Marcelo Luiz; Issa Traore; Sherif Saad; Isaac Woungang (2013). Authorship Verification for Short Messages Using Stylometry. IEEE Intl. Conference on Computer, Information and Telecommunication Systems (CITS). doi:10.1109/CITS.2013.6705711.
Can, Fazli; Patton, Jon M. (2004). "Change of writing style with time". Computers and the Humanities. 38 (1): 61–82. CiteSeerX 10.1.1.1.8850. doi:10.1023/b:chum.0000009225.28847.77. S2CID 38242388.
Brennan, Michael Robert; Greenstadt, Rachel. "Practical Attacks Against Authorship Recognition Techniques". Innovative Applications of Artificial Intelligence.
Hope, Jonathan (1994). The Authorship of Shakespeare's Plays. Cambridge: Cambridge University Press.
Hoy, Cyrus (1956–62). "The Shares of Fletcher and His Collaborators in the Beaumont and Fletcher Canon (I-VII)". Studies in Bibliography. 7–15.{{cite journal}}: CS1 maint: date format (link)
Juola, Patrick (2006). "Authorship Attribution" (PDF). Foundations and Trends in Information Retrieval. 1 (3): 3. CiteSeerX 10.1.1.219.1605. doi:10.1561/1500000005.
Kenny, Anthony (1982). The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities. Oxford: Pergamon Press.
Romaine, Suzanne (1982). Socio-Historical Linguistics. Cambridge: Cambridge University Press.
Samuels, M. L. (1972). Linguistic Evolution: With Special Reference to English. Cambridge: Cambridge University Press.
Schoenbaum, Samuel (1966). Internal Evidence and Elizabethan Dramatic Authorship: An Essay in Literary History and Method. Evanston, IL, USA: Northwestern University Press.
Van Droogenbroeck, Frans J. (2016) "Handling the Zipf distribution in computerized authorship attribution"
Van Droogenbroeck, Frans J. (2019) "An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics"
Zenkov, Andrei V. (2018). "A Method of Text Attribution Based on the Statistics of Numerals". Journal of Quantitative Linguistics. 25 (3): 256–270. doi:10.1080/09296174.2017.1371915.

External links[]

[1] Westcott, Richard (15 June 2006). "Making hit music into a science". BBC News.

[2] Sethi, Ricky (2016-06-07). "Using computers to better understand art". The Conversation. Retrieved 2021-12-01.

[structureofstyle-3] Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding manner and meaning. Springer Science & Business Media, 2010.

[4] Yang, Christopher C.; Chen, Hsinchun; Chau, Michael; Chang, Kuiyu; Lang, Sheau-Dong; Chen, Patrick; Carley, Kathleen M.; Hsieh, Raymond; Zeng, Daniel (2008). Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings. Berlin: Springer Science & Business Media. p. 252. ISBN 9783540691365.

[5] Chen, Hsinchun; Yang, Christopher C.; Chau, Michael; Li, Shu-Hsing (2009). Intelligence and Security Informatics: Pacific Asia Workshop, PAISI 2009, Bangkok, Thailand, April 27, 2009. Proceedings. Berlin: Springer Science & Business Media. p. 15. ISBN 9783642013928.

[6] Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 171.

[7] Lutoslawski, W. (1898). "Principes de stylométrie appliqués à la chronologie des œuvres de Platon". Revue des Études Grecques. 11 (41): 61–81. doi:10.3406/reg.1898.5847. ISSN 0035-2039.

[8] Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 196.

[9] F. Mosteller & D. Wallace (1964). Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley.

[10] Chaski, Carole (2012). Solan, Lawrence M; Tiersma, Peter M (eds.). Author Identification in the Forensic Setting. The Oxford Handbook of Language and Law. Oxford University Press. doi:10.1093/oxfordhb/9780199572120.001.0001. ISBN 9780199572120.

[WechtRago2005-11] Chaski, Carole (22 December 2005). Wecht, Cyril H.; Rago, John T. (eds.). Forensic Science and Law: Investigative Applications in Criminal, Civil and Family Justice. CRC Press. ISBN 978-1-4200-5811-6.

[12] Claburn, Thomas (March 16, 2018). "FYI: AI tools can unmask anonymous coders from their binary executables". The Register. Retrieved August 2, 2018.

[13] Bensalem, Imene; Rosso, Paolo; Chikhi, Salim (2019). "On the use of character n-grams as the only intrinsic evidence of plagiarism". Language Resources and Evaluation. 53 (3): 363–396. doi:10.1007/s10579-019-09444-w. hdl:10251/159151. S2CID 86630897.

[14] Brizan, David (October 2015). "Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics". International Journal of Human-Computer Studies. 82: 57–68. doi:10.1016/j.ijhcs.2015.04.005.

[15] Alican, Necip Fikri (2012). Rethinking Plato: A Cartesian Quest for the Real Plato. Amsterdam: Rodopi. p. 183. ISBN 9789042035379.

[16] Rowe, Christopher (2000). The Cambridge History of Greek and Roman Political Thought. Cambridge, UK: Cambridge University Press. p. 160. ISBN 0521481368.

[sigirws-17] Argamon, Shlomo, Jussi Karlgren, and . Stylistic analysis of text for information access. Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Information Retrieval, August 13–19, 2005, Salvador, Bahia, Brazil. Swedish institute of computer science, 2005.

[18] "The Signature Stylometric System". PhiloComp. Retrieved 2014-01-03.

[19] "JGAAP". JGAAP. 2012-09-04. Retrieved 2012-10-15.

[styloR-20] "The stylo for R package". Computational Stylistics Group. 2014-10-24. Retrieved 2014-10-24.

[styloRlit-21] Eder, Maciej; Rybicki, Jan; Kestemont, Mike (2016). "Stylometry with R: a package for computational text analysis" (PDF). R Journal. 8 (1): 107–121. doi:10.32614/RJ-2016-007.

[22] Daelemans, Walter & Hoste, Véronique (2013). STYLENE: an Environment for Stylometry and Readability Research for Dutch (Technical report). CLiPS Technical Report Series. ISSN 2033-3544.

[23] , , and . "Exploring attitude and affect in text: Theories and applications." AAAI Spring Symposium Technical report SS-04-07. AAAI Press, Menlo Park, CA. 2004.

[24] Jussi Karlgren, , and Pentti Kanerva. "Acquiring (and Using) Linguistic (and World) Knowledge for Information Access." (2002). AAAI Spring Symposium. Technical report SS-02-09. AAAI Press, Menlo Park, CA. 2002.

[25] Shlomo Argamon, Shlomo Dubnov, and . "Style and Meaning in Language, Art, Music, and Design" (2004). AAAI Fall Symposium. Technical report FS-04-07.

[26] Potthast, Martin, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. "An evaluation framework for plagiarism detection." In Proceedings of the 23rd international conference on computational linguistics: Posters, pp. 997–1005. Association for Computational Linguistics, 2010.

[27] Stamatatos, Efstathios, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. "Overview of the Author Identification Task at PAN 2014." In CLEF (Working Notes), pp. 877–897. 2014.

[28] Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. "Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter." Working Notes Papers of the CLEF (2017).

[29] Rangel Pardo, Francisco Manuel, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. "Overview of the 3rd Author Profiling Task at PAN 2015." In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 1–8. 2015.

[vandalism-30] Potthast, Martin, Benno Stein, and Teresa Holfeld. "Overview of the 1st International Competition on Wikipedia Vandalism Detection." In CLEF (Notebook Papers/LABs/Workshops). 2010.

[nobel-31] Text processing text analysis and generation – text typology and attribution. Proceedings of Nobel symposium 51 / ed. by Sture Allén Stockholm : Almqvist & Wiksell international 1982 653 pp. Data linguistica; 16 Nobel symposium; 51 ISBN 91-22-00594-3

[karlgrenblog-32] Karlgren, Jussi (2003). "Helander: An Authorship Attribution Case". Retrieved 4 October 2017.

[33] Edoardo M. Airoldi; Stephen E. Fienberg; Kiron K. Skinner (July 2007). "Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses" (PDF). PS: Political Science & Politics. 40 (3): 501–506. CiteSeerX 10.1.1.190.5798. doi:10.1017/S1049096507070874.

[34] Author Unknown by Gavin McNett Salon November 2, 2000

[pain-35] Belluck, Pam (April 10, 1996). "In Unabom Case, Pain for Suspect's Family". The New York Times. Archived from the original on August 10, 2017. Retrieved July 5, 2008.

[36] "Study finds a disputed Shakespeare play bears the master's mark". LATimes.com. 2015-04-10. Retrieved 2015-04-13.

[37] Boyd, Ryan L.; Pennebaker, James W. "Did Shakespeare Write Double Falsehood? Identifying Individuals by Creating Psychological Signatures With Text Analysis". Psychological Science. 26 (5): 570–582. doi:10.1177/0956797614566658.

[38] Jackson, MacDonald P (April 27, 2016). Who Wrote "The Night Before Christmas"? Analyzing the Clement Clarke Moore Vs. Henry Livingston Question. McFarland & Co. ISBN 978-1476664439.

[:0-39] Fuller, Simon; O'Sullivan, James (2017). "Structure over Style: Collaborative Authorship and the Revival of Literary Capitalism". Digital Humanities Quarterly. 11 (1). Retrieved April 20, 2017.

[:1-40] Lane, Anthony (June 18, 2018). "Bill Clinton and James Patterson's Concussive Collaboration". The New Yorker. Retrieved 2018-06-07.

[:2-41] "Why you don't need to write much to be the world's bestselling author". The Conversation. April 3, 2017. Retrieved April 20, 2017.

[:3-42] O'Sullivan, James (2018-06-07). "Bill Clinton and James Patterson are co-authors – but who did the writing?". The Guardian. Retrieved 2018-06-07.

[43] Savoy, Jacques (2018). "Is Starnone really the author behind Ferrante?". Digital Scholarship in the Humanities. 33 (4): 902–918. doi:10.1093/llc/fqy016.

[44] Peter Reuell. "You say John, I say Paul. But what does stylometry say?". https://news.harvard.edu/gazette/story/2018/09/harvard-statistician-examines-beatles-mystery/

[45] Glickman, Mark; Brown, Jason; Song, Ryan (2019). "(A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs". Harvard Data Science Review. 1 (1). doi:10.1162/99608f92.130f856e.

[46] "Un monstruo de la naturaleza llamado Lope". abc (in Spanish). 2018-11-28. Retrieved 2019-08-11.

[47] "Rastreadores digitales en el Siglo de Oro". El Norte de Castilla (in Spanish). 2018-12-23. Retrieved 2019-08-11.

[48] Real, La Tribuna de Ciudad (2019-07-09). "Juan Ruiz de Alarcón aumenta su obra cinco siglos después". La Tribuna de Ciudad Real (in Spanish). Retrieved 2019-08-11.

[49] Chamberí, PSOE. "PSOE | PSOE Chamberí | chamberí | suplemento cultural | domingo, 28 de julio 2019 | número 06 | Daniel Migueláñez | Pág nº 08 | El Holmes de la filología". Retrieved 2019-08-11.

[50] "Sor Juana Inés centró las 42 Jornadas de Teatro Clásico". Lanza Digital (in European Spanish). 2019-07-14. Retrieved 2019-08-11.

[51] "'La monja alférez' ya no es de Pérez de Montalbán, sino de Ruiz de Alarcón". El Norte de Castilla (in Spanish). 2019-07-10. Retrieved 2019-08-11.

[52] McCarthy, Rachel; O'Sullivan, James (2020). "Who wrote Wuthering Heights?". Digital Scholarship in the Humanities. 36 (2): 383–391. doi:10.1093/llc/fqaa031.

[53] Ilsemann, Harmut (2020) "Phantom Marlowe: Paradigmenwechsel in Autorschaftsbestimmungen des englischen Renaissancedramas". Düren: Shaker, ISBN 978-3-8440-7412-3

[54] Ilsemann, Harmut (2020). "The Marlowe corpus revisited". Digital Scholarship in the Humanities. 36 (2): 333–360. doi:10.1093/llc/fqaa010.

[55] Ilsemann, Harmut (2021). "A brief supplement to 'The Marlowe Corpus Revisited' and Phantom Marlowe". Digital Scholarship in the Humanities. doi:10.1093/llc/fqab078.

[biber-56] . Variation across speech and writing. Cambridge University Press, 1991.

[karlgrencutting-57] Karlgren, Jussi; Cutting, Douglass (1994). "Recognizing Text Genres with Simple Metrics Using Discriminant Analysis". Proceedings of the International Conference on Computational Linguistics. 2: 1071. arXiv:cmp-lg/9410008. Bibcode:1994cmp.lg...10008K. doi:10.3115/991250.991324. S2CID 1297432.

[58] Van Droogenbroeck F.J., 'An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics' (2019) [1]

[59] Matthews, Robert A. J.; Merriam, Thomas V. N (1993). "Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher". Literary and Linguistic Computing. 8 (4): 203–209. doi:10.1093/llc/8.4.203.

[60] Merriam, Thomas V. N; Matthews, Robert A. J. (1994). "Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe". Literary and Linguistic Computing. 9 (1): 1–6. doi:10.1093/llc/9.1.1.

[oxfordjournalsllc-61] JF Hoorn; SL Frank; W Kowalczyk; F van der Ham (2012-09-03). "Neural network identification of poets using letter sequences". Literary and Linguistic Computing. 14 (3): 311–338. doi:10.1093/llc/14.3.311.

[BROC2017-62] Brocardo, ML; Traore, I; Woungang, I; Obaidat, MS (2017). "Authorship verification using deep belief network systems". Int J Commun Syst. 30 (12): e3259. doi:10.1002/dac.3259.

[63] Vel, O.; Anderson, A.; Corney, M.; Mohay, G. (2001-12-01). "Mining e-Mail Content for Author Identification Forensics". SIGMOD Rec. 30 (4): 55–64. CiteSeerX 10.1.1.408.4231. doi:10.1145/604264.604272. ISSN 0163-5808. S2CID 1623521.

[64] Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan (2009-02-01). "Automatically Profiling the Author of an Anonymous Text". Commun. ACM. 52 (2): 119–123. CiteSeerX 10.1.1.136.9952. doi:10.1145/1461928.1461959. ISSN 0001-0782. S2CID 5413411.

[65] "Classification of Instant Messaging Communications for Forensics Analysis – TechRepublic". TechRepublic. Retrieved 2016-01-26.

[66] Zhou, L.; Zhang, Dongsong (2004-01-01). Can online behavior unveil deceivers? – an exploratory investigation of deception in instant messaging. Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004. pp. 9 pp.–. doi:10.1109/HICSS.2004.1265079. ISBN 978-0-7695-2056-8. S2CID 7154702.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]