PyMotifs: a workflow designed to detect significant lexico-grammatical patterns for automatic stylistic analysis in French
This project, lead by Antoine Silvestre de Sacy and Dominique Legallois (CNRS, Université Paris III - Sorbonne-Nouvelle), was supported by the European infrastructure DARIAH via a bi-annual thematic funding call on the theme of Workflows.
The aim of the workflow PyMotifs is to model French texts in the form of lexico-grammatical patterns, also known as "motifs". For a long time, the objectives of stylometry were authorship verification, or the dating of texts, the similarity (generally lexical) between several texts. From now on, with the development of the possibilities offered by computer science, and the advent of corpus linguistics or Digital Humanities, more qualitative analyses (based on quantitative results) are no longer only envisaged, but actually carried out in order to characterize the properties of a text (or of a textual genre, or the textual production of an author). A corpus stylistics can thus be developed: it retains the rigor and the degree of precision of traditional stylistics, while relying on large quantified data.
PyMotifs is situated at the frontier of stylometric and stylistic questions and is intended to be used by both disciplines: as a complement to stylometry, whose orientations are essentially quantitative, seeking to answer questions such as: what are the subjects addressed by a given text? By whom was it written? When was it written? But also as a complement to stylistics, whose orientations are essentially qualitative, focusing on particular and specific facts of language: is this form significant for this author or this text? How can we interpret this form as a style of this author, text, genre or corpus? Contrary to many approaches in digital humanities and in big data which stick to distant reading, but also in contrast to stylistics which sticks to singular facts of language, PyMotifs proposes a mixed method, relying on statistical methods but nevertheless enabling a systematic return to the texts to interpret the results. PyMotifs thus allows for both inductive approaches (data-driven and bottom-up approaches) and deductive approaches (top-down, qualitative stylistics).
All functionalities have been coded in such a way as to be usable by non-computer users with some notion of programming. Each module is presented in the form of a documented Jupyter notebook, enabling the corpus to be labeled in the form of patterns and a series of functions to be used to analyze and compare texts. More advanced machine learning functionalities integrating the scikit-learn framework have also been developed.
Workflow steps(5)
1 Installation
2 Text preprocessing
3 Document prediction based on motifs
4 Canonical label prediction based on motifs
5 Corpus clustering based on motifs features
The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.