Linguistic annotation of corpora

This scenario explains the steps to take to annotate a corpus in order to conduct linguistic and statistical analysis based on it. This scenario wants to provide general information for people starting with linguistic annotation. The aim is to provide a generic scenario with no specific tool in mind: We refer to tools but don't specify how to use them. There are various tools and frameworks to perform the steps in this scenario that can be used depending on the language/s you are working with or your programming environment. A number of tool-boxes for Natural language processing (NLP) exist, which are able to perform several of the annotation steps in an integrated way. These resources are listed below under "Using an existing NLP pipeline". The next step after performing the procedures described in this scenario usually is to put the annotated corpus into a corpus query engine to query and analyze it based on its annotations. Again, some popular query engines already provide a built-in pipeline that is able to perform the basic processing steps in one go, taking most of the burden off the user.

Media

Quranic Arabic Grammar - dependency syntax tree from the Quranic Arabic Corpus / Photo by Arabismo (Creative Commons Attribution 3.0 Unported)

Workflow steps(7)

1 Tokenizing
2 Part-of-speech tagging
3 Lemmatizing
4 Stemming
5 Named-entity recognition
6 Manual annotation
7 Using an existing NLP pipeline

The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.