If you are having trouble with the sign in process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

If you are having trouble with the sign up process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

Subject

Social Sciences & Humanities Open Marketplace

Successfully submitted!

For corpus composition a set of canonical decisions should be considered. As a starting point, the corpus type will have to be determined: Is it an opportunistic selection, a balanced corpus, etc.? If balanced, which would be the criteria of balancedness? This is accompanied by the question which domain the corpus should represent (e.g. Should it only contain resources within a specific date range or of a certain text type or from a specific author?). Based on this decision, the resources that should be included in the corpus have to be selected, possibly by reference to a research question. After that, corpus composers have to decide on the type of annotation that should be applied: lightweight or with multiple layers, inline or standoff, etc. The next important step for corpus composition is the determination of a general tagset which the annotation should follow, i.e. the content of the markup planned.

It is straightforward to include a phase of verification of results and cleanup after each step towards corpus creation. This step is actually a "hidden" or "stepified" scenario which should take parameters. Verification can be automatic at first, and then manual, done by annotators. This step has the power of looping the entire process back to the previous step (after the/some errors have been corrected). Important: Preserve the data that is about to undergo cleanup (simplest: zip with a date stamp, more intricate: use a versioning system).

For the project at hand, a TEI format has to be chosen or created (the latter by usage of the ODD language) which suits the markup necessities defined in the corpus composition step. Thus, if digitized data from other sources are to be re-used for corpus creation, these may very likely be available only in formats that aren't similar to the TEI format selected for the corpus creation project at hand. External data may either come in completely different formats or at least in different TEI dialects. In any case, it will be necessary to convert the data from different formats into the TEI output format. Conversion may be conducted semi-automatically.

After the data has been gathered and homogenized in one similar format, it should be compiled into a proper corpus. This can be done by just bundling the data into one repository, either a file directory or a Github repository.

The corpus texts containing TEI text structuring can be further enhanced by adding linguistic markup, e.g. information on tokens, lemmas, Parts-of-Speech or results of higher level linguistic analysis). Here, a wide range of NLP tools are available to perform the task of linguistic analysis and tagging automatically. For example, the CLARIN infrastructure offers the WebLicht service which allows for data analysis with various NLP tools and for the building of analysis chains. In- and output format here is the Text Corpus Format (TCF; for conversion between TEI and TCF use the TEI2TCF-webservice). Another toolkit is e.g. Apache OpenNLP. As tools you may use e.g. the Stanford Tokenizer or TreeTagger for tokenization, POS tagging and more, or e.g. enrich corpus with Named Entity Recognition annotation (e.g. by usage of Apache Stanbol or Babelfy). Manual annotation may e.g. be performed in WebAnno.

Finalize corpus creation by providing the corpus, its format and bibliographic information on it to the community. Ensure that the FAIR principles are met by its publication. It should be findable (F), accessible (A), interoperable (I) and re-usable (R).

Piotr Bański

Susanne Haaf

Klaus  Illmayer

This scenario explains the steps to take, in order to create a corpus based on the TEI tagset. As of today, the TEI guidelines have become a de facto standard for text annotation, providing solutions for a great variety of text and phrase structures, information on content types, linguistic information on words or phrases, etc. In many digital text collections and digital edition projects annotation has been based on the TEI. Linguistic corpora based on TEI may thus be re-used in projects of other disciplines as well or may themselves benefit from the wide range of already existing resources.

Creation of a TEI-based corpus

Media

Related items(1)

Creating an Interoperable TEI Annotation Schema

Workflow steps(9)

1 Corpus Composition

2 Verification and Cleanup

3 Conversion to TEI

4 Verification and Cleanup

5 Create Workbench

6 Verification and Cleanup

7 Linguistic Annotation

8 Verification and Cleanup

9 Finalize

Media

Related items(1)

WorkflowCreating an Interoperable TEI Annotation Schema

Workflow steps(9)

Creating an Interoperable TEI Annotation Schema