If you are having trouble with the sign in process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

If you are having trouble with the sign up process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

Subject

Social Sciences & Humanities Open Marketplace

Successfully submitted!

Check whether there is an existing corpus with your source material, or which can be used to answer your research question(s). If so, consider using that and not creating a corpus.

Identify all of your source materials for the corpus, and associated metadata. Take special care to verify whether your textual source material is already available in a machine-readable format, before undertaking digitization.

Check ethical and legal clearance for the source materials that you intend to use. Ensure that it is permitted for you to make copies of the data and use it in your research, and also ensure that you will be allowed to share the materials. If you are able to use the materials for your own research, but not able to share your enhanced corpus based on the materials, this will diminish the value of what you are doing, by affecting the verification and reproducibility of your research, and failing to contribute to open science.
Document any licences or other information that apply to the material.

Document the information about the data processing (DMP) and the source material (metadata, make sure you choose a simple and consistent format, e.g. csv, txt).

If your sources are not already available in a machine-readable format, you might need to digitise it, using for example OCR (Optical Character Recognition), HTR (Handwritten Text Recognition), ASR (Automatic Speech Recognition), etc.. The output of this step should be all of your materials in digital form, available for you to copy, transform and use.

Make sure to use a set of characters that is as standard as possible, like unicode. Be aware that not every script, and not every character of every script has been encoded in unicode, which can lead to OCR errors. Be prepared to make manual corrections.

Get all of your digital materials into a common format, with the same file formats, character encoding, metadata encoding, markup, etc. Check your text integrity (make sure the texts are complete) and remove unwanted material. Make sure to maintain precise documentation about that which you are removing, and anything that is missing. The corpus doesn't have to be perfect to be useful, but any users of the corpus will need to know what is and what is not in it.

Make an archival package including the textual data and the metadata. The goal of your project might be to create a version of the corpus in a specialized format for a particular application, and maybe with further levels of annotation, but it is important to include a step where you capture and archive the basic textual data.

You are now ready to go to the Corpus Management Workflow! Further steps such as adding annotation  to the corpus and creating annotated versions in different formats are covered there.

Martin Wynne

Alíz Horváth

Maroussia Bednarkiewicz

Péter Király

Cristina Vertan

Shih-Pei Chen

Calvin Yeh

Alexander König

Aleksandr Riaposov

Alexandre Arkhipov

Jonas Müller-Laackman

Gardy Stein

Francesco Gelati

Elena Lazarenko

Till Grallert

Nanette Rißler-Pipka

Monika Xenia Kudela

Merve Tekgürler

Femmy Admiraal

Emiliano  Degl'Innocenti

Giorgio Maria Di Nunzio

Duncan Paterson

DARIAH-EU

Alessia Spadi

Françoise Gouzi

Georgios Vardakis

How to build textual corpora for research into under-resourced languages.
This workflow was designed during the workshop "Creating Managing and Archiving Textual Corpora in Under-resourced Languages”. The workshop was conceived by DARIAH Working Groups [Research Data Management]( https://www.dariah.eu/activities/working-groups/research-data-management/) and [Multilingual DH](https://www.dariah.eu/activities/working-groups/multilingual-dh/), financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024.

Building Textual Corpora for Under-resourced Languages

Media

Related items(2)

Managing Textual Corpora in under-resourced languages

Archiving Textual Corpora

Workflow steps(9)

1 Data discovery 1

2 Data discovery 2

3 Ethical and legal considerations

4 Metadata documentation

5 Digitization

6 Character encoding

7 Data cleaning

8 Create archival package

9 Curate your corpus

Media

Related items(2)

WorkflowManaging Textual Corpora in under-resourced languages

WorkflowArchiving Textual Corpora

Workflow steps(9)

Managing Textual Corpora in under-resourced languages

Archiving Textual Corpora