Skip to main content
Workflow

Building Textual Corpora for Under-resourced Languages

How to build textual corpora for research into under-resourced languages. This workflow was designed during the workshop "Creating Managing and Archiving Textual Corpora in Under-resourced Languages”. The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024.

Media

Related items(2)

Workflow steps(9)

  1. 1 Data discovery 1

  2. 2 Data discovery 2

  3. 3 Ethical and legal considerations

  4. 4 Metadata documentation

  5. 5 Digitization

  6. 6 Character encoding

  7. 7 Data cleaning

  8. 8 Create archival package

  9. 9 Curate your corpus

European Union flag

The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.

CESSDACLARINDARIAH-EU