Preparing minority or endangered language corpora for annotation in SIL FLEx

Depending on the nature of the source data, do one or more of the following steps. If the source material is an audio file, start at step 01. If the source material is an older transcript of an audio file, start at step 03. If the source material is a manuscript, start at step 04. If the source material is a printed book, start at step 05. The intended result:

a .txt file abiding by the SIL FLEx/Fieldworks Standard Format Interlinear
(if ELAN is involved) a .flextext XML

The resulting data is to be imported into SIL FLEx for further morpheme-level analysis. This workflow was designed during the workshop "Creating Managing and Archiving Textual Corpora in Under-resourced Languages”. The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024.

Details

Properties

Access

License: Creative Commons Attribution 4.0 International

Categorisation

Activity: Annotating
Keyword: corpus building
Discipline: Corpus linguistics
Language: English

Context

See also: https://gitlab-ce.rrz.uni-hamburg.de/uahh-digitale-dienste/creating-managing-and-archiving-textual-corpora/

Actors

Author: Aleksandr Riaposov
University of Hamburg
Elena Lazarenko
University of Hamburg
ORCID
Alexandre Arkhipov
University of Hamburg
ORCID
Francesco Gelati
University of Hamburg
ORCID
Jonas Müller-Laackmann
Gardy Stein
University of Hamburg
ORCID
Cristina Vertan
dblp ORCID
Péter Király
ORCID dblp
Shih-Pei Chen
dblp
Calvin Yeh
dblp
Alexander König
CLARIN ERIC
ORCID
Till Grallert
dblp
Maroussia Bednarkiewicz
ORCID
Nanette Rißler-Pipka
Max Weber Foundation, DARIAH-DE
ORCID
Monika Xenia Kudela
ORCID
Alíz Horváth
ORCID
Alessia Spadi
ORCID
Françoise Gouzi
ORCID
Georgios Vardakis
Martin Wynne
dblp
Merve Tekgürler
ORCID
Femmy Admiraal
ORCID
Emiliano Degl'Innocenti
CNR-OVI
ORCID
Giorgio Maria Di Nunzio
ORCID
Duncan Paterson
ORCID
Funder: DARIAH-EU
https://www.dariah.eu/
Twitter

Workflow steps(8)

1 Clean/improve audio file quality
2 Transcribe the audio
3 Align the transcript
4 Export the .flextext file from ELAN
5 HTR (Handwritten Text Recognition)
6 OCR (Optical Character Recognition)
7 Format the data
8 Unify the spelling for the main tier

The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.