Skip to main content
Workflow

Preparing minority or endangered language corpora for annotation in SIL FLEx

Depending on the nature of the source data, do one or more of the following steps. If the source material is an audio file, start at step 01. If the source material is an older transcript of an audio file, start at step 03. If the source material is a manuscript, start at step 04. If the source material is a printed book, start at step 05. The intended result:

  • a .txt file abiding by the SIL FLEx/Fieldworks Standard Format Interlinear
  • (if ELAN is involved) a .flextext XML

The resulting data is to be imported into SIL FLEx for further morpheme-level analysis. This workflow was designed during the workshop "Creating Managing and Archiving Textual Corpora in Under-resourced Languages”. The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024.

Workflow steps(8)

  1. 1 Clean/improve audio file quality

  2. 2 Transcribe the audio

  3. 3 Align the transcript

  4. 4 Export the .flextext file from ELAN

  5. 5 HTR (Handwritten Text Recognition)

  6. 6 OCR (Optical Character Recognition)

  7. 7 Format the data

  8. 8 Unify the spelling for the main tier

European Union flag

The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.

CESSDACLARINDARIAH-EU