Skip to main content
Home

Semi-automated extraction of information from textual documents in a domain-specific repository

The workflow addresses the machine processing and enhancement of archaeological textual data (e.g., grey literature, reports, field forms) derived from image text recognition outputs or extracted from digital-born PDFs. The goal is to provide an integrated solution that facilitates the processing of legacy data and new uploads within the selected repository system. It aims to enhance searching and processing of documents while streamlining archival procedures by automating key steps in common archiving and annotation workflows.

By applying this workflow, multiple downstream applications become possible, including:

  • automatic linking of related documents,
  • quality assessment,
  • data extraction,
  • interlinking with other documents,
  • and automated abstract generation.

These capabilities contribute to the long-term preservation and accessibility of archaeological documentary archives. Furthermore, the workflow supports natural language processing (NLP) applications, enabling the creation of corpora for analysis using large language models (LLMs) and their training.

The workflow emphasizes the sustainable integration of existing tools accessible via APIs as services. It leverages the outputs of these tools to meet specific user needs, particularly within data archiving and publishing workflows, ensuring adaptability and scalability in diverse use cases.

The Archaeological Map of the Czech Republic (AMCR) repository serves as a demonstrator in the workflow.

Media

Related items(1)

Workflow steps(8)

  1. 1 Selection of a target repository and meeting basic requirements

  2. 2 Integration to data processing workflows of the selected repository

  3. 3 Data enhancement for improved search

  4. 4 Keyword extraction

  5. 5 Named entity recognition

  6. 6 Named entity recognition - optional training for enhanced results

  7. 7 Automated translations

  8. 8 Integration to user workflows

European Union flag

The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.

CESSDACLARINDARIAH-EU