
Ontology driven Information Extraction from Text
Description
This Workflow provides an automated pipeline for information extraction from textual data. It employs various machine-learning (ML) and rule-based methods for the extraction of entities, relations, and metadata information from published text. The entire process is ontology-driven, meaning that all the semantic definitions of the populated entity / relationship types are provided by an ontology specifically designed for the task at hand and can function as the schema of a potential Knowledge Graph comprising the extracted and interrelated entities.
Use case implementation
The workflow is implemented for a specific use case scenario, that of extracting information regarding scholarly work from research publications. Specifically - as part of the ATRIUM Project - the implemented workflow allows to employ Deep Learning (DL) methods for automatic extraction of textual spans denoting research activities and steps thereof as well as other expressions that describe either the intention (goal) of a specific activity, or the way (method) it was carried out. These entities, along with additional information extracted from article metadata (author keywords, publication information) and semantic relationships among them, provide the building blocks for the construction of a knowledge graph that describes work processes. All the semantic definitions of the populated entity / relationship types are provided by the Scholarly Ontology (SO), a CIDOC-CRM compatible conceptual framework, specifically designed for documenting scholarly work.
Goals
- Extract textual spans representing instances of for the Scholarly Ontology (SO) classes: Activity (i.e. a scholarly process like an archeological excavation, a social study or steps thereof, etc.), Goal (i.e. a research objective of an activity denoting why the latter was conducted), and Method (i.e. a procedure, plan or technique, employed by an activity and denoting how the latter was conducted).
- Disambiguate / link the extracted entities to external reference resources (like Wikipedia) when possible.
- Interrelate the extracted entities using properties and relationships provided by the SO, such as: employs(Activity,Method) and hasObjective(Activity,Goal).
Technical requirements
- Fine-tuned Deep Learning (DL) models using the spaCy NLP framework. The implemented workflow uses - for demonstration purposes - pretrained Transformer models that are downloaded from the Hugging Face Library and are already fine-tuned for recognition of textual spans representing research activities along with their goals and research methods. However, other models could be used interchangeably as long as they are compatible with the SpaCy pipeline.
- A framework for performing entity disambiguation and linking. The implemented workflow uses the Zshot framework for entity disambiguation of the extracted methods’ names by linking them with their corresponding Wikipedia entities. In addition, the ORCID API, is used in order to link the names of the authors with their corresponding ORCID information (ID, email, affiliations and full name) when possible and Wikipedia, Wikidata and DBPedia APIs are being accessed through linking-information queries in order to retrieve the methods’ description, proper and alternate names (aliases) and corresponding URLs.
Workflow steps(4)
2 Entity Disambiguation
3 Entity Linking
4 Relation Extraction
The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.


