If you are having trouble with the sign in process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

If you are having trouble with the sign up process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

Subject

Social Sciences & Humanities Open Marketplace

Successfully submitted!

Bibliographical data is not primarily produced with data scientists or quantitative analysis in mind. It is therefore crucial for the researcher to understand the data composition first and be aware of the peculiarities of the format both in general and as applied to the particular dataset at hand. As a connoisseur of bibliographical data put it: „There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” [Roy Tennant: MARC must die. Library Journal, October 15, 2002](https://www.libraryjournal.com/story/marc-must-die)

Catalogue data are normally available through an array of channels, the most frequent being data dumps and repositories, OAI-PMH and Z39.50 protocols, REST API, FTP see [Library APIs](http://pkiraly.github.io/2023/10/04/library-apis/).

When acquiring the data the researcher needs to take into account both the purpose of the research, as well as the legal prerequisites for using the dataset. While some datasets are CC0-licensed and the researcher is free to reuse the data as well as publish the processed dataset, other licences are more restrictive and may pose a major obstacle for both processing the data or disseminating the research results. 

The researcher needs to consider whether the available data format is suitable for further processing, if they have the tools to open and process the data.

This step consists of three complementary tasks: primary validation, conversion and sanity check.

First, the researcher considers the technical quality of the dataset and whether the format is suitable for further processing. It is essential that the dataset is not corrupt and contains what it is presented to contain.  

In the conversion part, the researcher should think of a suitable target format in which they want to work with. An example of a  suitable format may be a CSV table, JSON, XML, a spreadsheet file (e.g. Excel) or another format that the researcher is familiar with and has the tools and capacity to work with. During the conversion, the researcher may also decide to only select parts of the dataset that they deem necessary for their research project.

After the conversion, the researcher needs to run a preliminary sanity test of the data. The researcher should examine the quality of the data and consider whether the data are suitable for the research purpose. If not, the researcher should consider a different dataset. The sanity check should also show if some data were not lost in the conversion. This especially true in case the research opted for a partial conversion.


Available bibliographic metadata is seldom readily amenable to quantitative analysis. Biases, inaccuracies, and gaps hinder productive research use of bibliographic metadata collections. Varying standards, conventions and languages pose challenges for data integration. The purpose of harmonisation is to turn the catalogue information into a dataset that can be used in quantitative humanities research.

It is important to understand that for library catalogue data we can use largely identical algorithms across all metadata collections. Examples of harmonisation include removal of spelling errors, disambiguated and standardised terms, augmented missing values, and developed custom algorithms that can convert the raw MARC notation to numerical page count estimates, for instance. In data harmonisation we often have to deal with different challenges of different languages especially if we want to work across different catalogues from different countries. For example, identifying translations is a major research case and requires particular attention.

For harmonisation, it is common to use external data sources on authors, publishers, and places to enrich and verify bibliographic information. Automation, scalability, and quality control are critical, as the data collections may contain information on millions of documents. It is important to understand that the harmonisation is an iterative process that combines harmonisation, analysis of the data and validation. In the end improved understanding often leads to enhancements in data harmonisation and validation that can be incorporated in the automated processing steps.

Ideally, such harmonisation and validation efforts are transparent both in terms of data and source code. In contrast to code availability, many of the most comprehensive bibliographic metadata collections are not yet generally available as open data, however, and they may be difficult to obtain even for research purposes. The lack of open data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science.

**How** (option 1): In Bibliographic Data Science, it's crucial to underscore the significance of thorough data harmonisation. It is crucial that researchers can retrace their steps and ensure that the harmonisation workflow can be duplicated by anyone having access to both the data and the codebase. This methodology can be termed as “algorithmic harmonisation process”, where all the steps are managed through a dedicated end-to-end pipeline. A practical example of this can be found here: https://github.com/COMHIS/fennica.

**How** (option 2): While the algorithmic approach is preferable, there are also less rigorous practices and various existing tools that can be used to transform library catalogues into research data. For instance, OpenRefine is frequently employed for data harmonisation. What truly matters is the data's quality in relation to the research objectives. When seeking a broad overview of a catalogue, a less harmonised dataset might suffice. However, for many specialised use cases (such as understanding the role of Swedish publications in Finnish book history in the early twentieth century), there is a greater need for comprehensive data harmonisation across various aspects of the dataset. It should also be pointed out that reproducibility in research is crucial so when using ready-made tools that might have black box elements in them, the researcher needs to document her steps carefully for the sake of reproducibility.



For the analysis part of dealing with bibliographic data, one invaluable resource lies in the wealth of metadata catalogues available to trace the diffusion of knowledge throughout history, particularly during the early modern period and beyond. We can for example with relative ease use bibliographic data to study publishing networks, vernacularisation processes or take a statistical perspective on author’s relative popularity during a particular era. One of the foremost challenges is the quality, comprehensiveness, and interpretation of the data itself. Ensuring that the information captured in these catalogues is accurate and complete is a perpetual task that researchers must grapple with. Moreover, deciphering the nuances and historical context embedded within these records requires careful consideration. It's important to recognize that these collections are not mere repositories of data but rather reflections of intricate historical processes. Consequently, they inherently carry biases, inaccuracies, and complexities that must be navigated. The term "bibliographic data science" aptly encapsulates this shift, emphasising the treatment of bibliographic data as a robust foundation for quantitative research. Through this lens, we strive for data reliability, completeness, and a deeper understanding of the rich tapestry encoded within these catalogues.

**How**: There are many different ways to conduct analysis of the bibliographic data retrieved from library catalogues. It enables distant reading and tracing of major trends in the historical record, for example. This can be done algorithmically or using various different ready-made tools that can visualise, for example, library data. It is important to understand that the development of use cases also feeds back to the question of harmonisation and validation and forms an iterative relationship between the cases. Through serious analysis cases it is possible to notice biases and challenges with the data quality taking us back to certain further questions of harmonisation and development of the bibliographic data.


Validation takes place on several levels. First comes the technical validation the purpose of which is to check the overall consistency of the dataset: is it readable, do all records are in the same format, semantic schema, and character encoding.

A deeper quality assessment involves generic quality dimensions such as completeness (are the data elements we would like to analyse available in all or the majority of the records?), conformity (do the records follow the rules of the bibliographic schema, and the researcher’s own custom criteria?), appropriateness (does the dataset contain a subset, e.g. 18th century books the research would like to analyse?), consistency (do all the records use the same terms describing the same information, e.g. a person or concept?).

The slogan "fitness for use" frequently mentioned in the quality assessment literature reflects the phenomenon that we always measure how a functional requirement could be satisfied with the data. These requirements of a library and of the researcher might be different, so what could be high quality data in one context could be lower quality in another.

Validation might be run not only in the incoming data, but if we disseminate our results also on our output dataset.



The researcher has several non exclusive options to share their results. Here are some:
- Sharing the software code (or executables) that used to produce the results (it is also important in terms of research reproducibility),
- Publishing output data as it is or in a standard data format which fit to the particular research or professional domain. The researcher can do it in a research data repository (see the directory of research data repositories) or in other platforms such as core repositories.

In all cases the publication should follow the general rules of research output publication, so taking care of FAIR and CARE principles,  for example applying persistent identifiers, attaching proper metadata, licence or terms of use, giving credits to all who contributed in the process.

The researcher should inform not only the research community but the data provider (the institution where the data comes from) in order that they could implement the “metadata roundtrip” i.e. fetching the data back, and use it to enrich the original data source. For cultural heritage organisations credibility and reliability of their data is important. The traditional metadata schemas do not provide ways to record provenance information on statements level within a metadata record, but recent developments, such as the reification technique of RDF, or using external identifiers for data sources in MARC hopefully helps in this. Two good examples for this mix of cultural heritage and user produced metadata are a) the Swedish National Heritage Board’s [Wikimedia Commons Data Roundtripping experiment](https://meta.wikimedia.org/wiki/Wikimedia_Commons_Data_Roundtripping) and b) the Consortium of European Research Libraries’ [Material Evidence in Incunabula database](https://data.cerl.org/mei/_search) that contains data from 50,000 books, contributed by a network of over 400 European and American libraries and over 200 editors. Its metadata schema contains data elements regarding the source and certainty of the information.


Mikko Tolonen

Ondřej Vimr

Charlotte Panušková

Péter Király

Library catalogues have been identified as a crucial resource for studying different aspects of book production spanning from literature to intellectual history and to informatics. At the same time, using them requires addressing challenges of data quality, completeness, and interpretation. Important aspect of bibliographic data science workflow is that it is imagined as a multilingual and transnational way of approaching large humanities and social science data. Data from national libraries for example covers hundreds of years of data and several languages. When we look at the possibility of combining different library datasets from multiple countries, we often face the challenge of dealing with language dependencies and deduplication.

The core of this type of work is to iterate between data harmonisation, analysis built upon different use cases and validating the data against other sources. We should imagine an open science ecosystem of different metadata collections where work on one of them also eases the use of another. Ecosystem thinking is important also because the harmonisation step often depends on other linked sources such as authority files or other library catalogues. An approach through which library metadata catalogues become research data has been coined as bibliographic data science. In this workflow description, we explain what it takes to produce research data out of library catalogues. The workflow can be described as an open science initiative because it takes questions of reproducibility and data quality seriously.

This workflow provides a step-by-step guide for researchers eager to use bibliographical data for research. More specifically, it looks into common issues of data acquisition, preprocessing, harmonisation, analysis and validation as well as an array of dissemination options.


Bibliographical Data Science: from Catalogues to Research Data

Media

Related items(2)

Bibliographic Data Science and the History of the Book (c. 1500–1800)

R tools for Fennica (Finnish national bibliography)

Workflow steps(6)

1 Data acquisition

2 Preprocessing

3 Harmonisation

4 Analysis

5 Validation

6 Dissemination

Media

Related items(2)

PublicationBibliographic Data Science and the History of the Book (c. 1500–1800)

Tool or serviceR tools for Fennica (Finnish national bibliography)

Workflow steps(6)

Bibliographic Data Science and the History of the Book (c. 1500–1800)

R tools for Fennica (Finnish national bibliography)