If you are having trouble with the sign in process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

If you are having trouble with the sign up process, please <ContactLink>contact the SSH Open Marketplace team</ContactLink>.

Subject

Social Sciences & Humanities Open Marketplace

Successfully submitted!

Users can upload metadata and texts in several formats: [Zotero](https://www.zotero.org/) collections in CSV and RDF formats and EPrints (library) repositories as (EP3) XML files (metadata or metadata with links to the full texts). AVOBMAT can also import full texts, for example, by uploading a zip file of documents along with a CSV of the metadata. Documents from external databases can be imported by providing URLs to the full texts in the CSV. It can process texts in several formats (e. g. TXT, PDF, DOC/X, XML) since the Apache Tika library converts them to plain text.

AVOBMAT provides several key options for cleaning the text corpus. For example, users can 

* remove non-alphabetical tokens (e.g. of OCR-ed texts); 
* upload a list of words and replace words  and characters (e.g. modernization of old word forms; correction of typical OCR errors; synonyms) 
* make use of regular expressions (e.g. merging hyphenated words at line endings). 

A context filter is implemented to keep the context of a keyword or keywords and remove all other parts of the document. Users can specify the search terms and the length of the context (number of words). It is useful to approximately separate smaller parts of a document e.g. articles containing our keyword(s) in a newspaper.

Users can create and save different configurations for each analysis where the outcome depends on the language of the texts. There are two ways to assign a language to a document: 

* researchers can manually select a language for the full dataset (52 languages) or 
* choose the automatic language detection option. 

As for the latter, the system will choose a language independently for each document. Based on the chosen language, AVOBMAT offers stopword and punctuation filtering - drawing on the [spaCy](https://spacy.io/) library - and lemmatization. Lemmatization, for example, could be useful for topic modelling, but if researchers want to investigate the chronological changes of different forms of a lemma, they can switch off the lemmatization in the n-gram viewer. Extra stopword and punctuation lists can also be added. [SpaCy language models](https://spacy.io/usage/models) are used for lemmatization, with [LemmaGen](https://pypi.org/project/lemmagen3/) models being used for languages not supported by spaCy. 

The following preprocessing options are implemented:


* choose spaCy language model (small, large or transformer); 
* make text lowercase; 
* remove numbers; 
* set minimal character length. 

The metadata enrichment includes the identification of the gender of the authors (male, female, unknown gender or without author) and automatic language detection. Users can also upload a list of male and female first names, supplementing and replacing the ones found in the dictionaries of the programme. 

As for topic modelling, users also has the option to separate the documents into sections of equal size. 

AVOBMAT calculates the lexical diversity according to eight lexical diversity metrics, users can specify the so-called window lengths in case of the Mean segmental TTR (MSTTR) and Moving average TTR (MATTR).

Importing previously saved configuration settings is also possible.

Using the configuration settings specified in the previous step, AVOBMAT cleans and pre-processes a small sample of the uploaded database where the users can check if the set parameters are appropriate. The settings can be saved in a template if the configuration is acceptable. If the parameters need to be fine-tuned, the users can start the cleaning and configuration process again. During this phase all the metadata of the given database is uploaded, so users can check which metadata fields were mapped.

The users can search and filter the metadata and texts in faceted, advanced and command-line modes and perform all the subsequent analyses on the filtered dataset (empowered by the [Elasticsearch](https://www.elastic.co/) engine). The NLP analyses of the documents semantically enrich the metadata. For example, the recognized named entities such as person appear in all types of searches. The tool supports fuzzy and proximity searches. The user can search for (disambiguated) named entities in different languages.

Having filtered the uploaded databases and selected the metadata field(s) to be explored, the users can, among other actions, 

* analyze and visualize the bibliographic data chronologically in line and area charts in normalized and aggregated formats; 
* create an interactive network analysis of the metadata fields; 
* make pie, horizontal and vertical bar charts. 

To foster the critical investigation of the bibliographic data, AVOBMAT also presents the number of missing and other values (not included in the dataset limited by the selected number of top items parameter) in the filtered corpus. Besides revealing overlooked connections and trends over time, the bibliographic data analysis can also highlight selection biases, errors in the bibliographic (meta)data (e.g. incorrect classifications) and can reveal missing values and gaps in the data.

The following options are available for interactive text analysis.

### __7.1. N-gram viewer__

The diachronic analysis of texts shows the yearly count of the specified n-grams. The n-grams with a maximum 5-word length are generated at the pre-processing stage. Users can display the results in aggregated and normalized views. 

### __7.2. Frequency analysis__

Frequency analyses and word clouds can be efficient tools to highlight the prominent terms in a corpus. 

* The [significant text](https://www.elastic.co/guide/en/elasticsearch/reference/8.0/search-aggregations-bucket-significanttext-aggregation.html) analytical tool shows what differentiates a subset of the documents from others using four different metrics (e.g. Chi square). The significant text analysis highlights the most related terms to a special query. If users filter a period of time or select an author by using the searching possibilities of the AVOBMAT, this tool shows the words that are most strongly related to this selected subset.

* The [TagSphere](https://link.springer.com/chapter/10.1007/978-3-319-64870-5_10) analysis enables users to investigate the context of a word by creating tag clouds showing the co-occurring words of a specified search term within a specified word distance. 

Words can be interactively removed from the word clouds. Bar chart versions of the frequency analyses present the applied scores and frequencies.

### __7.3. Lexical diversity__

AVOBMAT calculates the lexical diversity of texts according to eight different metrics: Type-token ratio (TTR), Guiraud, Herdan, Mass TTR, Mean segmental TTR, Moving average TTR, Measure of Textual Lexical Diversity and Hypergeometric distribution Diversity.

### __7.4. Keyword-in-context__ 

The keyword-in-context function supports the close reading of texts. Users can provide the keyword, the length of the context and the number of passages to be displayed.

### __7.5. Topic modelling__

Topic modelling can be a powerful tool for identifying underlying themes within a filtered corpus by analyzing hidden semantic information. The [Latent Dirichlet Allocation](https://dl.acm.org/doi/10.5555/944919.944937) function calculates and graphically represents topic models. It shows the most relevant words and most relevant documents related to each topic, visualizes the distribution of these topics chronologically, highlights the correlation of different topics and exports the results in various formats. Users can interactively remove stopwords. It has the following parameters: 
* the minimum number of occurrences of words, 
* the number of topics,
* the number of iterations, 
* per-document topic distribution (alpha), 
* per-topic word distribution (beta). 

### __7.6. Parts-of-speech tagging__

AVOBMAT identifies the part-of-speech tags currently in 9 languages by using the spaCy language models. It produces different interactive visualizations and statistical tables of the results with the following information:

* the word form, 
* the lemma, 
* the part-of-speech tag, 
* the number of occurrences, 
* the relative frequency, 
* the number of documents in which the word form appears. 

Users can visualize the distribution of parts-of-speech tags over time.

### __7.7. Named entity recognition, disambiguation and linking__

It identifies named entities such as persons and places currently in 16 languages. The number and type of named entities differ by language. AVOBMAT creates different statistical tables and visualization of these entities. In case of English, German, French, Spanish and Portuguese, AVOBMAT disambiguates the named entities and links them to Wikidata, VIAF and ISNI. The identified named entities are displayed in different colours in the full-text view.

The reproducibility and transparency of the experiments and results using the tool are enhanced by the ability to import and export the parameter settings in JSON format. The users can create templates for the pre-processing and analytical functions on the graphical interface. The tabular statistical data and visualizations of the performed analyses can be saved in PNG and different CSV formats, including a document-topic graph file for Gephi in case of topic modelling. The latter enables researchers to use the generated data in other software. Users can share and make their databases public.

Róbert Péter

The [AVOBMAT](https://www.avobmat.hu/) (Analysis and Visualization of Bibliographic Metadata and Texts) multilingual research tool enables researchers to critically analyse bibliographic data and texts at scale with the help of data-driven methods supported by Natural Language Processing (NLP) techniques. This exploratory tool offers a range of dynamic text and data mining tasks and provides interactive parameter tuning and control from the preprocessing to the analytical stages. It can preprocess, analyse and (semantically) enrich a vast number of texts and metadata in several languages due to its scalable infrastructure. The implemented analytical and visualization tools provide close and distant reading of texts and bibliographic data. It combines bibliographic data and NLP research methods in one integrated, interactive, user-friendly web application, allowing users to ask complex research questions. 

The ["Multilingual Analysis and Visualization of Bibliographic Metadata and Texts With the AVOBMAT Research Tool"](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.175) article introduces the workflow of AVOBMAT with images and figures. 

Multilingual analysis and visualization of bibliographic metadata and texts with AVOBMAT

Related items(1)

AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) research tool

Workflow steps(9)

1 Uploading the corpus

2 Cleaning the corpus

3 Configuring the parameters

4 Testing and validating the configuration settings

5 Searching and filtering the corpus

6 Interactive metadata analysis

7 Interactive text analysis 1.

8 Interactive text analysis 2.

9 Exporting results, configurations and publicizing databases

Related items(1)

Tool or serviceAVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) research tool

Workflow steps(9)

AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) research tool