Date:

25 January 2021

In November 2020 CLARIN and SSHOC organised SSHOC Considerations for the Vocabulary Platforms,. The workshop followed on from three preliminary sessions, each dedicated to a candidate vocabulary hosting and management platform. The final session allowed participants to compare and critically evaluate the proposed candidates, learn about ongoing project work on controlled vocabularies and thesauri, and discuss the potential for identifying a common SSHOC vocabulary platform.

Links to the workshop slides and a recording of the event appear at the end of the article.

Candidate Platforms

Prior to the workshop, the SSHOC team had evaluated several platforms through focused surveys and interviews with experts, and produced a set of requirements that would accommodate the diverse needs of the SSH scientific community.

Import and export capabilities for controlled vocabularies in SKOS / RDF format
Unified access to all vocabularies via API services
Editing capabilities with collaborative functionalities
Alignment functions between vocabularies and external resources
Terminology management interface (hierarchical structure, semantic relationships translations, facets)
User role and workflow management
Version control and data provenance management
Features to support multilinguality and translation
Friendly and intuitive interface, suitable for non-expert users
Flexible design to be adapted to new needs and standards

While no single platform seemed to fulfill all the requirements, three out of the eight platforms evaluated, namely the vocabulary services of CESSDA , CLARIN and ACDH-CH , looked very promising.

CESSDA Vocabulary service

The CESSDA Vocabulary Service platform supports advanced vocabulary management, translation and publication, concept hierarchy, synonyms and user management. However, certain improvements still need to be implemented to ensure full functionality of the platform. One example is including concepts specific to a research domain within a broader controlled vocabulary which would support CESSDA organisations producing metadata from varied disciplines.

CLARIN Vocabulary service

CLARIN has been managing vocabularies using CLAVAS since 2017. Currently, they are looking for a follow-up vocabulary management and publishing platform. The requirements for the new platform are that it should be a sustainable open-source solution, SKOS compliant, with a GUI and a vocabulary editor, it should have advanced browsing functionalities and fast look-up via API, and it should support persistent identifiers and multilingualism. Preferably, the solution should have preloaded and curated controlled vocabularies.

ACDH-CH Vocabulary service

DARIAH has its central vocabularies service. For them, vocabularies should be published as Linked Open Data (based on SKOS data model), and provide comprehensive coverage of the domain through concept definitions and examples. Furthermore, users should be able to reuse existing vocabularies or link them to other artefacts, thus ensuring semantic interoperability. The presentation also covered best practices based on TaDiRAH (Taxonomy of Digital Research Activities in the Humanities) and the SSHOC Marketplace, as well as a snapshot of the vocabulary management workflow in ACDH-CH vocabulary service. The need to share the knowledge about vocabularies via dedicated training material was also highlighted, e.g. the Controlled Vocabularies and SKOS e-learning course available on DARIAH-CAMPUS.

For the full analysis, see this presentation available on Zenodo.

Controlled Vocabularies

The SSH VOCABULARY SURVEY was launched in the first half of 2020 to find out which vocabularies are used by the SSH community and identify the practices in terms of languages, alignment, availability and maintenance.

The survey results reveal that the most used vocabularies are: the Data Documentation Initiative (DDI), Getty Art & Architecture Thesaurus (AAT), CESSDA Controlled Vocabularies, ELLST, and Dublin Core. The respondents expressed a desire for vocabularies to be matched with Getty and Wikidata. A deliverable report will be available soon with more details.

OVERCOMING THE COMMON BARRIERS IN ARIADNE Plus

One helpful approach to overcoming the common barriers, such as language, level of specificity, etc., was developed in the ARIADNE project where the research team developed a vocabulary matching too l to be able to match local subject terms and concepts to Getty AAT concepts.

The tool supports the SKOS mapping properties, and it has a multilingual user interface. Researchers can use it to search and browse through AAT vocabularies to make more informed decisions about the mappings. Users can export their data to JSON or delimited text (CSV) format that they can import in other applications. If users encounter terms/ concepts without URIs, they can add them manually to facilitate Linked Open Data source vocabularies.

Image source: Vocabulary Matching Tool - Help (d4science.org)

MANAGING CONTROLLED VOCABULARIES WITHIN AIOLI

The Aioli platform which is a reality-based 3D annotation platform for collaborative documentation of cultural heritage was offered as an example of an innovative approach to the management of the controlled vocabularies. The platform integrates several controlled vocabularies that ensure consistent annotations, and it uses Opentheso, an ISO-compliant web-based thesaurus management tool, to create and manage the content. See the video gallery on the Aioli website to learn how you can create and annotate projects.

The concept management interface is intuitive and supports hierarchical relationships, synonyms and translation. Users can align concepts with other thesauri using semi-automatic alignment and pre-configured sources, such as Wikidata, Geonames etc. The platform also offers features for user management and collaboration. For more, see D4.16 Specification of the new feature of the Aioli platform.

ENSURING THESAURI INTEROPERABILITY WITH BACKBONE THESAURUS

A vocabulary platform aimed at storing and publishing the SSH thesauri should provide ways of ensuring semantic and technical interoperability. The BackBone Thesaurus (BBT) for example is a platform designated to build an overarching thesaurus federation for the humanities that would incorporate specialist thesauri and structured vocabularies used across scientific communities. The aligned vocabularies keep their autonomy, and the users can access the federated thesauri through the ACDH vocabulary repository service.

The platform supports high-level concepts, facets and hierarchies that do not exhaust the domain they classify. The BBT platform integrates BBTalk , a multilingual online service for thesauri management and maintenance. The service developed by FORTH-ICS supports RDF, and it includes a thesaurus alignment tool, collaborative features and keeps track of versioning.

Image source: the Backbone Thesaurus landing page

SUPPORTING EXTERNAL VOCABULARIES IN DATAVERSE

Interoperability and alignment of various sources and services native to a particular research community are imperative for the SSHOC vocabulary platform. During the workshop, Dataverse was put forward as an example of a platform that supports external vocabularies with a focus on semantic and technical interoperability. However, a recent self-assessment analysis on the compliance with the FAIR principles revealed that the service is weak from an interoperability point of view while scoring high with regard to the other principles.

Using GRID (Global Research Identifier Database) in SKOS, Dataverse provides a convenient depositor web interface to link the metadata of the datasets stored in the Dataverse network to external controlled vocabularies. The platform uses the Skosmos API specification protocol to ensure technical interoperability with other controlled vocabulary services (CESSDA). Furthermore, it connects to a Semantic Gateway application that enables users to query the vocabularies stored on different platforms (e.g. Skosmos). Finally, the platform supports multilingualism, and it allows researchers not only to enrich metadata but also export it to Linked Open Data Cloud to increase its findability or use it to train other Machine Learning models.

Image source: The search interface of the Dataverse

Panel Discussion

CHOOSING A SINGLE VOCABULARY PLATFORM

Sharing a vocabulary platform can mean either sharing a single instance or sharing the code and creating federated instances.

While using a single platform could help overcome some of the current challenges, for example the lack of a uniform API to access vocabularies, the panellists agreed that it is something that might not be feasible to achieve at this moment.

According to Matej Ďurčo (ACDH-CH), it would be challenging to unify the community and store all the vocabularies on one platform because there will always be stakeholders who prefer to control their own platforms and vocabularies. However, it would certainly be beneficial for the overall visibility of the sources to have at least a common registry of the existing SSH vocabularies.

Another suggestion was put forward by Menzo Windhouwer (KNAW) who believes that the developers of vocabulary platforms could try to agree on a couple of endpoints (e.g. for autocomplete) which could help tackle the interoperability challenges.

MEASURES TO BOOST FAIR IN SSH VOCABULARIES

Findability: Besides assigning a globally unique and persistent identifier, it is also important to describe the vocabularies coherently. Nowadays, researchers prefer to develop their vocabularies, but often do not follow the standards and guidelines for vocabulary development. Therefore, the vocabulary service and the editing interface should guide the researchers in their work and support them to develop vocabularies in a consistent and standardised way. This will help enhance the findability and reusability of the resources, while at the same time overcoming the (impossible) requirement of a single platform.

In fact, to support the findability aspect, the vocabularies do not need to be published in one instance because there are always ways to harvest the metadata and create a shared catalogue of all the available vocabularies in order to discover them.

Accessibility: Accessibility can be achieved via authentication and authorisation procedures, but Suzanne Dumouchel (OPERAS/TGIR Huma-Num) also suggested that in the case of vocabularies, it could be beneficial to replace the A-Accessible with A-Adaptable, since the vocabularies must keep up with the changes within the research community and include new concepts. Furthermore, in her opinion, more efforts should be made to identify and properly define the SSH disciplines.

Interoperability: To achieve interoperability with other applications or workflows for analysis, storage, and processing, (meta)data needs to use a formal, accessible, shared, and broadly applicable language for knowledge representation as well as vocabularies that follow the FAIR principles.

In the context of vocabularies, Daan Broeder (CLARIN ERIC) pointed out that there are tools to make the content of the vocabularies interoperable and support the mapping process between the vocabulary terms/concepts. Since it is time consuming to map entire vocabularies, a pragmatic approach might be desirable, e.g. mapping only those parts of vocabularies that are relevant. He referred the audience to the SEMAF EOSC projec t that proposes a flexible semantic mapping framework targeted at specific interoperability goals.

Matej Ďurčo reiterated that it is essential to distinguish between semantic and technical interoperability. Semantic interoperability could be achieved through an approach that counts for plurality, for example, letting researchers create their vocabularies and then use tools to link to match them. It was also noted that discussions around semantic interoperability have been going on for a long time and that similar projects are taking place in parallel. Hence, we should avoid reinventing the wheel and encourage the SSHOC researchers to build on existing initiatives as much as possible. These include the Linked Open Vocabularies (LOVs), the BARTOC Vocabularies, and the KOS Observatory for Social Sciences and Humanities,

Sustainability: Menzo Windhouwer asked how the vocabularies could be maintained over time, for example, after the funding received for a research project had ended. He pointed out that it would be laborious to keep the resources up to date by using volunteers because they need specific knowledge engineering skills. However, it has been indicated that some parties may have the opportunity to sustain the tools and linguistic resources that they are developing. For example, CNR has committed to hosting the ARIADNE mapping tool for at least five years after the SSHOC project has finished. There are plans to make all the mapped vocabularies available as RDF downloads for reuse as well. BBT and BBTalk will continue to be maintained voluntarily by the DARIAH’s Thesaurus Maintenance Working Group.

THE IMPACT OF EDITORIAL AND CURATION PROCESSES

Menzo Windhouwer pointed out that the end users do not know what types of relationships knowledge engineers apply during vocabulary mappings. He proposed therefore to set up a quality assessment of the mapping process that includes some provenance metadata with confidence metrics.

Conclusions

REQUIREMENTS FOR VOCABULARY PLATFORMS

The panel concluded that more discussions are needed to come to a final recommendation for SSHOC vocabulary platform(s) which would host and publish the SSH vocabularies. Currently, none of the existing platforms includes all the desired requirements for the SSHOC vocabulary platform. However, there are some promising ones that could be built upon, e.g. ACDH-CH, CESSDA and CLARIN Vocabulary Services.

The discussions revealed that instead of looking for a single platform to host and publish vocabularies, we should rather aim to achieve interoperability between the different platforms at the following levels: data exchange, vocabulary identification schemes, vocabulary maintenance, and quality management.

REQUIREMENTS FOR VOCABULARIES

Semantic and technical interoperability seem to be the most important requirements. They could be fulfilled by reusing the existing vocabularies, linking them to other semantic artefacts, or integrating them in authoring environments. Furthermore, vocabularies should be published as Linked Open Data (i.e. RDF, SKOS, OWL formats) and provide comprehensive coverage of the domain through structured concept definitions and examples.

Since SSH is a very diverse domain, concept definitions may vary and this can lead to ambiguity. Therefore, mappings between different vocabularies are needed, but since this is very time consuming, it was proposed to adopt a flexible semantic mapping framework (e.g. SEMAF EOSC projec t) targeted at specific interoperability goals. Moreover, since vocabularies are constantly evolving, a solid governance model is needed to ensure that the vocabularies are updated and maintained systematically and in a semi-automatic way.

Workshop slides and recording

If you’re interested in the webinar series preceding this workshop, you can read this summary of the takeaways or access the slides and the recordings of individual sessions via the following links:

To stay updated on SSHOC's latest activities, sign up for o ur newsletter, follow us on Twitter @SSHopenCloud or get in touch at info@sshopencloud.eu.

WORKSHOP NOTES: SSHOC Requirements - Vocabularies and Vocabulary Management Platforms