WEBINAR NOTES: Sharing Datasets of Pathological Speech

Date:

20 November 2020

Speech corpora of individuals with communication disorders (CSD) are gems in the realm of language resources. Because they are costly to obtain and hard to share due to their personal properties, researchers often collect the data themselves. However much time could be saved if it were possible to use existing data and respect GDPR requirements at the same time - if only one knew how.

The DELAD initiative was brought to life specifically to help researchers share datasets of pathological speech, and in October 2020, experts from DELAD and SSHOC Tasks 5.4 and 6.5 organised a webinar to demonstrate best practice in obtaining, processing, and sharing CSD.

The DELAD initiative

Henk van den Heuvel (CLST, Radboud University) opened the webinar with the presentation of the DELAD. The word means shared in Swedish and the DELAD initiative works towards facilitating the exchange and investigation of CSD (corpora of speech of individuals with communication disorders) in compliance with the GDPR.

DELAD strives to connect to existing research infrastructures. As such, they cooperate closely with the CLARIN Knowledge Centre for Atypical Communication Expertise (ACE) which collaborates with the CLARIN Data Centre at the Max Planck Institute for Psycholinguistics (The Language Archive), and with Talkbank for storage of sensitive data. DELAD also organises annual workshops covering topics related to ethical, legal and technical aspects of working with CSD. These cover everything from collecting, formatting, processing and sharing CSD, to ensuring access to such data by collaborating with existimg research infrastructures and providing a quality inventory of relevant datasets.

GDPR: a curse or a blessing?

Nicola Bessell (Department of Speech and Hearing Sciences, University College Cork) highlighted the ethical and GDPR considerations when collecting corpora of speech disorders. She underlined that the GDPR stipulates that processing of health-related data is only allowed for research purposes, while archiving of such data must be in the public interest in order to be legal.

In order to ensure GDPR-compliant use of CSD data, researchers and other users of such data must obtain consent from data owners. This can be done via consent forms which need to address the following aspects:

Use of data
- The data user must obtain explicit consent to use the data for the intended purpose.
- The consent form must outline how the confidentiality of data owners will be protected.
Dissemination of data
- The consent form must list the terms that will govern the dissemination of data.
- The consent form should also state what future use is envisaged for research purposes.
Archiving of data
- It is recommended to specify the archival period.

Among others, the discussion brought a useful remark regarding the use of clinical data for research purposes. Since such use is not classified as repurposing by the GDPR, it is fully allowed.

Also of interest in this regard will be the recent SSHOC webinar on the topic of the DARIAH ELDAH Consent Form Wizard – a tool that provides standardised consent form templates, enabling any user to quickly and easily obtain legal consent valid in all of the European Union.

Data storage and access: where and how?

Paul Trilsbeek (The Language Archive, Max Planck Institute for Psycholinguistics) presented the GDPR-compliant way in which data is stored and made accessible at The Language Archive. He put special emphasis on the issues regarding the anonymisation process. This process can often invalidate the data for many research purposes. Paul Trilsbeek further stressed the necessary legal agreements for archiving and sharing personal data which are in essence of two types:

deposit/processing agreements
data use agreements/licenses

He underlined the need for thorough examination of licenses used since many existing licenses are “perpetual” and may therefore be in conflict with the GDPR under certain conditions.

Paul Trilsbeek also elaborated on the technical and logistic requirements needed and implemented at TLA in order to ensure “data protection by design and by default” as stated in the GDPR. This includes up-to-date systems and software, secure transport of data (HTTPS) and an elaborate system of access policies and authorisation. At the TLA, all archived copies also reside within the EU at trusted data centres within the Max Planck Society, which is another important aspect for ensuring data security.

The next speaker, Libby Bishop (GESIS - Leibniz Institute for Social Sciences) presented remote secure access, an innovative access method to CSD that is now explored in the SSHOC project. This method brings the user to the data rather than the data to the user. The data resides at the local server and the user can perform analyses by using the tools available at the remote end. In this way, only aggregated analysis results can be downloaded by the user but not the data. This ensures a higher level of data security while enabling easy data reuse by researchers.

Two examples

Libby Bishop first shared some insights into a decade old data collection project called CAVA (Human Communication Audio-Video Archive) which includes data covering a wide range of disorders and is hosted at UCL. She addressed the legal and technical issues related to sustaining and possibly expanding such a collection. The main concern raised was that there is currently no reliable path to a sustainable infrastructure. Open cloud-based solutions, such as those (that will be) provided by SSHOC/EOSC offer a promising way forward, but we will have to wait and see if this will really be the winning option.

The final contribution was given by Katarzyna Klessa (Adam Mickiewicz University) on a very recent curation project which includes legacy data from Polish children with hearing impairment.

Katarzyna Klessa specifically highlighted the legal basis for sharing the data, and issues of interoperability when it comes to obsolete data formats. The CLARIN Knowledge Centre for Atypical Communication Expertise helped make this data accessible via a new and unique sharing model whereby all metadata and information on the dataset can be found at the Talkbank, whereas the audio data is stored on European servers only, more specifically at The Language Archive. This is a novel and promising example for data storage and access that opens up new possibilities for European researchers, since it uses a well-established data centre in the USA for hosting the landing page and part of the CSD, whilst keeping the most sensitive data on European servers.

Missed the event or want to know more?

There were many other useful insights shared during the webinar presentations and discussion, so we invite you to watch the recording and view the presentation slides.

In addition, the DELAD initiative is organising a virtual workshop on 27 and 28 January 2021. The provisional programme is already online.

If you would like to join this workshop or get actively involved, send an email to h.vandnheuvel@let.ru.nl by November 25th with the following information:

Your research topics
Do you have a CSD that you would like to share?
Please indicate if you would like to give a short presentation about point 1 and 2 during the workshop.

Article written by Henk van den Heuvel and Kristina Pahor de Maiti

WEBINAR NOTES: Sharing Datasets of Pathological Speech

The DELAD initiative

GDPR: a curse or a blessing?

Data storage and access: where and how?

Missed the event or want to know more?

News

SSHOC 2025 Updates

Science Clusters Position statement on operational commitment to EOSC and Open Research

SSHOC, the SSH Open Science Cluster has a New Chair and Vice-Chair in 2024

OSCARS project funded to foster the uptake of Open Science in Europe

Strengthening Cross-Cluster Collaboration: Highlights from the 2nd SSH Open Cluster Assembly