Glossary | research-data-management-introductory-course

English glossary of terms related to Research Data Management

A

anonymisation [4]

The process of removing personally identifiable information (information that directly or indirectly relates to an identified or identifiable person) from datasets containing sensitive data. As a result, data subject is no longer identifiable. As opposed to pseudonymisation, anonymisation is not reversible, which means that the re-identification of the data subject is not possible.

author’s moral rights [1]

Rights protecting the personal relationship between the author and their work. These include, among others, the right to be identified by name, the right to the integrity of the content and form of the work and its fair use, and the right to supervise the use of the work. Moral rights are unlimited in time and cannot be waived or transferred to other persons.

B

backup [4]

Data backup is a process of creating a copy of data in a digital format and storing it on another device to ensure that data are saved and to prevent data loss.
Backups can be full (all files are backed up whenever a backup is made) or partial (only a part of the files, e.g. new files, are backed up).

born digital

Digital materials that were originally created in digital form, as distinct from those created as a result of digitising analogue originals.

C

commercialization of research results [1]

Deriving financial or economic benefits from the outcomes of scientific research. Most often associated with patenting an invention created on the basis of previously obtained research results, followed by the granting of licences for its use in return for payment. Commercialisation is subject to legal regulations, which may influence the possibilities for open access to related.

controlled vocabulary [4]

An organised and standardised arrangement of predefined terms (words and phrases) that are used to index content in an information system with the aim of facilitating information retrieval. Controlled vocabularies connect variant terms and synonyms for concepts, link concepts in a logical order and organise them into categories, so as to provide a consistent way to describe data. They can be general and discipline-specific, and can take the form of subject heading lists, thesauri, authority files, taxonomies and alphanumeric classification schemes.

copyright (economic rights) [1]

A set of exclusive rights that allow the author to use, reproduce, distribute, and profit from their work. These rights can be licensed to others under specified conditions, or fully transferred — with or without compensation. Economic rights typically expire 70 years after the end of the calendar year in which the author dies.

copyright exceptions and limitations / fair use / fair dealing [1]

Legal provisions allowing certain uses of copyrighted works without the permission of the rights holder, typically for purposes such as private study, education, research, criticism, or commentary. Such uses are limited by national legislation and differ across jurisdictions.

copyright law [1]

The exclusive right to works – manifestations of human creative activity. Copyright protection arises without the involvement of state authorities, as a result of the creation of a work in any form external to the creator. As a rule, the author is the only person entitled to copyright, although exceptions exists, e.g. in the case of works created by employees in the course of their duties, as well as computer programs.

Creative Commons licenses [1]

License templates developed by Creative Commons, a non-profit organisation founded in the United States [19 May 2025], currently cooperating with partners in many countries. A person holding intellectual property rights to specific content (e.g. a data set) may label it with the symbol of the license. In this way, they determine the scope of rights granted to users. Open licences are those Creative Commons licences that allow for both the creation of derivative works and their commercial use. These conditions are met by CC-BY licences (requiring the user to provide information about the authorship and licence conditions) and CC-BY-SA licence (which additionally requires the use of the same licence when sharing adaptations of the work), as well as the CC0 declaration (which is a very broad licence, imposing essentially no obligations on the licensee).

D

data access committee [2]

A group of people responsible for reviewing and evaluating requests for access to data.

Data Availability Statement

A section within a scientific publication that describes where and how the underlying research data can be accessed. It may specify data repositories, licence types, access restrictions, or conditions for reuse. Increasingly required by journals to ensure data transparency and reusability.

database

An organised collection of structured data stored according to a defined model that describes data types and relationships between them. Databases support efficient data access, management, and updating.

data curation [3]

Managed process throughout the data lifecycle, by which data/data collections are cleansed, documented, standardised, formatted and inter-related. This includes versioning data, or forming a new collection from several data sources, annotating with metadata, adding codes to raw data (e.g., classifying a galaxy image with a galaxy type such as “spiral”). Higher levels of curation involve maintaining links with annotation and with other published materials. Thus a dataset may include a citation link to publication whose analysis was based on the data. The goal of curation is to manage and promote the use of data from its point of creation to ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Special forms of curation may be available in data repositories. The data curation process itself must be documented as part of curation. Thus curation and provenance are highly related.

data documentation [4]

Includes various types of information that can help find, assess, understand/interpret, and (re)use research data – e.g. information about methods, protocols, datasets to be used and data files, preliminary findings, etc. Documentation helps understand the context in which data were created, as well as the structure and the content of data. Data should be documented through all stages of the research data lifecycle. Detailed and rich documentation ensures reproducibility and upholds research integrity. Documentation also includes metadata.

data integrity

A property of data that guarantees its completeness, consistency, accuracy and reliability. In practice, ensuring data integrity requires the implementation of appropriate procedures at the data collection, processing and storage stages to prevent erroneous and unauthorised changes, damage or deletion of data. The implementation of appropriate procedures in this area is particularly important in the context of digital data management due to the ease with which files can be modified.

data journal

A scientific journal that publishes articles describing research datasets. The aim is to increase dataset visibility, reproducibility, and reuse, often by including detailed metadata and links to repositories.

Data Management Plan (DMP) [2]

A formal document that outlines how research data will be handled during and after the lifecycle of a research project. It covers aspects such as data collection, storage, access, sharing, reuse, and long-term preservation. The DMP is considered a "living document" that should be updated as the research progresses and circumstances change.

data preservation

All activities and processes designed to ensure access to digital research data in the medium and long term. Appropriate data security should take into account both issues related to storage and ongoing technological changes, such as the emergence of new data formats and new forms of access to them, as well as the ability to run the software necessary for their analysis.

data quality [3]

Reliability and application efficiency of data. Perception or assessment of a dataset's fitness to serve its purpose in a given context. Aspects of data quality include: Accuracy, Completeness, Update status, Relevance, Consistency across data sources, Reliability, Appropriate presentation, Accessibility. Data quality is affected by the way data are entered, stored and managed. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating, standardising, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems.

dataset (or data set)

A set of files containing research data generated in the course of research as part of a research project and/or the preparation of a scientific article, together with accompanying metadata. These metadata describe the data and indicate who produced it and who can access it (Department of Agriculture, Data Management Glossary, 19 May 2025).

data stewardship

Responsible planning and execution of all activities related to digital data before, during and after a research project, with the aim of optimising the usability, reusability and reproducibility of the resulting data (see Dutch Techcentre for Life Science, Research Data Management, 19 May 2025).

digitalisation [3]

The process of converting analogue materials into digital form, typically through scanning or other digitisation techniques.

digital object [3]

Machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems. A digital object is composed of a structured sequence of bits/bytes.
The bit sequence realising the object can be identified and accessed by a unique and persistent identifier or by use of referencing attributes describing its properties.

DOI – Digital Object Identifier [3]

Type of digital Persistent Identifier (PID) issued by the International DOI Foundation. This permanent digital identifier is associated with an object that permits the object to be referenced reliably even if its location and metadata undergo change over time.

dynamic data [3]

Data that are changing frequently and at asynchronous moments.

E

embargo [2]

The period during which research data is not made publicly available. Embargoes are typically used to secure intellectual property rights (e.g. through patents) or to allow time for the preparation and publication of scientific articles based on the data. Once the embargo period ends, the data should be made openly accessible.

European Open Science Cloud, EOSC

A initiative aimed at developing a federated ecosystem of research data infrastructures across Europe. EOSC supports the creation of a "Web of FAIR Data and Services" and enables researchers to practice Open Science by sharing and reusing data, publications and software [EOSC Association (2024), Strategic Research and Innovation Agenda (SRIA) of the European Open Science Cloud (EOSC) (Version 1.3)].

F

FAIR [2]

Acronym for ‘findable’, ‘accessible’, ‘interoperable’ and ‘reusable’, specifying the requirements that shared research data should meet.

I

intellectual property right [1]

A set of legal rules or exclusive rights arising therefrom that regulate the use of intangible assets. This usually includes copyright and related rights, sui generis rights to databases and industrial property rights such as patent rights, trademark rights and industrial design rights. The term ‘intellectual property rights’ often also includes prohibitions and obligations arising from trade secret regulations or unfair competition law. Ownership in the strict legal sense refers only to tangible things, while the concept of ‘intellectual property’ has no strict legal definition and is rather a rhetorical figure emphasising the granting of exclusive rights to use intangible assets to specific entities, similar to property rights.

interoperability

The ability of devices, systems, applications and content to interact with other devices, systems, applications and content.

M

machine readability

Content (in particular data and metadata) being available in a form that can be used and interpreted by machines (computers).

machine-readable format

A data or metadata format that allows for machine readability. Common machine-readable formats include JSON, XML or CSV. Such formats support interoperability and are a prerequisite for implementing the FAIR principles.

metadata [4]

Metadata are data that provide information about other data, e.g. a description of the content of the data, the date when the data were produced or collected, tools and devices used to obtain data, file formats and sizes, the names of the people who created or collected data, relevant persistent identifiers, etc. Metadata should be created and provided in accordance with commonly used metadata standards, which may be general or discipline specific. This ensures that metadata can be understood by humans and processed and exchanged by machines.

metadata standard

A formally defined and widely used method of describing data. A metadata standard defines the structure of a metadata description and the meaning of the concepts used in it. Examples of such standards are DataCite (a general metadata standard supporting the assignment of persistent identifiers (DOIs) to data) and DDI (Data Documentation Initiative, a metadata standard for social sciences data).

O

ontology [3]

Shared and standardised list of words, terms and phrases to describe components of a particular discipline or domain, along with a taxonomy of their relations. Compare this to a controlled vocabularies, which tend not to include a structure of relations between their terms. Ontologies are typically developed by domain-specific institutions or communities to aid in the precise referencing of elements. RELATED TERM controlled vocabulary.

Open Access [1]

Making specific information (e.g. research data) available on the Internet in a way that allows for unrestricted access. Open Access implies that there are no fees for users and no technical restrictions (such as the obligation to register in the system or the need to install specialised software). This does not mean that the information made available is not legally protected. Its use and further dissemination may be restricted, primarily by intellectual property law. However, in relation to research data, Open Access means that it has been made available under an open licence or in the public domain. In relation to scientific publications, a distinction is made between Open Access gratis and libre (gratis means availability free of charge and without technical restrictions, while libre means the additional granting of a free licence for publication).

open file format

A method of storing data on a digital medium specified in a published and publicly available specification, which can be implemented without technical or legal restrictions. Open formats that have undergone a standardisation (normalisation) process within an organisation that promotes pluralism, inclusiveness and transparency of this process, as well as the minimisation of technical and legal restrictions on the implementation of specifications, are called open standards. However, there is no single, universally accepted understanding of ‘openness’ in the context of formats and standards. The term ‘standard’ is also often used imprecisely to refer to popular formats that have not necessarily undergone a formal standardisation process.

open licenses [1]

Licences under which any interested party is granted permission to use the licensed object in any scope (in all fields of exploitation, including commercial purposes, in its original form or in the form of adaptations/modifications), as well as to freely distribute and modify it. Open licences are free of charge and do not impose any obligations on licensees beyond the obligation to provide recipients with certain information (e.g. about the author, source, licence – so-called attribution clauses) or the obligation to apply the same licence when distributing modifications (so-called copyleft or sharealike clauses).

open research data

According to the UNESCO recommendations, open research data includes, among others digital and analogue data, both raw and processed, and accompanying metadata, as well as numerical records, text records, images and sounds, protocols, analysis code and procedures that can be used openly, reused, preserved and redistributed by anyone, provided that the authorship is acknowledged. Open research data are available in a timely and user-friendly, human-, user-friendly, human- and machine-readable, and actionable format, in accordance with the principles of good data governance and data stewardship, in notably the FAIR principles, supported by regular curation and maintenance [19 May 2025].

Open Science

A set of principles and practices for conducting research and sharing its results, with the aim of ensuring the widest possible access to research results and the possibility of their reuse. Open Science includes, among others, Open Access to publications, open research data, open software, open educational resources, and citizen science initiatives. In the UNESCO recommendations adopted in 2021, Open Science was defined as a concept that brings together various movements and practices aimed at making scientific knowledge openly available and reusable for all, strengthening scientific cooperation and sharing information for the benefit of science and society, and opening the processes of scientific knowledge creation and evaluation, as well as communication with social actors outside the traditionally understood scientific community [19 May 2025].

open source software

Software with a source code that is technically accessible and licensed to allow for its widespread reuse. According to the formal definition (posted on the Open Source Initiative website: 19 May 2025), such a licence must allow anyone to redistribute the software without restriction as part of a larger whole, both in source code and in executable form (with appropriate guarantees of access to the source code), in its original and modified form, for any purpose. Most open source software is also free software, the definition of which (as found on the GNU website, 19 May 2025) requires that users be granted all four freedoms: to run the programme for any purpose, to modify it, to distribute copies, and to publicly share improvements (which also requires access to the source code).

P

patent [1]

Intellectual property right protecting an invention – a solution to a technical problem, if it is new, involves an inventive step and is capable of industrial application. A patent grants its holder the exclusive right to use the invention for commercial or professional purposes for a limited period. Patents are granted by national or regional patent offices (e.g. the Patent Office in Poland or the European Patent Office).

persistent identifier [4]

Persistent identifier (PID) is a long-lasting reference to a resource that provides the information required to reliably identify, verify and locate the resource. In a digital environment, PIDs have the form of URLs. When pasted in a browser, they take users to the resource.
Apart from digital resources, PIDs can also relate to researchers (e.g. ORCID), institutions (e.g. ROR), grants, instruments and devices, etc. In this case, a PID leads to the record describing a researcher, an institution, etc. in the relevant registry.
Examples of PIDs include DOIs, ORCIDs, ISBN, Handles, etc.

personal data [1]

Any information relating to an identified or identifiable living individual. Examples include names, identification numbers (such as a national tax payer number), physical characteristics, or health-related information. The processing of personal data in the context of scientific research is governed by the General Data Protection Regulation (GDPR) in the European Union. Research use may require informed consent and specific disclosures to data subjects, depending on the legal basis and research context.

personal rights [1]

Certain characteristics of a person and manifestations of their activity (e.g. surname, first name, confidentiality of correspondence, personal dignity, scientific and artistic work) protected under civil law. This protection requires action on the part of the person concerned, who may bring an action against the person infringing their personal rights and demand, for example, an apology or compensation.

pseudonymisation [4]

Processing of personal data in such a way that the data can no longer be related to the data subject without the use of additional information. However, the additional information must be kept separately and subject to technical and organisational measures to ensure that data subjects remain unidentifiable. As opposed to anonymisation, pseudonymisation is a reversible process, which means that data subjects can be re-identified if access to the additional information is enabled.

R

research data [1, 3]

Various materials collected in the course of scientific research (among others, numerical data, text documents, notes, questionnaires, survey results, audio and video recordings, photographs, database content, software, computer simulation results, methodological protocols, laboratory observations), defined, among other things, in the following ways: 1) recorded factual material generally recognised by the scientific community as necessary for the evaluation of research results; 2) data collected, observed or produced as material for analysis in order to obtain original scientific results; 3) everything that has been produced or created in the course of research. Research data may include, for example, experimental data, observational data, operational data, third party data, public sector data, monitoring data, processed data, or repurposed data.

Research Data Management

A structured set of procedures, rules and good practices for collecting, storing, processing, sharing and archiving of research data. The goal of research data management is to ensure the long-term integrity, security, and usability of research data, as well as to define clear conditions for its access and reuse. A Data Management Plan is a tool for planning the proper management of research data.

research replication

The process of applying the same analysis on different datasets (The Turing Way, 19 May 2025).

research reproduction

The process of repeating the same analysis on the same datasets (The Turing Way, 19 May 2025).

respository [4]

Repository is a digital platform that ingests, stores, manages, preserves, and provides access to digital content. A repository should support a commonly accepted metadata standard and have a protocol enabling metadata exchange.

Repositories are usually classified into:
- subject/disciplinary,
- institutional and
- general-purpose repositories.

reuse [2]

A general term referring to the technical, legal and methodological conditions for the use of data by any person and/or institution, in particular those not involved in data’s creation.

S

sensitive data

A special category of personal data subject to additional protection requirements under the GDPR. This includes data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data concerning health, sexuality or sexual orientation.

storage [4]

Data storage is a computing technology that enables saving data in a digital format on computer components and recording media, including cloud services.

In the context of research data management, it is necessary to ensure that data are stored securely until the end of the project and throughout the minimum retention period. Storage options may include, for example:
- Portable devices,
- University network drives,
- Cloud services.

sui generis database right [1]

A legal right protecting an organised data set as a whole. This right does not apply individual elements of the set, which maybe protected under separate regimes (e.g. copyright). The right to a database is only granted if the creation, verification or presentation of the contents of the set required a significant investment. The person entitled is the one who incurred the risk of the aforementioned investment (the regulations refer to this person as the database producer). They may transfer the right to the database or grant a licence, thereby authorising its use within a specified scope.

V

version control [3]

Control over time of data, computer code, software, and documents that allows for the ability to revert to a previous revision, which is critical for data traceability, tracking edits, and correcting mistakes. Version control generates a (changed) copy of a data object that is uniquely labelled with a version number. The intent is to track changes to a data object, by making versioned copies. Note that a version is different from a backup copy, which is typically a copy made at a specific point in time, or a replica.

Definitions marked with numbers have been taken or developed from the following sources [19 May 2025].

[1] K. Gaczyńska, N. Rycko, K. Siewicz, Prawne aspekty otwierania danych badawczych – poradnik (Eng. Legal Aspects of Opening Research Data – A Guide), Warszawa 2022. The brochure is available only in Polish on the Platforma Otwartej Nauki (Open Science Platform) website (under the Creative Commons Attribution 4.0 license).

[2] W. Fenrich, Selekcja i przygotowanie danych badawczych do udostępnienia, (Eng. Selection and Preparation of Research Data for Sharing), Warszawa 2019. The brochure is available only in Polish on the Platforma Otwartej Nauki (Open Science Platform) website (under the Creative Commons Attribution 4.0 license).

[3] Committee on Data International Science Council, Research Data Management Terminology (under the Creative Commons Attribution 4.0 license).

[4] OpenAIRE, Research Data Management Glossary (under the Creative Commons Attribution 4.0 license).