THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld (OpenAIRE workshop, Ghent, Nov.2015)
1. The Data Pilot and
OpenAIRE tools
Update on Research Data Management
Elly Dijk and Marjan Grootveld
Data Archiving and Networked Services (DANS)
2. Outline
1. Introduction to the Open Research Data Pilot
2. Prepare for responsible research
3. During the research project
4. Data services
5. In summary
6. Introduction to the afternoon’s ‘In Practice sessions’
2
3. 1. Introduction to the Open
Research Data Pilot
Horizon 2020
Open Research Data Pilot
3
4. • Financing European research and innovation projects
• Increasing competitive position of Europe, and find solutions
for societal challenges, e.g. climate change, food security,
health and wellbeing, secure societies
• Successor of the FP7 programme (KP7)
• Period 2014 - 2020; the budget is € 80 billion
• National Contact Points for the H2020 programme
• http://ec.europa.eu/programmes/horizon2020
4
7. OpenAIRE support for data
All information is available via https://www.openaire.eu/opendatapilot 7
https://www.openaire.eu/opendatapilot
8. Open Research Data Pilot
• Aim: to make the research data generated by selected Horizon 2020 projects
accessible with as few restrictions as possible, while at the same time
protecting sensitive data from inappropriate access.
• EC: information already paid for by the public should not be paid for again.
• Open data is data that is free to access and reuse
• Two types of data:
1. Data, including metadata, needed to validate the results in scientific
publications
2. Other data, including metadata, as specified in the Data Management Plan,
like raw data
8
9. Which research has to partipate in
the pilot?
• Future and Emerging Technologies
• Research infrastructures
• Leadership in enabling and industrial technologies
• Nanotechnologies, Advanced Materials, Advanced Manufacturing and
Processing, and Biotechnology
• Societal Challenge: Food security, sustainable agriculture and forestry,
marine and maritime and inland water research and the bioeconomy
• Societal Challenge: Climate Action, Environment, Resource Efficiency
and Raw materials
• Societal Challenge: Europe in a changing world – inclusive, innovative
and reflective Societies
• Science with and for Society
• Cross-cutting activities - focus areas – part Smart and Sustainable
Cities 9
10. Opting out / opting in
• Opting out of the pilot is possible
when motivated
• And opting in is also possible
11. Reasons for total or partial
opting out
• Incompatible with the Horizon 2020 obligation to protect
results if they can reasonably be expected to be
commercially or industrially exploited;
• Incompatible with the need for confidentiality in connection
with security issues;
• Incompatible with existing rules concerning the protection
of personal data;
• If the project will not generate / collect any research data;
• If there are other legitimate reasons to not take part in the
Pilot
11
12. Opting in
• Voluntary opting in also possible
• When a researcher wants to publish and share his/her data as
open access
• Mandate to open access of publications: Aim to deposit at the
same time the research data needed to validate the results
("underlying data”)
12
13. Opt in / Opt out numbers
Basis : 3,699 Horizon 2020 signed grant agreements
• Calls in core-areas: opt out 34,6% (149/431 proposals)
• Other areas: voluntary opt in 12,5% (409/3268 proposals)
Conclusion:
• These numbers in the proposals for the first calls of Horizon 2020 are
encouraging.
• Comprehensive follow up needed
• Numbers by Daniel Spichtinger, European Commission, at OpenCon 14-11-15
13
14. Reasons for opting out
Numbers by Daniel Spichtinger, European Commission, at OpenCon 14-11-15
14
17.85
35.37
5.32
24.96
7.79
8.71
No data generated
IPR protec on
Confiden ality
Privacy
Jeopardize main objec ve
other
15. Requirements Open Data
Pilot
1.Data Management Plan required within six
months after project grant
2. Deposit your data in a research data
repository
3.Open data is data that is free to access and
reuse: Creative Commons Licence CC-BY or
CC0
15
19. How to write a DMP
• Template available from https://dmponline.dcc.ac.uk/
•
• And from a few national DMPonline sites, e.g. in Spain and Belgium
See https://www.openaire.eu/opendatapilot-dmp - Spain: http://pgd.consorciomadrono.es/ - Belgium - forthcoming 19
1
23. 23
Briefly specify
• how data will be captured/created
• how it will be documented
• according to what standards
• who will be able to access it
• where it will be stored
• how it will be backed up, and
• where and how it will be shared and
preserved long-term
27. Let’s recall the goal:
• Open access to research data refers to the right
to access and re-use digital research data. Openly
accessible research data can typically be
accessed, mined, exploited, reproduced and
disseminated free of charge for the user.
• The use of a Data Management Plan (DMP) is
required for projects participating in the Open
Research Data Pilot, detailing what data the
project will generate, whether and how they will
be exploited or made accessible for verification
and re-use, and how they will be curated and
preserved.
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf 27
28. Negative intermezzo
• Stored data is not in itself “curated and preserved”
• Preserved (or: archived) data is not in itself findable
• Findable data is not in itself accessible
• Accessible data is not in itself understandable
• Understandable data is not in itself usable
28
What should be archived for
long-term reuse is a package
of data + context:
29. What should be deposited?
• The data needed to validate results in scientific publications (minimally!).
• The associated metadata: the dataset’s creator, title, year of publication, repository, identifier etc.
• Follow a metadata standard in your line of work, or a generic standard, e.g. Dublin Core or DataCite. Standards
are important for discovering and exchanging data.
• The repository will assign a persistent ID to the dataset: important for discovering and citing the data.
• Documentation like code books, lab journals, informed consent forms – domain-dependent, and
important for understanding the data and combining them with other data sources.
• Software, hardware, tools, syntax queries, machine configurations – domain-dependent, and important
for really using the data. (Alternative: information about the software etc.)
Basically, everything that is needed to replicate a study should be available for others.
Hence the name “replication package”, although the aspiration is reuse rather than replication: more is
most welcome. More data, more information in the package… and described in the DMP.
29
31. Open Access to all data, unless…
• Confidentiality and security issues can be good reasons not to
publish or share – all – data. Note in the DMP* the reasons for not
giving access, and deposit that part of the data under a Restricted
Access regime.
• E.g. when regenerating data would be cheaper than archiving,
don’t archive. Spend time on selecting what data you’ll need and
want to retain. Motivate your criteria in the DMP.
See http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
For selection criteria see https://www.openaire.eu/opendatapilot
31
Grant Agreement, Art. 29.3, Open Access to research data:
32. Repository, archive, ehm?
• A pilot requirement is to “deposit your data in a research data
repository”: a digital archive collecting and displaying datasets and
their metadata.
• Select a data repository that will preserve your data, metadata and
possibly tools in the long term. It is advisable to contact the
repository of your choice when writing the first version of your DMP.
Repositories may offer guidelines for sustainable data formats and
metadata standards, as well as support for dealing with sensitive data
and licensing.
But how to find a repository? More in a few minutes…
32
35. Deliver the DMP
• Send the initial DMP version to the Commission within six months.
• EC: “Since DMPs are expected to mature during the project, more
developed versions of the plan can be included as additional
deliverables at later stages. (…) New versions of the DMP should be
created whenever important changes to the project occur due to
inclusion of new data sets, changes in consortium policies or external
factors.”
35
36. 3. During the project
Data management is part of good research
36
38. Linking data and publications
• From a data-centric perspective publications are part of a dataset’s
context. However, there is no need to include publications in the
replication package:
• A lot of data repositories also accept publications, and allow linking
between publications and their underpinning data.
• By means of smart, persistent identifiers – consistently used – linking
is also possible across repositories.
38
43. Storage and Trust
• Local storage facilities during the research
• Network of trustworthy digital repositories for long-term preservation of
(a selection of) the data after the research is finished
• Certification of digital repositories in order to establish trust
• 4 certification standards available
44. Where to find a repository?
In order of preference: use
1. an external data archive or repository in your research domain
2. an institutional research data repository, or your research group’s
established data management facilities
1. Zenodo.org
2. or search for other data repositories at re3data.org
http://www.zenodo.org/ http://www.re3data.org/ 44
45. Main criteria for choosing a data repository:
• Certification as a ‘Trustworthy Digital Repository’, with an explicit
ambition to keep the data available in the long term.
• Matches your particular data needs: e.g. formats accepted; mixture of
Open and Restricted Access.
• Gives your submitted dataset a persistent and globally unique
identifier: for sustainable citations – both for data and publications –
and to link back to particular researchers and grants.
• Provides guidance on how to cite the data that has been deposited.
How to select a repository?
https://www.openaire.eu/opendatapilot-repository 45
46. • re3data.org is a global registry of research data
repositories
• different academic disciplines
• It presents repositories for the permanent storage
and access of data sets
• Funded by the German Research Foundation
(DFG)
• 2015: 1,368 reviewed repositories
46
54. HYPOX: FP7 PROJECT
52 Publications
from 20 different
OpenAIRE data
providers
392 datasets from
PANGAEA
Slide from Pedro Principe, University of Minho
54
55. FP7 projects: publications +
datasets
HYPOX >
https://www.openaire.eu/search/project?projectId=corda_______::abb5725eaf2617c39ae240b4ce1cce3e
http://hypox.net;
Slide from Pedro Principe, University of Minho 55
56. FP7 projects: publications +
datasets
HYPOX >
https://www.openaire.eu/search/project?projectId=corda_______::abb5725eaf2617c39ae240b4ce1cce3e
56
Open Access funded Publications
aggregated from repositories & journals
Datasets from Data
Repositories
57. 5. Summary
• Research projects in 9 appointed Horizon 2020 areas are automatically part of
the pilot, e.g. Future and emerging technologies; Nanotechnologies; Climate
action; Sustainable agriculture.
• Opting in / opting out is possible
• Data Management Plan required within six months after project grant
• Deposit the research data in a trusted research data repository
• Open data is data that is free to access and reuse: Creative Commons
Licence CC-BY or CC0
• 11,000 open datasets in OpenAIRE
61
58. Slogan EC: As open as
possible, as closed as needed
62
60. WP 4 Training and Support
• Task 4.3. Research Data Management training and support
• DANS (Data Archiving and Networked Services) is task leader
• Support kit for Open Research Data Pilot:
https://www.openaire.eu/opendatapilot
Briefing paper: Research Data management - Support for Open
Research Data Pilot OpenAIRE 2020
64
61. Programme ‘In Practice
session’
1. Situation of RDM in your country:
• Introduction of you and the situation in your country regarding RDM
2. Breakout sessions ‘Feedback Briefing paper RDM’
• Section 2: Reusability and data management
• Section 3: How to plan data management
• Section 5: Roles and responsibilities in RDM
1. Wrap up and other questions / suggestions to future support
materials
65
relates to controversial or security issues that might have undesired societal consequences if research results became known prematurely
Second, you can select your organisation, but no problem if it’s not on the list. Note that ou may also find projects here, such as ELIXIR for life sciences.
You may want to include the guidance provided by the DCC. This is a good addition to the guidance that the EC provides on the questions of the template.
Next, click CREATE.
You’re asked to provide some basic information. Please note that the ID here is one that you enter yourself, for your convenience. I’ll show you in a second where I did this.
This page summarises that the DMP is a deliverable to be submitted within 6 months into the project. Below the orange bar it lists the topics of the initial DMP.
You’re asked to provide some basic information. Please note that the ID here is one that you enter yourself, for your convenience and that of your collaborators.
In this way the researcher proceeds to write the plan – more details follow in a second, but let’s first look ahead:
And make sure that you know what will be asked of you for the mid-term and the final review: the focus here is on enabling reuse of your data – by your future self and others.
In a couple of minutes I’ll tell you why this is a bit underspecified.
Okay, this is the easy part: there is a template. What’s really at stake of course is: what to write in the plan, and who should be involved?
The process of planning is also a process of communication, increasingly important in interdisciplinary / multi-partner research. Collaboration will be more harmonious if project partners (in industry, other universities, other countries…) are in accord.
Open Science encourages – and indeed requires – heterogeneous stakeholder groups to work together for a shared societal goal.
It’s worth bearing in mind that RDM and DMP are similarly hybrid activities, involving multiple stakeholder types…
The principal investigator (usually ultimately responsible for data)
Research assistants (may be more involved in day-to-day data management)
Ideally, they have a FO in the institute and/or in the domain: Library/IT/Legal/Funding office (The library may issue PIDs, or liaise with an external service who do this, e.g. DataCite. The funding office may have a compliance role)
And the FO ideally relies on back-office services, such as long-term archives and high-capacity data transfer.
Partners based in other institutions
Commercial partners
Publishers
etc
Many of us have a role in the FO or the BO: hand raising!!
Remember, we are still in the early stage of a project,
Re Software etc: in many cases copyright will prevent the archiving of software and tools. The alternative is a sensible description.
More about this in the break-out session after the lunch break…
http://www.veryicon.com/icons/object/package-icons/packageicon-zip.html
At this point, usually a lively discussion ensues, because “yes, but this is different in our domain”. Exactly, and that’s why domains should maintain or develop and promote their standards. Researchers are quite capable of giving a sensible interpretation to the message “manage the data properly” for their own line of work, and help implement and foster measures to do so.
Now, should one really deposit and publish all data, raw, intermediate results and so on for eternity?
One size doesn’t fit all… When you’re a criminologist, your respondents will probably like to remain anonymous…
Recap: the researcher has made the planning, together with the stakeholders. Now finish the plan...
…and select an export format; for the EC PDF is fine.
The DMP is a deliverable/milestone to be delivered in the first 6 months AFTER the start of the project. The project officer and reviewers will ask for it, will evaluate it and give it a mark like any other deliverable (excellent, good, needs revision, rejected). This usually happens at the first review, unless the Project Officer is quite meticulous.
In subsequent reviews (or any time they feel like) the PO and reviewers may check to see if the DMP is followed (e.g., data files deposited, access status, metadata format, ...).
While the researchers and research assistants carry out the project, they might need some support in dealing with data:
For anonimising sensitive data
For deciding whether they can share particular data within their discipline with colleagues from institutions outside the project consortium (non-benificiaries)
For dealing with unexpected data formats, due to new instruments
When institutional repositories turn out to be unable to sustainably preserve big data
For dealing with developments in the Open Access to publications sector, with potentially ramifications for the underlying data
For dealing with changing policies or regulations, at the institutional, national or international level – the announced new European data protection regulation might have serious impact on using personal data in research
Et cetera
That’s OK, as long as you are around.
“consistently used”: as in many situations this is to a small extent a technical matter: to a larger extent it depends on organising, agreements, and people citing reponsibly. So, if you are in a position to stimulate this, please do so.
On top of this, the OpenAIRE project investigates and improves automatic linking, also often via the PID.
In several domains there are indications that publications WITH links to data receive more citations than papers WITHOUT data links. This study in astrophysics is a recent example.
During an expedition to Spitsbergen in 1977 much data on vegetation and biomass has been collected. This was used to make the map at the left-hand side. Fortunately, the underlying data was still available and interpretable when a couple of months ago another expedition went to Spitsbergen. Researchers from the Arctic Centre in Groningen in the NL were able to reuse the data for plotting and analysing the changes that have occurred in four decades.
Proper data management won’t prevent theft of your laptop, but will help you to keep your data safe – even if they are not meant to go public.
mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future
constant monitoring, planning, and maintenance
understand threats to and risks within its systems
regular cycle of audit and/or certification
DIN 31644 / ISO 16363
Council for Science – World Data System (ICSU-WDS). Met deze certificering bevestigt WDS dat DANS betrouwbaar is als het gaat om: authenticiteit, integriteit, vertrouwenswaardigheid en beschikbaarheid van data en datadiensten.
Use an external data archive or repository already established for your research domain to preserve the data according to recognised standards in your discipline.
If available, use an institutional research data repository, or your research group’s established data management facilities.
Use a well-known data repository in your own country.
Use a cost-free (data) repository such as Zenodo.
Search for other data repositories here: re3data.org
Certification as a ‘trustworthy digital repository’, with an explicit ambition to keep the data available in the long term. We know of course that several domains have longstanding archives that are not certified as TRD, because they are unsure how much effort a certification process entails. We think that’s a pity. … three-tiered proces... And the Open Science, Open Access, Open Data effort should really encourage the willing repositories to apply for certification.
Matches your particular data needs (e.g. formats accepted; access, back-up and recovery, and sustainability of the service). Most of this information should be contained within the data repository’s policy pages.
Gives your submitted dataset a persistent and unique identifier: for sustainable citations – both for data and publications – and to link back to particular researchers and grants.
Lands visitors at the dataset or its metadata.
Helps to track how the data has been used by providing access and download statistics.
Offers clear terms and conditions that meet legal requirements (e.g. for data protection) and allow reuse without unnecessary licensing conditions.
Provides guidance on how to cite the data that has been deposited.
Elly will tell more about trustworthy repositories and also say a few words about storing data safely DURING the project, because that’s also part of data management.
No data available for Latvia, Belarus, Bulgaria, Moldavia en some other countries
Zenodo is developed by CERN under the EU FP7 project OpenAIREplus
Research. Shared. — all research outputs from across all fields of science are welcome!
Citeable. Discoverable. — uploads gets a Digital Object Identifier (DOI) to make them easily and uniquely citeable.
Community Collections — accept or reject uploads to your own community collections (e.g workshops, EU projects or your complete own digital repository).
Funding — integrated in reporting lines for research funded by the European Commission via OpenAIRE.
Flexible licensing — because not everything is under Creative Commons.
Safe — your research output is stored safely for the future in same cloud infrastructure as research data from CERN's Large Hadron Collider.
DropBox integration — upload files straight from your DropBox.
392 datasets from PANGAEA AND 52 Publications from:
Unknown Repository (30)Biogeosciences (BG) (11)Biogeosciences (7)Biogeosciences Disc... (6)OceanRep (6)Ocean Science (OS) (4)Europe PubMed Central (3)Open Repository and... (2)Electronic Publicat... (2)NERC Open Research ... (2)Ocean Science Discu... (1)PLoS ONE (1)Ocean Science (OS) (1)e-Prints Soton (1)University of South... (1)Research@StAndrews:... (1)ArchiMer - Institut... (1)Earth-prints Reposi... (1)Archive ouverte UNIGE (1)Ghent University Ac... (1)
392 datasets
In several domains there are indications that publications WITH links to data receive more citations than papers WITHOUT data links. This study in astrophysics is a recent example.
During an expedition to Spitsbergen in 1977 much data on vegetation and biomass has been collected. This was used to make the map at the left-hand side. Fortunately, the underlying data was still available and interpretable when a couple of months ago another expedition went to Spitsbergen. Researchers from the Arctic Centre in Groningen in the NL were able to reuse the data for plotting and analysing the changes that have occurred in four decades.
I hoped that we could avoid this topic
It’s good that this is a pilot, because in this area the rules really should become clearer, also for other funders.
DIRECT costs primarily include personnel costs; costs of goods, works and services; depreciation costs of equipment, infrastructure and other assets; travel costs.
We ignore INDIRECT costs here – this slide is just to give you an impression of issues that have no clear answer yet.