After receiving an increasing number of enquiries about the new Engineering and Physical Sciences Research Council (EPSRC) guidelines which are coming into effect in May 2015, I decided to catch up with Kevin Ashley, Director, Digital Curation Centre, my colleagues Rory McNicholl and Timothy Miles-Board, and Matthew Addis, CTO at Arkivum to get a better understanding of what the requirements are and how institutions can meet them.
Q When looking into the EPSRC guidelines on research funding, I couldn’t help but notice them being not as tightly defined as I would have thought. Is that deliberate?
KA: I think it is, and there are perfectly valid reasons for it, but some are uncomfortable with that. The flexibility allows for different responses from larger and smaller institutions – you just need to be able to defend what you do.
MA: This is largely due to the variety of research projects and their differing objectives, both in terms of brief and data output. Some might have a mandate to be publicly accessible first, others will focus on the safety and security of the research data before being concerned about dissemination
of research findings.
Q There is a requirement to include an ‘access statement’ in published research papers and for underlying data to be available for scrutiny by others. How can that be achieved?
KA: Putting the data somewhere that has permanence is key, inside or outside the institution. Fundamentally it isn’t difficult. Many researchers aren’t used to doing this yet, but there is a lot of evidence that whenever papers include links to the data behind them the research gets more widely cited. This is something that benefits the researchers, the research funders and the universities so there are incentives all round.
MA: DOIs are a great way to reference data so it’s more easily accessible; for example, a DOI might resolve to a link to an EPrints page which in turn links to the underlying data in storage. But it’s important that these links are persistent and don’t break causing ‘404 errors’.
RM: The DataCite plugin we developed, ensures that Digital Object Identifiers (DOI) are created which offer a certain degree of flexibility, either automatically creating DOIs or allowing depositors to choose when a data set should receive a DOI, and in addition also has ‘sense check’ capabilities.
Q There is an expectation aimed at institutional level to ensure policies and processes are in place and maintained to meet the EPSRC guidelines. How can institutions ensure they are doing the right thing?
KA: This is where EPSRC differs from other funders, by placing a duty on the university and not the researcher making clear that it is the university’s job to support researchers dealing with research data. It’s only been in the last four years that universities have begun producing policies, with Edinburgh, Oxford and Hertfordshire being amongst the first to publish theirs. The main difference we are seeing is between an ‘aspirational policy’, such as Edinburgh’s which then requires further processes and services to be added later on and a more mandatory policy which prescribes everything at the outset. Both approaches can work and it depends on the individual institution, its structure and culture as to which to adapt.
TMB: Monitoring access helps institutions understand the impact of data sets and also informs the data retention policy. The latest version of IRstats, for example, has a pretty robust API which can be used to analyse the EPrints logs and produce useful statistics. One of our customers, the London School of Hygiene and Tropical Medicine (LSHTM), uses it internally to show the value of making information available, although in their case this is for their publications and not research data repository.
MA: The University of Bristol has produced a case study highlighting key areas which benefitted from establishing policies, procedures and internal awareness – increased data sharing, improved RDM skills and awareness, better funding applications, and improved efficiency and cost savings – with an increase in grant funding being just one of the compelling figures.
Q There is a big focus on ‘metadata for discovery’. Why is that important?
KA: Because if you can’t discover that data exists you are unlikely to reuse it, and reuse is the ultimate goal. This is another area where some people are requesting more prescriptive guidelines than exist at present. It’s not about following ONE prescribed standard but being able to ‘defend’ the standard you chose if challenged. Many scientific disciplines already have well-established standards and it wouldn’t be feasible to impose a one-size-fits-all.
RM: The EPrints Re-collect plugin, developed by University of Essex & Jisc, produced a so-called meta-data profile. Together with University of East London we extended the metadata that is collected to include pointers to the research publications that use the research data, which again helps improve and evidence the benefits of sharing data and making it discoverable.
MA: Access supports repeatable and verifiable science, meaning research results derived from data can be scrutinised. But this isn’t just about descriptive metadata – it includes knowing what formats data is in and that these formats are usable. Arkivum now has an integration with Archivematica,
a tool that will automatically do file format analysis, metadata extraction and format conversions.
Q Data needs to be preserved for a minimum of 10 years from last access or creation, which seems a very long time.
KA: This is certainly the one that seems to worry most of the IT managers. It strikes me as a fear of the unknown – how much storage will this require? – but the DCC have developed tools to make it easier to estimate the scale of the problem. It’s important to understand the EPSRC doesn’t expect all data from all working versions of research projects to be preserved. The DCC produced guidance on choosing what to keep. Comparing your research data storage needs vs all other ‘business as usual’ data is a valuable exercise; the former can be a fraction of the latter.
MA: It’s being able to ensure that if the data is being used by people, i.e. it’s useful, then it remains available to the community. With the Arkivum service we support long-term access to data, including ongoing data integrity and authenticity with predictable costs.
Q How can effective data curation throughout the full data lifecycle, as required by EPSRC, be achieved?
KA: Although this is likely to be the most difficult area for institutions to be 100% compliant from the outset, most are already doing something towards meeting this requirement. The key consideration is to have processes and support in place to ensure that data curation issues are being considered and addressed at the outset of a research project, rather than once the research has concluded.
MA: Curation is about the usability of data, especially by those who didn’t create it in the first place. Much of this is simply good research practice and should be a normal part of doing research. But the job never stops, especially dealing with the challenges of ongoing digital preservation. This is where services from ULCC, Arkivum and others can really help. Researchers can get on doing what they are good at – research – and they don’t need to be experts at digital preservation. Institutions still take responsibility, but delegate execution to service providers, and can use a range of ways to establish trust, for example reputation, due diligence, ISO 27001, Janet Frameworks, ISO 9000 and so on.