#researchdata | Explore Tumblr posts and blogs

pittarchives · 5 years ago

Text

5 Tips for Preserving Your Data Long-Term

In celebration of World Digital Preservation Day 2020 on November 5, we’re sharing a series of posts by University of Pittsburgh Library System librarians and archivists that highlight their expertise and work to preserve the digital!

This post was written by Dominic Bordelon, Research Data Librarian

Like academics everywhere, at the University of Pittsburgh we hope to make valuable contributions to our fields through our publications, which we can expect will outlive us. More recently, thanks to new technological possibilities, we turn our attention to how other research outputs, such as data and software code, can also be stored for posterity.

How can you get started? Here are five tips from Pitt Libraries that you can begin using right away.

1. Use open file formats

Open file formats are those which are widely adopted, well documented, and unhindered by proprietary restrictions which monopolize the creation, editing, or reading of files. These are formats like CSV (comma-separated value) for tabular data, plain text for qualitative data (.txt), or PNG (Portable Network Graphics) for images. Proprietary formats tend to create a barrier to access and may even face obsolescence should the vendor go out of business. These factors have a negative influence on the probable longevity of the files’ contents.

For example, users of IBM’s SPSS statistical software will be familiar with .sav files for their data and analyses. However, .sav is a binary format rather than a character-based one, unreadable without special software (such as SPSS). Nor has IBM published official documentation for community use. Consider instead (or in addition) depositing a version of your data in CSV format, which should be easily readable to any future users.

To find out more about the preservability of the file formats you use, and to see which are recommended, you can see the Library of Congress’ Recommended Formats Statement.

"Keat takes notes" by geekcalendar is licensed under CC BY 2.0

2. Describe and annotate your dataset

In order for your data to be useful in the future, readers will need to be able to make sense of it. Data does not usually explain itself. What does the abbreviation in this column name mean? If an instrument was used to record your data, what model? What steps did you follow in your lab to run the experiment? The answers to these questions have important implications for researchers who want to replicate your study or integrate your data in a new study of their own.

There are several ways you can describe the important context around your data:

A detailed abstract in your data depository, and completion of all appropriate metadata fields

Data dictionaries and codebooks which describe column names and values

Documentation of your research protocols (perhaps with a tool like protocols.io)

3. For software, document your dependencies and computing environment

When you run code, it’s important to know what needs to be installed for it to work properly. Which version of Python did you use? If you used a library like Astropy in Python or osmdata in R in your analysis, what version of the library did you use? Without this information, it might be difficult—or even impossible—for future users to run your code, and for them to be confident that they are running it as intended. You can do this with a text file, but look also at tools like Docker (or the Dockter project for researchers specifically) to containerize and document your environment.

4. Deposit your data in a trustworthy repository

When choosing a data repository, consider how it is maintained and whether they seem to have plans for the future. You can find much of this information in their about pages. For example, is the repository run at a large research institution by a team of dedicated staff, or, at the other extreme of that spectrum, is it a lone researcher’s side project? Do you trust that the repository, or at least its owner, will exist in ten or twenty years? If run by a private company, does it seem well-established with many ties to the academic community? Do persistence and preservation seem to be high priorities for the repository? While other factors might affect one’s choice of repository, we should hold this sense of “trustworthiness” high on the list.

CoreTrustSeal is an organization that certifies research data repositories as trustworthy, i.e., apparently sustainable and stewardship-oriented. Checking their list of repositories is a safe bet. If the repository in question is not CoreTrustSeal certified, your local data librarians (for example, at Pitt, the ULS Digital Scholarship Services team and the HSLS Data Services team) can help you evaluate the repository.

"elephant ears." by brittanyhock is licensed under CC BY-NC 2.0

5. Dark archive your dataset in your institutional repository

Sharing is great, but preservation is important too. The practice of “dark archiving” is simply depositing material in a nonpublic repository, for purely preservationist purposes. If you are planning to share your data in an open repository, consider also investigating whether your institution has a repository where you could dark archive an additional copy. The idea is that, should the open repository eventually fail, the data could still be restored from the dark archive, and then pointers to the open deposit such as DOIs could be redirected to the restored copy.

Why dark? If your dataset is hosted in multiple places online, some users might find it confusing, especially without knowing any rationale. Intellectual property ownership may also be unclear. Furthermore, the user may reasonably wonder whether the two copies are truly identical.

Your institution may not advertise a “dark archive,” but look instead for your general institutional repository, such as Pitt’s D-Scholarship.

Let me know how these tips work for you. Happy preserving!

“5 Tips for Preserving Your Data Long-Term” by Dominic Bordelon is licensed under Creative Commons Attribution-ShareAlike 4.0 (https://creativecommons.org/licenses/by-sa/4.0/).

#WDPD2020 #researchdata #digital preservation #digitalscholarship #datasets #digital archives

4 notes · View notes

teamarcstechnologies · 2 years ago

Photo

Market research is a critical component of business decision-making, providing insights into consumer behavior, preferences, and opinions. However, collecting and analyzing large amounts of data can be time-consuming and complex, especially when dealing with diverse populations. This is where sample management platforms come in, offering a more streamlined and efficient approach to market research.

#market research #samplemanagement #researchdata #online research panel

0 notes

cdiscprogrammingservices · 2 years ago

Text

The Importance of CDISC Programming Services for Clinical Research

Clinical research plays a critical role in the development of new drugs and therapies. However, the process can be complex and time-consuming, especially when it comes to data management and analysis. That's where CDISC programming services come in. In this blog, we'll explore the importance of CDISC programming services for clinical research.

What is CDISC?

CDISC stands for Clinical Data Interchange Standards Consortium. It's a non-profit organization that develops and promotes data standards for clinical research. CDISC standards ensure that clinical data is consistent, transparent, and easily shareable across different research platforms. CDISC standards include:

Study Data Tabulation Model (SDTM): A standard format for organizing and presenting clinical trial data.

Analysis Data Model (ADaM): A standard format for analyzing clinical trial data.

Controlled Terminology (CT): A standard set of terms and definitions used in clinical research.

What are CDISC Programming Services?

CDISC programming services involve the use of CDISC standards to manage and analyze clinical trial data. CDISC programmers are experts in CDISC standards and use specialized software to implement these standards in clinical research. CDISC programming services include:

CDISC-compliant data conversion: CDISC programmers convert clinical data from various formats to SDTM or ADaM formats.

CDISC-compliant data mapping: CDISC programmers map clinical data to standardized CT terms.

CDISC-compliant data analysis: CDISC programmers analyze clinical data using CDISC-compliant software.

Why are CDISC Programming Services Important for Clinical Research?

CDISC programming services are essential for clinical research for several reasons:

Consistency: CDISC standards ensure that clinical data is consistent across different studies and platforms. This makes it easier to compare and combine data from different studies, leading to more reliable results.

Transparency: CDISC standards make it easy to understand and interpret clinical data. This promotes transparency in clinical research and helps build trust among stakeholders.

Efficiency: CDISC programming services save time and resources by automating data management and analysis. This allows researchers to focus on other critical aspects of clinical research.

Compliance: CDISC standards are required by regulatory authorities such as the FDA. CDISC programming services ensure that clinical data is compliant with these regulations.

In conclusion, CDISC programming services are essential for clinical research. They ensure consistency, transparency, efficiency, and compliance in clinical data management and analysis. CDISC programming services are a valuable investment for any organization conducting clinical research.

#CDISC #CDISCstandards #clinicalresearch #clinicaltrialdata #SDTM #ADaM #controlledterminology #datamanagement #dataanalysis #CDISCprogramming #CDISCprogrammingservices #researchdata #researchstandards #regulatorycompliance #transparency #efficiency #consistency #reliability #trustworthyresults

0 notes

romandavis-blog · 4 years ago

Link

Students working on their thesis are always on the lookout for various methods and guides to provide them with the most productive tools that allow them to produce quality work all the while saving them a lot of time. The internet is filled with guides and study material about how to write, conduct proper research, […]

#ResearchData for Thesis FreeTools Analyze

0 notes

acrazyyellowtheory-blog · 8 years ago

Photo

More #brushstrokes! These days is all about priming the canvas!!!! I can feel the inspiration coming after carnival! . . . #yellow #crazy #art #fashion #design #paintings #theory #life #love #energy #positive #acrazyyellowtheory #passion #sun #yellowart #abstraction #inspiration #researchdata #muse #photography #yellowstuff #yellowblog #painting #modernpainting #colorful (at South Bank London)

0 notes

prashantvmou-blog · 6 years ago

Text

Emergence of Research Data Literacy with Special Reference to India

Prashant Shrivastava

Dinesh K. Gupta

Affiliations

Department of Library and Information Science, Vardhman Mahaveer Open University, Kota – 324021, Rajasthan, India

DOI:

10.17821/srels/2019/v56i2/131578

Abstract

Preservation of research data is an important policy requirement for universities and research organizations so as to manage research data at organizational level and make it available lifelong. However, the importance research data management especially in the context of e-research and data intensive research is not widely recognized both by researchers and research organizations due to lack of research data literacy. In the Indian context, it is essential to formulate a national data sharing and accessibility policy so that Indian universities and research organizations can cope research data.

Keywords Research Data, Research Data Literacy, Research Data Management, Research Data Literacy-India

References

Achard, P., Ayris, P., Fdida, S., Graddmann, S. (2013). LERU roadmap for research data.

Albert, P., Alpi, K., Baxter, P., Brown, E., Corson-rikert, J., Hirtle, P. and Westbrooks, E. L. (2008). Digital research data curation: Overview of issues, current activities, and opportunities for the Cornell University Library . Cornell University Library. Retrieved from http://hdl.handle.net/1813/10903.

Averkamp, S., Xiaomei, G., and Rogers, B. (2014). Data management at the University of Iowa : A university libraries report on campus research data needs. https://doi.org/10.17077/t5j5-ry3i

Beitz, A., Dharmawardena, K. and Searle, S. (2012). Research data management strategy and strategic plan 2012-2015. Monash Research Committee.

Brewerton, G., Steering Committee and IT Services . (2015 Oct). Research data management : A case study. Ariadne, 74, 1–11.

CARL Data Management Sub-Committee. (2009). Research Data: Unseen opportunities. Retrieved from http://www.carl-abrc.ca/uploads/pdfs/ data_mgt_toolkit.pdf.

Clements, A. (2013). Research information meets research data management in the library? Insights: The UKSG Journal, 26(3), 298–304.

Courses: Research Data Management and Sharing. Retrieved from https://www.coursera.org/learn/data-management.

CSIR Central : Mandate. Retrieved from www.csircentral.net/ mandate.pdf.

Data guide. Retrieved from https://www.bu.edu/dev/data/ guides.

Defining Research. Retrieved from https://www.lib.ncsu.edu/ data-management/define.

Educause - Research Data Management Guidance. (2009). Michigan State University. Retrieved from http://net.educause.edu/ir/library/pdf/ers0908/rs/ers09087.pdf.

EPSRC expectations on Research Data Management. Retrieved from http://www.ucl.ac.uk/library/research-support/ research-data/policies/epsrc/guidance.

Gurria, A. (2007). OECD principles and guidelines for access to research data from public funding. Organisation for Economic Co-Operation and Development, USA.

Gurstein, M. (2011). Open data: Empowering the empowered or effective data use for everyone? First Monday, 16(2). https:// doi.org/10.5210/fm.v16i2.3316

Hattwig, D., Bussert, K., Medaille, A. & Burgess, J. (2013). Visual literacy standards in higher education: New opportunities for libraries and student learning. Libraries and Academy. 13(1), 61–89. https://doi.org/10.1353/pla.2013.0008

Hey, A., Tansley, S., & Tolle, K. M. (2009). The fourth paradigm: Data-intensive scientific discovery. Redmond, WA: Microsoft Research.

Johnsson, M., and Ahlfeldt, J. (2015). Research libraries and research data management within the humanities and social sciences. Lund University, Sweden.

Johnston, L. (2014). A workflow model for curating research data in the University of Minnesota Libraries: Report from the 2013 Data Curation Pilot, University of Minnesota, Digital Conservancy.

Lin, E. (2012). The responsible stewardship of research data : A roadmap for the University of California, Merced. University of California, Merced.

Managing research. Retrieved from https://www.monash.edu/ library/researchdata.

Mandinach, E. B., and Gummer, E. S. (2013). A systemic view of implementing data literacy in educator preparation. Educational Researcher, 42(1), 30–7. https://doi.org/10.3102/0013189X12459803

MANTRA: Research data management training. Retrieved from https://mantra.edina.ac.uk/.

Max, W., Howard, A., Lise, M., Brian, F., Shari, H., Lynch, H. et al. (2016). Research data management framework report, Te Pokai Tera University , New Zealand.

Nordenberg, M. (2009). A guidelines on research data management. University of Pittsburgh.

Research data policy. Retrieved from http://www.mopp.qut.edu.au/D/D_02_08.jsp.

Research data policy at Elsevier support services. Retrieved from https://www.elsevier.com/about/our-business/policies/ research-data.

Rice, R., Çuna, E., Jeff, H., Sarah, J., Stuart, L., Stuart, M., and Weir, T. (2013). Implementing the research data management policy: University of Edinburgh Roadmap. International Journal of Digital Curation, 8(2), 194–204. https://doi.org/10.2218/ijdc.v8i2.283

Ridsdale, C., Rothwell, J., Smit, M., Bliemel, M., et al. (2015). Strategies and best practices for data literacy education: Knowledge synthesis report.

Shearer, K., Argaez, D., and Swanson, M. (2010). Addressing the research data gap: A review of novel services for libraries. Retrieved from http://forskningsdata.deff.wikispaces.net.

Shrivastava, P., and Gupta, D. K. (2018). Research data preservation in India: An analysis based on research data registry. World Digital Library: An International Journal, 11(1).

Steen, L. (1999). Numeracy: The new literacy for a datadrenched society. Educational Leadership, 57(2), 8–13.

Trewhella, J. (2014). Research data management policy 2014, Trobe University.

University of Edinburgh data policy. Retrieved from https://www.ed.ac.uk/information-services/about/policiesand-regulations/research-data-policy.

Vahey, P., Yarnall, L., Patton, C., Zalles, D., Swan, K. (2006). Mathematizing middle school: Results from a crossdisciplinary study of data literacy. Annual Meeting of the American Educational Research Association, San Francisco, CA.

Woollard, M., and Corti, L. (2014). Case study 4: A national solution - the UK Data Service. Delivering Research Data Management Services: Fundamentals of good practice, Facet Publishing.

Wright, G., Abraham, S., and Shah, N. (2012). Open government data study: India. (2012).

DOI:

http://dx.doi.org/10.17821/srels%2F2019%2Fv56i2%2F131578

1 note · View note

aronasouris · 3 years ago

Video

tumblr

Gain access to top-notch primary research data with Real-Time Intelligence Solutions - BA Health

Borderless Access' BA Health provides a world-class synergy of primary #marketresearch, and #businessintelligence by capturing niche #businessinsights. Stay ahead of your competition in relevant #datacollection with our New-age #CIsolution and top-notch primary #researchdata.

Explore More: https://bit.ly/3oXTnyt

#Real Time Intelligence #Real Time Intelligence Solution #BA Health #Healthcare Primary Research #Healthcare Market Research #Primary Market Research

0 notes

sciedithub-services · 3 years ago

Text

importance of #citation in #researchpaper #researchpaperwriting #research #researchanddevelopment #researchpapers #researchers #researchstudy #reseauxsociaux #researchwriting #researchassistant #researchscholar #researchdata #researchimpact #researchmethods #researcharticle #researchanalyst #researchreport

#sciedithub.org #sciedithub #scientificediting #scieditserv #copy editing #thesis editing #dissertation editing #editing services #editingservices

0 notes

oliviamarket · 3 years ago

Photo

Visit: https://justpaste.it/44rto

Objectives of Data Processing in Market Research

Data Processing is worried about altering, coding, arranging, classifying and outlining, and charting research information. The substance of information handling in research is information decrease. There are many other services like Data Processing that come under Market Research like Survey Programming, Data Visualization, and Translation Services.

#market research #market research company #data mining #data analysis #data processing #data visualization #data science #survey programming #translation servcies #translation

0 notes

eliteprojects · 4 years ago

Text

Aquaponics: Is It More Efficient or Faster Than a Regular Soil Garden?

Over the years, population growth and increased urbanization worldwide have caused high-quality pieces of farming land to shrink. With space continuing to come down, more agriculturalists have looked for ways of boosting harvesting speedto keep up with the increasing international demand for food. Several farms do not have enough average crop yield per unit thanks to the seasonality and conventional growth rate of plants. To tackle this issue, several commercial farmers are exploring the scope of aquaponics in the form of a quicker, more efficient method of growing crops. Now, do crops grow so quicklythroughan aquaponics system in UAE that it allows farmers to give up soil gardening?

As per several pieces of research,plants grow more quickly with the system as compared to other methods of cultivation. Agriculturalists who use it have witnessed considerably higher crop yields, and they harvest more often in a year because of controlled environment aquaponics. Nevertheless, noaquaponics system in UAE or elsewhere is perfect, meaning it has its possible limitations.

Here, we will compare aquaponic plants and soil-grown plants based on growth as well as the factors with a direct impact on growth in an aquaponics system. For the uninitiated, this is a method of growing crops in the water utilized forthe cultivation of aquatic organisms, such as fish.Many ahydroponic system supplier in UAEoffersit and act as silent proponents of aquaponic gardening.

Soil Versus Aquaponics Growth Rate

As perpast researchdata and anecdotes, aquaponic crops grow more quickly as compared to soil-based crops as these can use a constant flow of water.The continuously circulating water in an aquaponics grow bed has nutrients from the aquatic animal wasteneeded for crop growth. Conversely, crops planted in soil require growers to use a chemical fertilizer generously and as per a schedule, meaning these plants cannot use nutrients round the clock.

Factors Impacts Plant Growth ThroughAn Aquaponics System

Understanding why crops grow more quickly in the system, necessitates examining the factors impacting the rate of growth related to it. Here are a few of those factors.

Nutrients

In this agricultural setup, fish waste is the source of the nutrients that are essential for crops to grow quickly. Fish waste acts as a quality source of phosphorous, nitrogen, calcium, magnesium, sulphur, and potassium.The system always makes the waste available to aquatic plants,so these often grow in a healthier and faster way.

Water

Despite using a fraction of the quantity of water that soil-based plants need, an aquaponics system supplies an abundance of nutrients to crops.It is important to consider temperature, pH level, light and some other factors for maintaining water quality while ensuring a quick growth rate.

Planting Design

It is an important factor to ensure that each plant gets nutrients in quantities that are enough for it to grow.Besides this, planting design can help make garden space as big as possible and boost production. In the planning stage, growers may draw a farm layout while consideringthe amount of space each crop would use according to its size. They may also seek the valuable inputs of a farm construction company in UAE to make this process easier and streamlined.

Final Words

There are many possible benefits to growing crops with an aquaponics system in UAE, including the better growth rate and the higher yield than its soil-based counterpart. Nevertheless, any reliable provider of agricultural construction service in UAE should not understate the potential pros of soil gardening while promoting their aquaponics system.

0 notes

exactlyloudcollector · 4 years ago

Photo

Wakefield Research: Data Value Scorecard Report Quantitative Research of Data Leaders on Data Value and ROI

Enterprises are pouring money into data management software – to the tune of $73 billion in 2020 – but are seeing very little return on their data investments. Data leaders are now under pressure to monetize their data investments and show a meaningful return on the investments (ROI) made.

Read More:- https://www.hqpubs.net/wakefield-research-data-value-scorecard-report-quantitative-research-of-data-leaders-on-data-value-and-roi-2/

#datavalue #researchdata #wakefield #software

0 notes

strathoa · 8 years ago

Text

Progress in the Digital Curation of Research Datasets

Part 2

Alan Slevin, Open Access and Research Data Manager, University of Strathclyde

This is Part 2 of a blog post on developments in the preservation and curation of datasets in the University.

After the end of our Archivematica project (discussed in Part 1) we have continued to develop our manual curation workflow, and work on automating steps where possible.

Initially we have had a fairly simple manual process in mind.

dataset upload to PURE

saving dataset to Archivematica transfer folder

processing through Archivematica (customised Archivematica workflow – format registry and administrator options)

extracting DIP from Archivematica storage area

re-upload to PURE

Checksum monitoring, handling large datasets, and the longer-term and ongoing preservation of files would also complicate the standard workflow.

Archivematica Ingest Workflow

These steps follow the routine deposit, editing, DOI minting and validation of the PURE dataset record and files.

The manual curation workflow has been perfected which involves removing datasets deposited in PURE and transferring them to the queue folder for Archivematica ingest. We process SIPs through the various micro-services and use the storage service to extract the DIP. The DIP is re-uploaded back into PURE. The AIP containing the DIP and SIP is stored. Fine tuning this approach has in the last few months led to two strands of technical development.

Checksums

The way dataset files are stored in PURE makes the monitoring of checksums difficult to achieve. In PURE, there is no indication in the checksum file of the dataset (or any file) to which they pertain. The file shows a checksum and a file-size but it’s not possible to determine which bit stream this applies to. A solution might involve storing all dataset checksums and file sizes for comparison with ingested datasets, but again this lack of curation friendliness and interoperability in PURE inhibits the ability to maintain and monitor checksums throughout the workflow. Unless we can get PURE to describe which .bin file relates to which bit stream, the datasets can’t be processed with their own checksums through Archivematica.

Ideally to operate a chain of custody for checksums we would start the monitoring process using the checksum as it originates at the first point in our workflow which is after deposit in PURE. As mentioned there is a basic problem in retrieving these checksums from PURE, so for the purposes of our current workflow we identified where the checksums were retrievable for the datasets transferred to the Archivematica transfer source folder. We can generate a checksum manually for the zip and include it in the transfer for Archivematica (and/or Archivematica will generate one at ingest). However, checksums are only really useful if they are being monitored for change. Leaving the checksums to be stored with the associated METS file and managed in Archivematica might be more practical than storing them with the re-uploaded file in PURE. Decisions on what level to record the checksum in terms of file granularity are still to be confirmed but storing along with the file would only really be useful if it was being monitored for changes. This is one of the drawbacks of our mainly non-interoperable infrastructure.

Dataset Transfer

How to get data from researchers after deposit in PURE into the Archivematica transfer source is an important aspect of our digital preservation workflow. Data is most at risk during transfer between different systems.

We are investigating ways to automate parts of the workflow and also to address issues relating to the reformatting of zip deposits after DIPs are created. For the former, the work developed at Lancaster on PUREE Gems which interrogate the xml exported by the PURE REST service offers the possibility of transferring dataset metadata and files without manual involvement to the transfer queue. A second Gem presents the metadata for ingest to Archivematica. We need to decide which Dublin Core and relational metadata should be transferred from PURE and what preservation and administrative metadata might be appended in a .csv file. We are engaging our local technical support team in this work with the aim to convert the RAILS coding used to create PUREE GEMS to C+ edge as this reflects local expertise and specialisms.

Another facet of technical development might involve reformatting the ingested DIP in the ZIP format in which most deposits are made. This could involve quite a detailed change to the software itself but there is also a possibility of holding all the original and normalised files in the AIP which is then stored.

The original intention was for normalised dissemination copies to be re-uploaded back to PURE. However this is a problem for zip files which is the main way that datasets are deposited by academics. While the contents of the zip file are normalised in by Archivematica, the normalised versions are not automatically re-zipped in the way that the files were originally organised. There is functionality with Archives (ICA AtoM) for reconstituting zip files from processed versions, but no equivalent as yet for the datasets pipeline. There has been discussions on pursuing this area as a modification within Archivematica.

An alternative approach could be to follow the Bentley Historical Library model in Archivematica 1.6. and therefore to minimise the development work in the short-term. This means any ingested zip and the normalised versions of files are all accessible in one zip file. This does sound attractive for our purposes considering we are getting a lot of proprietary files which are untouched by Archivematica. This zip file contains the contents of the objects directory of the AIP, and has all the original files and any normalised versions of those files in them. So in this way where no normalisation takes place the original file is available, although it has been through Archivematica virus-checking, check-summing and file characterisation processes. Also, any normalised versions are also present including normalised versions of the files contained in the original zip.

Metadata

Archivematica provides multiple ways to get metadata into the AIP METS profile, for example the PREMIS rights form for entry during ingest. We need to explore the possibility of using the Archivematica metadata fields as a supplement to the PURE metadata record in the short-term, and/or explore ways to parse and report based on the METS file. If we are able to transfer files using the new PURE REST services we could augment DC metadata from PURE with administrative and preservation metadata from Archivematica in an attached .csv file.

The AIP itself contains (in addition to the ingested digital objects) preservation versions of the objects, submission documentation, logs, and the METS file. As described, in the absence of a solution on reformatting normalised zip files for access we have considered offering this zip file for access, a capability now possible in the latest version (1.6) of Archivematica.

Archivematica uses a Format Policy Registry (FPR) to provide rules on how particular file formats are processed. While the normalisation rules used in the FPR were generally in line with our own data deposit policy we should aim to list preferred, supported and unsupported formats through our own tailored FPR. In so doing we should aim to customise Archivematica throughout its different sub-services in order to identify files using the most suitable tool; characterise, validate and normalise formats appropriately, and store relevant PREMIS (events) metadata important to the understanding of the dataset.

During testing and in noting the lack of reporting functionality, our attention was drawn to a new Archivematica fixity app which might merit further investigation for our workflow. This app is able to monitor the checksums in archival storage. If a DIP in PURE becomes corrupted and we need to replace it, we can re-ingest the AIP for the purpose of DIP generation. In the absence of intelligent object storage which might do this job and the fact that Archivematica itself does not monitor ongoing file integrity, such a tool could prove to be very useful if applied properly. During the project Artefactual invited us to write up our own requirements in this area for possible development recognising that each HEI’s workflow is likely to be different.

PURE User’s Future?

While all these developments continue we must keep a close eye on the broader picture with the prospect of a systems review in the near future. We might continue to use PURE in the same way. Considering Elsevier’s apparent focus now on Mendeley for Data (and integrating it with PURE) their continued support for the datasets module in PURE, and particularly in its further development, is questionable. So we need to consider the implications of a PURE-Mendeley-DANS workflow – are all the Data Lifecycle stages covered, what are the cost and workflow implications, how might researchers respond to a two-way deposit process? Then of course we are waiting for practical and interoperable solutions from the JISC Shared Services project or at least useful ideas which might be incorporated within our existing network of supporting tools.

The priority is to promote engagement throughout the data lifecycle and to provide systems which are easy for the academic to use and which address the main funder mandated requirements. At the dataset side we can respond quickly to DOI requirements for datasets via the datasets module in PURE and provide a repository for retained data and metadata. Our secure and duplicated storage safeguards data while our work with Archivematica while still in development, provides evidence that we are being alive in addressing the requirements for long-term preservation and active curation.

Longer term developments in PURE and local infrastructure have the potential to affect interoperability issues, while development work involving Archivematica users and the wider Digital Preservation community, who may also use PURE and associated tools, is continuing. We should get used to working with Archivematica in our existing workflow with real datasets while monitoring the broader picture. A fully fledged RDM solution is likely to consist of a variety of different systems performing different functions within the workflow; Archivematica will fit well into this modular architecture.

We have already met impediments to local development centred on the availability of local IT support and so progress is hard to gauge at this time. Also, work with Artefactual to customise the software would be on a different contract from our current maintenance agreement. Fortunately the parallel work in different JISC projects promises some relevant developments and we will stay involved through the Archivematica User Group.

While we have much customisation of Archivematica ahead and more to discover about the suitability of its many processes to our workflow, we can say at this stage that it provides the preservation workflow tools to support our Research Data Deposit Policy and our curation and preservation strategies. It allows us to set and use our storage pathways appropriately and fills a gap in the current services supporting the RDM data lifecycle.

The key requirements for further development are better interoperability if PURE remains the point of data deposit and further customisation of the system to ensure a good balance, depending on the file formats in question, between the effectiveness of the automated curation processes and the human intervention which is still crucial at key decision points in the workflow.

Mediation and Data Quality Issues

There are other soft or socio-technical issues to do with dataset deposit and data quality checking which must be factored into workflow decisions. These data management routines would take place at the initial deposit stage but will affect what is sent to Archivematica.

Depositor agreement – can we generate a click-through agreement? If not, do we need some signed paperwork from the academic? This is being looked at in the PURE user group.

Notwithstanding the Legal Requirements metadata in PURE, the administrator should refer to the DMP in checking on the disclosure of personal data.

Granularity: when do we suggest files should be organised as collections with different (child) templates created according to particular themes? This might be related to individual publications related to datasets.

Applying pre-ingest checks in a self-deposit workflow continues to be an issue. How stringent should we be on ‘quality’ checks e.g. folder names, (lack of) documentation, file structure evident? (with or without a DMP being available). Having a fairly rigorous approach to this stage probably mean contacting the depositor in most cases. The reuse potential of the data may be factor in this area.

Particularly with the last point here on documenting and checking datasets, we are aware that while Archivematica assesses, virus checks, characterises and where appropriate (according to the Format Registry) normalises files, the actual organisation and naming of files and folders, the structure of zip files, the existence of accompanying documentation to promote data reuse and replicability is a ‘human’ area of checking which is not generally covered by curatorial administrators. Our efforts in promoting best practice among researchers in these areas with training and good pre-ingest guidance is the main technique while self-deposit of datasets remains the norm.

In summary, we believe that we have made progress in this area and therefore in addressing funder expectations in the preservation and curation of research datasets to ensure their long-term accessibility. However, there are obvious gaps in the interoperability of systems and our own processes which need to be overcome in order to provide services which are truly reflective of OAIS best practice. We are operating in a limited technical environment in terms of the tools available and the technical support at hand, but we have developed a number of steps which will ensure that the datasets deposited here are preserved for longer term access. While we are still operating on the basis that PURE remains the point of self-deposit, these limitations will remain. The Elsevier proposal for using Mendeley Data as a more fully functioning and integrated Data Repository linked to PURE (which will retain the data registry role for compliance and reporting) might change this dynamic. However, it seems safe to assume that this decision – in tandem with the JISC Shared Services work – should follow on from a comprehensive evaluation of the interlinking systems available as this area of RDM continues to develop.

#digitalcuration #digitalpreservation #archivematica #researchdata

1 note · View note