KSU Digital Preservation Workflow: A case study in workflow adaptation

Introduction

For my final project for Digital Preservation at the University of Alabama, I chose to perform a digital preservation workflow on a set of personal files. These files were created over the course of the past three years during my work as a music engraver and editor. Included in this collection of files are a number of various formats. These consisted primarily of music engraving files (proprietary SIB and DORICO files) and PDFs, but also included were text, image, audio, and video files. This collection was chosen due to its non-homogeneity, providing for an excellent study into how a typical collection of personal or professional files may be preserved using a digital preservation workflow.

Choosing the Workflow

Because this file collection was originally intended for public consumption in a digital music library, it was important that the chosen workflow end with the uploading of files and metadata to a publicly accessible repository. In addition to this, it was important that the workflow be adaptable, so that the parts of the process that are non-essential to my purposes may be removed, or that the parts that are not well suited to my purposes may be altered.

In light of these requirements, I chose the digital preservation workflow of Kennesaw State University (KSU). I chose this workflow due to its relative linearity and simplicity, and its focus on delivering an accessible collection of derivative files to its patrons. This workflow is presented in the form of a chart in Figure 1. Many of the individual processes in the workflow may be done simultaneously or skipped altogether, making it especially suitable for the purposes of this project.¹

However, it is important to note that this workflow was developed for the KSU Digital Archives, and it was thus intended to ultimately furnish a different kind of repository than the one I intended to use. This necessitated various alterations to the workflow to better suit the workflow to the context of a digital library. I will discuss these changes in detail as I outline the process of implementing this workflow.

The Preservation Process

I began the workflow by locating the folder on my computer that contained the files I wanted to preserve. Because the files already existed on my computer, there was no need to write-block the media. To begin working on these files, I needed to transfer the directory to a different, temporary location on my SSD. This transferred directory is what will ultimately become the Archival Information Package (AIP). The transferral was accomplished using the tool identified in the workflow–a transfer tool called Bagger, which uses the BagIt protocol to package and move files from one location to another with accompanying fixity and descriptive data. The package that is created and that I will be working with for the remainder of this project will be referred to as a “bag” for rest of this report. After creating the bag, associating basic provenance metadata with it, and transferring it to the temporary storage location, I validated the bag in Bagger to ensure that the checksums of the content matched those of the original content.

With the bag successfully validated, the next step in the workflow was to perform a contextual virus scan over the bag. If any infected files were located, they would need to be quarantined apart from the rest of the contents of the bag. To scan the bag, I used an anti-virus software called BitDefender, a proprietary software that I use to protect my PC from digital threats. It provides an option to scan a single directory, so that made it well-suited to this workflow. After scanning the directory, I located the XML output file and copied it into the bag. No infected files were found, so no quarantining was necessary.

Since no viral threats were identified, it was safe to move on to the next stage in the workflow, which was to run JHOVE and DROID over the content in the bag. This needed to be done to identify the significant properties and the file formats of the included files, respectively. As I was already familiar with these tools, it was simple to perform these tasks and export their respective reports. These reports were then added to the bag along with the other files.

The next processes in the workflow were to find personally identifiable information and duplicate files. The first of these tasks can be achieved with a tool called BulkExtractor, an open source program designed to identify email addresses, phone numbers, social security numbers, and other information that individuals may not want to be publicly accessible. I ultimately deemed this step unnecessary because, as the original creator of the files, I know that the files were created for public consumption. As such, they do not contain any information that the creator did not want to be released to the public.

Next, I performed the de-duplication of files in the bag. For this step, I used a proprietary software called TreeSize. TreeSize is available as freeware or as a paid program. The free version allows users to identify where data is located in their storage and to create helpful charts to visualize this data. Unfortunately, this was not sufficient for my purposes. Therefore, I signed up for the free trial of the paid version of TreeSize, which gave me access to a file de-duplication tool. With the help of this tool, I was able to scan the bag for files with the same digital fingerprint–that is, they are identical on the bit level. No duplicate files were located, and so I was able to move on to the next stage of the workflow.

The final step in creating the AIP was to migrate the contents of the bag to their target archival formats. As can be seen in Table 1, most of the files were already in their target formats. Ideally, the JPG, MP3, and MP4 files would have been created in a higher quality format; however, since this was not the case, it would make little difference to migrate them to a higher quality format such as TIF (images) or WAV (audio).

Original	Target (Archival)	Target (Dissemination)
PDF (any version)	PDF/A 1.6	PDF/A 1.6
JPG	JPG	–
TIF	TIF	–
MP3	MP3	–
WAV	WAV	–
MP4	MP4	–
GABC	GABC	–
DORICO (any version)	DORICO 3.5	–
SIB (any version)	SIB 2021.2	–

Table 1

Therefore, the only files in need of migration were the SIB, DORICO, and PDF files. The SIB and DORICO files were migrated using their respective source software. Each file was opened in the program that created it, and, as long as the program was up-to-date, the file was automatically converted to the most recent version. To migrate the PDF files, I used the “standards” tool in Adobe Acrobat to convert each PDF to the PDF/A standard. This is an archival standard for PDFs designed to ensure that each element of the PDF remains stable across software and devices. After this process was completed for all of the necessary files, the AIP was complete. It was then transferred to a permanent storage location via Bagger.

The final step in the process was to upload access derivatives with their respective descriptive metadata to an access repository. This step is vitally important to any digital preservation workflow, as preservation means little without the promise of access. It was in this step that my workflow took the biggest departure from the KSU workflow. This was because the arrangement and descriptive requirements of archival materials are quite different from those of bibliographic materials. My goal in this project was to provide access to my musical editions as bibliographic items, not archival items. They need not be treated as evidence of the activities of the creator, neither do they need to be arranged and described as such. Therefore, I decided to describe them as musical works, with properties (with custom labels) such as “Composer,” “Title,” “Editor,” and “Compositions Date(s).” I determined that these access points would best meet the access expectations of prospective patrons. Any other method of description would likely hinder access to the collection, thereby undermining my preservation efforts.

This digital library was created on my personal website, using an installation of Omeka S. The repository is accessible via editions.davidroby.org.

Conclusion

The KSU digital preservation workflow showed great resiliency when challenged by a new context outside of archival practice. Only a few parts of the process needed to be changed to better suit the workflow to a new end goal. After completing the KSU workflow, I had a completed AIP and a populated digital library. With my successfully preserved digital items in the AIP, I have security in the fact that if the contents of the digital library ever become corrupted, I can create new access derivatives from the files in the AIP. This guarantees access to the resources for as long as the contents of the AIP are maintained.

Ultimately, a digital preservation workflow should always support access, if at all possible. The KSU digital preservation workflow lends itself well to adaptation, and it excels at supporting the accessibility of digital resources, no matter what kind of repository they end up hosted on.

Helms, Alissa Matheny. “Digital Preservation Workflows and Reference Models.” Lecture presented at the Class Meeting for Weeek 10, The University of Alabama, March 18, 2021.

April 30, 2021 admin

David Roby