BagIt

From CollectiveAccess Documentation
Revision as of 19:45, 6 November 2018 by Julia (talk | contribs) (Configuration)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

See also: BagIt support notes

About BagIt

BagIt is a standard for storage and transfer of arbitrarily structured digital content. As defined by the BagIt standard, a "bag" consists of a "payload" of one or more content files and "tags" – metadata files – documenting the bag. Every “bag” comes with a data directory identifying the “payload” and a tag file that provides a manifest of all files in the “payload” and checksums for each.

Its support for flexible payloads and the required inclusion of arbitrary metadata and checksums for data verification make BagIt well-suited for use in archival and digital preservation contexts.

BagIt support in CollectiveAccess

CollectiveAccess supports generation of BagIt files for any record, set of records or record hierarchy. Representation media may be included and optionally filtered on representation type, relationship type (where available), primary/non-primary status and/or version. Media attached to records using "media" metadata elements may be included and optionally filtered by metadata element, media version and other metadata values when the media element is part of a container. Files attached using "file" metadata elements may be included and optionally filtered by metadata element and other metadata values when the file element is part of a container

Metadata from included records may be exported using any available export mapping.

When exporting a hierarchy of records, files included in the BagIt “payload” may be structured in a directory structure mirroring the record hierarchy.

All CollectiveAccess BagIt output will be serialized as either ZIP or Gzip'ed TAR files.

BagIt output may be created automatically on creation or change to a CollectiveAccess record, or manually during an export of selected records.

CollectiveAccess will support transmission of BagIt output to remote targets. Target types will include:

  • As a direct download from within CollectiveAccess to a user's local machine.
  • Locally mounted file systems (e.g. a local directory on the server, or a file server mounted on the server)
  • A remote file store such as Dropbox, Amazon S3, SFTP, Lockss or GoogleAPI.

Configuration

The BagIt workflow in CollectiveAccess is configured in the file external_exports.conf located in /app/conf. This file contains settings for BagIt targets, outputs, options and more.

Here we'll walk through each part of the file and the parameters for each setting. To start we must configure a target. Multiple targets may be configured within a single CollectiveAccess system. Let's begin by setting up a custom BagIt export that packages an EAD XML finding aid along with the collection's related media assets. First, under target, we set the preliminary details:

targets = {
    ead_collections = {  
    
        label = EAD BagIt export to server,
        table = ca_collections,
        restrictToTypes = [],
        destination = {
            type = sftp,        
            hostname = 192.168.6.4,
            user = seth,
            password = a_password_goes_here,
            path = /data/exports   
        },

Within targets we've created a rule set called ead_collections. This is an arbitrary name for the configuration (that should contain no spaces or special characters). The name for the export that catalogers will see in CollectiveAccess is set via label (here we've called it "EAD BagIt export to server").

Next we set the CollectiveAccess table that is targeted by the export, in this case ca_collections. If desired we can also restrict the BagIt option, with the restrictToTypes setting, so that it only appears for certain record types within the ca_collections table. This can also be left blank as shown above.

In destination we set where the "bag" will export to. Types that will be supported include sftp, path, dropbox, amazons3. In our example we've configured an sftp connection and have included the hostname, user, password and path for that connection.

triggers = { save, periodic = 1d },

Once the destination and basic target settings have been configured we must determine when the "bags" will generate. Above we've set the export to occur when the record has been saved, with a daily export frequency.

output = {
            format = BagIt, 
            name = collection_^ca_collections.idno,
            content = {
                collection_data.xml = {
                    type = export,
                    exporter = ead_exporter  
                },
                media/ = {
                    type = file,
                    relativeTo = ca_objects,
                    restrictToTypes = [],
                    restrictToMimeTypes = [image/*],
                    files = { 
                        ca_object_representations.media.original = {
                            delimiter = .,
                            components = {^original_basename, "mymedia", ^extension }
                        },
                        ca_objects.install_instruction.original = ^original_filename
                    }
                }
            },

In the output section of the configuration we can customize the details of the "bag". After setting the format to BagIt we can structure the filename for the export. In the example above the filename of the "bag" is set to the word collection followed by an underscore followed by the CollectiveAccess idno of the collection record from which the "bag" was generated. The format used here for the idno,^ca_collections.idno, is a syntax described in more detail here.

The EAD XML file we wish to generate within the "bag" requires a CollectiveAccess export mapping. This mapping tells the system how to structure the data. Data can be packaged in a variety of formats using the export mapping framework, including XML, MARC21, CSV. After a data export mapping is uploaded to the CollectiveAccess system its code can be referenced via the exporter setting within content.

Under media/ we can set the parameters for the related media files that will be included in the "bag". In the example above we are exporting the images (of all formats, e.g. jpeg, tif) attached to the objects that are related to the file level of the collection. Using restrictToMimeTypes we can be more or less specific about the media formats we wish to include.

In the files section we can set the media versions we wish to export and the associated filenames. Options that can be set in the components section include: original_filename, original_basename (filename without extension), filename, basename, extension and id.

The ca_objects.install_instruction.original configuration above shows an example of the syntax used when targeting CollectiveAccess media stored as metadata (rather than media store as a representation).

Finally we have the options section:

options = {
                allowFetch = 1,
                bag-info-data = {
                    title = ^ca_collections.preferred_labels.name,
                    contact-name = Your name here,
                    contact-organization = Your organization name here,
                    contact-email = Your email address here
                }
            }

Here we can set up metadata to introduce the contents of the "bag" via a file called bag-info-data. In our example we've included the collection record's title (via the display template ^ca_collections.preferred_labels.name) along with some arbitrary information about our repository.

Please note that the allowFetch option is not currently supported, but will be developed soon.

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools