Digital Preservation and Media Integrity Features

From CollectiveAccess Documentation
Jump to: navigation, search

Overview

CollectiveAccess supports a wide range of digital media types as well as tools to view, transform and convert them. It also offers functionality to verify integrity, embed metadata and perform format migration of media.

Media file integrity checks

MD5 checksums are generated for all ingested media files, and for derivative media files as they are created. Checksums are stored in the database and can be compared at any time with freshly calculated checksums from files on disk to verify that all original media and derivatives have not changed since ingestion.

Media checksums may be verified using the check-media-fixity command in caUtils. Running the command with a file path (the file option) or email address (the email option) will result in a text report listing all checksum mismatches. The command offers a number of other options that provide control over the scope and format of the fixity report, including:

Option Description Default Example
file Path to file to write report into. The placeholder %date may be included to add a date/timestamp to the report's filename None; if not specified no file will be written. --file=/home/bento/fixity_report_%date.txt
email Email address to send report to. None --email=support@collectiveaccess.org
format Output format for report. May be set to text, tab (tab-delimited) or csv (comma separated values). Both tab and csv formats are suitable for use with spreadsheet and database applications. text --format=csv
versions Limit checksum comparisons to specified media versions. Separate multiple versions with commas. None; all versions will be checked. --versions=small,medium,large,original
start_id Representation id to start comparisons at. This allows limiting of checks to specific sets of media. None --start_id=5000
end_id Representation id to stop comparisons at. This allows limiting of checks to specific sets of media. None --end_id=10000
id A single representation id to compare checksums for. This can be useful when debugging a fixity issue. None --id=31415
ids A comma separated list of representation ids to compare checksums for. This can be useful when debugging a fixity issue. None --ids=534,631,7822
object_ids A comma separated list of object ids to compare representation checksums for. All representations linked to the specified objects will be checked. This can be useful when debugging a fixity issue. None --object_ids=100,543,653
kinds Comma separated list of kind of media to check. Valid kinds are ca_object_representations (object representations) and ca_attributes (media and file metadata elements). You may also specify "all" to check all media regardless of type. all --kinds=ca_object_representations
quiet Suppress progress messages. false --quiet

The generated report will include a row for each checksum mismatch. Each row will contain the following information about the mismatch:

Name Description
Type The type of error. This will almost always be "MD5 mismatch."
Error A description of the error.
ID The ID of the representation or attribute in error.
Version The version of the representation of attribute in error.
File path The absolute path on the server to the file in error.
Expected MD5 The MD5 checksum of the original file as stored in the database.
Actual MD5 the MD5 checksum calculated from the file on disk.

When the report is configured to be sent via email, formatting is set in /themes/your_theme/views/mailTemplates/check_media_fixity_report.tpl

Media migration

CollectiveAccess can convert between a wide variety of audio, video, image and document formats. Conversion is performed at ingest time using format-specfic rules specified in the media_processing.conf configuration file. Format-specific transformations, including watermarking and modifications to scale, quality and colorspace, may also be applied. Modifications to the rules in media_processing.conf will only apply to subsequently ingested media. To apply recent changes to previously ingested media the reprocess-media command in caUtils must be run. When run without any options, reprocess-media will re-convert all representation and attribute media in the database using current media_processing.conf rules. Conversion will take place even if the resulting file will be no different that the existing one.

reprocess-media options provide control over which files get processed and can be used to cut down on processing time when only specific files are affected by a rules change:

Option Description Default Example
mimetypes Comma separated list of mimetypes or mimetype stubs (the first part of a mimetype, such as "image") to limit reprocessing to. None; if not specified files will be processed regardless of mimetype. --mimetypes=image/jpeg,image/tiff
versions Limit reprocessung to specified media versions. Separate multiple versions with commas. None; all versions will be reprocessed. --versions=small,medium,large,original
start_id Representation id to start reprocessing at. This allows limiting of reprocessing to specific sets of media. None --start_id=5000
end_id Representation id to stop reprocessing at. This allows limiting of reprocessing to specific sets of media. None --end_id=10000
id A single representation id to reprocess. This can be useful when debugging a media processing issue. None --id=31415
ids A comma separated list of representation ids to reprocess. This can be useful when debugging a media processing issue. None --ids=534,631,7822
object_ids A comma separated list of object ids to reprocess media for. All representations linked to the specified objects will be reprocessed. This can be useful when debugging a media processing issue. None --object_ids=100,543,653
kinds Comma separated list of kind of media to reprocess. Valid kinds are ca_object_representations (object representations) and ca_attributes (media and file metadata elements). You may also specify "all" to reprocess all media regardless of type. all --kinds=ca_object_representations
quiet Suppress progress messages. false --quiet
Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools