Embedded metadata extraction

From CollectiveAccess Documentation
Jump to: navigation, search

Version 1.6

Overview

You can import EXIF, IPTC and XMP data embedded in uploaded media files (images, video, audio, Etc.) into CollectiveAccess using the same data importer mapping template (File:Data Import Mapping template.xlsx) used for Excel, delimited, XML and other common data formats. All of the import and data transformation options available for text-, XML and database-based data formats are available for media-embedded metadata.

System requirements

To import EXIF data your server must have the free ExifTool application installed on your server. Make sure the ExifTool entry in your external_application.conf configuration file is set to point to the installed application.

Use

To import EXIF, IPTC and XMP embedded metadata change the inputFormats setting of your mapping to "EXIF". The only real difference between writing an embedded metadata mapping and any other type of mapping is the process of deriving source element names. EXIF and the related IPTC and XMP embedded metadata formats define scores of elements, often with names that do not match (or even resemble) the labels shown in data entry interfaces such as Photoshop. It can take a bit of sleuthing on sample files to derive the actual source element names used internally. Perhaps the simplest way to do this is using the free ExifTool application. This command-line application is used internally by the CollectiveAccess importer to extract data, so you are guaranteed to get correct element names.

Running ExifTool like this on a file:

exiftool -json -a -g1 my_file.tiff

will return JSON encoded metadata in the same format used by the CollectiveAccess importer (edited for length):

[{
  "SourceFile": "/media/MattressFactory/images/0/test.tiff",
  "ExifTool": {
    "ExifToolVersion": 9.40
  },
  "System": {
    "FileName": "test.tiff",
    "Directory": "/media/MattressFactory/images/0",
    "FileSize": "18 MB",
    "FileModifyDate": "2014:02:17 19:58:19-05:00",
    "FileAccessDate": "2014:05:17 09:32:29-04:00",
    "FileInodeChangeDate": "2015:02:08 09:51:41-05:00",
    "FilePermissions": "rw-r--r--"
  },
  "File": {
    "FileType": "TIFF",
    "MIMEType": "image/tiff",
    "ExifByteOrder": "Big-endian (Motorola, MM)",
    "CurrentIPTCDigest": "e95f98d10fe58342da21e7ecf0b0cf4b"
  },
  "XMP-x": {
    "XMPToolkit": "Adobe XMP Core 5.2-c004 1.136881, 2010/06/10-18:11:35        "
  },
  "XMP-xmp": {
    "ModifyDate": "2012:12:28 13:32:59-05:00",
    "CreateDate": "2005:05:21 21:21:53-04:00",
    "MetadataDate": "2012:12:28 13:53:23-05:00",
    "CreatorTool": "Adobe Photoshop CS5.1 Macintosh",
    "Label": "VRA metadata imported"
  },
  "XMP-dc": {
    "Format": "image/tiff",
    "Title": "Untitled",
    "Creator": "Mattress Factory",
    "Rights": "\"© Mattress Factory.  Other rights may apply.  For copyright information, contact the Mattress Factory.\"",
    "Subject": ["Jason Simmons","Edgar Um Bucholtz","untitled","performance","Mattress Factory","For Those about to Rock","born digital",2005],
    "Description": "\"Digital image IRC.2012.07452 (old filename: IMG_1873) showing Jason Simmons and Edgar Um Bucholtz's untitled performance at the Mattress Factory, 2005.\""
  },
  "XMP-aux": {
    "SerialNumber": 1260413208,
    "LensInfo": "18-55mm f/?",
    "Lens": "18.0-55.0 mm",
    "ImageNumber": 0,
    "ApproximateFocusDistance": 4294967295,
    "FlashCompensation": 0,
    "OwnerName": "Erik Garcia Gomez",
    "Firmware": "1.1.1"
  },
  "XMP-crs": {
    "AlreadyApplied": true
  },
  "XMP-photoshop": {
    "LegacyIPTCDigest": "E95F98D10FE58342DA21E7ECF0B0CF4B",
    "ColorMode": "RGB",
    "ICCProfileName": "sRGB IEC61966-2.1",
    "DateCreated": "2005:05:21 21:21:53",
    "CaptionWriter": "Lucy T. Jones"
  },
  "XMP-vrae": {
    "Workagent": "Jason Simmons; Edgar Um Bucholtz",
    "Worktitle": "Untitled",
    "Workdate": 2005,
    "WorkstylePeriod": "Contemporary",
    "Worktechnique": "Performance",
    "WorklocationExhibition": "For Those about to Rock",
    "Imagetitle": "\"View of Jason Simmons and Edgar Um Bucholtz's untitled performance, Mattress Factory, 2005\""
  },
  "XMP-xmpRights": {
    "Marked": true,
    "WebStatement": "http://www.mattress.org"
  },
  "XMP-exif": {
    "DateTimeOriginal": "2005:05:21 21:21:53-04:00",
    "ColorSpace": "sRGB",
    "CompressedBitsPerPixel": 3,
    "ExifImageWidth": 3072,
    "ExifImageHeight": 2048,
    "ExposureTime": "1/20",
    "FNumber": 5.6,
    "ShutterSpeedValue": "1/20",
    "ApertureValue": 5.6,
    "ExposureCompensation": 0,
    "MaxApertureValue": 3.5,
    "SubjectDistance": "4294967295 m",
    "MeteringMode": "Average",
    "FocalLength": "18.0 mm",
    "FocalPlaneXResolution": 3443.94618834081,
    "FocalPlaneYResolution": 3442.01680672269,
    "FocalPlaneResolutionUnit": "inches",
    "SensingMethod": "One-chip color area",
    "CustomRendered": "Normal",
    "ExposureMode": "Manual",
    "WhiteBalance": "Auto",
    "SceneCaptureType": "Standard",
    "ExifVersion": "0221",
    "FlashpixVersion": "0100",
    "FileSource": "Digital Camera",
    "ComponentsConfiguration": ["Y","Cb","Cr","-"],
    "ISO": 1600,
    "FlashFired": false,
    "FlashReturn": "No return detection",
    "FlashMode": "Unknown",
    "FlashFunction": false,
    "FlashRedEyeMode": false
  },
  "XMP-iptcCore": {
    "CreatorAddress": "500 Sampsonia Way",
    "CreatorCity": "Pittsburgh",
    "CreatorCountry": "United States of America",
    "CreatorPostalCode": 15212,
    "CreatorRegion": "PA"
  },
  "IPTC": {
    "CodedCharacterSet": "UTF8",
    "ApplicationRecordVersion": 2,
    "ObjectName": "Untitled",
    "Keywords": ["Jason Simmons","Edgar Um Bucholtz","untitled","performance","Mattress Factory","For Those about to Rock","born digital",2005],
    "DateCreated": "2005:05:21",
    "By-line": "Mattress Factory",
    "CopyrightNotice": "\"© Mattress Factory.  Other rights may apply.  For copyright information, contact the Mattress Factory.\"",
    "Caption-Abstract": "\"Digital image IRC.2012.07452 (old filename: IMG_1873) showing Jason Simmons and Edgar Um Bucholtz's untitled performance at the Mattress Factory, 2005.\"",
    "Writer-Editor": "Lucy T. Jones"
  }
}]

Once you have run Exiftool on a media file to view how the embedded metadata is stored you can grab the source tags to construct the "source" values in your mapping. The source names in the mapping are constructed by writing the heading and sub-entry separated with a slash.

For example, from the Exiftool run on a digital image above, if you want to extract the description field you'd look for the requisite block of data:

"XMP-dc": {
    "Format": "image/tiff",
    "Title": "Untitled",
    "Creator": "Mattress Factory",
    "Rights": "\"© Mattress Factory.  Other rights may apply.  For copyright information, contact the Mattress Factory.\"",
    "Subject": ["Jason Simmons","Edgar Um Bucholtz","untitled","performance","Mattress Factory","For Those about to Rock","born digital",2005],
    "Description": "\"Digital image IRC.2012.07452 (old filename: IMG_1873) showing Jason Simmons and Edgar Um Bucholtz's untitled performance at the Mattress Factory, 2005.\""
  }

and join the heading ("XMP-dc") and the sub-entry ("Description") to form the source XMP-dc/Description.

Let's say you wanted to import description, rights and title to your object record's description, copyright, and preferred label fields respectively. You would simply write the sources in your mapping as:

Example source Example target
XMP-dc/description
ca_objects.description
XMP-dc/Title
ca_objects.preferred_labels.name
XMP-dc/Rights
ca_objects.rights_statement

Writing the rest of your mapping is no different from writing a mapping for any other data format.

Two more options are available that may make it easier to grab the media from your import directory and match the files with the objects being created during the import. Use _filename_ as a source if you wish to set any field in CollectiveAccess as the filename. And more importantly, _filepath_ points to the media in the import directory, and can be used to trigger ingestion of the media itself.

Source Description Parameter notes
__filename__ This source value takes the filename of the media being imported. You can import filenames to any field in CollectiveAccess, including preferred_labels and idno.
__filepath__ This source takes the full server filepath from your media import directory to give you the media. Map this to ca_object_representations and use the objectRepresentationSplitter. {
   "objectRepresentationType": "front",
   "attributes": {
       "media": "^__filepath__"
   }

}

sphinx

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools