Data Exporter

From CollectiveAccess Documentation
Jump to: navigation, search

[Available for v1.4+]

Overview

Probably a step-by-step intro here ..

Supported output formats

Currently: XML, MARC21, CSV

Creating a mapping

To create a mapping, first download the Excel-based export mapping template available here (File:Data Export Mapping template.xlsx). Once all of the mappings and settings have been entered into the template it can be loaded directly into CollectiveAccess (see Running an export below). The mapping is automatically checked using format-specific rules before it is added so if your mapping has any errors or ambiguities, the mapping loader will let you know.

Creating the mapping is a very dependent on the format you want to export. Specific notes and examples can be found in the section about element values and formats.

Rule types

The first column of the main mapping spreadsheet is called "Rule type". What you set here basically qualifies what this row does. Most of the rows will end up being of the "Mapping" type but there are several options available:

Rule type Description
Mapping Maps a CollectiveAccess data source to a target column number, XML element/attribute or MARC field.
Constant Allows you to set an element in the target format (a CSV column or an XML element/attribute to a static/constant value. If this is set, the value is taken from the 6th column in the mapping sheet ("Source").
RepeatMappings Allows repeating a list of existing mappings in a different context. If this is set, the comma-delimited list of mappings is taken from the 6th column ("Source"). See Data_Exporter#Mapping repitition.
Setting Sets preferences for the mapping (see below).
Variable (Available for v1.5) Allows you, using all the available features of the exporter, to assign a value to a user-defined name for later usage. See Data_Exporter#Variables.

Hierarchical mappings

Some export formats support hierarchical relationships between mapping items. For XML this is a very core concept. To create a hierarchy, simply assign a number to a mapping in the 2nd column of the Mapping sheet and then reference that number in other rows (i.e. for other items) in the 3rd row, which is typically named "Parent ID". The second item will then become a direct child if the first one. In theory, those hierarchies can be nested very deep but in practice the format implementations may apply restrictions.

Source

The value for the 5th column in the mapping sheet can be any CollectiveAccess bundle specifier. See API:Getting_Data#Bundle_specifiers for details. This usually specifies the actual data that is pulled into this item. Can be set to arbitrary text for items with static content or be left empty for items without content (e.g. wrapping elements in XML or empty columns in CSV).

Note that if the context for the current mapping is changed, there are a couple of special keys available for the source column. For more information see the description for the "context" option in the table below.

Element values and general notes on specific formats

The 4th column of the mapping sheet is named 'Element'. This is a very format-specific setting where you enter the name of the element you want to put your field data in. See below for a description of the formats.

XML Element values

The XML format implementation allows valid XML element names as values for the "Element" column. If you want to specify an XML attribute, prefix the name with an @. The attribute will then be appended to the hierarchy parent (which can't be another attribute). The mapping item hierarchy pretty much represents the XML tree that will be constructed from it.

Say you have the following very simple part of a mapping sheet and you export a single object.

Rule type ID Parent ID Element Source Options
Mapping 1 object
Mapping 2 1 @idno ca_objects.idno
Mapping 3 1 title ca_objects.preferred_labels

What you end up with as export for a given objects is something like the following:

<?xml version="1.0" encoding="utf-8"?>
<object idno="00001">
  <title>My very cool object</title>
</object>

MARC Element values

Let's start off by saying that MARC is a very old and very specific format. Creating MARC mappings can be a bit painful. Make yourself familiar with the format before you dive into the following description.

In MARC mappings, the Element value is either a control field or a data field definition. For control field definitions, simply enter the field code (like '001') here. For data field definitions, enter the field code, followed by a forward slash and both indicator characters. For details on valid field codes and indicators, please refer to the MARC documentation. For empty/unused indicators, use the pound sign (#). Valid examples are 001 300/## 490/1#

Mapping items with data field definitions also shouldn't have any source definition or static data. The data resides in subfields, which should be separate mapping items with a hierarchical relationship (via Parent ID) to the field definition. For instance, you'd define an item for the data field "300/##". Suppose it had the ID 1. This field (like every data field) has a couple of subfields [1], namely a through g and 3, 6, 8 (leave out the $ character from the original documentation). Now create separate mapping items for each subfield you need, pull in the CA data you want using the 'Source' field in the mapping sheet and fill in the Parent ID "1", the identifier of the data field. Here's an example in table form (which may not make sense from a MARC standpoint but we're only trying to explain the format here, not the semantics of MARC fields):

Rule type ID Parent ID Element Source Options
Mapping 1 001 ca_objects.idno
Mapping 2 300/##
Mapping 3 2 b ca_objects.preferred_labels

An example export for a single object looks like this then. Note that we selected the 'readable' format for the MARC exporter, more info on format-specific settings are below.

LDR                         
001     00001
300 ## _bMy very cool object

CSV Element values

In this format you simply define column numbers as Element values. The first column is indexed 1, not 0. No hierarchical relationships are allowed. And that's about all you need to know to create a CSV export. A very simple example:

Rule type ID Parent ID Element Source Options
Mapping 1 ca_objects.idno
Mapping 2 ca_objects.preferred_labels

The resulting line looks like this. We used the defaults for delimiter and enclosure. More info on format-specific settings can be found below.

"00001","My very cool object"

Variables

This feature allows you, using all the available features of the exporter, to assign a value to a user-defined identifier for later usage. The value can be anything you can pull from the database. The '''identifier''' should '''only contain alphanumeric text, dashes and underscores'''. Otherwise the mapping spreadsheet will fail to load. For example: type, my_variable, some-value, somethingCamelCase.

The identifier (essentially the name) that you assign to the variable goes into the element column. Since variable don't end up in the export, this column has no other use. Below is a simple example.

The main (and for the moment only) use for variables are conditional mappings. Say you have two objects, a document and a photo. And say you have an attribute 'secret_info' that is valid for both object types but that you only want to have in your export for photos. You could build two different mappings for these cases or you could use a variable to assign the object type to a user-defined identifier and then use the skipIfExpression option for the mapping in question.

A good way to think of variables is that they are mappings that don't end up in the actual export. They respect the current context, the current place in the hierarchy, everything.

Rule type ID Parent ID Element Source Options
Variable type ca_objects.type_id
Mapping 1 object
Mapping 2 1 @idno ca_objects.idno
Mapping 3 1 secret ca_objects.top_secret { "skipIfExpression" : "^type!~/49/" }

We use the "type" variable in the skipIfExpression setting for the top_secret mapping. For more info on this setting, see the setting description below.

Settings

These are configuration options that apply to the whole exporter mapping.

Setting Description Parameter notes Example
exporter_format Sets the format used for this mapping. Restricted list, at the moment 'XML', 'MARC' and 'CSV' are supported. XML
code Alphanumeric code of the mapping Arbitrary, no special characters or spaces my_mapping
name Human readable name of the mapping Arbitrary text My mapping
table Sets the table for the exported data Corresponds to CollectiveAccess Basic Tables ca_objects
wrap_before If this exporter is used for an item set export (as opposed to a single item), the text set here will be inserted before the first item. This can for instance be used to wrap a repeating set of XML elements in a single global element. The text should be valid for the current exporter format. Arbitrary string value <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" ...>
wrap_after If this exporter is used for an item set export (as opposed to a single item), the text set here will be inserted after the last item. This can for instance be used to wrap a repeating set of XML elements in a single global element. The text has to be valid for the current exporter format. Arbitrary string value </rdf:RDF>
wrap_before_record Same as wrap_before but only applies to single item/record exports. Arbitrary string value <mySingleRecordWrap xml:id="fooBar">
wrap_after_record Same as wrap_after but only applies to single item/record exports. Arbitrary string value </mySingleRecordWrap>
typeRestrictions If set, this mapping will only be available for these types. Multiple types are separated by commas or semicolons.

Note that this doesn't work very well for batch exports because search results or sets typically consist of records of multiple types. The exporter select dropdown always shows all exporters for that table, but when you actually run the export in batch mode, it will filter according to the restriction, which can get a little confusing when you look at the result.

comma- or semi-colon separated list of valid type codes for this table image,document

In addition to those globally available settings, the exporter can pull in format-specific settings if needed. Note that not all formats may have such settings. For instance, XML does not.

MARC-specific settings

Setting Description Parameter notes Example
MARC_outputFormat MARC supports a couple of different output formats for the same kinds of mapping. Set the format you want to use here. Default is 'readable'. See [2] for more details 'readable', 'raw' or 'xml'. readable refers to the typical more or less human-readable table-like format used for MARC records. raw is used to write MARC binary files for data exchange. The 3rd option uses MARCXML as output format. xml

CSV-specific settings

Setting Description Parameter notes Example
CSV_delimiter Sets the character used to separate field values. Typically a single character like ';' or ','. Default is ','. ;
CSV_enclosure Character used to enclose the text content in the export. Typical values are single or double quotes. "
CSV_print_field_names Output field names in first row of output. Name used is as set in the "id" column of the mapping. If no id is set then the configured label for the source field is used. 1 to print names, 0 to not print names. Default is 0 1

Options

Each mapping item (i.e. a line in the mapping spreadsheet) can have its own settings as well. To set these settings, you can fill out the 6th column of the mapping sheet, called 'Options'. The options must be filled in in JavaScript Object Notation. If you set this value and it's not formatted properly, the mapping loading tool will throw an error. Here's a description of the available options:

Options Description Parameter notes Example
default Value to use if data source value is blank Arbitrary text "No value"
delimiter Delimiter to used to concatenate repeating values Usually a single character like a comma or semi-colon
prefix Text to prepend to value prior to export Arbitrary text Dimensions are:
suffix Text to append to value prior to export Arbitrary text feet
template Format exported value with provided template. Template may include caret (^) prefixed placeholders that refer to data source values. See the [Bundle_Display_Templates] article for details on the syntax ^height
maxLength Truncate to specified length if value exceeds that length Integer 80
repeat_element_for_multiple_values Some source values may select multiple values, for instance for repeatable metadata elements. If this is the case and this option is set, the current mapping item is repeated for each value instead of them being put into a single string using the delimiter option 1 or 0. defaults to 0 1
filterByRegExp Allows filtering values by regular expression. Any value that does NOT match this PCRE regular expression is filtered and not exported Insert expression without delimiters. Has to be valid expression. [A-Za-z0-9]+
locale Locale code to use to get the field values from the database. If not set, the system/user default is used. Valid locale code de_DE
context This is used to switch the context for this mapping item (and all hierarchy children) to a different record (usually a set of records). A basic application of this feature is to create a kind of sub-export inside the mapping where you can pull in data from related items or hierarchical descendants. Once the context is switched, the 'source' values for this row and all children are relative to the new context, unless of course it is switched again (you can build cascades). This allows you, for instance, to list all works of the creator of a painting which you're exporting. The context-switched mapping item is always repeated for each record selected by the context switch! See also the 'restrictToTypes', 'restrictToRelationshipTypes' and 'checkAccess' settings to further specify the
Note that if the context is switched to a related table, there are a couple of special keys available for the '''source''' column to fetch the type of the relationship between the item in the current context and the item where the context was last switched. These keys are: '''relationship_type_id''', '''relationship_type_code''' and '''relationship_typename'''.
It is also possible to switch context to an attribute of the current record. This helps properly process repeatable containers as encapsuled sub-exports. If the context is switched to a container like ca_entities.address, all elements of the container are available in the source column for all child mappings. They are addressed by only their code(e.g. "city"). No table prefix.
Either a related table name like 'ca_entities', a metadata element bundle specifier (ca_entities.address) or one of the literals 'children' or 'parent' for hierarchy traversal. ca_entities
restrictToTypes Restricts the context of the mapping to only records of the designated type. Only valid when context setting is set list of valid type codes "restrictToTypes":["photograph", "other", "mixed", "text"]
restrictToRelationshipTypes Restricts the context of the mapping to only records related with the designated relationship type. Only valid when context is set. list of valid relationship type codes "restrictToRelationshipTypes":["creator", "depicts"]
checkAccess Restricts the context of the mapping to only records with one of the designated access values. Only valid when context is set. List of valid 'access' values "checkAccess": [1, 2]
omitIfEmpty Completely ignores this mapping if the selector doesn't return any value. This is primarily meant for XML exports where you don't want to end up with ugly empty XML elements like <relatedObjects></relatedObjects>.
Note: This works differently from the Data Importer Option "skipIfEmpty"!
Valid bundle specifier "omitIfEmpty": "ca_objects.description"
omitIfNotEmpty Omit this item and all its children if this CollectiveAccess bundle specifier returns a non-empty result. This is useful if you want to specify fallback-sections in your export mapping that are only used if certain data is not available. Valid bundle specifier "omitIfNotEmpty": "ca_objects.description"
convertCodesToDisplayText If set, id values refering to foreign keys are returned as preferred label text in the current locale. 0 or 1 "convertCodesToDisplayText": 1
convertCodesToIdno (from v1.6.1) If set, id values refering to foreign keys are returned as idno. 0 or 1 "convertCodesToIdno": 1
returnIdno If set, idnos are returned for List attribute values instead of primary key values. Should not be combined with convertCodesToDisplayText, as it overrides it and can produce unwanted results. Only applies to List attribute values! 0 or 1 "returnIdno": 1
skipIfExpression If the expression yields true, skip the mapping for the data. Note that all user-set variables are available under their identifiers. arbitrary text { "skipIfExpression" : "^foo!~/49/" }
start_as_iso8601 If you are exporting a range of dates, and wish for the start and end dates to be split and exported to separate elements, use this setting to grab the "start" date. 0 or 1 { "start_as_iso8601" : 1 }
end_as_iso8601 If you are exporting a range of dates, and wish for the start and end dates to be split and exported to separate elements, use this setting to grab the "end" date. 0 or 1 { "end_as_iso8601" : 1 }
coordinatesOnly If you are exporting georeference points, the default output is [lat,long]. If you wish to strip out the brackets, set coodinatesOnly to 1 0 or 1 { "coordinatesOnly" : 1 }
timeOmit By default, start_as_iso8601 and end_as_iso8601 will produced the timestamp as well as the date. To omit the time, use timeOmit. 0 or 1 { "timeOmit": 1 }
dontReturnValueIfOnSameDayAsStart This setting will ensure that the end_as_iso8601 will be skipped on single dates (where there is no end date). 0 or 1 { "dontReturnValueIfOnSameDayAsStart" : 1 }
sort Sorts the values returned for a context switch on these fields. Only valid when context is set to a related table. Must always be a list. List of valid field names for related table "sort" : [ "ca_entities.preferred_labels.surname" ]

Below is a properly formatted example in JSON that uses some of these options:

{
	"default" : "No value",
	"delimiter" : ";",
	"maxLength" : 80,
	"filterByRegExp" : "[A-Z]+"
}

Processing order

In some cases the order in which the options and replacements (see next sub-section) are applied to each value can make a significant difference so it's important to note it here:

1) skipIfExpression (available for v1.5)
2) filterByRegExp
3) Replacements (see below)
4a) If value is empty, respect 'default' setting
4b) If value is not empty, use prefix and suffix
5) Truncate if result is longer than maxLength

Replacements

While looking at the exporter mapping template you might have noticed that there's a second sheet called 'Replacements' in there. This can be used to assign replacements to each mapping item. The first column references the ID you set in the 2nd column of the mapping item table. The second column defines what is to be replaced. This again should be a PCRE-compatible regular expression without delimiters. The 3rd column defines what value should be inserted for the matched values. These conditions are applied to each matching value in the order they've been defined, i.e. if you have multiple replacements for the same mapping item, the incoming value is first passed through the first replacement, the result of this action is then passed in to the second replacement, and so on ...

[Useful note for advanced users and PHP programmers]

The values are passed through preg_replace, the 'pattern' being the 2nd column value (plus delimiters) and the 'replacement' being the value from the 3rd column. This allows you to do pretty nifty stuff, for instance rewriting dates:

Search column:  (\w+) (\d+), (\d+)
Replace column: $2 $1 $3
value: April 15, 2003
result: 15 April 2003

Mapping repitition

The 'RepeatMappings' rule type allows you to repeat a set list of mappings in a different context without actually defining them again. This is, for instance, very useful when creating EAD exports of hierarchical data where the basic structure is always the same (for archdesc, c01, c02, etc.) but the context changes. It's basically a shortcut that saves a lot of work in certain scenarios. Note that all hierarchy children of the listed items are repeated as well.

If you create a RepeatMappings rule, the mapping loader expects a comma-delimited list of references to the 2nd column in the Mapping sheet. It also really only makes sense to create this type of rule if you change the context in the same step. A simple example could look like this:

Rule type ID Parent ID Element Source Options
Mapping 1 root
Mapping 2 1 label ca_objects.preferred_labels
Mapping 3 1 idno ca_objects.idno
Mapping 4 1 children
RepeatMappings 4 child 2,3 { "context" : "children" }

In this case, the 'child' element would be repeated for each hierarchy child of the exported item because of the context switch and for each of those children, the exporter would add the label and idno elements.

Running an export

The export can be executed through caUtils. To see all utilities ask for help after cd-ing into support

cd /path_to_Providence/support
bin/caUtils help

To get further details about the load-export-mapping utility:

bin/caUtils help load-export-mapping

To load the mapping:

bin/caUtils load-export-mapping --file=~/my_export_mapping.xlsx

Next you’ll be using the utility export-data. First, have a look at the help for the command to get familiar with the available options.

bin/caUtils help export-data

Essentially there are 3 export modes:

1) Export a single record

Since the scope of a mapping is usually a single record, it's easy to use a mapping to export a record by its identifier. Suppose you have a ca_objects XML mapping with the code 'my_mapping'. To use this to export the ca_objects record with the primary key identifier (not the custom idno!) 550 to a new file ~/export.xml, you'd run this command:

bin/caUtils export-data -m my_mapping -i 550 -f ~/export.xml

2) Export a set of records found by custom search expression

In most real-world export projects you'll need to export a set of records or even all your records into a single file. The exporter utility allows this by letting you specify a search expression with the -s parameter that selects the set of records used for export. The records are simply exported sequentially in the order returned by the search engine. This sequence is wrapped in the wrap_before and wrap_after settings of the exporter, if set. If you want to export all your records, simply search for "*". This example exports all publicly accessible files to a file ~/export.xml:

bin/caUtils export-data -m my_mapping -s "access:1" -f ~/export.xml

3) Export a diverse set of records ("RDF mode")

[For advanced users] The error handling in this portion of the code is very poor so you're pretty much left on an island if something goes wrong.

Sometimes a limited export scope to for example ca_objects like in the previous example is not enough to meet the target format requirements. Occasionally you may want to build a kind of 'mixed' export where records from multiple database entities (objects, list items, places, ...) are treated equally. We have found this requirement when trying to use the exporter to generate an RDF graph, hence the name. The export framework originally wasn't designed for this case but the caUtils export-data command offers a way around that. The switch --rdf enables this so called "RDF mode". In this mode, you again use -f to specify the output file and you have to provide an additional configuration file (see Configuration_File_Syntax) which tells the exporter about the records and corresponding mappings which will be used for this export.

Here is a minimal example that uses all the available features:

wrap_before = ""
wrap_after = ""

nodes = {
	my_images = {
		mapping = object_mapping,
		restrictBySearch = "access:1",
		related = {
			concepts = {
				restrictToRelationshipTypes = [depicts],
				mapping = concept_mapping,
			},
			agents = {
				restrictToTypes = [person],
				mapping = agent_mapping,
			},
		}
	},
}

While processing this configuration, the exporter essentially builds one big list of records and corresponding mappings to export. There are no duplicates in this list, if object_id 23 is selected by two different node type definitions or by multiple related definitions, it is still only exported once, using the mapping provided by the first definition.

Here is an example of how to run an RDF mode export:

bin/caUtils export-data --rdf -c ~/rdf_mode.conf ~/export.xml

RDF Mode configuration file options

Setting Description
wrap_before Text to prepend before the export.
wrap_after Text to append after the export.
nodes List of primary node type definitions to be used for this export

Node type definition options

Setting Description
mapping Mapping to be used for this type of node. Has to be an existing mapping code.
restrictbySearch Restrict exported records using a search expression
related List of related records also to be included in the global node set. You can use this for example to make sure you only export list_items that are actually actively used as vocabulary terms for objects, meaning you don't have to create an extra node type (which would potentially export all list items in your database) for this.

'related' options

mapping Mapping to be used for this type of related item. Has to be an existing mapping code.
restrictToRelationshipTypes Restrict selected related records by relationship types. Has to be a list or empty.
restrictToTypes Restrict selected related records by record types (e.g. entity type). Has to be a list or empty.

Misc Setting and Options

Exporting values from Information Services (e.g Library of Congress, Getty

If your CollectiveAccess configuration includes information services, such as Library Of Congress Subject Headings or Getty's Art and Architecture Thesaurus, you can export these in the exact same way as you would export other kinds of metadata elements.

However, in order to comply with certain XML formats (like MODS of TEI) you may find that you need to extract the terms' URI and export these to an attribute while exporting the label name to an element.

To grab an information service term's URI, you can simply append ".uri" or ".url" to the Source.

For example, if your Getty AAT element happens to be called "ca_objects.aat" and you wish to export the URI, simply express the source as "ca_objects.aat.uri". This will give you the URI while the simple "ca_objects.aat" will get you the label name as before.

LC services work a little differently. For these, you must append to the source ".text" to get the label name and ".id" to get the URI.

For example:

ca_objects.lcsh_terms.text will get you the label name of all lcsh terms on the record. ca_objects.lcsh_terms.id will get you the URI for these terms.

update_sphinx

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools