Difference between revisions of "Data Exporter"

From CollectiveAccess Documentation
Jump to: navigation, search
(Creating a mapping)
(Mapping repitition)
Line 258: Line 258:
||Mapping|| 4|| 1 || children || || ||
||Mapping|| 4|| 1 || children || || ||
||Mapping|| || 4 || child || children || 2,3 ||
||RepeatMappings|| || 4 || child || children || 2,3 ||

Revision as of 16:33, 11 March 2013

[Available for v1.4]


Probably a step-by-step intro here ..

Supported output formats

Currently: XML, MARC21, CSV

Creating a mapping

To create a mapping, first download the Excel-based export mapping template available here (File:Data Export Mapping template.xlsx). Once all of the mappings and settings have been entered into the template it can be loaded directly into CollectiveAccess (see Running an export below). The mapping is automatically checked using format-specific rules before it is added so if your mapping has any errors or ambiguities, the mapping loader will let you know.

Creating the mapping is a very dependent on the format you want to export. Specific notes and examples can be found in the section about element values and formats.

Rule types

Maybe general notes here?

Rule type Description
Mapping Maps a CollectiveAccess data source to a target column number, XML element/attribute or MARC field.
Constant Allows you to set an element in the target format (a CSV column or an XML element/attribute to a static/constant value. If this is set, the value is taken from the 6th column in the mapping sheet ("Source").
RepeatMappings Allows repeating a list of existing mappings in a different context. If this is set, the comma-delimited list of mappings is taken from the 6th column ("Source"). See Data_Exporter#Mapping repitition.
Setting Sets preferences for the mapping (see below).

Hierarchical mappings

Some export formats support hierarchical relationships between mapping items. For XML this is a very core concept. To create a hierarchy, simply assign a number to a mapping in the 2nd column of the Mapping sheet and then reference that number in other rows (i.e. for other items) in the 3rd row, which is typically named "Parent ID". The second item will then become a direct child if the first one. In theory, those hierarchies can be nested very deep but in practice the format implementations may apply restrictions.

Context values and context switches

The Context column (#5) is used to switch the context for this mapping item (and all hierarchy children) to a different record (usually a set of records). This is used to create a kind of sub-export inside the mapping where you can pull in data from related items or hierarchical descendants and only makes sense in nested structures like XML. Don't use this feature in CSV or MARC exports! The Context field uses a kind of custom syntax similar to what we did with Bundle_Display_Templates. Here is an example that uses all available features


Which means the current mapping item is repeated for those related objects (to the record currently being processed) of type either image or print where the relationship type depicts or foobar is used and which are publicly accessible.

If you want to switch the context to items related via ad-hoc hierarchical relationships which are supported by several record types (objects, occurrences, places, ...), you can use either

children or parent as value for Context.

Once the context is switched, the 'Source' values for this row and all children are relative to the new context, unless of course it is switched again. This allows you, for instance, to list all works of the creator of a painting which you're exporting. The context-switched mapping item is always repeated for each record selected by the context switch!


The value for the 6th column in the mapping sheet can be any CollectiveAccess bundle specifier. See API:Getting_Data#Bundle_specifiers for details. This usually specifies the actual data that is pulled into this item. Can be set to arbitrary text for items with static content or be left empty for items without content (e.g. wrapping elements in XML or empty columns in CSV).

Element values and general notes on specific formats

The 4th column of the mapping sheet is named 'Element'. This is a very format-specific setting where you enter the name of the element you want to put your field data in. See below for a description of the formats.

XML Element values

The XML format implementation allows valid XML element names as values for the "Element" column. If you want to specify an XML attribute, prefix the name with an @. The attribute will then be appended to the hierarchy parent (which can't be another attribute). The mapping item hierarchy pretty much represents the XML tree that will be constructed from it.

Say you have the following very simple part of a mapping sheet and you export a single object.

Rule type ID Parent ID Element Context Source Options
Mapping 1 object
Mapping 2 1 @idno ca_objects.idno
Mapping 3 1 title ca_objects.preferred_labels

What you end up with as export for a given objects is something like the following:

<?xml version="1.0" encoding="utf-8"?>
<object idno="00001">
  <title>My very cool object</title>

MARC Element values

Let's start off by saying that MARC is a very old and very specific format. Creating MARC mappings can be a bit painful. Make yourself familiar with the format before you dive into the following description.

To use this feature, you have to have PEAR and its File_MARC package installed. Otherwise the exporter will throw an error.

In MARC mappings, the Element value is either a control field or a data field definition. For control field definitions, simply enter the field code (like '001') here. For data field definitions, enter the field code, followed by a forward slash and both indicator characters. For details on valid field codes and indicators, please refer to the MARC documentation. For empty/unused indicators, use the pound sign (#). Valid examples are 001 300/## 490/1#

Mapping items with data field definitions also shouldn't have any source definition or static data. The data resides in subfields, which should be separate mapping items with a hierarchical relationship (via Parent ID) to the field definition. For instance, you'd define an item for the data field "300/##". Suppose it had the ID 1. This field (like every data field) has a couple of subfields [1], namely a through g and 3, 6, 8 (leave out the $ character from the original documentation). Now create separate mapping items for each subfield you need, pull in the CA data you want using the 'Source' field in the mapping sheet and fill in the Parent ID "1", the identifier of the data field. Here's an example in table form (which may not make sense from a MARC standpoint but we're only trying to explain the format here, not the semantics of MARC fields):

Rule type ID Parent ID Element Context Source Options
Mapping 1 001 ca_objects.idno
Mapping 2 300/##
Mapping 3 2 b ca_objects.preferred_labels

An example export for a single object looks like this then. Note that we selected the 'readable' format for the MARC exporter, more info on format-specific settings are below.

001     00001
300 ## _bMy very cool object

CSV Element values

In this format you simply define column numbers as Element values. The first column is indexed 1, not 0. No hierarchical relationships are allowed. And that's about all you need to know to create a CSV export. A very simple example:

Rule type ID Parent ID Element Context Source Options
Mapping 1 ca_objects.idno
Mapping 2 ca_objects.preferred_labels

The resulting line looks like this. We used the defaults for delimiter and enclosure. More info on format-specific settings can be found below.

"00001","My very cool object"


These are configuration options that apply to the whole exporter mapping.

Setting Description Parameter notes Example
exporter_format Sets the format used for this mapping. Restricted list, at the moment 'XML', 'MARC' and 'CSV' are supported. XML
code Alphanumeric code of the mapping Arbitrary, no special characters or spaces my_mapping
name Human readable name of the mapping Arbitrary text My mapping
table Sets the table for the exported data Corresponds to CollectiveAccess Basic Tables ca_objects
wrap_before If this exporter is used for an item set export (as opposed to a single item), the text set here will be inserted before the first item. This can for instance be used to wrap a repeating set of XML elements in a single global element. The text should be valid for the current exporter format. Arbitrary string value <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" ...>
wrap_after If this exporter is used for an item set export (as opposed to a single item), the text set here will be inserted after the last item. This can for instance be used to wrap a repeating set of XML elements in a single global element. The text has to be valid for the current exporter format. Arbitrary string value </rdf:RDF>

In addition to those globally available settings, the exporter can pull in format-specific settings if needed. Note that not all formats may have such settings. For instance, XML does not.

MARC-specific settings

Setting Description Parameter notes Example
MARC_outputFormat MARC supports a couple of different output formats for the same kinds of mapping. Set the format you want to use here. Default is 'readable'. See [2] for more details 'readable', 'raw' or 'MARCXML'. readable refers to the typical more or less human-readable table-like format used for MARC records. raw is used to write MARC binary files for data exchange. The 3rd option uses MARCXML as output format. MARCXML

CSV-specific settings

Setting Description Parameter notes Example
CSV_delimiter Sets the character used to separate field values. Typically a single character like ';' or ','. Default is ','. ;
CSV_enclosure Character used to enclose the text content in the export. Typical values are single or double quotes. "


Each mapping item can have its own settings as well. To set these settings, you can fill out the 7th column of the mapping sheet, called 'Options'. The options must be filled in in JavaScript Object Notation. If you set this value and it's not formatted properly, the mapping loading tool will throw an error. Here's a description of the available options:

Options Description Parameter notes Example
default Value to use if data source value is blank Arbitrary text "No value"
delimiter Delimiter to used to concatenate repeating values Usually a single character like a comma or semi-colon ;
prefix Text to prepend to value prior to export Arbitrary text Dimensions are:
suffix Text to append to value prior to export Arbitrary text feet
template Format exported value with provided template. Template may include caret (^) prefixed placeholders that refer to data source values. See the [Bundle_Display_Templates] article for details on the syntax ^height
maxLength Truncate to specified length if value exceeds that length Integer 80
repeat_element_for_multiple_values Some source values may select multiple values, for instance for repeatable metadata elements. If this is the case and this option is set, the current mapping item is repeated for each value instead of them being put into a single string using the delimiter option 1 or 0. defaults to 0 1
filterByRegExp Allows filtering values by regular expression. Any value that does NOT match this PCRE regular expression is filtered and not exported Insert expression without delimiters. Has to be valid expression. [A-Za-z0-9]+
locale Locale code to use to get the field values from the database. If not set, the system/user default is used. Valid locale code de_DE

Below is a properly formatted example in JSON that uses some of these options:

	"default" : "No value",
	"delimiter" : ";",
	"maxLength" : 80,
	"filterByRegExp" : "[A-Z]+"

Processing order

In some cases the order in which the options and replacements (see next sub-section) are applied to each value can make a significant difference so it's important to note it here:

1) filterByRegExp
2) Replacements (see below)
3a) If value is empty, respect 'default' setting
3b) If value is not empty, use prefix and suffix
4) Truncate if result is longer than maxLength


While looking at the exporter mapping template you might have noticed that there's a second sheet called 'Replacements' in there. This can be used to assign replacements to each mapping item. The first column references the ID you set in the 2nd column of the mapping item table. The second column defines what is to be replaced. This again should be a PCRE-compatible regular expression without delimiters. The 3rd column defines what value should be inserted for the matched values. These conditions are applied to each matching value in the order they've been defined, i.e. if you have multiple replacements for the same mapping item, the incoming value is first passed through the first replacement, the result of this action is then passed in to the second replacement, and so on ...

[Useful note for advanced users and PHP programmers]

The values are passed through preg_replace, the 'pattern' being the 2nd column value (plus delimiters) and the 'replacement' being the value from the 3rd column. This allows you to do pretty nifty stuff, for instance rewriting dates:

Search column:  (\w+) (\d+), (\d+)
Replace column: $2 $1 $3
value: April 15, 2003
result: 15 April 2003

Mapping repitition

The 'RepeatMappings' rule type allows you to repeat a set list of mappings in a different context without actually defining them again. This is, for instance, very useful when creating EAD exports of hierarchical data where the basic structure is always the same (for archdesc, c01, c02, etc.) but the context changes. It's basically a shortcut that saves a lot of work in certain scenarios. Note that all hierarchy children of the listed items are repeated as well.

If you create a RepeatMappings rule, the mapping loader expects a comma-delimited list of references to the 2nd column in the Mapping sheet. It also really only makes sense to create this type of rule if you change the context in the same step. A simple example could look like this:

Rule type ID Parent ID Element Context Source Options
Mapping 1 root
Mapping 2 1 label ca_objects.preferred_labels
Mapping 3 1 idno ca_objects.idno
Mapping 4 1 children
RepeatMappings 4 child children 2,3

In this case, the 'child' element would be repeated for each hierarchy child of the exported item because of the context switch and for each of those children, the exporter would add the label and idno elements.

Running an export

Currently the export framework can only be run on the command line. To do so, follow these instructions.

The export is executed through caUtils. To see all utilities ask for help after cd-ing into support

cd /path_to_Providence/support
bin/caUtils help

To get further details about the load-export-mapping utility:

bin/caUtils help load-export-mapping

To load the mapping:

bin/caUtils load-export-mapping --file=~/my_export_mapping.xlsx

Next you’ll be using the utility export-data. First, have a look at the help for the command to get familiar with the available options.

bin/caUtils help export-data

Essentially there are 3 export modes:

1) Export a single record

Since the scope of a mapping is usually a single record, it's easy to use a mapping to export a record by its identifier. Suppose you have a ca_objects XML mapping with the code 'my_mapping'. To use this to export the ca_objects record with the primary key identifier (not the custom idno!) 550 to a new file ~/export.xml, you'd run this command:

bin/caUtils export-data -m my_mapping -i 550 -f ~/export.xml

2) Export a set of records found by custom search expression

In most real-world export projects you'll need to export a set of records or even all your records into a single file. The exporter utility allows this by letting you specify a search expression with the -s parameter that selects the set of records used for export. The records are simply exported sequentially in the order returned by the search engine. This sequence is wrapped in the wrap_before and wrap_after settings of the exporter, if set. If you want to export all your records, simply search for "*". This example exports all publicly accessible files to a file ~/export.xml:

bin/caUtils export-data -m my_mapping -s "access:1" -f ~/export.xml

3) Export a diverse set of records ("RDF mode")

[For advanced users] The error handling in this portion of the code is very poor so you're pretty much left on an island if something goes wrong.

Sometimes a limited export scope to for example ca_objects like in the previous example is not enough to meet the target format requirements. Sometimes have to build a kind of 'mixed' export where records from multiple database entities (objects, list items, places, ...) are treated equally. We have found this requirement when trying to use the exporter to generate an RDF graph, hence the name. The export framework originally wasn't designed for this case but the caUtils export-data command offers a way around that. The switch --rdf enables this so called "RDF mode". In this mode, you again use -f to specify the output file and you have to provide an additional configuration file (see Configuration_File_Syntax) which tells the exporter about the records and corresponding mappings which will be used for this export.

Here is a minimal example that uses all the available features:

wrap_before = ""
wrap_after = ""

nodes = {
	my_images = {
		mapping = object_mapping,
		restrictBySearch = "access:1",
		related = {
			concepts = {
				restrictToRelationshipTypes = [depicts],
				mapping = concept_mapping,
			agents = {
				restrictToTypes = [person],
				mapping = agent_mapping,

While processing this configuration, the exporter essentially builds one big list of records and corresponding mappings to export. There are no duplicates in this list, if object_id 23 is selected by two different node type definitions or by multiple related definitions, it is still only exported once, using the mapping provided by the first definition.

Here is an example of how to run an RDF mode export:

bin/caUtils export-data --rdf -c ~/rdf_mode.conf ~/export.xml

RDF Mode configuration file options

Setting Description
wrap_before Text to prepend before the export.
wrap_after Text to append after the export.
nodes List of primary node type definitions to be used for this export

Node type definition options

Setting Description
mapping Mapping to be used for this type of node. Has to be an existing mapping code.
restrictbySearch Restrict exported records using a search expression
related List of related records also to be included in the global node set. You can use this for example to make sure you only export list_items that are actually actively used as vocabulary terms for objects, meaning you don't have to create an extra node type (which would potentially export all list items in your database) for this.

'related' options

mapping Mapping to be used for this type of related item. Has to be an existing mapping code.
restrictToRelationshipTypes Restrict selected related records by relationship types. Has to be a list or empty.
restrictToTypes Restrict selected related records by record types (e.g. entity type). Has to be a list or empty.

Personal tools