Data Import: Creating and Running a Mapping

From CollectiveAccess Documentation
Jump to: navigation, search

<languages/> <translate> It's all well and good to have a CollectiveAccess configuration that perfectly accommodates your data, but if you can't accurately import that data it will be of little use. CollectiveAccess v1.4-v1.6 allows users to create their own mappings and use them to import source data into their systems either through the Command Line or through the User Interface. The overall process involves seven basic steps:

1 - First, create a mapping - a document that tells each bit of data in your source where to go in your new system.

2 - Then, if you already have other data in your system, create a backup of your database by executing a data dump (see here for more information).

3 - After taking this pre-caution, use your mapping document to run the import (either through the command line or the graphical user interface).

4 - When the data migration is complete, check it out in CollectiveAccess for inconsistencies and mistakes.

5 - Revise your mapping accordingly

6 - Load the data dump so that the system returns to its pre-import state (and thereby clears out the data that you are about to re-run.)

7 - Run the import again.

In order to follow these steps, of course, you will need to understand exactly how to write one of these mapping documents. Fortunately, we have created a template for you, available here: (File:Data Import Mapping template.xlsx) and we will go through each of the different aspects of the template below.


Writing a Mapping Document

If you have downloaded the template, you will have noticed that it's in Excel format. If you want to read this first and look at the template later, then bear in mind that we're talking about an Excel spreadsheet when we mention rows and columns. This template can be used to import data in tab-delimited, CSV, XLSX, XLS, MYSQL, Filemaker XML, Inmagic XML, MARC, CollectiveAccess data and more. No matter the source, the importer will assume 2 things:

1.) Each row in a data set corresponds to a single record - in other words, every cell in a certain row in the excel spreadsheet (for example) that you're using for your source data will pertain to one record. For information about how to map from other source file types, see here.

2.) Each column corresponds to a single metadata element.

Rule Types

Rule Types are the most basic instructions you will give your data. At the present time there are four Rule Types, outlined below:

Rule type Description
Mapping Maps a data source column or table.field to a CollectiveAccess metadata element. Mappings can carry refineries (see below).
SKIP Use SKIP to ignore a data source column or table.field.
Constant Set a data source column or table.field to an arbitrary constant value. Include the chosen value in the Source column on the mapping spreadsheet. Matches on CollectiveAccess list item idno.
Setting Sets preferences for the mapping (see below).
Rule Rules allow you to set record-level conditionals in your mapping with target actions triggered by true or false outcomes. With Rules, you can manipulate the migration of specific data and/or set metadata based on expression statements.

We'll talk briefly about the first two in the following section, and then describe the last two here.

Source, CA table.element, and Group

Let's take a look at this selection from a mapping:

Mapping example 3.png

and here is a selection from the source data spreadsheet:

First source example 2.png

This image doesn't show all of the columns in a mapping (we'll get to the rest later) but by taking a look at the first four columns, Rule Type (which we already mentioned above), Source, CA table.element, and Group we can begin to understand the road map for your data migration.

Take a look at the first row. Under the heading Rule Type you see the word "SKIP", and then under the heading Source you see the number 1. This means that, when importing data, all information coming from column 1 will be skipped. You can use this rule whenever you want your importer to ignore certain data.

Now look at the next row. You will see the word "Mapping" under Rule Type and the number 2 under Source. This means that this row on your mapping spreadsheet determines where all of the data contained in the metadata element described in column 2 of your original data source will go. That's a mouthful, so let's break this row down.

1.) You are using the Rule Type "Mapping." This simply means that this row will provide a roadmap for the transformation of a specific metadata element in your source data to that in your CA configuration.

2.) Since the data is all coming from the second column in your source data (in this case the source is in excel so we're using columns), the number 2 goes under Source.

3.) If you look at the data source sample above, you will see that column 2 is titled "award year." You want to tell all of the data in the column to go to the metadata element that's been created in CA to hold it, which, in this case, is:

<metadataElement code="award_year" datatype="DateRange">

4.) The metadata element in your CA configuration that is supposed to contain this data has been named "award_year," and it's placed on the "Collections" table. Therefore the complete path is ca_collections.award_year. This is what you write under the heading CA_table.element.

In this case, no further instructions are required. There is a clear 1 to 1 relationship between the column in your source data and the metadata element in CA.

But what if you're dealing with a slightly more complex Metadata element, like a container?

For an example, let's take another look at the bottom 2 rows in our sample above:

Mapping example 4.png

You can see that they have slightly longer "CA table.element" paths, and that both have the word "status" under the column heading "Group." The format for expressing a metadata element in a container is: CA table.element.subElement. In this example, the container is called "grantStatus" and the subElements are "grant_status" and "status_notes." However, it is not enough to simply write the path, you must also clarify that the data in question is destined for the same container. That's what the "Group" column is for. Simply put the same (essentially arbitrary) term under that heading for both subElements, and you will ensure that they go to the exact same container, rather than different versions of the same container. In this case, the term that links the two is "status."

Complex containers structures

If there is more than one layer of containers (for example, a container and a sub-container like here), the bundle specifier namespace must be truncated to the main container.

Example:

Consider this containers structure associated to an object:

 * container1
   * sub_container1
     * sub_container2
       * metadata

To access the metadata, the path (bundle specifier namespace) to put in the mapping file should be: ca_objects.container1.metadata (and not "ca_objects.container1.sub_container1.sub_container2.metadata").


Simple, right? Yes, but it gets a bit more complex in the next few columns of our template.

Options, Refinery, and Refinery Parameters

So what about more complicated metadata, such as relationships? Or, perhaps, metadata that needs to be combined from two elements in the source to one in CA? This is where Refineries come into play. A refinery, at a simplistic level, is what it sounds like - it refines an individual mapping and allows for greater complexity. Refineries can be one of the following: Splitters, Makers, Joiners, and Getters. For a detailed description with technical examples, see these tables

Refineries

Refinery:Splitters

Let's begin with an example of Splitters. What do we mean by "Splitter," and when would we use it? A Splitter has several different potential uses.

1.) It could create a relationship by "splitting" the given data into a new record. In other words, if you need to create an entity relationship, and in so doing create a new record for the entity in question, you would use a Splitter. In the example below, the third column in the data source contained related entities, and we wanted to create related entity records for each of the names listed there. Don't worry, we'll talk about those "Refinery Parameters" in a moment.

Splitter example 1.png

Splitters used for scenarios like this (and the one below) include the entitySplitter, collectionSplitter, occurrenceSplitter, storageLocationSplitter and placeSplitter - as you can see, the formulation is "table type" followed by the word "Splitter." Also, take note of the fact that the CA_table.element code in this case is simply ca_entities.


2.) If there is already an existing record for the other side of the relationship, a Splitter can also be used to match data rather than just creating new records. This will be determined through Settings, which we'll discuss later.

3.) If a jumble of data in one cell or line in a source needs to be broken up into individual metadata elements, you can also use a Splitter (for example, if all of the dimensions of a given object are contained within one cell in an excel source but you want to break them up into individual components in CA, you would employ this tool using the measurementsSplitter).

Refinery:Maker

Maker refineries are used to create CollectiveAccess tour/tour stop, object lot/object and list/list item pairings. These relationships are different than other CollectiveAccess relationships for two reasons. Firstly, they don't carry relationship types. Secondly, these relationships are always single to multiple: a tour can have many tour stops, but a tour stop can never belong to more than one tour. Similarly an object can never belong to more than one lot. List items belong to one and only one list. Here's an example of a Maker in use:

Maker.png

In this case, the source column (column 1) lists the tour name, which is the parent for the many tour stops that will each have their own records. Here is a sample from the source data:

Maker source.png

The "Maker" tells the data that it will be creating multiple records (for each of these places, or stops) under one tour parent.

Refinery:Joiner

A Joiner is used to merge data from two separate sources (different columns, for example) into a single field in CA. This can be useful for dates, for example, when the different sides of a date range are located in different cells. This joiner would be called the "dateJoiner." Similarly, it could be used to merge data from more than one source (column, etc.) into a single entity. For example, if first name and last name are in different columns in a spreadsheet, but you don't want to accidentally create a separate relationship for each, you would use the "entityJoiner."

Here's an example of the dateJoiner in use:

Joiner example 1.png

And here is a selection from the data source to which it refers:

Date join source.png

As you can see, the elements of the date are separated out in the original, and the dateJoiner needs to amalgamate them.

Refinery:Getter

Getters are less frequently used. Getters are designed specifically for MYSQL data mappings. These refineries map the repeating source data through the linking table to the correct CollectiveAccess elements. For more information, see the relevant table here.

Refinery:Builder

The collectionHierarchyBuilder is a new tool designed to allow you to import multiple levels of a collection hierarchy (i.e. fonds, series, file) from a single data source. This refinery creates new records, like a splitter, but instead of creating lateral relationships it creates hierarchical ones. Using a parameter called "Parents," which will be discussed below along with other parameters, the builder stacks higher level collection records above lower ones. For example, if your source includes file and series in the same record, but you want a different collection record entry for each type, you can import on "file" and use the builder to create "series" records above that.

Refinery Parameters

As you may have noticed in the examples above, refineries are often accompanied by parameters that define their behavior. These can become complex, and attention to detail is important when writing them, so we'll dissect a few of them here. For complete tables that least each possible parameter, see Refineries

Let's look at the entitySplitter example from above, as it has relatively simple parameters.

Splitter parameters.png

In this case, the parameter types are relationshipType, entityType, and delimiter. Let's look at each one and how it functions here.

1.) relationshipType - you need to define the relationship type between the new entity record you're creating with the Splitter and the record to which its related. In some cases, such as this example, your source will all be of one relationship type. In the example above, all of the entities were in a column entitled "created by," and the relationshipType, defined in the CA configuration, was "creator." Sometimes your source data will have multiple relationship types to juggle in a single mapping. For example, take a look at this parameter:

{"relationshipType":"^32","entityType":"ind"}

Instead of saying something like "creator" under "relationshipType," you see "^32." What does that mean? Here is part of the Source for that mapping:

Parameter source.png

These "Associated Names" are all mapped as related entities, but, as you can see from the column to the right, they don't all have the same role (i.e. relationshipType). As a result, we need to set the parameters to draw the relationshipType from that column, column 32. When we do this however, we need to specify "Original Values" and "Replacement Values," a process that will be described below.


2.) entityType - you must define the type for the new record your creating. In this case, the type is "ind," which is the code in the CA configuration for "individual." The type must match the code exactly.

3.) delimiter - in this case, a semicolon delineates between multiple entities in a single cell (name 1; name 2) which need to be separated, therefore the delimiter is ";"

You will also notice that a very particular formatting is employed when writing parameters. They must begin and end with curly brackets, the parameter type must be in quotes followed by a colon, and the parameter itself must be in quotes as well. We'll look at a few more examples and get the hang of the formatting while we're at it.

Let's take a look at a slightly more complicated case, also an entitySplitter.

Attributes example.png

There are a lot of curly brackets there, so let's break it down.

1.) relationshipType and entityType Again, this a situation in which the relationshipType is simple and uniform. In this case, we're making related entity records for project contacts. EntityType, similarly, is uniform - these are all individuals, and in this configuration the code for individuals is "ind."


2.) attributes Here's where things get a bit more complicated. In the data source, address information for the entities was included next to their names. Take a look:

Attributes source 1.png

You can pull a reasonable amount of data into a new record through the attributes parameter using the {"attributes":{"element_code":"^source"}} format. In this case, we wanted to grab all of the address information related to a given name, but we were dealing with multiple columns, all of which are contained within an address container in the configuration. The code for the container is "address" and the subElements are "address1," "address2," "city," "stateprovince," "postalcode," and "country." Because we're dealing with a container, the path specified in the parameter is slightly longer, which each subElement contained within brackets following "address," which is the code for the main element. As you can see, the main trick here is too keep track of your punctuation. Once you've pulled the source into the parameters using "^source" you don't need to write out an individual line for that source in the template spreadsheet. Here's how the results might look after an import:

Address container.png

4.) Parents This parameter can be used for both the collectionSplitter and the collectionHierarchyBuilder. In the case of the collectionSplitter, it is used to build records above the new laterally related records. In the collectionHierarchyBuilder, the parents parameter defines new records directly above the imported data. Within the Parents parameter you can map the idno, name, type, and (optional) metadata attributes of the parent record.

There are more parameters discussed here, along with examples. To check the formatting of your parameters (or of any other JSON) you can visit jsonlint.com. Often a slew of import errors can be attributed to a missing set of quotation marks, so it's always good to double-check.

Options

Refineries are not the only way to introduce greater complexity and accuracy to your import. You also have a column entitled "Options," in which you can set additional formatting and conditionals. Here's an example of that column in use:

Options 1.png

and here's the source column to which it refers:

Options source.png

As you can see, the data in that source generally says "posters," but there's a blank cell. This could be because of a cataloging mistake or any other number of reasons, but if you know that you have a clear default value you can protect against these occasional blanks with the option above, {"default":"value"}.

Another common use of "Options" is formatting. Here's an example:

Format options.png

And here's the source:

Source 15.png

In this case, a title for the record (preferred_label) needed to be formed from several disparate elements - the words "Oral History," the Catalog Number (from column 15 in the source, seen above), the word "with" and data from two additional columns, 2 and 3, seen below:

Source 2 3.png

Using this template would lead to: "Oral History 98.04.16JDMDGL with Jim Dauphinee."

There are many more options, which are listed (with examples) here.

Original Values and Replacement Values

Two columns in the mapping template that bear closer examination are "Original Values" and "Replacement Values." These are used to map list items, relationship types, or other variables to their corresponding codes. Let's look at an example:

Original replacement 2.png

This example contains another relatively complex parameter, but we'll boil it down to the relevant components for our purposes. As you can see, the parameters are pulling data from columns in the source - in this case, the source contained obituary information relating to an entity. Take a look at part of that source, the part that is referenced by "obituary":"^11", the last part of the attribute settings.

Replacement source.png

now take a look at the list in the configuration that is contained in the relevant metadata element:

<list code="obituary" hierarchical="0" system="0" vocabulary="1">

     <labels>
       <label locale="en_US">
         <name>Obituary?</name>
       </label>
     </labels>
     <items>  
       <item idno="yes" rank="2" enabled="1" default="0">
         <labels>
           <label locale="en_US" preferred="1">
             <name_singular>yes</name_singular>
             <name_plural>yes</name_plural>
           </label>
         </labels>
       </item>
       <item idno="no" rank="1" enabled="1" default="1">
         <labels>
           <label locale="en_US" preferred="1">
             <name_singular>no</name_singular>
             <name_plural>no</name_plural>
           </label>
         </labels>
       </item>
     </items>
   </list>

As you can see, "Y," (the way that "yes" is indicated in the data source), is different from "yes," (the code in the configuration.) As a result, you need to map the original value "Y," to its replacement value, "yes." You can add multiple values to a single cell, so long as the replacement value matches the original value line by line. This feature can be used to correct common misspellings in the source data, or to simply normalize names and terms, or to conform to element codes for fixed list values in CollectiveAccess.


Settings and Constant

The last two Rule Types, [mentioned above] are Settings and Constant.

Settings, located at the bottom of your template, define overarching preferences for the mapping. For a list of all possible settings (at this time), see this table. They allow you to define the table and type, specify a code for your mapping spreadsheet (useful when uploading the importer), and dictate how this mapping will behave in regard to other data already in your database. Here's how they might look in context:

Settings.png

Where the data source deals primarily with objects and there is no existing data with which to contend.

Constant is used in place of Mapping in your spreadsheet when you want to set an unchanging value to be used in every instance of a given field. For example:

Constant.png

In the example above, the two table elements are both within the same container. One, the dates themselves, is going to be a variable value - different records will have different creation dates. However, in this case the person writing the mapping knew that, in this particular container, which had a "type" drop-down, the type would always be "created." Here's that container in the User Interface:

Date element.png

As a result, the Rule Type "Constant" was used, and the value ("created," in this case) was typed where the Source would otherwise be indicated. The two rows are united in one container through the "Group" column, as discussed here.

Defining and running an Import

Now that you've crafted your mapping spreadsheet, you'll want to actually put it to use.

This is done in two steps:

  • define the import: add to imports list
  • run the import: actually import data


With CollectiveAccess version 1.4, there are two ways to go about this.

  • The recommended approach, detailed here, allows you to run an import through the command line. This will allow you to get access to detailed logs and share it to the community in case of troubles. This is also a good option if you're working with a less-than-dependable internet connection or a data set, as you wouldn't want your browser to time-out in the middle of an import.
  • However, if you're not comfortable using the command line, or if you simply prefer to work through the UI, CollectiveAccess now has a simple Data Import tool. Here's how it works:


To begin, navigate to Import --> Data in the Global Navigation Bar.

Data.upload.png

To upload your importer (mapping), click "+ add importers." You will then see a dotted outline around the words "Drag importer worksheets here to add or update."

Add importers.png

Simply drag the excel worksheet containing your data map from your computer into the center of this box, and wait for the upload process to complete. Once uploaded, the importer (or importers) will appear in a list on your screen. To import data, you can either click on the green arrow to the right of the Importer's name, or you can click "Run Import" on the left-hand navigation.

Run import.png

On the "Run Import" screen, you must choose the appropriate (previously uploaded) importer, and the format (Excel, FMPro DSOResult, Inmagic XML, MARC, MySQL or CollectiveAccess) of your data source. Then you can attach the file containing your data.

Run import screen.png

Note for CollectiveAccess to CollectiveAccess migrations: in the Data URL field, you must include account credentials for the remote server. They are incorporated into the URL you use when you run the mapping. The url format is:

http://username:password@hostname/path/table?q=query

So if you were planning on importing all ca_objects records from the server julia.whirl-i-gig.com using account guest and password import your URL would read:

http://guest:import@julia.whirl-i-gig.com/ca_objects?q=*

Once everything is set, click "Execute data import" to finish the process. </translate>

i_sphinx

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools