Development:Synchronization Service

From CollectiveAccess Documentation
Revision as of 18:13, 2 May 2017 by Jonathan (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Overview

This page contains notes for the reimplementation of CollectiveAccess' master/slave sync'ing system. This system keeps one or more CollectiveAccess systems synchronized with changes made to one or more "master" systems. Typical use cases are:

  1. Sync a public-facing server running a front-end collections web site with a back-end cataloguing server. Changes are always made to the back-end and sync'ed periodically to the front-end for dispel to the public.
  2. Periodically sync several back-end servers to a single front-end collections web site. This is done for consortia portals such as NovaMuse, where 52 museum back-end systems are presented on a single web site.

This work is being performed within the scope of the 1.6 release planned for late 2015.

Current implementation

The current implementation is an ad-hoc script that uses the now-deprecated web service API. To perform a sync the "slave" system executes a search on the "master" together with a modified:"after <timestamp>" where the timestamp is the date/time of the last sync. The search can be configured but is typically "*", which returns all records modified since the last sync.

For each record in the returned set the script pulls the item-level data, then queues any related records is it configured to sync for sync. It will recursively spider related records until it hits a record with no related record, or only related records it has already sync'ed in the current session.

The current process makes the following assumptions:

  1. The configuration of the two systems is exactly the same in terms of metadata element codes, types and lists. Internal table primary key ids don't need to match but idno's, list codes and element codes do.
  2. Related media should be pulled over in its original form and reprocessed using local rules.
  3. All sync'ing is focused on a primary kind of record (Eg. ca_objects) with other kinds (Eg. ca_entities, ca_collections) pulled in as needed via relations to the primary.
  4. All communication is done via HTTP, typically on port 80, for simplicity and to avoid firewall headaches.
  5. Sync'ing is done periodically via cron job. While it could be run at any interval it is typically run daily.

The current process has several problems:

  1. It's relatively slow since it pulls item level data for records one at a time using discrete service calls.
  2. It's relatively slow because it spiders across a network of related records. For example, sync'ing a single object record may cascade to a sync of hundreds of objects related to entities related to the initial object. The script does not analyze the change history of related records, but rather blindly syncs them.
  3. It can miss sync'ing changes:
    1. when records are changed in the period between when it starts and completes a sync.
    2. when the change in the master is to a related record that is not logged against the subject. (Eg. an entity is related to an object but the object's modified time is not incremented)
    3. when records are deleted. Some versions of the script can detect deletions but this is all very hacky.
    4. when the idno of the record being sync'ed changes on the master. The current script uses idno's to pair records on the master with their slave equivalents. If the master idno changes the script will create a new record and orphan the old one which will no longer get updates (not to mention being an unwanted duplicate)
  4. Sync will fail if the related list or list item used on the master is not on the slave.
  5. Sync cannot deal with two items with the same idno. This should not but does happen with some datasets. When sync'ing several masters into a single slave (Eg. a consortium like NovaMuse) this should happen and does happen. The NovaMuse script has a hack to deal with this, but general support is needed.
  6. Sync replicates media by pulling the original by URL to the slave and reprocessing using slave-specific rules. Depending upon who you talk to this is a feature (the slave can have only the derivatives it needs; don't waste bandwidth pulling 8 or 10 or 12 files per record) or a bug (slave needs to be able to do serious media processing and have all server-side processing application installed).
  7. Handling of FT_MEDIA and FT_FILE metadata attributes seems to be broken.
  8. Have not tested sync with InformationService metadata attributes; probably broken.
  9. Not entirely sure sync'ing of hierarchical records works in all cases.

Some features that are not standard currently, but have been hacked into various iterations of the script and should be standard in a new implementation:

  1. Rules based quality control, allowing the sync to reject records that don't validate. When this has been done in the current script it's just a lump of project specific code. For the new implementation it could be support using expressions or plugins or both.
  2. Ability to report rejected records back to the master. These reports can be made available to cataloguers and system administrators.
  3. Filtering sync'ed records by access
  4. Filtering sync'ed records by type
  5. Filtering sync'ed records by source
  6. Ability to configure media sync'ing to use either on-slave processing (the current arrangement) or simple copy of all, or selected, derivatives.
  7. Ability to configure on-slave media processing to use version other than original. This can be useful for storage-constrained slaves, or for cases where it is not desirable to expose high-quality original media on a public server.

Features of new implementation

Key features:

  1. Use current web services APIs. These may be currently available ones, or new ones optimized for sync.
  2. Devise sync protocol that can:
    1. Support periodic or near-real-time change tracking
    2. Does not miss changes on the master; this may require making changes to change logging on the master to log a wider variety of changes in related records.
    3. Can properly handle deletes.
  3. Handle spidering more efficiently by:
    1. Only sync'ing related records that have actually changed.
    2. Pulling item-level information in batches rather than one at a time (perhaps we precomputed the set of records that are needed before sync'ing?)
  4. Create a registry associating master internal id's to IDNO's, to make it possible to track IDNO changes on the master.

Literature

  1. http://docs.couchdb.org/en/latest/replication/protocol.html
  2. https://github.com/git/git/blob/master/Documentation/technical/http-protocol.txt

Protocol

  • Need to devise a unique "system id" (hostname/config setting!?) and a GUID for each record (system_id + primary record id)
  • Leverage sequential change log we already have to compute the full set of records that need to be pulled in advance
  • Keep sync log on the slave with the last sequence# (change log id) that was successfully applied for this master -- keep in mind slaves can have multiple masters
  • Treat relationships like every other record. That way we don't have to worry about "spidering" graphs for changed records. If a relationship is new or has changed then sync it,
    • Have to add change logging for relationship models
  • Replicator should be a discrete utility/script that talks to both sides using REST APIs. Some users may want to run it on the slave side, most will use the master -- but we shouldn't make assumptions about "pushing" and "pulling".

sphinx

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools