Deduplication

From CollectiveAccess Documentation
Jump to: navigation, search

This is a feature available for v1.7. It's only available as command line utility and as part of the Replication protocol

Usage

If you want to run the deduplication by hand, there's a script in caUtils. It has one mandatory and one optional parameter.

-t / --tables a table or list of tables to run the duplicate records report for. Lists are separated by commas or semicolons
-d actually merge and delete duplicate records. Default is false

If you just want a report, run the script without -d:

$ support/bin/caUtils remove-duplicate-records -t ca_entities
CollectiveAccess 1.7 (133/GIT) Utilities
(c) 2013-2016 Whirl-i-Gig

 Table ca_entities has 1 records that have potential duplicates.
	2 records have the checksum e6fece79354532493a45102948d80714bf773ef540c369de389a0483197f87ac
		entity_id: 1 (Homer J. Simpson)
		entity_id: 3 (Homer J. Simpson)

Once you have decided these are in fact duplicates and you want to merge them, you would backup your database and then rerun the script with the -d switch:

$ support/bin/caUtils remove-duplicate-records -t ca_entities -d
CollectiveAccess 1.7 (133/GIT) Utilities
(c) 2013-2016 Whirl-i-Gig

Table ca_entities has 1 records that have potential duplicates.
	2 records have the checksum e6fece79354532493a45102948d80714bf773ef540c369de389a0483197f87ac
		entity_id: 1 (Homer J. Simpson)
		entity_id: 3 (Homer J. Simpson)
	Successfully consolidated them under id 1

Configuration

The default settings should more or less work in most cases. The script uses checksums to find similar records and there are a few settings that control how these checksums are computed. If you want to override the default values, you would add them to app.conf:

Setting name Description Example
<ca_tablename>_dont_use_idno_in_checksums
If set to one, the idno is excluded when computing the checksum for records from this table. This can be useful when two systems with automatically generated idnos (that will never line up) are being merged. Defaults to 0. ca_entities_dont_use_idno_in_checksums = 1

Implementation details

The core implementation is in a trait "DeduplicateBaseModel", which is then mixed into BundlableLabelableBaseModelWithAttributes. It defines methods to compute checksums and some static utilities to list duplicates and merge records for the current table.

computing the checksum

The record checksum is a simple sha256 hash of a serialized PHP array that contains

  • the record idno (or not, this can be turned off, se above)
  • the type code, e.g. "image"
  • all preferred labels
  • all nonpreferred labels
  • the checksum of the hierarchy parent record (e.g. for places or storage locations)
  • the source code (source_id translated into a list item idno)
  • additional table-specific checksum components, for instance
    • ca_entities: lifespan
    • ca_lists: list_code
    • for anything under BaseRelationshipModel: the GUIDs of the left and right record, the relationship type code, and the effective date/source_info intrinsics

sphinx2022

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools