Deduplication
This is a feature available for v1.7. It's only available as command line utility and as part of the Replication protocol
Usage
If you want to run the deduplication by hand, there's a script in caUtils. It has one mandatory and one optional parameter.
-t / --tables a table or list of tables to run the duplicate records report for. Lists are separated by commas or semicolons -d actually merge and delete duplicate records. Default is false
If you just want a report, run the script without -d:
$ support/bin/caUtils remove-duplicate-records -t ca_entities CollectiveAccess 1.7 (133/GIT) Utilities (c) 2013-2016 Whirl-i-Gig Table ca_entities has 1 records that have potential duplicates. 2 records have the checksum e6fece79354532493a45102948d80714bf773ef540c369de389a0483197f87ac entity_id: 1 (Homer J. Simpson) entity_id: 3 (Homer J. Simpson)
Once you have decided these are in fact duplicates and you want to merge them, you would backup your database and then rerun the script with the -d switch:
$ support/bin/caUtils remove-duplicate-records -t ca_entities -d CollectiveAccess 1.7 (133/GIT) Utilities (c) 2013-2016 Whirl-i-Gig Table ca_entities has 1 records that have potential duplicates. 2 records have the checksum e6fece79354532493a45102948d80714bf773ef540c369de389a0483197f87ac entity_id: 1 (Homer J. Simpson) entity_id: 3 (Homer J. Simpson) Successfully consolidated them under id 1
Configuration
The default settings should more or less work in most cases. The script uses checksums to find similar records and there are a few settings that control how these checksums are computed. If you want to override the default values, you would add them to app.conf:
Setting name | Description | Example |
---|---|---|
<ca_tablename>_dont_use_idno_in_checksums |
If set to one, the idno is excluded when computing the checksum for records from this table. This can be useful when two systems with automatically generated idnos (that will never line up) are being merged. Defaults to 0. | ca_entities_dont_use_idno_in_checksums = 1 |
Implementation details
The core implementation is in a trait "DeduplicateBaseModel", which is then mixed into BundlableLabelableBaseModelWithAttributes. It defines methods to compute checksums and some static utilities to list duplicates and merge records for the current table.
computing the checksum
The record checksum is a simple sha256 hash of a serialized PHP array that contains
- the record idno (or not, this can be turned off, se above)
- the type code, e.g. "image"
- all preferred labels
- all nonpreferred labels
- the checksum of the hierarchy parent record (e.g. for places or storage locations)
- the source code (source_id translated into a list item idno)
- additional table-specific checksum components, for instance
- ca_entities: lifespan
- ca_lists: list_code
- for anything under BaseRelationshipModel: the GUIDs of the left and right record, the relationship type code, and the effective date/source_info intrinsics
sphinx2022