Search Indexing Configuration
The search_indexing.conf file controls which data in your CollectiveAccess database is searchable, and how. Only data elements configured in search_indexing.conf are searchable. Note that configuration of CollectiveAccess' browse system is completely independent from search. It is possible to search on data that are not browse-able, and browse on elements that are not indexed for search. See this page for details about configuring the browse.
Contents
Organization
At the top level, search_indexing.conf is structured as a series of blocks, one for each type of item to be indexed:
ca_objects = {
... indexing configuration for ca_objects records ...
},
ca_entities = {
... indexing configuration for ca_entities records ...
},
ca_places = {
... indexing configuration for ca_places records ...
},
ca_occurrences = {
... indexing configuration for ca_occurrences records ...
},
...
Within each block is a sub-block for item fields as well as sub-blocks for related items and access points (aliases and short cuts for selected data elements or groups of elements). Content in related items may be indexed against the item. For example, you may have an object record indexed by its various fields (accession number, condition, appraised value) as well as by content in related entities (name of artist, nationality of artist), places (place of manufacture), storage location, and more. Indexing for each type of item is configured independently. You may have objects indexed with content taken from related entities, while omitting related object data from entity indexing, for instance.
A typical ca_objects block might look like this:
ca_objects = {
# ------------------------------------
ca_objects = {
fields = {
_metadata = { }, # forces indexing of all attributes
parent_id = {STORE, DONT_TOKENIZE, DONT_INCLUDE_IN_SEARCH_FORM },
source_id = {},
lot_id = {},
idno = { STORE, DONT_TOKENIZE, INDEX_AS_IDNO, BOOST = 100 },
type_id = { STORE, DONT_TOKENIZE },
source_id = { STORE, DONT_TOKENIZE },
hier_object_id = { STORE, DONT_TOKENIZE },
access = { STORE, DONT_TOKENIZE },
status = { STORE, DONT_TOKENIZE },
deleted = { STORE, DONT_TOKENIZE },
is_deaccessioned = { STORE, DONT_TOKENIZE },
deaccession_notes = {},
deaccession_date = {},
circulation_status_id = { STORE, DONT_TOKENIZE }
},
# Index idno's of related objects
related = {
fields = {
idno = { STORE, DONT_TOKENIZE, INDEX_AS_IDNO, BOOST = 100 }
}
}
},
# ------------------------------------
ca_object_labels = {
key = object_id,
fields = {
name = { BOOST = 100, INDEX_ANCESTORS, INDEX_ANCESTORS_START_AT_LEVEL = 0, INDEX_ANCESTORS_MAX_NUMBER_OF_LEVELS = 4, INDEX_ANCESTORS_AS_PATH_WITH_DELIMITER = . },
name_sort = { DONT_INCLUDE_IN_SEARCH_FORM },
_count = {}
},
# Index names of related objects
related = {
fields = {
name = { BOOST = 100, INDEX_ANCESTORS, INDEX_ANCESTORS_START_AT_LEVEL = 0, INDEX_ANCESTORS_MAX_NUMBER_OF_LEVELS = 4, INDEX_ANCESTORS_AS_PATH_WITH_DELIMITER = . }
}
}
},
# ------------------------------------
ca_objects_x_entities = {
key = object_id,
fields = {
_count = { }
}
},
# ------------------------------------
ca_entities = {
tables = {
entities = [ca_objects_x_entities]
},
fields = {
idno = { STORE, DONT_TOKENIZE, INDEX_AS_IDNO, BOOST = 100 },
_count = { }
}
},
# ------------------------------------
ca_entity_labels = {
tables = {
entities = {
ca_objects_x_entities = { },
ca_entities = {}
},
annotations = [ca_objects_x_object_representations, ca_object_representations, ca_representation_annotations, ca_representation_annotations_x_entities, ca_entities]
},
fields = {
entity_id = { DONT_INCLUDE_IN_SEARCH_FORM },
displayname = { PRIVATE },
forename = {},
surname = {},
middlename = {}
}
},
# ------------------------------------
_access_points = {
label = {
fields = [ca_object_labels.name],
options = { DONT_INCLUDE_IN_SEARCH_FORM }
},
desc = {
fields = [ca_objects.description],
options = { }
},
}
# ------------------------------------
}
This may look a bit intimidating, but there are actually only three types of sub-blocks present: indexing configuration for the item itself (the indented ca_objects key immediately following the first ca_objects that defines the block), indexing from related items (the ca_object_labels keys and those referencing other tables that follow) and access point definitions (the _access_points key at the end of the sub-block). These sub-blocks form the core of the configuration, and are discussed in detail below.
Item sub-sections
Within a section for a given item type are several sub-sections:
Fields
The next section of the configuration determines what fields are indexed for search and the option(s) each field carries. By default the configuration indexes every custom element created in the system by a user (via the "special field" _metadata) as well as a list of fields "baked into" the database such as type (type_id), access, status, etc. An element must be defined as a field in this section of the configuration (either via a "special field," by default or by a user) in order for it to be indexed. User-defined fields would only be necessary if the _metadata field wasn't used or if indexing an intrinsic bundle (not indexed by default) was desired.
Special fields
In addition to the default and any user-defined fields there are several "special fields." Special fields always start with underscore character.
Option | Description |
_metadata | Forces indexing of all attributes created in the system by a user. |
_count | Embeds the number of related rows for a given table in the index. You can specify this for both relationship and primary tables. The field is named <table_name>.count - for example: object_representations.count for table 'object_representations'. This can be used to find rows that have, or don't have, related rows in a given table.
When specified on a primary table (eg. ca_entities, ca_occurrences), counts are indexed in aggregate as well as for each type. For relationship tables (eg. ca_objects_x_entities) counts are indexed in aggregate as well as for each relationship type. For example querying on a specific type or types: ca_entities.count/individual:3 (finds records with exactly three related entities of type "individual") ca_objects_x_entities.count/artist:[2 to 4] (finds objects with between two and four entities related as artist) |
Field-level options
A variety of options are available for defined fields.
Option | Description | Example syntax |
STORE | Forces the value to be stored in the index, if possible; this can speed display of the content in a search but may slow down indexing and increases index size | not applicable |
DONT_TOKENIZE | Indexes the value as-is, rather than breaking into separate values on whitespace characters, such as a spaces or line breaks, or by punctuation characters. | not applicable |
DONT_INCLUDE_IN_SEARCH_FORM | As described above, causes the field to not be includable in user-defined search forms. | not applicable |
BOOST | A numeric "boost" value for the index field. Higher values will cause search hits on the boosted field to count for more when sorting by relevance. | BOOST = 100 |
INDEX_AS_IDNO | Causes the value to be indexed with various permutations for flexible retrieval as a record identifier. For example, if this option is used then a search for KA1 would return KA.0001. | not applicable |
INDEX_ANCESTORS | Enables hierarchical indexing for field, assuming it is in an hierarchical table, resulting in all values for this field in records above the subject in the hierarchy being indexing against the subject | not applicable |
INDEX_ANCESTORS_START_AT_LEVEL | Forces hierarchical indexing to start X levels down from the root. This allows you to omit the very highest, and least selective, levels of the hierarchy when indexing. If omitted indexing starts from the hierarchy root | INDEX_ANCESTORS_START_AT_LEVEL = 2 |
INDEX_ANCESTORS_MAX_NUMBER_OF_LEVELS | Sets the maximum number of levels above the subject to be indexed. If omitted all levels of the hierarchy above the subject are indexed | INDEX_ANCESTORS_MAX_NUMBER_OF_LEVELS = 3 |
INDEX_ANCESTORS_AS_PATH_WITH_DELIMITER | Sets a delimiter to place between each level of the hierarchy prior to indexing the entire hierarchy path above the subject. This is useful when you want to treat the hierarchy path as an identifier | INDEX_ANCESTORS_AS_PATH_WITH_DELIMITER = . |
Here's an example of a field, idno, that uses multiple options:
ca_objects = { fields = { idno = { STORE, DONT_TOKENIZE, INDEX_AS_IDNO, BOOST = 100 },
Access Points
The access points sub-section (use key _access_points) defines aliases for specific indexed elements or groups of elements. It also allows a user to set attributes to be used in search forms as well as search shortcuts.
Search Shortcuts
With _access_points you can create shortcuts to be used in any search system-wide, including Basic Search, Quick Search, Find in the Hierarchy bundle, and Advanced Search.
Let's say you want to create a search shortcut for a "Materials" element on your object record. In the Access points sub-section of the objects section of your configuration file:
ca_objects = { # ------------------------------------ _access_points = {
you would add the "Materials" access_point. Whatever you want the shortcut to be (let's say "mat") should be included on the left side of the equals sign:
ca_objects = { # ------------------------------------ _access_points = { mat = { fields = [ca_objects.material], options = { DONT_INCLUDE_IN_SEARCH_FORM } },
Within the square brackets to the right of the fields equals sign, the attribute's elementCode is used (following a period and the CA table name).
Now you can quickly search for materials anywhere in your system using the syntax:
mat:stone
It is also possible to create shortcuts that bundle several elements together. A search on the access point will search all of the included fields at the same time. Each attribute should be comma separated:
style = { fields = [ca_objects.material, ca_objects.medium, ca_objects.technique],
Remember that if you want to search for multiple words within your single access point, quotation marks should enclose the whole string:
style:"stone sculpture"
A search for simply:
style:stone sculpture
would mean search for stone in the Materials, Medium & Technique fields AND sculpture anywhere else. That would mostly likely also return effective (but different) search results. Similarly, there shouldn't be a space between the colon and the search term (i.e. style: stone) because the search will "break" on the space and the search preformed will be a universal query for stone.
If your target element for a search shortcut is a container, make sure to include the full path of ca_table.elementCode.subElementTarget or:
fields = [ca_objects.description.description_source],
Search forms
You may have noticed that in the code examples above an option was used:
options = { DONT_INCLUDE_IN_SEARCH_FORM }
This is because by default each defined metadata element will be pulled into the available elements for building search forms. Including your shortcut a second time would be redundant. However, if you're adding an access point that isn't already included (say, "filename" which until recently wasn't indexed by default but was stored in the database) you would define it here and remove the DONT_INCLUDE_IN_SEARCH_FORM option.
Note that all fields included in an access point must be included in the search index - they must appear in the fields list in other words. All indexed fields automatically have access points created in the format tablename.fieldname (ex. objects.title); indexed metadata also have access points in the format tablename.md_<element_id> (ex. objects.md_5)