Search Indexing Configuration

From CollectiveAccess Documentation
Revision as of 17:07, 25 June 2013 by Julia (talk | contribs)
Jump to: navigation, search

IN PROGRESS

The search_indexing.conf file controls which bundles (data elements) in your CollectiveAccess database are searchable, and how. Only elements listed in search_indexing.conf are searchable, although the __metadata "field" forces the indexing of all attributes. Note that configuration of CollectiveAccess' browse system is completely independent from search. It is possible to search on data that are not browse-able, and browse on elements that are not indexed for search. See this page for details about configuring the browse.

Organization

The file is divided into sections for each item type to be indexed. The key for each item type is simply the table name. Within each section are sub-sections for related items whose content is to be indexed against the item at hand.

Access Points

Access points, or _access_points, is a special sub-section that defines aliases for specific indexed elements or groups of elements. It also allows a user to set attributes to be used in search forms as well as search shortcuts.

Search Shortcuts

With _access_points you can create shortcuts to be used in any search system-wide, including Basic Search, Quick Search, Find in the Hierarchy bundle, and Advanced Search.

Let's say you want to create a search shortcut for a "Materials" element on your object record. In the Access points sub-section of the objects section of your configuration file:

ca_objects = {
	# ------------------------------------
	_access_points = {

you would add the "Materials" access_point. Whatever you want the shortcut to be (let's say "mat") should be included on the left side of the equals sign:

ca_objects = {
	# ------------------------------------
	_access_points = {
		mat = {
			fields = [ca_objects.material],
			options = { DONT_INCLUDE_IN_SEARCH_FORM }
		},

Within the square brackets to the right of the fields equals sign, the attribute's elementCode is used (following a period and the CA table name).

Now you can quickly search for materials anywhere in your system using the syntax:

mat:stone

It is also possible to create shortcuts that bundle several elements together. A search on the access point will search all of the included fields at the same time. Each attribute should be comma separated:

style = {
	fields = [ca_objects.material, ca_objects.medium, ca_objects.technique],

Remember that if you want to search for multiple words within your single access point, quotation marks should enclose the whole string:

style:"stone sculpture"

A search for simply:

style:stone sculpture

would mean search for stone in the Materials, Medium & Technique fields AND sculpture anywhere else. That would mostly likely also return effective (but different) search results. Similarly, there shouldn't be a space between the colon and the search term (i.e. style: stone) because the search will "break" on the space and the search preformed will be a universal query for stone.

If your target element for a search shortcut is a container, make sure to include the full path of ca_table.elementCode.subElementTarget or:

			fields = [ca_objects.description.description_source],	

Search forms

You may have noticed that in the code examples above an option was used:

options = { DONT_INCLUDE_IN_SEARCH_FORM }

This is because by default each defined metadata element will be pulled into the available elements for building search forms. Including your shortcut a second time would be redundant. However, if you're adding an access point that isn't already included (say, "filename" which until recently wasn't indexed by default but was stored in the database) you would define it here and remove the DONT_INCLUDE_IN_SEARCH_FORM option.

Note that all fields included in an access point must be included in the search index - they must appear in the fields list in other words. All indexed fields automatically have access points created in the format tablename.fieldname (ex. objects.title); indexed metadata also have access points in the format tablename.md_<element_id> (ex. objects.md_5)

Fields

The next section of the configuration determines what fields are indexed for search and the option(s) each field carries. By default the configuration indexes every custom element created in the system by a user (via the "special field" _metadata) as well as a list of fields "baked into" the database such as type (type_id), access, status, etc. An element must be defined as a field in this section of the configuration (either via a "special field," by default or by a user) in order for it to be indexed. User-defined fields would only be necessary if the _metadata field wasn't used or if indexing an intrinsic bundle (not indexed by default) was desired.

Special fields

In addition to the default and any user-defined fields there are several "special fields." Special fields always start with underscore character.

Option Description
_metadata Forces indexing of all attributes created in the system by a user.
_count Embeds the number of related rows for a given table in the index; you can only specify this for non-subject tables; the field is named <table_name>_count - for example: object_representations_count for table 'object_representations'; we need this so we can find rows that have, or don't have, related rows in a given table. Specifically we need this to implement "show only objects with media" functionality since we can't just do an INNER JOIN in Lucene as we did in the old SQL-based search engine.
_hier_ancestors Adds a number of specified fields of the ancestors of related rows to the index; obviously this will only work for hierarchical entites like place_names and voc_terms; you can, for example, specify something like this: _hier_ancestors = { name }; this will force the indexer to look for ancestors of the current subject and add their "name" fields to a virtual field named _hier_ancestors; this - for instance - enables you to find objects related to the place "Madrid" while searching for "Spain".

Field-level options

A variety of options are available for defined fields.

Option Description Example syntax
STORE Forces the value to be stored in the index, if possible; this can speed display of the content in a search but may slow down indexing and increases index size not applicable
DONT_TOKENIZE Indexes the value as-is, rather than breaking into separate values on whitespace characters, such as a spaces or line breaks, or by punctuation characters. not applicable
DONT_INCLUDE_IN_SEARCH_FORM As described above, causes the field to not be includable in user-defined search forms. not applicable
BOOST A numeric "boost" value for the index field. Higher values will cause search hits on the boosted field to count for more when sorting by relevance. BOOST = 100
INDEX_AS_IDNO Causes the value to be indexed with various permutations for flexible retrieval as a record identifier. For example, if this option is used then a search for KA1 would return KA.0001. not applicable

Here's an example of a field, idno, that uses multiple options:

ca_objects = {
		fields = {

			idno = { STORE, DONT_TOKENIZE, INDEX_AS_IDNO, BOOST = 100 },

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools