Difference between revisions of "ElasticSearch"

From CollectiveAccess Documentation
Jump to: navigation, search
Line 1: Line 1:
 +
['''Valid for v1.6''']
 +
 
== About ElasticSearch ==
 
== About ElasticSearch ==
  
[[http://www.elasticsearch.org/ http://www.elasticsearch.org/]]. "It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of [http://lucene.apache.org/ Apache Lucene]."
+
[https://www.elasticsearch.org ElasticSearch] is an an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of [http://lucene.apache.org/ Apache Lucene]. CollectiveAccess added support in v1.3 and completely overhauled the plugin again in v1.6. It now uses the official [https://github.com/elastic/elasticsearch-php ElasticSearch PHP API] and no longer requires running a script to maintain the field mapping for the search engine.
 +
 
 +
'''Note that CollectiveAccess v1.6 only supports ElasticSearch 2.0 or higher!'''
  
 
== Setup ==
 
== Setup ==
  
Please refer to the ElasticSearch [http://www.elasticsearch.org/guide/reference/setup/installation.html website] for installation and setup notes. Once you have ElasticSearch set up, you will have to set aside an index for CollectiveAccess to use. By default that index is called "collectiveaccess" but that can be changed in the [[Search_Configuration|search.conf configuration file]] if you want to use one ElasticSearch setup for multiple CollectiveAccess instances. You also have to configure the communication endpoint you want CollectiveAccess to use. If you're running ElasticSearch locally with the default settings, the default values in the config file should work as is.
+
Please refer to the ElasticSearch [https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html website] for installation and setup notes. Prebuilt packages are available for apt/dpkg on Debian/Ubuntu, yum on CentOS/RedHat or homebrew on Mac OS X, and probably your favorite package manager too!
 +
 
 +
Once you have ElasticSearch set up, you will have to set aside an index for CollectiveAccess to use. By default that index is called "collectiveaccess" but that can be changed in the [[Search_Configuration|search.conf configuration file]] if you want to use one ElasticSearch setup for multiple CollectiveAccess instances. You also have to configure the communication endpoint you want CollectiveAccess to use. If you're running ElasticSearch locally with the default settings, the default values in the config file should work as is: localhost and port 9200.
  
 
<code><pre>
 
<code><pre>
Line 12: Line 18:
 
</pre></code>
 
</pre></code>
  
If ElasticSearch and CollectiveAccess are configured properly, the next step is to execute the following script:
+
== Interaction with the CollectiveAccess Providence Installer ==
  
<code><pre>
+
CollectiveAccess will try to maintain the mapping automatically and it uses a few local caches to do so. When running the installer, these cache interactions can get a little wonky and lead to seemingly random exceptions and errors. Try to make sure you '''delete''' your ElasticSearch index and '''clear''' all local caches (by deleting the contents of app/tmp) '''before''' you run the installer if you have CollectiveAccess configured to use ElasticSearch before you install.
<collectiveaccess_base_dir>/support/utils/createElasticSearchSchema.php
 
</pre></code>
 
  
As the name indicates, the script deletes the configured index and recreates it (to wipe all existing data) and then creates an ElasticSearch [http://www.elasticsearch.org/guide/reference/mapping/ mapping] which tells the search engine how the data we index is structured based on the current CollectiveAccess metadata schema (the fields you configured for your system) so that it can be indexed properly. Data like date ranges, geographic points and numbers has to be treated differently so that we can take full advantage of ElasticSearch capabilities like range searches.
+
You could also run the installer using SqlSearch and then switch to ElasticSearch later. Remember to reindex your database contents after you switch.
  
 
== Operation ==
 
== Operation ==
  
ElasticSearch is able to change its schema dynamically if new fields are added but it is recommended to recreate the index and the mapping with the script mentioned above when you do extensive changes to your metadata schema in CollectiveAccess because ElasticSearch's guesses regarding the data structure are not always optimal which can for example lead to dates being treated as pure strings.
+
The v1.6 implementation of the ElasticSearch Plugin should take care of mapping maintenance automatically. If you see mapping exceptions piling up in the ElasticSearch log, that's probably because you've changed things in search_indexing.conf.
 +
 
 +
'''One common problem''' is having different analyzer/tokenizer settings for the same field in different contexts. For instance, you could have ca_object_lots.idno_stub indexed for ca_objects with no additional settings but then use INDEX_AS_IDNO and BOOST when indexing it for ca_object_lots. ElasticSearch doesn't like that. Try streamlining these settings so that they're the same for every occurrence of a field in search_indexing.conf. The default config that ships with Providence should be fine but if you have local changes, keep an eye on the ElasticSearch log and change your indexing config accordingly.
  
 
[[Category:Setup_and_Configuration]]
 
[[Category:Setup_and_Configuration]]

Revision as of 14:38, 2 February 2016

[Valid for v1.6]

About ElasticSearch

ElasticSearch is an an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene. CollectiveAccess added support in v1.3 and completely overhauled the plugin again in v1.6. It now uses the official ElasticSearch PHP API and no longer requires running a script to maintain the field mapping for the search engine.

Note that CollectiveAccess v1.6 only supports ElasticSearch 2.0 or higher!

Setup

Please refer to the ElasticSearch website for installation and setup notes. Prebuilt packages are available for apt/dpkg on Debian/Ubuntu, yum on CentOS/RedHat or homebrew on Mac OS X, and probably your favorite package manager too!

Once you have ElasticSearch set up, you will have to set aside an index for CollectiveAccess to use. By default that index is called "collectiveaccess" but that can be changed in the search.conf configuration file if you want to use one ElasticSearch setup for multiple CollectiveAccess instances. You also have to configure the communication endpoint you want CollectiveAccess to use. If you're running ElasticSearch locally with the default settings, the default values in the config file should work as is: localhost and port 9200.

search_elasticsearch_base_url = http://localhost:9200/
search_elasticsearch_index_name = collectiveaccess

Interaction with the CollectiveAccess Providence Installer

CollectiveAccess will try to maintain the mapping automatically and it uses a few local caches to do so. When running the installer, these cache interactions can get a little wonky and lead to seemingly random exceptions and errors. Try to make sure you delete your ElasticSearch index and clear all local caches (by deleting the contents of app/tmp) before you run the installer if you have CollectiveAccess configured to use ElasticSearch before you install.

You could also run the installer using SqlSearch and then switch to ElasticSearch later. Remember to reindex your database contents after you switch.

Operation

The v1.6 implementation of the ElasticSearch Plugin should take care of mapping maintenance automatically. If you see mapping exceptions piling up in the ElasticSearch log, that's probably because you've changed things in search_indexing.conf.

One common problem is having different analyzer/tokenizer settings for the same field in different contexts. For instance, you could have ca_object_lots.idno_stub indexed for ca_objects with no additional settings but then use INDEX_AS_IDNO and BOOST when indexing it for ca_object_lots. ElasticSearch doesn't like that. Try streamlining these settings so that they're the same for every occurrence of a field in search_indexing.conf. The default config that ships with Providence should be fine but if you have local changes, keep an eye on the ElasticSearch log and change your indexing config accordingly.

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools