ELASTICSEARCH UNIVERSE: FLYING OVER THE DATA-I

Hi all again!!

This is the third tech-paper of this serie dedicated to the optimization of an Elasticsearch Cluster.

Here we  can find how to configure your searches in order to make them faster and more efficient.

Avoid big http requests

In search  tasks Elasticsearch needs to fetch the Id of all the documents, this operation has a high performance cost in big documents due the way the filesystem cache works.

In order to control this, you can configure it:

elasticsearch.yml

http.max_context_length (default value 100 MB)  ES will refuse to accept bigger http requests that the size configured in this value.

Model your data

Denormalize your data (create redundant copies of  the most used-searched data in different parts of the index itself) ,the goal of this modeling is to decrease the searching time ,making data more accessible to be found  by the queries you execute.

In elasticsearch there are 2 approaches at configuration level that deserves to have a look.

  • Nested Document /Query
  • Parent & Child Relationship

Cache is your friend

To enable cache  in your requests (query context) :

index.requests.cache.enable

indices.requests.cache.size

To enable cache in your queries(filter context):

index.queries.cache.enabled

indices.queries.cache.size

Also, after a cluster restart, it is highly recommended  to have a “warm up time”, as the filesystem cache will be empty , give it enough time to the FS cache to load enough amount of documents  to make search operations fast again.

Prioritize different cache options in Elasticsearch (filesystem cache, the request cache or the query cache) , when you execute same search operations in a row, you can go to different shards of the cluster, so in one of the searches the cache is located in another shard, so you loose search consistency. To avoid this behaviour you can set up:

preference

* I will dive deeper in cache option in next tech-papers

Be cautious

After rebooting your elasticsearch cluster, you do not have any cached data, but you have the option to say to put in memory while is starting the cluster the documents/indexes that you consider.*

In config/elasticsearch.yml:

index.store.preload: [“doc”]

*Do not overuse this conf, do it according to your hardware and OS’s configuration

Limit your borders

It is always better to make a search in determinate data and not in all the data.

Keep an eye in DNS cache

By default Elasticsearch (JVM)  keep positive hostnames resolutions indefinitely in cache ,  in arquitectures with the option node-to-node resolution, the dns resolution might change, so you can have a problem with that.

The best option is to configure a ttl in your JVM options:

networkaddress.cache.ttl=<timeout>

Monitor your cache

You can control the current status of the cluster cache thanks to the indices stats (you can have quite a few different statistics about what is going on in your cluster).

GET /_nodes/stats/indices/request_cache?


Images

https://pixabay.com/es/puesta-del-sol-cielo-avi%C3%B3n-militar-3219428/

https://pixabay.com/es/archivos-ddr-archivo-1633406/

 

mm

Félix Rodríguez

SysAdmin landed happily in Datio after a few years managing physical and cloud platforms. Words like performance optimization, control, availability, reliability, scalability and system integrity are hardcoded in my IT-DNA. When I am not in front of a black screen, I love to spend time with my 2 little khalessis and also run in the mountains where I live, fill me up of pure energy.

More Posts

Follow Me:
TwitterFacebookLinkedIn