Hi Folks!!
This is going to be the first of a 4 articles serie, the subject on which I am going to write about is ElasticSearch, and how we can set up a proper architecture , and also how we can optimize our configuration to improve the performance and reliability of the service.
A few details about Elasticsearch:
- Elasticsearch is based on Lucene , a distributed open source search and analytics engine, designed for horizontal scalability, multi-tenant, and easy management.
One of the most powerful features is the possibility to interact with the data stored with 2 differents modes:
- API: It is a way to interact between client and server through http methods (GET, POST, PUT, DELETE).
- Elasticsearch DSL: Library located on top of Elasticsearch, which enable us to write write and run queries against the data stored.
After this short description of Elasticsearch, next step is to figure out how an Elasticsearch cluster must be set up in production environments.
My first thoughts in the approach to the architecture are summarized in these terms:
- High Availability
- Fault tolerance
Diving in the configuration options of Elasticsearch, we find that Elastic offers us different roles inside a cluster.
- Master: an instance with this role enabled, holds the cluster state and handle where are all the shards located in the cluster. When a cluster master goes down, the cluster automatically starts a new master election process and picks a new master from all the nodes with the master role.
- Data: these instances stores all the data (indexes in Elasticsearch) and also make all the CRUD, search and aggregations operations.
- Coordinator: it is a node that does not hold data, is not master-eligible and not ingest enabled,It “only”· route requests, distributing bulk indexing operations, we can say It does work as “load balancer”
So, once we know the different roles we can configure in an Elasticsearch cluster, keep moving Bro !!!
So, let’s start by how to set up such an stable architecture:
First of all, we need to avoid what we call “split brain”. Let’s say we have a cluster with two instances and the role master is enabled.
Something that never happens could happen, for instance a network failure, and the communication between both masters get interrupted:
Both masters,believe the other node-master is unavailable so they decide to become active:
The implication of this behaviour is that we have 2 nodes with the role master enabled, so both will receive indexing requests, and will try to allocate the shards (primary and replica) , so there will be 2 primaries replicas, which makes our cluster inconsistent.
In order to avoid this behaviour , we need to follow the next steps:
First of all, in the main conf file elasticsearch.yml
discovery.zen.minimum_master_nodes:
This means the amount of master nodes must have connectivity between them, The math rule is based is N/2 + 1 (where N is the number of nodes) , so basically this means we need at least 3 nodes with the role master enabled to achieve the fault tolerance and High availability.
Once we do have all the info we can start to figure out how a proper physical architecture should be, and the picture that appears is as follow:
That’s all for this first approach to the wonderful world of elasticsearch, see you in the next chapter!!!
Related links:
https://www.elastic.co/
https://www.computerhope.com/jargon/f/faulttol.html
www.theabundanceproject.com/2016/06/keep-moving-forward/

SysAdmin landed happily in Datio after a few years managing physical and cloud platforms. Words like performance optimization, control, availability, reliability, scalability and system integrity are hardcoded in my IT-DNA.
When I am not in front of a black screen, I love to spend time with my 2 little khalessis and also run in the mountains where I live, fill me up of pure energy.
Hello Felix and congratulations for your post!
If you’re kind enough I would like to ask you a question. You said that we need at least 3 Master nodes to achieve HA/fault tolerance, however taking a look into your diagram, looks like coordinator node could be a SPOF. How do you deal with a faulty coordinator node?
Kind regards
Hi
First of all thanks for your words, and for your interest in my post.Regarding your question, every node in the cluster, behave like a coordinator node (master and data nodes can route requests, handle the search reduce phase, and bulk indexing) and also their own duties .The main reason to set up a dedicated coordinator node, is to offload master and data nodes.
However setting up several coordinator nodes,is not recommended, as master nodes needs to get the cluster state of every single node, and this increse the global load.
Also If coordinator node gets down , the cluster keeps alive.
Thanks again