Michael Cizmar
President And Managing Director @ MC+A
Elastic Cloud cluster for ELK Stack = Easy
Elastic Cloud makes deploying, operating, and scaling the Elasticsearch Stack (ELK) in the cloud easy. The Elastic Cloud is run by Elastic, the maker of Elasticsearch and related products. It runs in all the major public cloud providers (aka Hyperscalers) and through its management console, you can start with the logical minimal infrastructure needed and then scale it up to hundreds of nodes to service your use case. In this article we’ll discuss the major considerations when optimizing your cloud cluster size and therefore your bill.
How to size the major components of your Elastic Cloud deployment?
Start by understanding your ELK use case.
The Elastic Cloud serves a diverse range of use cases, including: Log Analytics, Enterprise Search, APM (Application Performance Monitoring), and Security Monitoring. Your use case will determine the hardware profile and redundancy that you’ll need to configure. A deployment has 3 basic hardware profiles along with additional variants depending on the hosting platform you choose. These hardware profiles determine the ratio of RAM, Disk, and CPU resources. The following table demonstrates the basic differences for search nodes running on AWS (Amazon Web Services).
Profile | RAM to Disk | vCPU/RAM | Notes |
---|---|---|---|
Storage Optimized | 30:1 | 0.138 | |
Storage Optimized (Dense) | 80:1 | 0.133 | Similar CPU and RAM but a lot more disk. |
CPU Optimized | 12:1 | 0.529 | 2x the RAM ratio and 4x the CPUs than storage optimized. |
CPU Optimized (ARM) | 30:1 | 0.533 | New / Faster CPUs. |
General Purpose | 10:1 | 0.267 | Best RAM to Disk Ratio |
General Purpose (ARM) | 15:1 | 0.267 | New / Faster CPUs |
In general, if you are storing a large amount of data with a low query volume, our general recommendation is to lean towards the storage optimized profiles. If your use case is a traditional enterprise search or vector search use case our recommendation is to lean towards the CPU optimized profiles.
Your use case is key for determining whether you need to plan for data growth or significant query volume. You can further segment this by putting your data into specific data tiers and utilized Index Lifecycle management policies to move the data into the appropriate tier as it grows older and is less likely to be accessed.
The data ingestion and query volume will be key factors in how much CPU capacity you need.
More Clusters = Better, right? ¯\_(ツ)_/¯
The business flexibility that comes with the ability to create clusters on demand can’t be underestimated. But because of some of the limitation with cluster sizing our recommendation is to stick with one cluster for each use case. The logical separation has practical benefits with regards to scaling clusters. Narrowly focusing your clusters also allows you to take advantage of the most applicable and optimized hardware profile for a given use case. Here is an example, you can have a production search cluster powering your e-commerce website using the CPU optimized profile while an observability cluster using the storage-optimized profile is monitoring the performance of the other clusters and gather the logs of the system.
Computing Resources - The Basic Costs
If you review the pricing table for Elastic Cloud, you can see that are essentially paying for GB of RAM per hour. Based on the hardware profile you have an allotment of disk and computing. The more ram you consume, the more you are changed. There are other incidentals including, network I/O and snapshots to consider but these are typically less significant in comparison to the main cluster cost.
Memory
The fundamental computing unit for Elasticsearch is RAM usage. This matches the principal sizing metric for a cluster which is to maintain a certain amount of machine RAM to disk in what we call “ram to disk ratio.” Our general rule is to aim at maintaining specific RAM to disk ratio depending on the use case for the cluster. This is the total amount available in the clusters data nodes / the index size of the primary shards and their replicas. Standard Ratios are:
- 1:15 – Autocomplete Query Completion
- 1:30 – Enterprise Search
- 1:100 – Warm Data
- 1:1000 – Cold or Frozen Data
But as we also mentioned above, this is fixed for your cluster and the tier of nodes (i.e. hot/warm/cold/frozen).
Storage
A cluster needs to have enough storage to hold all of the data need to service your use case. Storage is the simplest component of a cluster to scale out as ESS can auto scale this as your disk capacity runs low. Additionally, you can take advantage of warm and cold tiers, moving data to a zone of the cluster which maintains a larger RAM to disk ratio. Moving data down in the availability tiers allow you to store that is accessed less frequency for a lower price.
Ultimately, the data can be written into a snapshot and mounted as read only at the most extreme ratio of ram to disk using the Frozen Tier.
Compute
Cluster computer size varies based on your use case and is variable based on a variety of factors, including shard configuration, query volume, index size, as well as the complexity of your queries. Compute capacity refers to the number of available CPU threads your cluster has. More threads equals more concurrent requests and processes.
Word of caution, threads are easily consumed by poorly designed indexes and queries. For example, a query that calls a single index with 8 shards, that query will consume 8 threads. If that query takes 100 milliseconds and your cluster has 4 cores (with two hyper threads), your cluster can sustain 10 QPS with the cluster unable to do anything else (not a typical scenario).
Specialty Nodes
Which Version? It Matters
How do you determine if you are over or undersized
Like goldilocks and the three bears, your cluster could be under, right, or oversized. Typically speaking, you probably know that your cluster is undersized. Elastic generates a warning if there is an issue with disk or CPU utilization. But, how do you if your cluster is right (which would be great) or is oversized and wasting resources and money?
Determining if your ESS cluster has too much capacity (oversized) involves evaluating multiple performance metrics and resource utilizations against the requirements of your application and use case. These metrics can be through an Elastic observation cluster or through another monitoring system like Grafana.
Here is a short list of some of the metrics that you should use to assess if your cluster is oversized:
- Monitor Resource Utilization
- CPU Usage: Low average CPU usage with no spikes
- Are your Shards allocated to all nodes?
- Does your search ratio utilize the nodes (searches * shards = current CPU threads)?
- Memory Usage:
- Monitoring the heap and ensure it remains within the typical operational ranges
- Compare your index size to the amount of RAM
- Disk Usage: Monitoring your overall disk usage and shard allocations
A couple key questions:
- Do you need the RAM to disk ratio that you selected in your profile?
- Are you seeing CPUs spikes?
Remember, it’s always easier to scale up then down. Scaling down can cause reshuffling of shards and potentially removal of nodes which will have negative effects.
- Data Nodes – The data exceeds the storage capacity
- a. No capabilities for CPU or RAM scaling
- No downward scaling
- ML Nodes
- Is based loaded models and jobs
- Scales up and down
Want an expert opinion?
References:
- Elasticsearch Scaling Tips & Best Practices | Elastic Videos
- Quantitative Cluster Sizing | Elastic
- Sizing for Machine Learning with Elasticsearch | Elastic Blog
- Benchmarking and sizing your Elasticsearch cluster for logs and metrics | Elastic Blog
- Using Rally to Get Your Elasticsearch Cluster Size Right | Elastic Videos
- Sizing Hot-Warm Architectures for Logging and Metrics in the Elasticsearch Service on Elastic Cloud | Elastic Blog
- Stretching the Cloud: Flexibility in Elastic Cloud Deployments | Elastic Blog
- Size your shards | Elasticsearch Guide [8.7] | Elastic
- Cooking up machine learning models: A deep dive into the supervised learning pipeline | Elastic Blog
- Stateless — your new state of find with Elasticsearch | Elastic Blog
- Take control of your Elastic Cloud spend with data-driven insights
Go Further with Expert Consulting
Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.