EC2 Right-Sizing 101
A practical guide to right-sizing EC2 instances without breaking workloads — what to measure, what to ignore, and how to roll back.
May 5, 20267 min readRefine Team

Right-sizing EC2 instances is the single highest-leverage cost optimization most teams can make. It's also the most likely to be done badly. The pattern is familiar: a tool recommends "m5.xlarge → t3.medium", someone applies it, and a week later application latency tanks during a traffic spike.
The fix isn't to avoid right-sizing — it's to do it with the metrics that matter and a rollback plan ready.
What to measure
Three metrics drive correct right-sizing recommendations: CPU utilization, memory utilization, and network throughput. AWS publishes CPU utilization by default via CloudWatch; memory and network throughput require the CloudWatch agent (free for the agent, charges for custom metrics).
The window matters as much as the metrics. Sample at 1-minute granularity, look at the p99 over the trailing 30 days, not the mean. An instance averaging 15% CPU but hitting 85% p99 during morning peak is not over-provisioned — it's right-sized for its actual workload shape.
A useful heuristic: if p99 utilization is below 40% of available capacity across all three metrics for 30+ days, the instance is a candidate for downsizing.
What to ignore
Two metrics that show up in right-sizing tools but should not drive decisions on their own:
Average utilization. Means almost nothing for workloads with variable load shapes. A web tier serving a global app will average 20% CPU while peaking at 80% during regional business hours — downsizing based on the average ruins peak performance.
Memory used by file system cache. Linux aggressively caches files in unused memory, making memory utilization look 90% even when the workload only actually needs 30%. The relevant metric is "available memory" (free + cached), not "used memory".
Right-sizing in practice: a workflow
Family transitions worth knowing
Some right-sizing moves are family transitions, not just size reductions. A few worth knowing:
m5 → m6i or m7i. Newer Intel instances offer 10–20% better price-performance than m5 at the same vCPU/memory ratio. Same OS compatibility, drop-in replacement for most workloads.
m5 → m6g or m7g. Graviton (ARM) instances offer 20–40% better price-performance for compatible workloads. Requires ARM-compiled binaries — straightforward for Go, Python, Node, JVM languages; harder for C++ with platform-specific dependencies.
c5/m5 → t3/t4g for low-utilization workloads. Burstable instances are dramatically cheaper for workloads that idle most of the time and burst occasionally. Watch CPU credit balance — if it's hitting zero, the workload needs a non-burstable instance.
RI / Savings Plan coverage is the multiplier
Right-sizing alone saves the difference between instance types. Right-sizing plus appropriate Reserved Instance or Savings Plan coverage compounds that.
A typical pattern: right-size first, run for 30 days to validate the new instance shape is stable, then purchase a 1-year Compute Savings Plan covering the new baseline. The combined savings often beat right-sizing alone by another 40%.
The mistake to avoid is purchasing RIs before right-sizing — you end up paying for committed capacity at the old (oversized) shape.
What "doing it well" looks like at scale
Teams that get right-sizing right share three habits:
The numbers
For a typical fleet:
On a $50k/mo EC2 line item, that's $5–10k/mo recovered without changing the workload. Worth the afternoon.
---
Refine surfaces resource-level right-sizing recommendations from CUR data, with the p99 utilization context to apply them safely. [See cost optimization](/product/cost-optimization).