EC2 Right-Sizing 101

Right-sizing EC2 instances is the single highest-leverage cost optimization most teams can make. It's also the most likely to be done badly. The pattern is familiar: a tool recommends "m5.xlarge → t3.medium", someone applies it, and a week later application latency tanks during a traffic spike.

The fix isn't to avoid right-sizing — it's to do it with the metrics that matter and a rollback plan ready.

What to measure

Three metrics drive correct right-sizing recommendations: CPU utilization, memory utilization, and network throughput. AWS publishes CPU utilization by default via CloudWatch; memory and network throughput require the CloudWatch agent (free for the agent, charges for custom metrics).

The window matters as much as the metrics. Sample at 1-minute granularity, look at the p99 over the trailing 30 days, not the mean. An instance averaging 15% CPU but hitting 85% p99 during morning peak is not over-provisioned — it's right-sized for its actual workload shape.

A useful heuristic: if p99 utilization is below 40% of available capacity across all three metrics for 30+ days, the instance is a candidate for downsizing.

What to ignore

Two metrics that show up in right-sizing tools but should not drive decisions on their own:

Average utilization. Means almost nothing for workloads with variable load shapes. A web tier serving a global app will average 20% CPU while peaking at 80% during regional business hours — downsizing based on the average ruins peak performance.

Memory used by file system cache. Linux aggressively caches files in unused memory, making memory utilization look 90% even when the workload only actually needs 30%. The relevant metric is "available memory" (free + cached), not "used memory".

Right-sizing in practice: a workflow

Identify candidates. Pull 30-day p99 CPU, memory, and network metrics for all instances. Filter to instances where all three are under 40%.

Group by workload type. Right-sizing decisions differ by workload. A web tier can usually scale to a smaller instance with auto-scaling for headroom. A database tier needs more headroom for memory-intensive query patterns. A batch job tier can often move to spot instances entirely.

Stage in non-prod first. Apply the change in a staging environment first. Run real load for at least one full traffic cycle (24 hours minimum, ideally a week to catch weekly patterns).

Schedule the production change during a low-traffic window. Most teams that get burned skip this step — applying the change Monday morning when business traffic peaks within hours.

Have a rollback plan ready. Document the original instance type, AMI, and any user-data scripts. If the new instance shows pathological latency, swap back immediately rather than debug live.

Family transitions worth knowing

Some right-sizing moves are family transitions, not just size reductions. A few worth knowing:

m5 → m6i or m7i. Newer Intel instances offer 10–20% better price-performance than m5 at the same vCPU/memory ratio. Same OS compatibility, drop-in replacement for most workloads.

m5 → m6g or m7g. Graviton (ARM) instances offer 20–40% better price-performance for compatible workloads. Requires ARM-compiled binaries — straightforward for Go, Python, Node, JVM languages; harder for C++ with platform-specific dependencies.

c5/m5 → t3/t4g for low-utilization workloads. Burstable instances are dramatically cheaper for workloads that idle most of the time and burst occasionally. Watch CPU credit balance — if it's hitting zero, the workload needs a non-burstable instance.

RI / Savings Plan coverage is the multiplier

Right-sizing alone saves the difference between instance types. Right-sizing plus appropriate Reserved Instance or Savings Plan coverage compounds that.

A typical pattern: right-size first, run for 30 days to validate the new instance shape is stable, then purchase a 1-year Compute Savings Plan covering the new baseline. The combined savings often beat right-sizing alone by another 40%.

The mistake to avoid is purchasing RIs before right-sizing — you end up paying for committed capacity at the old (oversized) shape.

What "doing it well" looks like at scale

Teams that get right-sizing right share three habits:

Continuous monitoring, not one-off pruning sessions. New instances grow and shift. Set up a quarterly review at minimum, monthly for fast-growing fleets.

Right-size workloads as units, not individual instances. A load-balanced web tier of 20 instances gets right-sized together with auto-scaling adjusted for the new shape.

Document the workload's expected p99, not its current actual utilization. The forward-looking question is "what does this need to serve well?", not "what is it using right now?".

The numbers

For a typical fleet:

20–30% of instances are over-provisioned by enough to warrant a size reduction.

Average savings per right-sized instance: 30–40% of that instance's monthly cost.

Combined with appropriate Savings Plan coverage: total savings often 50%+ on the right-sized portion.

On a $50k/mo EC2 line item, that's $5–10k/mo recovered without changing the workload. Worth the afternoon.

---

Refine surfaces resource-level right-sizing recommendations from CUR data, with the p99 utilization context to apply them safely. [See cost optimization](/product/cost-optimization).

What to measure

What to ignore

Right-sizing in practice: a workflow

Family transitions worth knowing

RI / Savings Plan coverage is the multiplier

What "doing it well" looks like at scale

The numbers

Related posts

AWS Cost Anomalies: Catching Them Before They Compound

S3 Storage Class Tuning Without Breaking Apps

Tag Governance: A FinOps Foundation

Stop reading. Start saving.