Skip to content
Refine
Anchor post

EC2 Right-Sizing 101

A practical guide to right-sizing EC2 instances without breaking workloads — what to measure, what to ignore, and how to roll back.

May 5, 20267 min readRefine Team
EC2 right-sizing illustration

Right-sizing EC2 instances is the single highest-leverage cost optimization most teams can make. It's also the most likely to be done badly. The pattern is familiar: a tool recommends "m5.xlarge → t3.medium", someone applies it, and a week later application latency tanks during a traffic spike.

The fix isn't to avoid right-sizing — it's to do it with the metrics that matter and a rollback plan ready.

What to measure



Three metrics drive correct right-sizing recommendations: CPU utilization, memory utilization, and network throughput. AWS publishes CPU utilization by default via CloudWatch; memory and network throughput require the CloudWatch agent (free for the agent, charges for custom metrics).

The window matters as much as the metrics. Sample at 1-minute granularity, look at the p99 over the trailing 30 days, not the mean. An instance averaging 15% CPU but hitting 85% p99 during morning peak is not over-provisioned — it's right-sized for its actual workload shape.

A useful heuristic: if p99 utilization is below 40% of available capacity across all three metrics for 30+ days, the instance is a candidate for downsizing.

What to ignore



Two metrics that show up in right-sizing tools but should not drive decisions on their own:

Average utilization. Means almost nothing for workloads with variable load shapes. A web tier serving a global app will average 20% CPU while peaking at 80% during regional business hours — downsizing based on the average ruins peak performance.

Memory used by file system cache. Linux aggressively caches files in unused memory, making memory utilization look 90% even when the workload only actually needs 30%. The relevant metric is "available memory" (free + cached), not "used memory".

Right-sizing in practice: a workflow



  • Identify candidates. Pull 30-day p99 CPU, memory, and network metrics for all instances. Filter to instances where all three are under 40%.


  • Group by workload type. Right-sizing decisions differ by workload. A web tier can usually scale to a smaller instance with auto-scaling for headroom. A database tier needs more headroom for memory-intensive query patterns. A batch job tier can often move to spot instances entirely.


  • Stage in non-prod first. Apply the change in a staging environment first. Run real load for at least one full traffic cycle (24 hours minimum, ideally a week to catch weekly patterns).


  • Schedule the production change during a low-traffic window. Most teams that get burned skip this step — applying the change Monday morning when business traffic peaks within hours.


  • Have a rollback plan ready. Document the original instance type, AMI, and any user-data scripts. If the new instance shows pathological latency, swap back immediately rather than debug live.


  • Family transitions worth knowing



    Some right-sizing moves are family transitions, not just size reductions. A few worth knowing:

    m5 → m6i or m7i. Newer Intel instances offer 10–20% better price-performance than m5 at the same vCPU/memory ratio. Same OS compatibility, drop-in replacement for most workloads.

    m5 → m6g or m7g. Graviton (ARM) instances offer 20–40% better price-performance for compatible workloads. Requires ARM-compiled binaries — straightforward for Go, Python, Node, JVM languages; harder for C++ with platform-specific dependencies.

    c5/m5 → t3/t4g for low-utilization workloads. Burstable instances are dramatically cheaper for workloads that idle most of the time and burst occasionally. Watch CPU credit balance — if it's hitting zero, the workload needs a non-burstable instance.

    RI / Savings Plan coverage is the multiplier



    Right-sizing alone saves the difference between instance types. Right-sizing plus appropriate Reserved Instance or Savings Plan coverage compounds that.

    A typical pattern: right-size first, run for 30 days to validate the new instance shape is stable, then purchase a 1-year Compute Savings Plan covering the new baseline. The combined savings often beat right-sizing alone by another 40%.

    The mistake to avoid is purchasing RIs before right-sizing — you end up paying for committed capacity at the old (oversized) shape.

    What "doing it well" looks like at scale



    Teams that get right-sizing right share three habits:

  • Continuous monitoring, not one-off pruning sessions. New instances grow and shift. Set up a quarterly review at minimum, monthly for fast-growing fleets.


  • Right-size workloads as units, not individual instances. A load-balanced web tier of 20 instances gets right-sized together with auto-scaling adjusted for the new shape.


  • Document the workload's expected p99, not its current actual utilization. The forward-looking question is "what does this need to serve well?", not "what is it using right now?".


  • The numbers



    For a typical fleet:

  • 20–30% of instances are over-provisioned by enough to warrant a size reduction.
  • Average savings per right-sized instance: 30–40% of that instance's monthly cost.
  • Combined with appropriate Savings Plan coverage: total savings often 50%+ on the right-sized portion.


  • On a $50k/mo EC2 line item, that's $5–10k/mo recovered without changing the workload. Worth the afternoon.

    ---

    Refine surfaces resource-level right-sizing recommendations from CUR data, with the p99 utilization context to apply them safely. [See cost optimization](/product/cost-optimization).
    Share:TwitterLinkedIn

    Stop reading. Start saving.

    Connect AWS in 60 seconds. Free forever.

    Refine is built and supported by HabileLabs, an AWS Advanced Tier Services Partner.