Chapter 09 · Infrastructure

Empirical / classifier-defined

Inbox warmup — engagement curves and the synthetic-warmup detection problem.

Warmup is not a volume problem. It is the deliberate construction of positive engagement signal on a sending identity that every major receiver's reputation classifier currently treats as untrusted. Operators who optimize the wrong variable — daily send count rather than per-message engagement rate — produce the exact training signal that classifies their domain as bulk for the subsequent quarter.

TL;DR

Sizing formula: (daily volume / 60) × 1.1 = number of domains. Three mailboxes per domain. Max 20 cold sends per inbox per day.
Ramp: start at 5 sends/day per inbox, increase by 5 every couple of days, target 40 to 65/day steady-state.
Wait 14 days minimum before sending any cold mail. 21 days is safer.
Warmup-pool quality matters. Some tools (named ones with reputation issues like GWarm in the past) share pools with spammers and actively get you blacklisted. Pick a reputable provider.
Keep warmup running forever, alongside production cold volume. Reputation decays — rotate domains on a ~90-day cycle and always have new ones warming in the background.

The premise — receivers classify on engagement, not volume

A receiving mail server, on accepting an inbound message from a previously unknown sending identity, does not produce a deliverability verdict from message content alone. It produces a verdict from a per-sender reputation score, computed continuously from the engagement signal accumulated against every prior message from that identity. The signal categories ingested by the major mailbox providers are documented, in coarse form, in the receiver-side bulk sender guidance published in 2023 and revised in February 2024: opens, replies, archival without read, archival after read, deletion without read, manual movement out of spam, manual movement into spam, and explicit spam-flag complaints.

The observation that determines warmup methodology: the classifier does not score raw send volume against a threshold. It scores the ratio of positive to negative engagement, weighted by signal strength. A reply is approximately one to two orders of magnitude stronger positive signal than an open. A manual movement out of spam is the strongest positive signal a single message can produce. A spam-flag complaint at any meaningful rate is operationally terminal in single-digit days. Warmup is the deliberate construction of positive engagement signal on a new sending identity that trains the classifier toward "this sender's mail is wanted" before the sender begins producing the cold-volume profile that would train the opposite.

The cold-start asymmetry

A freshly provisioned mailbox is treated as untrusted by default. Not a content judgment — a structural prior. With no engagement history, the classifier weights early-volume signal disproportionately and applies an exponentially decaying skepticism that takes 3 to 6 weeks of consistent positive engagement to fully discharge.

Send 30 cold messages on day one from a new mailbox and you get classified as “high-risk bulk sender” for 2 to 6 weeks beyond the day you stop. This is the failure mode of the founder who buys domains Friday afternoon and launches a sequence Monday morning. The mailboxes burn before they ever had a chance.

Positive signal accumulates logarithmically. Negative signal — especially spam complaints — accumulates linearly with a destructive multiplier. A week of correct warmup gets you somewhere. A morning of incorrect cold launch on an unwarmed domain costs weeks to recover from, sometimes permanently.

Domain and inbox sizing — the formula

Before you warm anything, size the estate to the volume you actually need. The rule:

3 mailboxes per domain (e.g., jamie@, j.smith@, js@). Mailboxes should look like real variations of one person — not sales@ or hello@.
Max 20 cold sends per inbox per day at steady-state. Past that, the classifier reads aggregate per-mailbox volume as bulk regardless of engagement.
That gives you 60 cold sends per domain per day.
Formula: (desired daily volume / 60) × 1.1 = number of domains needed. The 1.1 multiplier is buffer for rotation, downtime, and the occasional domain that fails warmup.

A founder targeting 200 cold sends per day needs roughly 4 domains and 12 mailboxes. A 1,000-send-per-day operation needs roughly 18 domains and 54 mailboxes. The cost is ~$10-12/domain and ~$7/mailbox/month on Google Workspace or Microsoft 365.

Never send cold mail from your primary corporate domain. When reputation slips — and at some point it will — your founder@yourcompany.com goes down with it. Internal email, customer threads, fundraising conversations all start landing in spam. The cost of that recovery dwarfs the $12 of insurance per lookalike domain.

The ramp curve

The empirically derived warmup curve, observed across thousands of new sending identities in production, is approximately the following:

Week	Sends/day per inbox	Target reply rate	What the curve is training
1	1–5	30% or higher	Initial positive-engagement accumulation against the cold-start skepticism prior
2	5–15	25% or higher	Volume ramp under preserved engagement ratio; classifier transitions from untrusted to neutral
3	15–30	20% or higher	Volume floor near production rate; classifier transitions from neutral to trusted-low-volume
4+	30 steady, warmup continues at 10–15	10% or higher on warmup mail	Maintenance — production cold volume layered over preserved warmup-engagement floor

The shape is not arbitrary. It is the approximate inverse of the receiver-side skepticism decay: positive engagement accumulates logarithmically, so early-week volume must remain low enough that every individual message produces measurable positive signal rather than incremental volume against a flat engagement floor. Doubling volume on a week where reply rate would halve produces, in classifier terms, identical engagement counts but worse training data — 15% reply rate at 10 sends/day is a worse signature than 30% at 5 sends/day. Operators who compress the curve to two weeks, or who start at week-three volume on day one, produce the predictable failure: a domain that lands in spam for the subsequent month under any volume.

The engagement-target inversion

Most operator mental models of warmup are volume-indexed: how many sends per day, how many inboxes, how long until full volume. The receiver-side classifier is engagement-indexed. A warmup pool that produces 30% reply rates during week one trains the classifier as a wanted sender; the same volume at 5% reply rate trains the opposite, and produces a sending identity that lands in spam despite having completed the prescribed runway.

This is the operational reason the historical category of warmup services — the ones that swap automated synthetic mail between participating mailboxes at industrial volume — has degraded in effectiveness over the 2023 to 2026 window. Those services optimized the visible variable (send count) at the expense of the load-bearing variable (engagement quality). The correct optimization target during warmup is the engagement rate, not the engagement count. Operators who instrument warmup against per-message reply percentage rather than total replies-per-week produce reliably better outcomes at the seed-list placement test in week three (Chapter 13).

The ramp in plain numbers

If you want the ramp in operator-friendly form: start at 5 sends per day per inbox. Increase by 5 every two to three days. By the end of week two you should be at 25-30/day per inbox. By the end of week three, 40-65/day. Hold there. Cold-send volume layers on top of warmup volume, not in place of it.

Two non-negotiables: do not send a single cold email before day 14. Day 21 is safer. And keep warmup running forever, even after you start cold sending — the warmup mail is what holds the engagement floor that production cold volume would otherwise erode.

Volume-based vs engagement-based warmup services

The category historically described as "automated warmup" joins a new sending identity into a network of participating mailboxes, generates synthetic conversational mail between them at the prescribed volume curve, and produces — on the surface — the engagement profile required to train the receiver's classifier. The technical premise was sound in approximately 2018 through 2022. It is increasingly unsound in 2025 and 2026. The receiver-side detection patterns the category has not adapted to are coarse, observable, and increasingly enforced:

Geographic uniformity. A warmup network operates from a finite set of data center IP ranges. A sender whose engagement signal originates exclusively from cloud-provider netblocks — never from the residential, mobile, or corporate IPs that real B2B mail recipients send from — is a recognizable signature.
Sub-second response times. Humans formulate responses on a timescale of minutes to hours. Automated warmup networks respond in seconds. The latency distribution is, in aggregate, separable.
Reciprocal-pair patterns. Real conversational mail produces A-replies-to-B-replies-to-A patterns at low frequency. A warmup network with N participants produces it at order-N-squared frequency.
Regular sending cadence. Human-operated sending produces a circadian distribution with sharp drop-offs overnight in the sender's timezone. Automated warmup produces a flatter distribution that does not match the local-business-hours envelope of the claimed identity.
Conversation-graph isolation. Warmup pool participants exchange mail with each other and nobody else. The graph of who-mails-whom for a real human is broadly connected to the public mail graph.

The result, observable in seed-list testing on identities warmed exclusively through automated pools, is a measurable placement penalty relative to identities warmed through human-equivalent traffic. The penalty compounds with the unfavorable engagement signature of cold mail itself, producing a domain that lands in promotions when the operator was targeting primary, or in spam when targeting promotions.

What "real warmup mail" looks like

The structural attributes of warmup mail that resembles human conversation, in descending order of importance to the classifier:

Threaded replies. A message and its reply share a References and In-Reply-To header chain (Chapter 14) with consistent Re:-prefixed Subject. The classifier reads threading as strong positive signal because spammers historically do not produce threaded mail at scale.
Conversational text. Variable-length, natural-language messages with referents to prior message content, not template-substituted boilerplate. Templated mail is recognizable in content-hashing streams even when individual words vary.
Mixed engagement actions. A real recipient does not reply to every message — they archive some, star some, reply to some, delete some without reading. A signal where 40% are replied to, 30% archived after a delay, 20% starred, and 10% deleted is more convincing training data than uniform 100% replies.
Message-ID consistency, varied subject lines and bodies. A pool with five subject templates and three body templates produces a recognizable repetition signature in the classifier's content-hashing stream.
Plausible inter-message timing. Reply delays in the minutes-to-hours range. Sending cadence respecting local business hours of the sender's claimed location.

The per-provider warmup curve

The three receiver families that determine B2B deliverability — Gmail, Microsoft (Outlook/Office 365), and Yahoo (which also operates AOL) — apply observably different cold-start curves. The differences are undocumented in any provider's published guidance, but stable enough across observed cohorts to inform planning.

Gmail applies the slowest and most rigorous cold-start. A new identity sending to Gmail recipients is observably penalized on engagement signal for 21 to 28 days of consistent warmup, with the steepest part of the trust curve in days 14 through 21. Gmail also applies the most aggressive engagement-rate gating: a sending identity whose Gmail reply rate falls below approximately 8 to 10% during the maintenance phase produces a measurable reputation drop within seven to ten days. Postmaster Tools (Chapter 11) is the diagnostic stream that surfaces this.

Microsoft applies a faster but less forgiving cold-start. The trust ramp completes in approximately 14 to 21 days, but recovery from a spam-classification event is observably worse than Gmail's — a domain that lands in Microsoft junk during week two often does not recover at Microsoft for the subsequent quarter even after the engagement profile normalizes. SNDS enrollment surfaces the IP-reputation dimension; the domain-reputation dimension has no published feedback channel.

Yahoo applies the most volume-sensitive cold-start. The reputation system tolerates lower engagement rates at low volume than Gmail's, but produces the sharpest penalty for volume jumps — a sender stepping from 5 to 30 messages per day in a single transition at Yahoo produces, with reliability, a multi-week routing to bulk regardless of engagement rate.

Maintaining warmup during production

The single largest source of post-launch reputation drops, in our observation: the operator completes the three-week warmup, launches the cold sequence at full volume, and stops warmup entirely on day one of production. The observed result is a reputation drop measurable in Postmaster Tools within seven to fourteen days, an open-rate decline of 30 to 50% over the same window, and a domain that has functionally re-entered the cold-start state at every receiver.

The mechanism: cold mail produces, by definition, an engagement profile worse than warmup mail — lower reply rates, higher archival without read, occasional spam flagging. A production identity sending only cold mail is, in the classifier's view, a sender whose engagement signature has collapsed from the warmup baseline. The collapse trains the classifier toward bulk, regardless of whether the cold volume itself is within the prescribed envelope. The operational requirement: warmup volume continues at 10 to 15 sends per day per producing inbox, indefinitely, alongside production cold volume. The warmup mail preserves the engagement-rate floor the classifier requires to maintain trust.

Picking a warmup tool — quality of the pool matters

The mainstream warmup tools — Instantly, Smartlead, Mailreach, Warmup Inbox — all do the same thing in principle: enroll your inbox into a shared network, exchange synthetic conversational mail at the prescribed ramp, and report engagement metrics back. The difference is the quality of the pool you've been joined into.

A clean pool is mostly active human-operated mailboxes with real engagement patterns. A dirty pool is full of other senders' burner mailboxes, IPs that have been blacklisted elsewhere, and synthetic addresses that the major receivers have already tagged. Joining a dirty pool actively gets you blacklisted faster than skipping warmup entirely — the receiver sees consistent engagement with known-bad addresses and concludes you're part of the same network.

There is no public ranking of warmup pool quality, and tool reputations shift quarter to quarter. The diagnostic that works: run a seed-list inbox placement test (Chapter 13) at the end of warmup. If a tool's three-week warmup produces under 80% inbox placement on a clean seed list, the pool is dirty and you switch providers before launching production volume. Do not trust the warmup tool's own reported metrics — every tool reports green.

Domain rotation — a permanent operating discipline

Even well-maintained sending domains lose effectiveness over time. The structural reason: as a domain accumulates cold-send volume, the engagement signature gradually shifts toward bulk regardless of how disciplined the operator is. Rotation is not a recovery move from a damaged domain — it is a routine operating discipline.

The practical pattern: rotate your sending domains on a ~90-day cycle. Always have new domains warming in the background, ready to be promoted into the active sending pool. When a domain shows the early signs of reputation drift (declining Postmaster reputation, falling seed-list placement, climbing complaint rate), retire it before damage compounds and rotate the next warmed domain into its slot.

One bad domain can pull others down with it if they share naming patterns. If tryacme.com goes Bad, the classifier looks more skeptically at getacme.com and acmehq.io. Diversify naming patterns when you size the estate so the failure modes are independent.

The IP-warmup distinction

Two reputation dimensions exist independently: sending IP reputation and sending domain reputation. For senders on dedicated IP addresses — typically only the highest-volume operators on transactional infrastructure — both must be warmed independently, with IP warmup running on a separate curve that may extend to six or eight weeks. For senders on shared IPs through a sequencing platform — the architecture of essentially every B2B cold operator — IP reputation is the platform's problem and domain reputation is the sender's. Confusing these is a common debugging failure: an operator on a shared IP attributes a Gmail placement drop to "the platform's IP reputation" when the actual cause is their own domain's engagement signature collapsing under cold volume.

Authentication prerequisites before warmup

Warmup on a domain with broken authentication rebuilds zero reputation. The classifier discards engagement signal from messages that fail SPF and DKIM alignment (Chapter 3), because such messages cannot be reliably attributed to the claimed sending identity. An operator who runs three weeks of warmup on a domain with a misconfigured DKIM selector — a depressingly common production failure — produces, at week three, a domain with the same reputation as on day one. The prerequisites that must be verified operational before warmup begins:

SPF record published under the 10-DNS-lookup limit (Chapter 1)
DKIM signing operational on at least one selector with a 2048-bit key, verified end-to-end (Chapter 2)
DMARC published at p=none with an rua destination, alignment verified through the first aggregate reports (Chapter 3)
MX records pointing at the operational mailbox provider
Root domain resolving to a content page — not a placeholder, not a 404, not a parking page
Custom tracking domain configured (a shared blocklisted CNAME defeats the warmup before it begins)

A pre-warmup diagnostic pass through these six items takes thirty minutes per domain and prevents approximately 60% of observed "warmup didn't work" failures.

Measuring warmup progress

The signals that distinguish a successful warmup from an unsuccessful one are not visible in the warmup tooling itself — every warmup service reports its own engagement metrics, and reports them as healthy. The reliable diagnostic signals are external:

Seed-list inbox placement testing, weekly. A standardized seed list covering Gmail, Outlook, Yahoo, Apple iCloud, and a representative enterprise Exchange tenant, run at the end of each warmup week. Week-one placement of 40 to 60% is normal. Week-two should reach 60 to 80%. Week-three should reach 80 to 95% before the operator begins cold sending. Below 80% at week three is grounds to extend warmup by an additional week, not grounds to launch anyway. (Chapter 13)
Postmaster Tools reputation movement. The Gmail Postmaster domain-reputation dashboard requires a minimum daily volume to produce signal, typically around 100 messages per day across the domain. Once reached, the score should move from "Unknown" to "Low" by week one, and to "Medium" or "High" by week three. Persistent "Unknown" indicates the volume floor is not being reached; persistent "Low" indicates the engagement signature is failing. (Chapter 11)
Warmup-pool reply rates by week. Reported reply rate should be 30% or higher in week one and decline gradually as volume increases. A service reporting flat 60% reply rates throughout warmup is producing synthetic signal of a kind the receivers detect.

Common operator failures observed in production

Skipping warmup on "new but legitimate" sending domains. The classifier does not distinguish "new because we just registered it" from "new because we're a spammer." Both signatures look identical at week one.
Warming 50 or more inboxes on a single sending domain. The classifier reads aggregate per-domain volume. A domain producing 1,500 sends per day at the end of warmup — regardless of how it is distributed across mailboxes — is, by signature, a bulk sender.
Starting cold sends before reputation has stabilized. The per-inbox jump from 5 warmup sends per day to 30 cold sends per day in a single transition is visible in the classifier's volume-stability signal.
Stopping warmup at launch. The single largest source of post-launch reputation drops. Warmup is not a phase. It is a continuous engagement floor that must persist for the operating life of the sending identity.
Warming on a domain with broken DKIM. Three weeks of operating cost, zero reputation accumulated. A pre-warmup authentication verification pass is non-optional.
Warming on the corporate root domain. The corporate root protects transactional and one-to-one mail. Cold sending on it produces the multi-quarter reputation incident on payroll, invoice, and password-reset flows. The sending tier is a separate organizational domain (Chapter 7).
Trusting the warmup service's reported metrics as the success criterion. Every warmup service reports green at every stage. The success criterion lives in external seed-list testing and provider-side reputation dashboards.

Pre-warmup checklist

SPF, DKIM, DMARC verified operational and aligned (Chapters 1–3)
MX records resolving to the operational mailbox provider
Root domain resolving to a content page, not a placeholder
Custom tracking subdomain configured with its own CNAME verified
No more than five inboxes per sending domain
Seed-list placement service onboarded, baseline measurement taken on day zero
Postmaster Tools enrolled for the sending domain (Chapter 11)
Warmup volume curve scheduled to maintain 10 to 15 sends per day per inbox beyond launch
Documented criterion for extending warmup if week-three placement falls below 80%

Where warmup fits in the broader infrastructure

Warmup is the operational discipline that converts a technically correct sending estate — authenticated, isolated, monitored — into a sending identity the receivers' classifiers trust enough to route to primary. The authentication layer (Chapters 1–6) establishes that the sender is who they claim to be. The architecture layer (Chapters 7–8) establishes that the sending identity is appropriately isolated from corporate reputation. Warmup establishes that the sending identity, having been correctly identified and isolated, is also a sender whose mail recipients want to receive. The reverse is also true: a warmup runway on a domain that has not completed the authentication and architecture layers produces no durable reputation, because the classifier discards signal from unauthenticated mail.

The correct posture: warmup is the last preparatory step before production cold volume, executed only after authentication is verified, only on a correctly isolated sending tier, only with engagement-rate instrumentation in place, and maintained at a floor volume indefinitely thereafter. The operator who treats warmup as a one-time three-week event produces a sending estate that performs well at the seed-list test on day twenty-one and degrades through the subsequent quarter. The operator who treats it as a continuous discipline produces a sending estate that performs durably.

Related chapters

How Domain Age Affects Cold Email Deliverability — the starting reputation prior that warmup ramps from.
Postmaster Tools and SNDS — The Reputation Dashboards — the receiver-side visibility you need during warmup.
Seed List Inbox-Placement Testing — how to verify the warmup actually landed.
Gmail and Yahoo Bulk Sender Requirements — the volume threshold post-warmup that triggers compliance rules.

Was this guide useful?

Skip the setup

Allston Labs operates the full sending estate as a service.

We provision domains, configure the entire authentication record set, run warmup, and monitor reputation across providers. The stack lives under your entity. The engineer on call lives in your Slack.

See the service →Book a call →