Inbox warmup — engagement curves and the synthetic-warmup detection problem.
Warmup is not a volume problem. It is the deliberate construction of positive engagement signal on a sending identity that every major receiver's reputation classifier currently treats as untrusted. Operators who optimize the wrong variable — daily send count rather than per-message engagement rate — produce the exact training signal that classifies their domain as bulk for the subsequent quarter.
The premise — receivers classify on engagement, not volume
A receiving mail server, on accepting an inbound message from a previously unknown sending identity, does not produce a deliverability verdict from message content alone. It produces a verdict from a per-sender reputation score, computed continuously from the engagement signal accumulated against every prior message from that identity. The signal categories ingested by the major mailbox providers are documented, in coarse form, in the receiver-side bulk sender guidance published in 2023 and revised in February 2024: opens, replies, archival without read, archival after read, deletion without read, manual movement out of spam, manual movement into spam, and explicit spam-flag complaints.
The observation that determines warmup methodology: the classifier does not score raw send volume against a threshold. It scores the ratio of positive to negative engagement, weighted by signal strength. A reply is approximately one to two orders of magnitude stronger positive signal than an open. A manual movement out of spam is the strongest positive signal a single message can produce. A spam-flag complaint at any meaningful rate is operationally terminal in single-digit days. Warmup is the deliberate construction of positive engagement signal on a new sending identity that trains the classifier toward "this sender's mail is wanted" before the sender begins producing the cold-volume profile that would train the opposite.
The cold-start asymmetry
A freshly-provisioned sending identity is treated by every major mailbox provider's classifier as untrusted by default. This is not a content judgment. It is a structural prior: in the absence of observed engagement history, the classifier weights early-volume signal disproportionately and applies an exponentially-decaying skepticism that does not fully discharge for between three and six weeks of consistent positive engagement.
Sending 30 cold messages on day one from a new mailbox produces, with high reliability, a classification of "high-risk bulk sender" that persists for between two and six weeks beyond the date the operator stops sending. This is the failure mode of the operator who provisions a sending estate on Friday afternoon and launches a sequence on Monday morning. The mailboxes burn before the open rate stabilizes. Positive signal accumulates approximately logarithmically against the skepticism prior; negative signal — especially the spam-flag complaint — accumulates linearly with destructive multiplier. A single week of correct warmup produces meaningful positive accumulation. A single morning of incorrect cold launch on an unwarmed domain produces several weeks of remediation, often without recovery.
The ramp curve
The empirically derived warmup curve, observed across thousands of new sending identities in production, is approximately the following:
| Week | Sends/day per inbox | Target reply rate | What the curve is training |
|---|---|---|---|
| 1 | 1–5 | 30% or higher | Initial positive-engagement accumulation against the cold-start skepticism prior |
| 2 | 5–15 | 25% or higher | Volume ramp under preserved engagement ratio; classifier transitions from untrusted to neutral |
| 3 | 15–30 | 20% or higher | Volume floor near production rate; classifier transitions from neutral to trusted-low-volume |
| 4+ | 30 steady, warmup continues at 10–15 | 10% or higher on warmup mail | Maintenance — production cold volume layered over preserved warmup-engagement floor |
The shape is not arbitrary. It is the approximate inverse of the receiver-side skepticism decay: positive engagement accumulates logarithmically, so early-week volume must remain low enough that every individual message produces measurable positive signal rather than incremental volume against a flat engagement floor. Doubling volume on a week where reply rate would halve produces, in classifier terms, identical engagement counts but worse training data — 15% reply rate at 10 sends/day is a worse signature than 30% at 5 sends/day. Operators who compress the curve to two weeks, or who start at week-three volume on day one, produce the predictable failure: a domain that lands in spam for the subsequent month under any volume.
The engagement-target inversion
Most operator mental models of warmup are volume-indexed: how many sends per day, how many inboxes, how long until full volume. The receiver-side classifier is engagement-indexed. A warmup pool that produces 30% reply rates during week one trains the classifier as a wanted sender; the same volume at 5% reply rate trains the opposite, and produces a sending identity that lands in spam despite having completed the prescribed runway.
This is the operational reason the historical category of warmup services — the ones that swap automated synthetic mail between participating mailboxes at industrial volume — has degraded in effectiveness over the 2023 to 2026 window. Those services optimized the visible variable (send count) at the expense of the load-bearing variable (engagement quality). The correct optimization target during warmup is the engagement rate, not the engagement count. Operators who instrument warmup against per-message reply percentage rather than total replies-per-week produce reliably better outcomes at the seed-list placement test in week three (Chapter 13).
Volume-based vs engagement-based warmup services
The category historically described as "automated warmup" joins a new sending identity into a network of participating mailboxes, generates synthetic conversational mail between them at the prescribed volume curve, and produces — on the surface — the engagement profile required to train the receiver's classifier. The technical premise was sound in approximately 2018 through 2022. It is increasingly unsound in 2025 and 2026. The receiver-side detection patterns the category has not adapted to are coarse, observable, and increasingly enforced:
- Geographic uniformity. A warmup network operates from a finite set of data center IP ranges. A sender whose engagement signal originates exclusively from cloud-provider netblocks — never from the residential, mobile, or corporate IPs that real B2B mail recipients send from — is a recognizable signature.
- Sub-second response times. Humans formulate responses on a timescale of minutes to hours. Automated warmup networks respond in seconds. The latency distribution is, in aggregate, separable.
- Reciprocal-pair patterns. Real conversational mail produces A-replies-to-B-replies-to-A patterns at low frequency. A warmup network with N participants produces it at order-N-squared frequency.
- Regular sending cadence. Human-operated sending produces a circadian distribution with sharp drop-offs overnight in the sender's timezone. Automated warmup produces a flatter distribution that does not match the local-business-hours envelope of the claimed identity.
- Conversation-graph isolation. Warmup pool participants exchange mail with each other and nobody else. The graph of who-mails-whom for a real human is broadly connected to the public mail graph.
The result, observable in seed-list testing on identities warmed exclusively through automated pools, is a measurable placement penalty relative to identities warmed through human-equivalent traffic. The penalty compounds with the unfavorable engagement signature of cold mail itself, producing a domain that lands in promotions when the operator was targeting primary, or in spam when targeting promotions.
What "real warmup mail" looks like
The structural attributes of warmup mail that resembles human conversation, in descending order of importance to the classifier:
- Threaded replies. A message and its reply share a
ReferencesandIn-Reply-Toheader chain (Chapter 14) with consistentRe:-prefixedSubject. The classifier reads threading as strong positive signal because spammers historically do not produce threaded mail at scale. - Conversational text. Variable-length, natural-language messages with referents to prior message content, not template-substituted boilerplate. Templated mail is recognizable in content-hashing streams even when individual words vary.
- Mixed engagement actions. A real recipient does not reply to every message — they archive some, star some, reply to some, delete some without reading. A signal where 40% are replied to, 30% archived after a delay, 20% starred, and 10% deleted is more convincing training data than uniform 100% replies.
- Message-ID consistency, varied subject lines and bodies. A pool with five subject templates and three body templates produces a recognizable repetition signature in the classifier's content-hashing stream.
- Plausible inter-message timing. Reply delays in the minutes-to-hours range. Sending cadence respecting local business hours of the sender's claimed location.
The per-provider warmup curve
The three receiver families that determine B2B deliverability — Gmail, Microsoft (Outlook/Office 365), and Yahoo (which also operates AOL) — apply observably different cold-start curves. The differences are undocumented in any provider's published guidance, but stable enough across observed cohorts to inform planning.
Gmail applies the slowest and most rigorous cold-start. A new identity sending to Gmail recipients is observably penalized on engagement signal for 21 to 28 days of consistent warmup, with the steepest part of the trust curve in days 14 through 21. Gmail also applies the most aggressive engagement-rate gating: a sending identity whose Gmail reply rate falls below approximately 8 to 10% during the maintenance phase produces a measurable reputation drop within seven to ten days. Postmaster Tools (Chapter 11) is the diagnostic stream that surfaces this.
Microsoft applies a faster but less forgiving cold-start. The trust ramp completes in approximately 14 to 21 days, but recovery from a spam-classification event is observably worse than Gmail's — a domain that lands in Microsoft junk during week two often does not recover at Microsoft for the subsequent quarter even after the engagement profile normalizes. SNDS enrollment surfaces the IP-reputation dimension; the domain-reputation dimension has no published feedback channel.
Yahoo applies the most volume-sensitive cold-start. The reputation system tolerates lower engagement rates at low volume than Gmail's, but produces the sharpest penalty for volume jumps — a sender stepping from 5 to 30 messages per day in a single transition at Yahoo produces, with reliability, a multi-week routing to bulk regardless of engagement rate.
Maintaining warmup during production
The single largest source of post-launch reputation drops, in our observation: the operator completes the three-week warmup, launches the cold sequence at full volume, and stops warmup entirely on day one of production. The observed result is a reputation drop measurable in Postmaster Tools within seven to fourteen days, an open-rate decline of 30 to 50% over the same window, and a domain that has functionally re-entered the cold-start state at every receiver.
The mechanism: cold mail produces, by definition, an engagement profile worse than warmup mail — lower reply rates, higher archival without read, occasional spam flagging. A production identity sending only cold mail is, in the classifier's view, a sender whose engagement signature has collapsed from the warmup baseline. The collapse trains the classifier toward bulk, regardless of whether the cold volume itself is within the prescribed envelope. The operational requirement: warmup volume continues at 10 to 15 sends per day per producing inbox, indefinitely, alongside production cold volume. The warmup mail preserves the engagement-rate floor the classifier requires to maintain trust.
The IP-warmup distinction
Two reputation dimensions exist independently: sending IP reputation and sending domain reputation. For senders on dedicated IP addresses — typically only the highest-volume operators on transactional infrastructure — both must be warmed independently, with IP warmup running on a separate curve that may extend to six or eight weeks. For senders on shared IPs through a sequencing platform — the architecture of essentially every B2B cold operator — IP reputation is the platform's problem and domain reputation is the sender's. Confusing these is a common debugging failure: an operator on a shared IP attributes a Gmail placement drop to "the platform's IP reputation" when the actual cause is their own domain's engagement signature collapsing under cold volume.
Authentication prerequisites before warmup
Warmup on a domain with broken authentication rebuilds zero reputation. The classifier discards engagement signal from messages that fail SPF and DKIM alignment (Chapter 3), because such messages cannot be reliably attributed to the claimed sending identity. An operator who runs three weeks of warmup on a domain with a misconfigured DKIM selector — a depressingly common production failure — produces, at week three, a domain with the same reputation as on day one. The prerequisites that must be verified operational before warmup begins:
- SPF record published under the 10-DNS-lookup limit (Chapter 1)
- DKIM signing operational on at least one selector with a 2048-bit key, verified end-to-end (Chapter 2)
- DMARC published at
p=nonewith anruadestination, alignment verified through the first aggregate reports (Chapter 3) - MX records pointing at the operational mailbox provider
- Root domain resolving to a content page — not a placeholder, not a 404, not a parking page
- Custom tracking domain configured (a shared blocklisted CNAME defeats the warmup before it begins)
A pre-warmup diagnostic pass through these six items takes thirty minutes per domain and prevents approximately 60% of observed "warmup didn't work" failures.
Measuring warmup progress
The signals that distinguish a successful warmup from an unsuccessful one are not visible in the warmup tooling itself — every warmup service reports its own engagement metrics, and reports them as healthy. The reliable diagnostic signals are external:
- Seed-list inbox placement testing, weekly. A standardized seed list covering Gmail, Outlook, Yahoo, Apple iCloud, and a representative enterprise Exchange tenant, run at the end of each warmup week. Week-one placement of 40 to 60% is normal. Week-two should reach 60 to 80%. Week-three should reach 80 to 95% before the operator begins cold sending. Below 80% at week three is grounds to extend warmup by an additional week, not grounds to launch anyway. (Chapter 13)
- Postmaster Tools reputation movement. The Gmail Postmaster domain-reputation dashboard requires a minimum daily volume to produce signal, typically around 100 messages per day across the domain. Once reached, the score should move from "Unknown" to "Low" by week one, and to "Medium" or "High" by week three. Persistent "Unknown" indicates the volume floor is not being reached; persistent "Low" indicates the engagement signature is failing. (Chapter 11)
- Warmup-pool reply rates by week. Reported reply rate should be 30% or higher in week one and decline gradually as volume increases. A service reporting flat 60% reply rates throughout warmup is producing synthetic signal of a kind the receivers detect.
Common operator failures observed in production
- Skipping warmup on "new but legitimate" sending domains. The classifier does not distinguish "new because we just registered it" from "new because we're a spammer." Both signatures look identical at week one.
- Warming 50 or more inboxes on a single sending domain. The classifier reads aggregate per-domain volume. A domain producing 1,500 sends per day at the end of warmup — regardless of how it is distributed across mailboxes — is, by signature, a bulk sender.
- Starting cold sends before reputation has stabilized. The per-inbox jump from 5 warmup sends per day to 30 cold sends per day in a single transition is visible in the classifier's volume-stability signal.
- Stopping warmup at launch. The single largest source of post-launch reputation drops. Warmup is not a phase. It is a continuous engagement floor that must persist for the operating life of the sending identity.
- Warming on a domain with broken DKIM. Three weeks of operating cost, zero reputation accumulated. A pre-warmup authentication verification pass is non-optional.
- Warming on the corporate root domain. The corporate root protects transactional and one-to-one mail. Cold sending on it produces the multi-quarter reputation incident on payroll, invoice, and password-reset flows. The sending tier is a separate organizational domain (Chapter 7).
- Trusting the warmup service's reported metrics as the success criterion. Every warmup service reports green at every stage. The success criterion lives in external seed-list testing and provider-side reputation dashboards.
Pre-warmup checklist
- SPF, DKIM, DMARC verified operational and aligned (Chapters 1–3)
- MX records resolving to the operational mailbox provider
- Root domain resolving to a content page, not a placeholder
- Custom tracking subdomain configured with its own CNAME verified
- No more than five inboxes per sending domain
- Seed-list placement service onboarded, baseline measurement taken on day zero
- Postmaster Tools enrolled for the sending domain (Chapter 11)
- Warmup volume curve scheduled to maintain 10 to 15 sends per day per inbox beyond launch
- Documented criterion for extending warmup if week-three placement falls below 80%
Where warmup fits in the broader infrastructure
Warmup is the operational discipline that converts a technically correct sending estate — authenticated, isolated, monitored — into a sending identity the receivers' classifiers trust enough to route to primary. The authentication layer (Chapters 1–6) establishes that the sender is who they claim to be. The architecture layer (Chapters 7–8) establishes that the sending identity is appropriately isolated from corporate reputation. Warmup establishes that the sending identity, having been correctly identified and isolated, is also a sender whose mail recipients want to receive. The reverse is also true: a warmup runway on a domain that has not completed the authentication and architecture layers produces no durable reputation, because the classifier discards signal from unauthenticated mail.
The correct posture: warmup is the last preparatory step before production cold volume, executed only after authentication is verified, only on a correctly isolated sending tier, only with engagement-rate instrumentation in place, and maintained at a floor volume indefinitely thereafter. The operator who treats warmup as a one-time three-week event produces a sending estate that performs well at the seed-list test on day twenty-one and degrades through the subsequent quarter. The operator who treats it as a continuous discipline produces a sending estate that performs durably.
Allston Labs operates the full sending estate as a service.
We provision domains, configure the entire authentication record set, run warmup, and monitor reputation across providers. The stack lives under your entity. The engineer on call lives in your Slack.