Reply detection — threading, Message-IDs, and the false-bounce problem.

Reply detection is the operational layer where every upstream authentication, isolation, and warmup decision converts to a measurable outcome. It is also the layer most sequencing platforms implement at the eighty-percent-correct tier and ship anyway, because the failure modes are invisible to anyone not auditing the inbox by hand.

TL;DR

Reply detection is the layer where every upstream authentication and warmup decision converts to measurable pipeline. It is also the layer most sequencing platforms ship at the 80-percent-correct tier because the failure modes are invisible to anyone not auditing the inbox by hand.
The protocol-correct path: thread by RFC 5322 headers — Message-ID on the original, In-Reply-To and References on the reply. Generate UUID-grade Message-IDs, retain the outbound table for at least 180 days, attribute replies by Message-ID lookup not sender-address match.
Autoresponder detection: inspect Auto-Submitted: auto-replied per RFC 3834 plus the legacy Precedence: bulk. Body-content heuristics ("out of office," "currently away") catch the residual autoresponders that emit neither header.
Forwarded threads are a false-positive minefield: a champion forwards your pitch to procurement, the colleague replies, the reply's From: doesn't match the original recipient. Sender-address-match classifiers miss this and keep touching the original recipient after their colleague has replied on their behalf.
The reply-misclassified-as-bounce trap: a legitimate human reply containing the words "returned" or "undeliverable" — referring to a separate, unrelated issue — gets pattern-matched as a bounce and silently suppressed. Always gate body-content heuristics on at least one structural signal (DSN From:, message/delivery-status MIME part, Auto-Submitted).

Plain-English overview

Reply detection is the chapter where the entire upstream infrastructure investment realizes its return — or fails to. A sequencing estate with perfect authentication, isolated subdomains, a clean warmup curve, and seed-list-verified primary placement still produces zero measurable conversion if the reply pipeline misattributes its replies. Threading is mechanical when it works: the receiver's mail client populates In-Reply-To and References with the original Message-ID, the sequencing platform looks up the ID in its outbound table, the next-touch send job is canceled. The 80-percent-correct tier of platforms gets that case right and misses everything else — replies from mobile clients that strip the Re: prefix, replies forwarded from a colleague's mailbox, autoresponders that don't conform to RFC 3834, DSNs from corporate gateways that don't conform to RFC 3461. The chapter below covers the header chain, the auto-responder filtering, and the two specific misclassification failures (bounce-as-reply and reply-as-bounce) that silently destroy pipeline measurement. The header chain itself — Message-ID, In-Reply-To, References — is the closed-loop instrumentation that makes the entire sending estate measurable.

The premise

A sequencing platform's central operational requirement, after delivery, is to identify which inbound messages are replies and stop the next touch when one arrives. The detection is not trivial. A single mailbox receives, on the same port, replies from prospects, delivery-status notifications, vacation auto-responders, list-server notifications, colleague forwards, and the occasional human note that references a sequence in passing. All arrive as RFC 5322 messages. The platform classifies them in milliseconds against a taxonomy of at least eight distinct message classes.

Misclassification cuts in both directions and each direction is expensive. A reply misclassified as a non-reply produces over-sending: the platform keeps touching a contact who has already replied, generating relationship damage, complaints, and — under CAN-SPAM and its European equivalents — regulatory exposure once a recipient has functionally requested cessation. A bounce or auto-responder misclassified as a reply produces under-sending: the sequence stops on a contact who never received the message, and the reply-rate dashboard inflates with phantom conversions that never close. Across the production platforms we have audited, naive classification runs at an 8 to 15 percent misclassification rate. The false-positive cost — pipeline lost to sequences stopped on phantom replies — is consistently the larger error and consistently undetected until a quarterly close-rate review surfaces the gap.

The header chain that defines a thread

Threading in RFC 5322 (and its predecessor RFC 2822) is defined by three header fields acting together: Message-ID on the original outbound, In-Reply-To and References on the reply. A conforming client populates In-Reply-To with the original Message-ID and appends it to the References chain on the user's reply action. The thread is, mechanically, the transitive closure of References values across all messages sharing at least one entry. A correctly threaded reply, observed by the platform's IMAP or API consumer, looks approximately like this:

Message-ID: <a4f7b821-3c9d-4e1a-8f02-9b7c3d2e1f04@reply.prospect.com>
In-Reply-To: <seq-7a3f-touch-2@mail.sender-domain.com>
References: <seq-7a3f-touch-1@mail.sender-domain.com>
            <seq-7a3f-touch-2@mail.sender-domain.com>
Subject: Re: Quick question about your procurement workflow
From: sarah.chen@prospect.com
To: ae@mail.sender-domain.com

That message is unambiguously the second-touch reply on sequence 7a3f. The platform reads In-Reply-To, looks up the Message-ID in its outbound table, and stops the sequence. The next-touch send job, scheduled two business days out, is canceled before it runs.

Message-ID best practices

The Message-ID format is specified in RFC 5322 §3.6.4: a unique identifier in angle brackets, structured as <local-part@domain>. The local part is opaque to receivers but load-bearing for the sender's own thread attribution. The correct posture:

Globally unique. A collision across two outbound messages is mechanically a thread collision — replies attribute to whichever the platform looks up first. UUID-grade entropy makes collisions statistically impossible at any plausible volume.
Domain-anchored. The domain portion should be the sending domain, not the recipient's and not a generic placeholder. Receivers that sanity-check Message-ID domains — a meaningful subset do — will flag implausible values. A Message-ID that exposes the recipient's email as base64 in the local part has been observed in production and is a privacy regression.
Opaque, no PII. The local part should not embed contact email, name, company, or any recipient identifier. The Message-ID survives at every hop and may surface in spam reports, blocklist forensics, and aggregate reporting (chapter 3) for the lifetime of the message.
Persistent across the sequence lifecycle. Retain the outbound Message-ID indefinitely, not garbage-collected after the sequence completes. A reply landing 47 days after the last touch — observed regularly with enterprise procurement — should still attribute correctly.

The In-Reply-To header — semantic vs syntactic threading

A reply with a properly populated In-Reply-To pointing to a known outbound Message-ID is unambiguously a reply. This is the syntactic case and it is the easy one. The hard case is the fraction of replies that arrive with no In-Reply-To at all — older clients, certain mobile clients, web-mail interfaces that compose a "new" message rather than threading from the inbox view, and any reply the recipient initiated by copying the sender's address into a fresh compose window. In seed-list testing across a representative B2B population, between 6 and 11 percent of legitimate replies arrive without In-Reply-To. A platform that relies exclusively on syntactic threading misses one in ten replies, every day, silently. The correct posture is to attempt syntactic threading first, then fall back to a heuristic chain: subject-line normalization, sender-domain match, time-window proximity, and body-content signal (quoted-text matching, signature blocks, mail-client artifacts). Each heuristic in isolation produces false positives; the conjunction of three or more produces classification accuracy within a fraction of a point of syntactic threading on the same population.

The References header — the full thread chain

Where In-Reply-To identifies the immediate parent, References contains the full ancestry chain: every Message-ID in the conversation, in order. Mail clients use it to render threads correctly when intermediate messages are missing locally. Sequencing platforms use it for multi-touch attribution: a reply on the third touch of a sequence will, if properly threaded, carry Message-IDs for touches one and two. A platform that reads only In-Reply-To attributes the reply to the immediate parent — touch three. A platform that reads References reconstructs the full engagement path. The difference shows up in any analysis asking "which touch converted." Platforms that drop the References chain — and there are several in the market — overweight last-touch and understate the value of the touches that did the actual relationship work.

Subject-line heuristics

When threading headers are absent, the subject line is the most reliable fallback. The Re: prefix on a reply, the Fw: or Fwd: prefix on a forward, and their locale equivalents — German AW:, French Réf:, Polish Odp:, Italian R:, Japanese 返信:, Chinese 回复: — all carry threading semantics that mail clients populate automatically. A platform serving a multilingual base has to normalize against the full set, not just the English forms. Matching is necessary but insufficient: a meaningful fraction of mobile clients strip the prefix on reply, and certain iOS mail apps have shipped behavior in production that omits Re: under specific configurations. A platform that requires Re: for classification under-counts replies from any sequence with mobile engagement, which is every sequence.

Auto-responder detection

The protocol-correct way to identify automated mail is the Auto-Submitted header, specified in RFC 3834. A vacation auto-responder declares Auto-Submitted: auto-replied; a list-distributed notification declares auto-generated; a human reply declares no or omits the header. The legacy Precedence: auto_reply, sometimes bulk or junk, predates RFC 3834 and persists on corporate gateways from a particular era. A robust filter inspects both. Beyond the headers, body-content heuristics — "out of office," "currently away," "limited access to email," "will respond when I return" — catch the residual auto-responders that emit neither header. The conjunction of header and body content produces classification accuracy meaningfully above either signal alone. Abuse-feedback notifications, formatted per RFC 5965, should route to suppression, not the reply pipeline.

Out-of-office handling

The interesting question after detecting an auto-responder is what to do about it. The naive options — pause indefinitely, treat as hard exit, or send the next touch on schedule — are all wrong in some meaningful fraction of cases. A correctly-implemented OOO handler parses the return date from the body. Most corporate OOO messages include it in a recognizable format — "I will return on June 14" or "out of office until 14/06/2026" — and modern parsers extract it with reliability in the high-ninety percent range. The sequence pauses until the morning after the return date, at which point the next touch resumes. This produces meaningfully better reply rates than either continuing on schedule (the recipient never sees the touch and the sequence rolls past) or treating OOO as a hard exit (a 5-day vacation should not eliminate the contact). Hard exit is the correct posture for parental leave, extended medical leave, and "no longer with the company" responders; the classifier needs to distinguish those from short absences.

The bounce-misclassified-as-reply failure

Delivery Status Notifications, defined by RFC 3461, are the standard bounce format. A well-formed DSN carries Auto-Submitted: auto-generated, a From: at MAILER-DAEMON@ or postmaster@, and a structured message/delivery-status MIME part with the failure code. A pipeline that inspects any of these signals classifies the DSN correctly.

The failure mode is the DSN that does not. A meaningful fraction of corporate gateways — older on-prem Exchange tenants, regional ISPs, several enterprise anti-spam appliances — emit bounces that omit Auto-Submitted, use a human-looking From: ("Exchange Administrator"), and embed the failure reason in plain-text prose. To a heuristic that looks only at threading headers and subject-line prefixes, these are indistinguishable from short human replies. The classifier marks them as replies, the sequence stops, and the platform reports the bounce as a conversion. The damage is silent: the reply-rate dashboard inflates with phantom replies that never close, the bounce pipeline misses the signal that should have triggered suppression (chapter 12), and the next campaign re-targets the same dead address. A pre-deployment audit should test against the known DSN dialects that don't conform to RFC 3461.

The reply-misclassified-as-bounce failure

The inverse is rarer but more expensive. A legitimate human reply containing text like "I tried to email you last week but it was returned as undeliverable" — a recipient referring to a separate, unrelated mail issue — may match bounce heuristics scanning for "returned," "undeliverable," "bounce," "delivery failure." The classifier marks the reply as a bounce. The contact is suppressed. The sequence continues, because bounce handling and reply handling are separate code paths. The relationship is damaged because the platform keeps sending despite their engagement, and the bounce-rate dashboard inflates with phantom bounces, sometimes triggering automated throttling on a sequence with no actual deliverability issue. The fix: gate any body-content bounce heuristic on at least one structural signal — DSN From: address, Content-Type: message/delivery-status presence, or Auto-Submitted value.

Forwarded-thread detection

When a recipient forwards the sequence to a colleague — common in B2B where a champion routes to procurement — the reply arrives from the forwarded-to mailbox, not the original recipient. The reply's References chain, if the forwarder behaves correctly, still contains the original outbound Message-ID. Its In-Reply-To may point to the original outbound or to the forward message; behavior varies by client. The attribution requirement is to recognize the colleague's reply as a reply on the original sequence even though the sender does not match. Threading by Message-ID handles this. Threading by sender-address does not.

The reply-from-different-address case

A contact at prospect@company.com may forward internally and reply from procurement@company.com, or from a personal address. The reply carries the correct References chain — the contact is mechanically replying to a thread they have participated in via the forward — but the From: is new. A platform that attributes by Message-ID lookup handles this. A platform that attributes by matching sender-address against recipient-address — and a surprising number do, because it is the simpler implementation — produces an orphan reply that does not stop the next touch. The original recipient continues receiving touches after their colleague has explicitly replied on their behalf. The damage shows up as complaint-rate inflation two touches later.

Inbox automation interaction

Major mail clients have, over the last 18 months, shipped automated inbox features the pipeline has to account for: automated unsubscribe-via-reply (the client sends a reply on the user's behalf containing "unsubscribe"), snooze-with-auto-reply (an auto-generated holding message), and rule-based forwarding (a reply that originates from the user's mailbox but was composed by a server-side filter rule, not the user). The auto-generated unsubscribe reply, in particular, is a hard exit, not engagement — but a platform that treats it as a regular reply attributes it to the conversion funnel and continues sending because the unsubscribe intent was not parsed. The correct posture is to detect the automated origin via Auto-Submitted (which recent clients emit correctly) and route to suppression.

Common deployment failures observed in production

Sender-address-match classification only. Colleague forwards, procurement routes, and personal-address responses are all missed. The platform over-sends to contacts who have effectively replied via a colleague, producing complaint inflation two touches later.
Requiring Re: for classification. Mobile clients that strip the prefix, replies composed as new messages, and non-English prefixes are all missed. The under-count runs 5 to 12 percent, invisible without seed-list testing.
No auto-responder filter. Vacation messages, "I no longer work here" notifications, and gateway acknowledgments classify as replies. The reply-rate dashboard inflates by 3 to 7 percent of outbound volume; the operator reads inflation as performance.
No DSN-dialect coverage. Well-formed RFC 3461 bounces classify correctly; the long tail of non-conforming bounces misclassifies as replies. Sequences stop on dead addresses; the next campaign re-targets them.
Body-content bounce heuristics without structural gating. Legitimate replies that mention undelivered mail in passing are classified as bounces, suppressed, and silently lost.
Garbage-collected outbound Message-ID table. Replies arriving weeks after the last touch — common in enterprise procurement — fail to resolve and either fall through to heuristic threading or classify as new conversations.
Dropped References chain. Multi-touch attribution collapses to last-touch; the touches that did the actual relationship work are undervalued; sequence optimization runs against a distorted picture.

Pre-deployment checklist

Outbound Message-ID generated with UUID-grade entropy, domain-anchored, no recoverable PII in the local part
Outbound Message-ID table retained for at least 180 days, ideally indefinitely
Inbound classifier reads In-Reply-To first, falls back to References, then to subject-line normalization across the major locale prefixes
Auto-Submitted (RFC 3834) and Precedence: headers both inspected before reply classification
Body-content auto-responder detection running alongside header detection, with OOO return-date parsing and resumption-on-return scheduling
DSN detection gated on at least one structural signal — From: address, message/delivery-status MIME part, or Auto-Submitted — not body-content heuristics alone
Reply attribution by Message-ID lookup, never by sender-address match alone; mismatch replies routed to the original sequence with the alternative address logged
Seed-list testing (chapter 13) used pre-deployment to measure classification accuracy against a labeled corpus across the major mailbox providers
Inbox-automation patterns (auto-unsubscribe, snooze-auto-reply, rule-based forwards) detected and routed to suppression, deferral, or attribution as appropriate

Where this fits in the broader infrastructure

Reply detection is the operational layer where the entire upstream infrastructure investment realizes its return — or fails to. A sequencing estate with perfect SPF, DKIM, DMARC, MTA-STS, BIMI, ARC, isolated subdomains, a clean warmup curve, bulk-sender compliance, an instrumented monitoring stack, a sane bounce taxonomy, and seed-list-verified primary-tab placement still produces no measurable conversion if the reply pipeline misattributes its replies. The campaign lands, the prospect engages, the engagement signal never reaches pipeline analytics, the operator concludes the campaign underperformed, and next quarter's investment is misallocated against a hallucinated reality.

Every other chapter is about delivering the message. This chapter is about hearing the response. A sender who has internalized the first thirteen chapters and skipped the fourteenth has built a one-way broadcast system, not a sales channel. The header chain defined by RFC 5322 — Message-ID, In-Reply-To, References — is the closed-loop instrumentation that makes the broadcast measurable. Platforms that respect the specification produce reliable conversion data. Platforms that do not, do not.

Related chapters

Email Bounce Codes Explained — distinguishing a real bounce from a false bounce on an auto-response.
Reply Classification — The Five Categories — what to do with replies once you reliably detect them.
Reply Routing Architecture — Messaging vs CRM — where reliably-detected replies should land for action.
Pipeline Conversion Math — The Per-Stage Funnel — why reply attribution is foundational for funnel measurement.

Was this guide useful?

Skip the setup

Allston Labs operates the full sending estate as a service.

We provision domains, configure the entire authentication record set, run warmup, and monitor reputation across providers. The stack lives under your entity. The engineer on call lives in your Slack.

See the service →Book a call →