Chapter 02 · Define the ICP

Your data first

First-party signal mining — your data first, at zero marginal cost.

The highest-quality ICP signal you can use is the signal you already control, at zero marginal acquisition cost. Closed-won customers (Chapter 1) are the most concentrated form. The broader first-party estate — web analytics, product usage, inbound forms, support tickets, newsletter engagement, event attendance, CRM history, social engagement — is the next layer. Most operators skip past it to a third-party data purchase.

TL;DR

Web analytics, product usage, support tickets, and demo requests are sitting in your stack at zero marginal cost. Mine them before you pay a vendor.
For PLG companies, in-product activation milestones are the dominant signal. The activated-but-not-yet-paid cohort produces 30-50% reply rates — the highest-converting outbound segment in the entire stack.
Apply a 90-day signal refresh. Per-source half-lives range from 14 days (trial usage) to 12 months (CRM touch). Treat every signal with a timestamp and decay weight, or your warehouse calcifies into noise.
For pre-product or pre-traction operators, the YC batch directory, alumni directory, and design-partner roster are your first-party signal. Treat them with the same discipline as a website analytics export.
Layer third-party intent on top of first-party, never as a substitute. Modeled signal shared across every customer of the provider is categorically lower quality than signal you observed yourself.

The premise

Every prospective customer who has interacted with a company — visited the marketing site, opened a trial, submitted a form, opened a ticket or a newsletter, attended a webinar, replied to a sales email two years ago, liked a post — has emitted a signal. That signal sits in the company's own systems at zero marginal acquisition cost. Its predictive quality exceeds any single third-party intent provider's signal by a wide margin, because the prospect has self-selected by interacting with the company's actual surface area rather than by appearing in a vendor's model of who might be interested.

The failure pattern is uniform: an operator decides to refresh the prospect list, opens a purchase conversation with a third-party data vendor, and never audits the first-party estate. The vendor data arrives, the sequence runs, the conversion rate is the conversion rate of a cold list against a generic firmographic filter. The same operator's analytics platform contains, untouched, a 90-day cohort of repeat visitors to the pricing page from companies that match the ICP — empirically the highest-converting source in the entire stack.

Web-analytics as ICP signal

The marketing-site analytics platform is the most universally available first-party source. Every company has one; almost no outbound operator queries it for prospecting.

Per-page-visited segmentation. Visitors to /pricing, /security, /integrations/[specific-system], and /case-studies/[specific-vertical] are not behaving like homepage visitors. The differential reply rate of a sequence targeted at someone who read three pricing-tier pages last week versus a firmographic match with no behavioral signal is, in production, between 4x and 12x.

Dwell-time signal. A 12-second homepage visit is not a signal. A 6-minute visit to a technical documentation page, followed by a 9-minute return to the integration-architecture page, is. The threshold above which dwell-time becomes operationally useful is approximately 90 seconds on a non-homepage URL — below that, the visitor is bouncing.

Repeat-visit patterns. A single visit is noise. Three or more visits within a 30-day window from the same identifier is the strongest behavioral signal in the analytics layer. The reply rate against the three-or-more-visits cohort is, empirically, between 8% and 22% — multiples above the ICP-firmographic-only rate.

The visitor-identity layer. The analytics platform on its own captures an anonymous identifier. A visitor-identification layer — products that resolve an anonymous web visit to a company domain via reverse-IP lookup, and in some cases to a specific person via deterministic matching — converts anonymous behavioral signal into a prospect record. The per-account match rate is between 25% and 55% of qualified traffic. The matched cohort is the highest-quality non-paying first-party source.

Product-usage data

For product-led companies, the in-product telemetry stream is the densest first-party source the company operates. The free-tier or trial user has self-selected for problem-fit, has provided an email and usually a company affiliation, and has emitted signal at the per-feature granularity.

Per-feature-used segmentation. The cohort that has invoked the core value-delivery feature is distinct from the cohort that signed up and never returned. The cohort that has invoked a specific integration is distinct again. An outbound sequence framed around the specific feature the user invoked has the highest reply rate of any product-led outbound.

The activation-vs-non-activation signal. Every product-led company has an activation event — the behavior empirically correlated with eventual paid conversion. The activated-but-not-yet-paid cohort is the single highest-leverage outbound segment in the entire stack: reply rates of 30 to 50% are routine, meeting-conversion exceeds cold outbound by an order of magnitude. The non-activated cohort is a re-engagement segment, not a sales-outbound segment.

The activation-milestone hierarchy. Activation is rarely a single event. The cleanest PLG signal taxonomy is a hierarchy: signup, first meaningful action, second session, invited-a-teammate, hit-a-quota-limit, connected-an-integration. Each milestone is a separate cohort with its own outbound treatment. The "invited-a-teammate" and "hit-a-quota-limit" cohorts in particular are the densest expansion signal you'll have access to before a sales call exists at all.

Pre-product first-party signal — directories as your warehouse

If you don't have web analytics yet, you don't have product usage, and your CRM has three rows — your first-party signal is the people directories you can already access. Treat them with the same discipline as an analytics export.

YC batch directory (for YC founders): every founder in your batch is a candidate buyer, candidate introducer, or both. Index by company, by stage, by ICP fit. The batchmate density is why most early YC B2B startups source their first 5-10 customers from the directory.
YC alumni directory: every prior-batch alumnus at a target account is a warm path you would otherwise have written as cold. Run the join.
Design partner roster: design partners are your highest-fidelity first-party signal cohort by definition — they self-selected to give you product input. The roster doubles as the seed for ICP definition and the reference list for the next ten customers.
Founder personal network: former coworkers, classmates, advisors, prior customers from a past company. List every person you could plausibly send a cold email to and instead send them a warm one. Second-degree intros through this network convert at multiples of cold outbound.

The discipline is the same as the warehouse pattern below: one row per person, source captured, last-touch timestamp, relationship strength score. Pre-product operators who maintain this single table outperform pre-product operators who don't, on every measurable dimension of how fast the first 10 customers close.

Demo-request and inbound-form signal

Inbound form submissions are the highest-intent first-party signal a company receives. Demo-request-to-closed-deal conversion exceeds outbound conversion by a factor of 5 to 15 in B2B SaaS. The signal extraction is at the per-field granularity, not just at the conversion of the form itself.

The fields the prospect populated, the fields they skipped, the free-text "what brings you here," the self-reported company size, the use case dropdown — each informs both the sales conversation and the ICP refinement. The operator whose form fields exist only to populate a CRM record is wasting the highest-density first-party signal the company receives.

UTM parameters, referrer, and landing page, aggregated over hundreds of submissions, produce the empirical conversion rate by source. A demo request from organic search converts differently than a paid-social campaign, a partner referral, or a podcast appearance. The per-source decomposition feeds channel-allocation decisions and the segmentation architecture in Chapter 5. The operator reporting demo-requests in aggregate is producing a vanity metric.

Support-ticket data

Support tickets are the most-unmined first-party source in the typical B2B stack. The premise is structural: support is operated by a different team than sales, the data does not flow into the prospecting pipeline, and per-account ticket history is invisible to the outbound operator.

Ticket-volume segmentation reveals the active-buyer cohorts. Accounts filing tickets are accounts actively using the product. Per-account ticket volume, segmented by user role and category, identifies accounts where expansion is plausible, the technical contact is engaged, and the product is meaningfully embedded. Expansion-sequence targeting derived from this segmentation outperforms generic post-sale outreach by 3 to 6x on meeting-booked rate. Ticket-category is independent signal: integration-category tickets indicate active integration work; permissions-category, active rollout; billing-category, procurement engagement.

Newsletter-engagement signal

The marketing newsletter is the largest opt-in audience most B2B companies directly control. Per-recipient open rate, click rate, links clicked, and recency of last engagement map directly to outbound prioritization.

A recipient who opened in the last 30 days is in the active cohort; a recipient whose last open was 18 months ago is dormant — and empirically less reachable than a cold prospect with no relationship, because the dormant subscriber has trained their mail client to suppress the sender. Outbound to the active newsletter cohort produces reply rates between 6 and 14%; outbound to the dormant cohort produces reply rates below 1% and degrades sending reputation.

The discipline: segment by 30-day, 90-day, and 180-day engagement; target the 30-day cohort with high-intent outbound; target the 90-day cohort with re-engagement; suppress the 180-day cohort from cold outbound entirely.

Event-attendance signal

Webinar registrants, conference attendees with scanned booth badges, podcast guests, panel attendees, public roundtable participants — each is an event-attendance signal the company captured but, in most stacks, never routed to the prospecting pipeline. The signal is two-layered: registration (intent) and attendance (completed engagement).

The conference-attendee cohort is the most under-routed first-party source in the typical sales operation. Marketing collects the names, the names sit in a marketing-automation tool, sales runs against a separate list, the cohort is never targeted. A sequence sent to a conference cohort, framed around the specific event, exceeds the conversion rate of a generic cold sequence by a factor of 3 to 8.

CRM-historical-touch data

The CRM contains a multi-year archive of prior interaction, with three operationally distinct segments:

Never-replied-to. Touched in prior sequences and never engaged. Large cohort, weak signal. Suppress unless re-qualified — a contact who ignored four sequences in 2024 is not a high-probability target for the same approach in 2026.

Replied-but-didn't-convert. Responded but did not advance. Small cohort, high signal — a reply, even a negative one, indicates the contact reads outbound from the company and is willing to engage. The correct treatment is a fundamentally different sequence: new angle, new offer, new framing. Reply rates against this cohort, with the right reframe, exceed cold rates by 4 to 10x.

Lost-deal segments. Accounts that entered an active sales process, advanced to qualification or proposal, and lost. Smallest cohort, highest signal. The 12-month and 24-month re-engagement of lost deals produces closed-won rates between 12 and 25% — empirically the highest-converting outbound segment in the CRM archive.

Social-engagement signal

The professional-network and short-form-social presences are first-party surfaces. Who liked the post, who commented, who reshared, who replied to the founder's post — each is signal the company emits and the platforms expose. The cohort that engaged with the company's content in the prior 90 days is a warm-outbound segment with reply rates substantially above firmographic-cold, dependent primarily on the specificity of the reference back to the engagement.

The discipline is to extract the engagement stream — exportable from creator tools or public APIs — and route it to the warehouse as a per-contact signal with a timestamp. A sequence opening with "you commented on the post about X last month" produces a categorically different reply rate than the same sequence sent cold.

The per-source signal-quality ranking

The first-party signal sources are not equal-weight. The empirical ranking, from highest to lowest conversion-rate contribution in production:

Rank	Source	Typical reply rate to targeted outbound
1	Activated-but-not-paid product users	30–50%
2	Lost-deal re-engagement (12+ months)	18–30%
3	Repeat-visit identified web cohort (3+ visits, 30 days)	8–22%
4	Replied-but-didn't-convert CRM cohort	10–20%
5	Active newsletter cohort (30-day engagement)	6–14%
6	Conference-attendee event cohort	9–18%
7	Social-engagement 90-day cohort	5–12%
8	High-volume support-ticket account expansion	8–16% (meeting-booked)
9	Single-visit identified web cohort	2–5%
—	Firmographic-cold (no first-party signal)	0.8–2.5%

The implication is operational: an outbound program that has exhausted the top six rows of this table before purchasing a single third-party record is rare. An outbound program that purchases third-party data before auditing the top six rows is common, and is, in our observation, producing one to two orders of magnitude below its achievable conversion rate.

The unified first-party signal warehouse

The operational pattern is to build a unified first-party signal store before any third-party data purchase. The architecture is mundane: a single SQL store with one row per identified contact and a schema that accommodates per-source signal events with a timestamp. Minimum-viable: a contacts table keyed on email, a signal_events table with foreign-key, source, event-type, magnitude, and timestamp, and a contact_scores view that materializes the per-contact aggregate score with decay applied. The signal sources catalogued above each feed the events table on a daily ingest; the prospecting pipeline reads from the materialized score view and orders the outbound queue accordingly.

The cost of building this is, in practice, two to six weeks of engineering. The cost of skipping it is the foregone conversion of the top six rows of the ranking table, compounded over every campaign run before the warehouse exists.

The signal-decay problem

First-party signals decay. A pricing-page visit from 18 months ago is not predictive; a pricing-page visit from 8 days ago is. The operator who treats a 2-year-old form submission as equivalent to a 2-week-old form submission is producing noise.

The half-life varies by source. In our observation: web-analytics behavioral signals decay with a 30 to 60 day half-life; trial product-usage with a 14 to 30 day half-life; paid product-usage with a 90 to 180 day half-life; inbound-form with a 60 to 90 day half-life; newsletter-engagement with a 90 day half-life; CRM-touch decays slowly and re-engages usefully at the 12-month mark and beyond; social-engagement with a 45 to 90 day half-life.

The operational discipline is a 90-day refresh: every contact has their per-source signals recomputed every 90 days with decay weights applied. Contacts whose composite score has decayed below threshold drop out of the active queue and into a re-engagement segment. The discipline is the difference between a warehouse that compounds value and a warehouse that calcifies into stale noise within two quarters.

Where this connects — intent data (Chapter 6)

The first-party warehouse is the upstream of any meaningful integration of third-party intent data. The intent providers (Chapter 6) sell behavioral signal observed across the public web — search activity, technographic shifts, hiring patterns, funding events — that the company does not see directly. The signal is useful at the margin, particularly for accounts that have not yet appeared in the first-party estate, but it is categorically lower quality: modeled rather than observed, shared across every customer of the provider, and empirically in the 40 to 70% reliability range against ground truth.

The correct architecture layers third-party intent on top of the first-party warehouse, not as a substitute. First-party signal sets the baseline prioritization; third-party contributes a marginal adjustment for accounts where first-party signal is absent or stale. Chapter 6 treats the per-vendor differential in detail.

Common operator failures observed in production

Ignoring the first-party estate entirely. The operator opens a third-party purchase without auditing what the company already has. The per-source ranking table is unexamined, the warehouse is unbuilt, the conversion rate is the conversion rate of a cold list. The single most frequent failure pattern.
Treating all signals as equal weight. A single binary "has-signal" flag is assigned to every contact. A pricing-page visitor and a newsletter open score identically. The targeting layer cannot distinguish high-intent from incidental, and the conversion rate flattens to the average rather than concentrating on the top-ranked.
No decay model. The warehouse is built, signals are weighted, but the timestamp is not applied to the score. A 2-year-old signal is treated equivalently to a 2-week-old. The active queue is contaminated with stale contacts and the operator concludes — incorrectly — that the warehouse approach does not work.
Routing newsletter dormants into cold outbound. The 180-day dormants are added to the active list, spam complaints aggregate, sending reputation degrades, and the deliverability of the high-intent cohort regresses alongside.
Failing to route conference attendees. The booth-scan list is loaded into a marketing-automation tool for a single follow-up email; the names are never routed to the sales outbound queue with the event-attendance flag attached. The highest-conversion warm-outbound source in the quarter is invisible to the operator.
Suppressing the replied-but-didn't-convert cohort. The CRM marks any prior reply as "engaged" and excludes the contact from future sequences. The cohort with the highest reply rate per dollar of effort is permanently excluded.
Treating the visitor-identification match rate as a coverage problem. The operator evaluates a vendor at a 30% match rate, concludes coverage is too low, and does not deploy. The 30% that matches is, in production, the highest-converting outbound source in the stack. The wrong metric was optimized.

Pre-deployment checklist

Every first-party system catalogued by name, owner team, and export mechanism: web analytics, product telemetry, CRM, support ticketing, marketing automation, event registration, social platforms
A unified contacts and signal_events schema deployed to a SQL store accessible to the prospecting pipeline
Per-source ingest jobs running on at least a daily cadence, with monitoring on row-count delta to detect a stalled ingest
Per-source signal weights documented and version-controlled, with explicit rationale
A decay model applied to every per-source weight, with the half-life parameter exposed in configuration
A materialized contact_scores view refreshed at least daily, with audit columns showing per-source contribution
A 90-day refresh job that demotes contacts whose composite score has decayed below threshold to a re-engagement segment
An explicit cross-reference to third-party intent integration (Chapter 6), with first-party signals scored above third-party in any tie
A documented suppression rule for newsletter dormants beyond 180 days, enforced at the sender
A measurement loop that attributes booked meetings back to the per-source signal that contributed, refining weights quarterly

Where this fits in the cluster

Chapter 1 deconstructs the closed-won base — the densest, smallest, highest-quality first-party source the company operates. This chapter expands the aperture to the broader estate around it: every prospect-identifying behavior the company has observed but not yet routed to the outbound pipeline. Chapter 3 takes the ICP refined by this analysis and treats it as a falsifiable hypothesis, tested against live outbound reply data on a weekly cycle. Chapter 6 layers third-party intent on top of the first-party baseline. The warehouse is the substrate on which every subsequent chapter operates.

Related chapters

How to Build Your ICP From Closed-Won Customers — the densest first-party source in the company's data.
Buyer Intent Data — Where the Signal Actually Is — the third-party layer that pairs with first-party.
ICP Hypothesis Testing — Falsifying the Definition — turning the signal into a falsifiable ICP.
Operational List Management — Hygiene at Production Scale — the 90-day decay discipline applied to first-party signal.

Was this guide useful?

Skip the setup

Allston Labs operates the full sending estate as a service.

We provision domains, configure the entire authentication record set, run warmup, and monitor reputation across providers. The stack lives under your entity. The engineer on call lives in your Slack.

See the service →Book a call →