Salesgraph’s open-source GTM skills — the cleanest research and scoring scaffolds in the field.

Ruhan Ponnada (YC P26) shipped two repos under Salesgraph that solve adjacent halves of the GTM operator’s problem. gtm-research-skills is prospect research as a slash-command pack. call-score is call scoring plus team-level playbook generation as a TypeScript CLI. The first is the cleanest research scaffold we have found anywhere in the open-source skill ecosystem. The second is infrastructure-grade engineering masquerading as a CLI — Zod-validated outputs, prompt-injection sanitization, evidence-quote enforcement. Together they are the most directly copyable open-source GTM IP in the field. We have borrowed from both across customer engagements. Here is what holds up, where it stops short, and how the techniques map to a full GTM loop.

TL;DR

Two Salesgraph repos: gtm-research-skills ships five slash commands (gtm-research, competitor-brief, value-prop, icp, discovery) on top of six prompt templates and a bash installer that symlinks into ~/.claude/commands/. call-score is a Bun/TypeScript CLI with analyze and playbook commands and pluggable Anthropic/OpenAI providers.
The standout technique in the research pack: a fanout-then-synthesize protocol that enumerates forbidden behaviors verbatim — “single search → synthesize → done” and “synthesizing from search snippets without scraping” are explicitly banned in prompts/fanout.md. The synthesize step pairs with a self-check that demands a citation on every claim. This is the most portable technique we have surfaced across the entire skills landscape.
The ICP prompt at prompts/icp.md is the single best ICP prompt the research surfaced — firmographics table, primary/secondary/blocker personas, an explicit anti-ICP section, and the framing “ICP is observed not aspirational.”
call-score ships a YAML pack model — methodology lives in packs/default/lenses/*.yaml (MEDDPICC, Command of the Message) with per-dimension rubric anchors on a 0–3 scale. Ponnada’s tagline: “methodology is data, not code.”
The two-tier architecture is the deeper point. Per-call coaching runs through src/pipeline/coach.ts; team-level playbook generation runs through src/playbook/build.ts. The split is enforced because, in Ponnada’s words, “a single call is enough to coach but not to generalize.”
Engineering discipline: Zod schemas require rationale + verbatim evidence quote + betterMove per dimension; prompt-injection sanitization lives in src/security.ts; the playbook builder only ingests evidence quotes that scored ≥ 2 so generalization happens off strong signal, not noise.
Where the repos stop short: research is upstream of campaigns, call-score is a CLI not a service, no CRM integration, no closed feedback loop from replies back to the scoring rubric, and the YAML packs assume an active maintainer who keeps them current.
What to borrow: the fanout anti-pattern enumeration, the ICP prompt, the lens-pack YAML model, the Zod schema with mandatory evidence quotes, and the coach-vs-playbook two-tier split. What to leave: the assumption that running the CLI weekly is operationally trivial.

Why read Salesgraph

The open-source GTM skill ecosystem is largely vibes. Operators ship slash-command packs that wrap a single prompt, claim methodology, and ship without engineering scaffolding. Ponnada is the rare technical founder whose repos read like production infrastructure rather than a weekend project. call-score has Zod validation, prompt-injection sanitization, evidence-quote enforcement, and a pipeline architecture that separates per-call coaching from cross-call generalization. gtm-research-skills ships an installer, a slash-command directory, a prompt-template directory, and an explicit anti-pattern enumeration that names the modal failure mode of AI research and bans it by name. These are not vibes. They are opinionated, engineered, and load-bearing in a way the rest of the field is not.

The two repos cover the upstream and the discovery-layer scoring sides of the loop. Research drives who you write to. Scoring drives what you learn from the calls that come back. Between them, they cover roughly 70% of what an early-stage operator needs to score discovery and prospect at the same level of rigor. The remaining 30% is the campaign-running layer that sits between research and scoring — the loop that turns one into the other — and that is the layer neither repo touches. Worth reading the repos with that gap in mind from the start.

One upstream point that matters for the rest of this chapter. Ponnada bypasses the standard Anthropic Skills convention — no YAML frontmatter, no SKILL.mdmetadata block. The pack ships as plain markdown invoked through slash commands installed by the bash setup script. The convention skip is intentional: slash commands are more visible in a developer’s workflow than auto-triggered Skills, and the opinionated workflow Ponnada is teaching is easier to teach when the operator has to type the command. The trade-off is portability — the pack does not auto-discover. That is the right call for an opinionated pack and the wrong call for a general utility.

The fanout-then-synthesize protocol

The single most portable technique across either repo lives in prompts/fanout.md. The protocol decomposes a research question into a fan of parallel searches, scrapes the underlying pages rather than reading snippets, and synthesizes the results in a separate step that runs against a self-check. The standout move is the explicit anti-pattern enumeration. The prompt lists, verbatim, the behaviors the model is forbidden from exhibiting:

“single search → synthesize → done”
“Let me first search for X, then I’ll search for Y”
“synthesizing from search snippets without scraping”

The mechanism is psychological as much as procedural. Models default to the cheapest path that produces a plausible-sounding answer, and the cheapest path in research is exactly what Ponnada bans — one search, snippets-only synthesis, the appearance of completeness. Naming the failure mode explicitly forces the model to pause and recognize when it is about to fall into the cheap path. A general instruction to “research deeply” produces no such recognition because the model has no concrete failure pattern to check itself against. The anti-pattern list is the check.

The synthesize step in prompts/synthesize.md applies the same discipline to the output. The self-check is a literal checklist the model is required to walk through before returning:

[ ] Every claim has a [N] citation
[ ] No claim asserts numbers without a source

The framing the prompt ships with is the operative phrase: “a brief with honest gaps is more useful than hallucinated completeness.” The line is doing work — it explicitly authorizes the model to leave a section blank rather than fabricate content to fill it, which is the modal failure mode of AI-driven research output. Operators who borrow only this line and bolt it onto their existing research prompts see a measurable drop in hallucinated numbers and inferred-but-uncited claims within a day.

The protocol is tool-agnostic. The default path runs on pure Claude Code with WebSearch and WebFetch. Optional integrations with Exa MCP, Parallel, and Firecrawl compress runtime by parallelizing the fanout and accelerating scrape latency, but the discipline does not depend on them. An operator can run the protocol on Day 1 with the tooling they already have, which is the property that makes a technique adoptable rather than aspirational.

The ICP prompt

prompts/icp.md is the single best ICP prompt the research has surfaced across the field. The shape:

A firmographics table — industry, headcount range, revenue band, geography, tech stack signal, funding stage — populated from observed customer data, not speculation.
A primary persona, a secondary persona, and a blocker persona. The blocker — the role most likely to kill the deal during procurement, security, or legal review — is the asymmetric inclusion. Most ICP docs name the buyer; almost none name the killer.
An explicit anti-ICP section. The companies, sizes, or buying motions you should refuse to sell to. Naming who you are not for is a discipline that disqualifies bad-fit pipeline before it consumes calendar.
The framing line that anchors the whole prompt: “ICP is observed not aspirational.” The model is instructed to derive the ICP from closed-won data, not from the founder’s ambition. This is the line that separates a useful ICP from a marketing artifact.

The corresponding slash command (commands/icp) runs the prompt against a customer’s closed-won list and the public web. The output reproduces the shape above with citations on each row. The combination of explicit anti-ICP, blocker persona, and observed-not-aspirational framing is what makes this prompt operationally load-bearing rather than rhetorical. Operators who copy nothing else from gtm-research-skills should copy this file.

The call-score lens architecture

call-score ships a pack model where the methodology lives in YAML rather than code. packs/default/pack.yaml declares the pack. packs/default/call-types.yaml declares thirteen call types split across pre-sales and post-sales — first-touch discovery, technical deep-dive, procurement review, security review, post-implementation QBR, churn-risk call, expansion conversation, and so on. packs/default/lenses/meddpicc.yaml and packs/default/lenses/command-of-message.yaml declare two scoring methodologies as data: each dimension (Metrics, Economic Buyer, Decision Criteria, Decision Process, Paper Process, Identify Pain, Champion, Competition) is a row with per-score rubric anchors describing what a 0, a 1, a 2, and a 3 look like in evidence terms.

Ponnada’s framing is exact: “methodology is data, not code.” The implication is that the pack is editable by a non-engineer — a head of sales or a RevOps lead can fork the default pack, rewrite the MEDDPICC rubric anchors to match how their team actually qualifies, add a custom lens for their specific motion, and re-run the analyzer without touching TypeScript. The shape mirrors how rubrics work in human teams: a shared document, opinionated anchors, room for the team to revise without filing a pull request.

The scoring is anchored to 0–3 per dimension with explicit rubric language for each score. The Zod schema enforces the shape of every coaching output:

A score on the 0–3 scale.
A rationale sentence explaining the score.
A verbatim evidence quote from the transcript supporting the score.
A betterMove describing what the seller should have done differently.

The schema is doing the work that prompt-only systems leave to the model’s discretion. The schema rejects an output that is missing the evidence quote, which forces the model to find one or to score the dimension as zero. The dedicated zero anchor — “No signal in transcript” — is the hard floor. The model cannot invent signal where the transcript contains none, because the only legal output in that case is a zero with the “No signal in transcript” rationale. This is the right shape for any audit deliverable. Evidence-anchored 0–3 scoring with a no-fabrication zero floor is the structural property that produces an audit a customer can trust six weeks later.

The two-tier coach-vs-playbook architecture

The deeper architectural move in call-score is the split between per-call coaching and cross-call playbook generation. The two run through different code paths and against different inputs.

src/pipeline/coach.ts runs against a single transcript and produces per-dimension coaching for the rep who took the call. The output is granular, evidence-quoted, and actionable on the next call. The shape is what a sales coach would write up after listening to one recording — score the dimensions, point at what got dropped, prescribe the better move.

src/playbook/build.ts runs against an aggregated set of analyzed calls and produces team-level patterns. The output is the cross-call generalization — which dimensions the team consistently strengths-and-weaknesses on, what high-scoring evidence quotes have in common, which call types the team handles well and which they fumble. The shape is what a sales leader would produce after listening to twenty calls — the recurring pattern, not the single instance.

Ponnada’s rule for the split is exact: “a single call is enough to coach but not to generalize.” A per-call output that tries to teach team-level patterns is overfitting to one transcript. A cross-call output that tries to coach a specific rep’s last call is too noisy to be useful. The two outputs are different artifacts and the code path enforces the separation.

The detail that makes this architecture load-bearing: the playbook builder ingests only evidence quotes that scored ≥ 2 from the per-call analyses. The evidence-digest filter is a hard threshold that ensures generalization happens off strong signal, not off the noise of low-scoring dimensions. A 0 or 1 evidence quote is a marker that the rep failed to surface signal on that dimension; aggregating those quotes into the playbook would teach the team patterns of failure rather than patterns of strength. The filter is the discipline that makes the playbook output useful.

Two more engineering details worth flagging. Prompt-injection sanitization lives in src/security.tsand runs on every transcript before it reaches the model. Transcripts include verbatim buyer text — a sufficiently sophisticated buyer could embed instructions in the transcript that hijack the scoring prompt. The sanitization is not theoretical; it is the right discipline for any deliverable that ingests untrusted text. And the playbook prompt itself ships with a hard-coded constraint: “Only use quotes present in the data... Do not fabricate.” The constraint is in the prompt and the Zod schema would reject a fabricated quote at parse time. Belt and suspenders.

Where Salesgraph stops short

Both repos are rigorous on the scoped problem and silent on the operating loop. Five gaps an operator hits the moment they try to run either repo in production:

Research is upstream of campaigns. gtm-research-skills produces excellent briefs. Briefs do not produce replies. Without the campaign-running layer underneath — sequencing, deliverability, reply routing, persistence across touches — prospect research is just better-targeted spam. The operator who installs the research pack and continues writing outbound by hand sees calibration improvement on the briefs and no change in pipeline.
call-score is a CLI, not a service.Running the analyzer on one transcript is straightforward. Running it weekly across a team — twenty transcripts every Monday, with the playbook builder running against the rolling thirty-day window — requires infrastructure Salesgraph does not ship. A Bun CLI on a founder’s laptop is not a team workflow. The teams that get value out of the repo are the ones who wrap it in a scheduler, a transcript ingest, and a notification layer they build themselves.
No CRM integration.Scores live in JSON files until someone routes them. The dimensions and evidence quotes are exactly the shape that would update an opportunity record, qualify a deal stage, or trigger a reroute — and none of that happens because the repo terminates at the JSON output. The integration layer is the operator’s problem.
No closed feedback loop from replies to scoring rubric.The lens packs are static. A reply that says “your scoring rubric overweighted Champion and underweighted Paper Process” — empirically true the moment you start running call-score in production — produces no mechanism for refining the rubric. The YAML stays as it was checked in. The framework prescribes no path for letting outcomes update the methodology, and without that path the methodology drifts away from reality within a quarter.
The YAML pack model assumes an active maintainer.“Methodology is data, not code” is true and operationally lethal when there is no one updating the data. A team that forks the pack on Day 1 and ships it untouched is running rubrics that match their March selling motion six months later when the motion has shifted. Without a maintenance cadence, the editable YAML degrades faster than hard-coded methodology would, because the team assumes it can be updated without realizing no one is updating it.

These are not criticisms of Salesgraph. Both repos are scoped at the methodology and engineering layer where Ponnada’s edge sits — the layer the rest of the open-source skill ecosystem skips. The gaps are operational and they are the operator’s to close.

How Allston applies it

The architecture we run across customer engagements borrows several Salesgraph techniques wholesale, with attribution.

The fanout-then-synthesize protocol runs inside Skylarq’s company-research flow. The anti-pattern enumeration from prompts/fanout.md is the discipline we enforce on every prospect-research run. The synthesize-step self-check — every claim a citation, no fabricated numbers, honest gaps over hallucinated completeness — is the discipline we enforce on every customer-facing brief. Credit to Ponnada in the internal documentation.
The YAML lens pattern powers our customer audit deliverables.Each audit dimension is a lens. Each finding is anchored to a 0–3 score with verbatim evidence from the customer’s campaigns, transcripts, or systems. The Zod-schema discipline — every score has a rationale, a verbatim evidence quote, and a betterMove — is the shape we ship audits in.
The two-tier coach-vs-playbook architecture is the right shape for team-level synthesis.Per-customer coaching runs against the individual engagement. Cross-customer playbooks run against the aggregated evidence digest filtered to score ≥ 2. The split prevents the per-engagement noise from polluting cross-engagement generalization, and prevents cross-engagement generalization from being applied too granularly to a specific engagement’s situation.
The engineering discipline (Zod validation, evidence-quote enforcement, prompt-injection sanitization) we have borrowed wholesale. Any deliverable that ingests buyer-authored text — transcripts, replies, RFP responses — flows through the sanitization layer. Any output that asserts a finding requires a verbatim quote in the schema. The pattern is too cheap to skip and too costly to skip well.
The campaign-running layer fills the gap Salesgraph leaves. The research briefs flow into the outbound sequencing. The call scores flow back into the ICP refinement. The reply data flows back into the lens packs — the rubric anchors get updated against actual outcomes on a monthly cadence so the methodology does not drift. The loop closes in days, not quarters.

Operator failures observed when adopting Salesgraph patterns

Running gtm-research-skills without scraping. The anti-pattern Ponnada explicitly bans in prompts/fanout.mdis “synthesizing from search snippets without scraping.” Operators violate it within a day. The model is faster on the snippets-only path, the output looks complete, and the operator does not notice the brief is fabricated until a customer-facing send goes out with a wrong number. The fix is the literal anti-pattern enumeration — and the discipline of treating a violation as a bug rather than a shortcut.
Treating call-score as a deliverable, not a pipeline. Running the analyzer on one transcript produces an interesting report. Running it weekly across twenty transcripts produces a coaching system. The first is a snapshot; the second is the value. Operators who ship a single-call analysis to their team as a one-time artifact see the analysis read once and ignored thereafter.
Lens packs that drift. The YAML lens files are checked in once, never updated against actual outcomes. Six months later the team is scoring against a rubric that matches their March selling motion, not their current one. The fix is a monthly maintenance cadence where the rubric anchors get revised against the highest-scoring and lowest-scoring evidence quotes from the prior 30 days.
Skipping the prompt-injection sanitization. Transcripts include verbatim buyer text. A sufficiently sophisticated buyer (or a sufficiently random transcript artifact) can include a phrase that hijacks the scoring prompt. The repo ships src/security.ts for exactly this reason and operators who fork the repo and route transcripts around the sanitization layer rediscover the problem the hard way.
Running the playbook builder on too few calls.The framework documents 20+ calls as the rough threshold for the playbook to produce signal rather than noise. Operators run it on three transcripts to “test the output” and get a playbook that overfits to those three calls. The output reads coherent because the model is good at producing coherent output, and the operator ships the playbook before realizing it generalizes from nothing. The threshold is real; treat the under-threshold output as instrumented noise, not as a deliverable.
Fabricating evidence quotes.The Zod schema demands verbatim quotes from the transcript. Operators paraphrasing for “readability” lose the audit trail that makes the score reproducible. The discipline is binary: the quote is verbatim or the dimension scores zero. There is no middle ground that preserves the audit property.

Salesgraph-adoption checklist

The gtm-research-skills repo is installed via the bash setup script, the slash commands are symlinked into ~/.claude/commands/, and the operator has run gtm-research on at least one prospect end-to-end
The fanout protocol’s anti-pattern enumeration is read literally and the operator has caught themselves about to violate it at least once (the catch is the proof the discipline is internalized)
The ICP prompt (prompts/icp.md) has been run on the current target list and the output has been compared against the pre-adoption ICP doc — segments added, dropped, or reweighted on the basis of the observed-not-aspirational framing
The call-score CLI is installed, the default pack is forked, and the lens rubric anchors have been customized to the team’s actual sales motion (MEDDPICC and Command of the Message defaults are starting points, not finished states)
Transcripts are scoring through call-score analyze on a weekly cadence — at minimum the prior week’s first-touch and discovery calls — with outputs persisted, not discarded
Zod-validated evidence quotes are enforced on every output and the team treats a missing quote as a hard reject, not a soft warning
Prompt-injection sanitization (src/security.ts) is on the path for every transcript ingest, including transcripts from third-party sources (Gong, Fathom, Granola exports)
The playbook builder (call-score playbook) runs only on a rolling window of 20+ analyzed calls and only on the evidence digest filtered to score ≥ 2
The lens YAML files have a maintenance owner and a monthly cadence — rubric anchors are revised against the prior month’s highest- and lowest-scoring evidence
The campaign-running layer underneath research is built, contracted, or honestly acknowledged as missing — the briefs are inputs to a loop, not outputs of one

Where this fits

Salesgraph is one chapter in our framework analysis series, and it sits in a specific position relative to the others. Rob Snyder’s PULL framework is the methodological foundation that call-score does not include in the default lens packs but is the natural fork — a lenses/pull.yamlwith Project, Unavoidable, Looking, Lacking on the 0–3 scale slots into the existing pack architecture without code changes. The two together are the discovery-layer pair we recommend to operators starting from scratch.

Richard Makara’s GTM Context OS is the operational scaffold that sits one layer above either Salesgraph repo — the directory structure and slash-command pack that holds the broader GTM context the research and scoring outputs feed into. Our ICP hypothesis testing protocol is the campaign-layer compliment to the Salesgraph ICP prompt — the falsifiable testing loop that makes an observed-not-aspirational ICP stay current against drift.

Related chapters

Rob Snyder’s PULL framework — the methodological foundation and the natural lens fork for call-score.
Richard Makara’s GTM Context OS — the operational scaffold one layer above the Salesgraph repos.
ICP hypothesis testing — the campaign-layer compliment to the Salesgraph ICP prompt.
The principles of cold copy — the per-touch copy layer that turns a Salesgraph research brief into a reply.

Was this guide useful?

Run the loop

Allston Labs runs Salesgraph's research protocol and scoring lenses inside a live campaign loop.

We install the fanout-then-synthesize protocol on every prospect, score every transcript through evidence-anchored lens packs, route the call scores back into ICP refinement and the lens packs back against actual reply outcomes, and run the campaign layer Salesgraph doesn't ship. The methodology is open source. The loop is operations.

See how we run it →Book a call →