Most lead scoring models are guesses dressed up as math. Someone in a room decides a demo request is worth +10, a webinar signup +5, an enterprise title +15, and the numbers go live without anyone ever checking them against a single deal that actually closed. A scoring model that works is built backward from closed-won, not forward from opinion: it reads fit and intent together, weights each input by how much it actually separates buyers from tire-kickers, and gets checked against who really converted before any rep ever sees a score.
Fit without intent ranks a perfect-profile account that will never buy. Intent without fit ranks a curious student downloading your whitepaper. You need both, multiplied, then validated. This is how to build it.
What you need before you start
A Clay account and a list of your last 30 to 50 closed-won deals, plus a roughly equal set of closed-lost or no-decision leads to score against. Enriched lead data (firmographics, seniority, tech stack, and any intent signals you capture). A working ICP hypothesis, even an imperfect one. You do not need a perfect model before you begin; you build a rough version, score historical outcomes, and tighten the weights until the scores line up with reality. The upstream half of this (capturing form fills, enriching them, and routing the scored result to a rep in minutes) is the inbound lead qualification workflow; this article builds the scoring engine that workflow runs on.
Step 1: Define your ICP from pain, not demographics
A demographic-first ICP is the first guess everyone gets wrong. Teams open a doc and start typing filters: 100 to 500 employees, North America, SaaS, $10M+ revenue. The list looks reasonable and predicts almost nothing, because it never answers why any of those traits would make someone buy.
The right starting point is the buying motivation underneath the firmographics. Before you write a single filter, answer five questions about who actually feels the problem you solve.
Answer five questions about who actually buys, and watch your scoring inputs build themselves
Your ICP hypothesis
Fit inputs
Answer a question to fill this in.
Intent inputs
Answer a question to fill this in.
An ICP built from buying motivation produces concrete, scoreable inputs that already split cleanly into fit and intent. An ICP built from demographics produces filters that predict nothing.
The payoff of starting here is that every answer becomes a measurable input you can score later. "RevOps leaders at companies that just raised" is not a vibe; it is a seniority check, a department match, and a funding-recency signal. Your ICP is a hypothesis, not a verdict. You will refine it against closed-won in Step 5, but it has to start from motivation or there is nothing real to refine.
Step 2: Separate fit from intent, then multiply them
A single number that mashes fit and intent together hides the one thing a rep needs to know. Fit answers "should we ever sell to this account," and intent answers "are they ready right now"; collapsing both into one additive score loses the distinction that decides what a rep does next. A perfect-fit account with zero intent is a nurture target. A scrappy account showing five buying signals is a call today. The same total score, two completely different plays.
The fix is to score the two dimensions on separate axes and let an account's position decide its action, rather than summing everything into one pile of points.
Drag five leads into the fit-times-intent grid and see why a single score hides the play
Place this lead
VP RevOps, 220 employees, ideal stack, no recent activity
Pick the quadrant above that matches its fit and intent.
Fit and intent are independent axes, and an account's quadrant, not its total point count, is what tells a rep whether to call, nurture, qualify, or drop it.
In practice you can keep two sub-scores, Fit and Intent, each on a 0 to 100 scale, and combine them as a product rather than a sum so that a near-zero on either axis pulls the whole lead down. A lead at 90 fit and 10 intent is not the same as 50 and 50, and multiplying keeps that honest in a way addition never does.
Step 3: Assign weights based on what the data says predicts a purchase
Equal weights are a confession that you have not looked at your data. Most teams give every criterion the same importance because deciding otherwise feels arbitrary, but the whole point of a scoring model is that some signals separate buyers from non-buyers far more sharply than others. Seniority might be the strongest predictor in your data, or it might be tech stack, or funding recency. You do not know until you weight them and test.
Start with a weighted set of fit inputs, set them from your best current read of which signals correlate with closing, and treat those numbers as a draft you will correct in Step 5.
Set the dimension weights and watch the lead list re-rank, pushing your real closed-won deals to the top
The right weights are the ones that rank your actual closed-won deals near the top. You discover them by tuning against real outcomes, not by debating importance in a meeting.
If pushing your closed-won deals up the list forces you into weights that feel strange, that is the data correcting your instinct, which is exactly what it is for. The criteria themselves come straight from your ICP work in Step 1; the weights are the part you are now letting the outcomes decide.
“Clay gave us the ability to define what a great customer looks like on our terms. Not just industry and title, but the signals that actually predict who will buy. Our reps are working better lists, closing faster, and generating 19% more revenue per head.”
Step 4: Build the scoring logic in Clay with deterministic rules plus AI
Not every criterion should be scored the same way. Firmographic checks (headcount band, funding stage, title keywords) are deterministic lookups: a clear rule returns a clear number. Judgment calls (does this messy industry value map to a category we win in, does this job title signal real buying authority despite a weird label) need AI. Build the deterministic 80% with native rules first, then layer an AI formula only over the 20% of cases that rules handle badly, so you spend credits where judgment actually changes the score.
Clay's native Score Row enrichment (Add Enrichment, then Score Row) handles firmographic scoring across up to 15 criteria: you set each factor, a comparison type, keywords, and the points to assign, and Clay returns a number plus its reasoning. Use it for everything a rule can decide cleanly. For the harder inputs, add an AI formula column. A common one is normalizing the industry field, since raw provider data is inconsistent (Ulta Beauty and Nike both come back as "Retail," and different vendors label the same sector "IT," "Software," or "Internet"). Map it to your own categories before you score it.
The other input AI handles well is reading a lead's enriched fields and returning a fit score with its reasoning, so a human can audit why the model rated it that way.
You are scoring inbound and sourced leads for fit against our ICP.Lead data:- Job title: {{job_title}}- Seniority tier: {{seniority_tier}}- Department: {{department}}- Company: {{company_name}}- Headcount: {{headcount}}- Industry (normalized): {{industry_normalized}}- Tech stack: {{tech_stack_summary}}- Funding stage: {{funding_stage}}Score each dimension 0-10, then return JSON only:{ "seniority_fit": 0-10, "headcount_fit": 0-10, "tech_stack_fit": 0-10, "funding_fit": 0-10, "industry_fit": 0-10, "fit_score_0_100": 0-100, "strongest_signal": "one sentence: the single trait most like our best customers", "weakest_signal": "one sentence: the trait that argues against fit", "reasoning": "one sentence explaining the overall fit score"}Scoring guidance:- Seniority: C-level 9-10, VP/Director 7-8, Manager 5-6, IC 3-4, entry 1-2.- Headcount: 50-500 is our primary band (8-10); outside it scores lower.- Tech stack: reward a mature stack with 5+ tools in the category.- Do not invent signals. Use only the data provided. If a field is empty, score that dimension 5 and say so in reasoning.
Test this on 10 to 20 rows before running at scale, and read every output. If a meaningful share look generic or contradictory, tighten the prompt before you proceed. When an AI formula misfires, Clay's "Output is Wrong" button drops you into a flow to fix the logic; if it still resists, the formula is just generated code, so paste it into an AI assistant with your expected-versus-actual outputs and ask why it breaks.
Step 5: Validate the model against who actually closed
This is the step almost every team skips, and skipping it is why most scoring models quietly lie. A scoring model is unvalidated until you have run it across your historical closed-won and closed-lost leads and confirmed that your best customers actually score high; a model that has never met a real outcome is a guess with decimals. The test is simple: score the deals you already know the answer to, and see whether the model agrees with reality.
Take your last 30 to 50 closed-won deals and a matched set of closed-lost or no-decision leads, run them all through the model, and check the distributions. The bar to clear is that the large majority of your closed-won deals land above your sales-ready threshold. If they scatter all over the range, or your best customers cluster in the middle, the weights from Step 3 are wrong and you go back and re-tune.
Drag the threshold across your real closed-won and closed-lost deals, then compare to a guessed model
Closed-won deals (n=40)
Closed-lost / no-decision (n=40)
100%
Won captured (recall)
100%
Precision above line
40
Won above line
0
Lost above line
A validated model separates the bands. The defensible threshold is where that separation holds.
A model is only trustworthy once its scores actually separate the deals that closed from the ones that did not, and the threshold you can defend is the one where that separation holds.
There is a second, generative use of closed-won here. Once you know which companies actually bought, you can find more that look like them: mark a record closed-won in your CRM, trigger a Clay workflow that runs the company through Find Company Lookalikes (or Ocean.io), and write the 10 nearest matches to a new table to enrich and score. The same outcomes that validate your model also become a source for the next list. ElevenLabs built automated scoring on exactly this foundation, scoring every inbound lead so the right ones reached sales faster.
Lift in sales-qualified leads ElevenLabs saw after moving to automated lead scoring in Clay.
Read the full storyStep 6: Operationalize the score and keep it fresh
A validated model is worthless if it lives in a spreadsheet no rep opens. Once the scores separate your winners, the model has to run automatically on every new lead, write its score and reasoning where reps work, and get re-validated on a schedule, because a model that was right last quarter degrades as your market moves. Operationalizing is two jobs: wiring the score into the live flow, and keeping the model honest over time.
For the live flow, the scoring columns you built run on every new row as leads land in Clay, and the score plus its tier and one-line reasoning sync to your CRM as custom fields (ICP Tier, Fit Score, Strongest Signal). Standard CRM fields will not surface the right context, so build the custom fields before the sync runs. The routing and alerting that happens after the score is set, who gets the lead and how fast, is the job of the inbound qualification workflow; the scoring model is what feeds it a number it can trust.
Keeping it fresh is the part teams forget. Set a recurring date, every quarter is reasonable, to re-run the Step 5 validation on the most recent closed-won and closed-lost deals. Markets shift, your product moves upmarket, a new competitor changes which signals matter. When the recent winners stop scoring high, re-tune the weights. Scoring is not set-and-forget; it is a loop where each quarter of outcomes corrects the next quarter of scores.
Common failure modes
- Building forward from opinion instead of backward from closed-won: A demo request is worth +10 because someone decided it was, not because the data showed demo-requesters close. Always run new weights against historical outcomes before they go live.
- Adding fit and intent into one number: A 90-fit, 10-intent account and a 50-50 account get the same total and completely wrong plays. Keep the two scores separate and combine them as a product, not a sum.
- Equal weights across every criterion: Flat weighting scatters your real winners through the middle of the distribution. Let the closed-won validation set decide which signals get the points.
- Scoring on raw, un-normalized data: Seniority rules that look for "VP" miss "Vice President," and industry filters miss because providers label the same sector three different ways. Normalize titles and industries before any scoring runs.
- Never re-validating: A model that nailed last year's buyers can quietly drift as your market changes. Re-run the closed-won validation every quarter and re-tune when recent winners stop scoring high.