Clay logo, go to homepage

Clay GTM guide

How to Find and Remove Duplicate Contacts

The duplicate contacts that hurt you don't look like duplicates. Find the ones exact matching misses, merge them without losing data, and stop new ones from forming on the next import.

May 11, 20269 min read

The duplicate contacts that hurt you don't look like duplicates. Your CRM's built-in tool catches john.smith@acme.com listed twice. It does not catch j.smith@acme.com and john.smith@acme.com sitting in two records, or "Acme Inc" filed separately from "Acme Corp," or the same person who filled out a form as "Jon" in 2023 and "Jonathan" last quarter.

Those are the records that split a rep's history in half, double your account count, and route the same prospect to two SDRs. The consumer "merge all duplicates" button that dominates every search result solves exact-match collisions on a phone. It does nothing for the near-matches that make up most of the duplication in a B2B database. This is how to find the duplicates exact matching misses, merge them without losing data, and stop new ones from forming on the next import.

Step 1: Understand why duplicate contacts form in the first place

Duplicates are a side effect of having more than one way into your CRM. A contact enters from a form fill, then again from a trade-show list upload, then a third time when a rep adds them by hand after a call. Each entry uses a slightly different spelling, a different email alias, or a different company name, so the CRM treats them as three people. The problem is not that anyone made a mistake. It is that nothing normalized the values before they landed.

Once you see the entry points, the pattern is obvious: the same human or account arrives through channels that never agree on formatting. A name is "Bob" in one source and "Robert" in another. A company is "Acme, Inc." in a purchased list and "Acme" in a self-reported form. An email is personal in a PLG signup and work-domain in a sales-entered record. None of these match on a literal string compare, so every native dedupe tool walks right past them.

Add each record and watch four sources become one person

Webinar form

Name
Bob Lee
Email
bob@acme.io
Company
Acme
Record 1
1Records created
1Real people

Duplicates form because every entry channel formats the same person differently, so the CRM stores one human as several records.

The takeaway is that you cannot prevent variation at the source without rejecting half your inbound. You can only reconcile it after. That reconciliation is a matching problem, and it is the next step.

Step 2: Find the duplicates exact matching misses with fuzzy matching

Clay's own Dedupe is exact-match, case- and whitespace-sensitive, on a single column, so to catch near-matches you build a workflow: normalize the keys with formatters, compare with a Lookup Rows match and an AI similarity score, then pick the surviving values. That catches the easy 10% the exact-match tools find, then surfaces the duplicates that actually cost you.

Bring your contacts into Clay from HubSpot or Salesforce through the native integration, no CSV export required. Then build the comparison keys. Normalize the email to its root by stripping dots and plus-tags from the local part and lowercasing the domain, so j.smith@acme.com and John.Smith+crm@acme.com both reduce to jsmith@acme.com. Normalize the company name by lowercasing and dropping legal suffixes (Inc, LLC, Corp, Ltd) and punctuation, so "Acme, Inc." and "Acme Corp" both reduce to acme. Match people on the normalized email plus a name-similarity check, and match accounts on the normalized domain. Two records that looked unrelated to the CRM now collapse into one match group.

Run exact match, then normalize, and watch the two records snap together

Email
j.smith@acme.com
Name
J. Smith
Company
Acme Inc
?
Email
John.Smith+crm@ACME.com
Name
Jonathan Smith
Company
Acme, Corp.

Records that share no identical field can still be the same contact; normalizing each field before comparing is what surfaces them.

You can run the name-similarity judgment with AI rather than a brittle string-distance rule, which handles "Bob" versus "Robert" and "Liz" versus "Elizabeth" that pure character comparison gets wrong. Here is a prompt you can drop into an AI column to score whether two records are the same person:

AI same-person matching prompt
You are matching two CRM contact records. Decide if they are the same person.Record A: {{name_a}}, {{email_a}}, {{company_a}}, {{title_a}}Record B: {{name_b}}, {{email_b}}, {{company_b}}, {{title_b}}Consider nicknames (Bob = Robert, Liz = Elizabeth), email aliases (j.smith and john.smith on the same domain), and company name variants (Acme Inc = Acme Corp = Acme). Ignore formatting, capitalization, and punctuation.Return JSON only:{"same_person": true|false, "confidence": 0-100, "reason": "<one short clause>"}

Score every candidate pair, then keep the groups above a confidence you trust: 90 and up for auto-merge, 70 to 89 for human review.

Step 3: Decide the survivor and merge field by field

Merging is choosing the best value for each field, not picking a row and deleting the other. The common failure is "keep the newest record." The newest record is often the thinnest: a rep typed a name and an email after a call and skipped everything else. The older record might hold the verified phone, the correct title, and the activity history. Survivorship means assembling one complete record from the best parts of each, not crowning a winner.

Go field by field. For each field, the surviving value is the most complete and most trustworthy one across the group, regardless of which record it came from. Email: keep the verified work address over the personal Gmail. Phone: keep the populated direct dial over the blank. Title: keep the specific "VP of Revenue Operations" over the generic "Manager." Company: keep the normalized canonical name. The result is a single record richer than any of its inputs.

Pick the surviving value for each field and watch the record fill in

Work email

Phone

Title

Company

Survivor record0%
Work email
Phone
Title
Company

The most complete record is assembled from the best value in each field across all duplicates, not copied from whichever row is newest. Activity history from all three records is preserved.

This is the difference that compounds across a whole database. At Sana, the team rebuilt CRM trust by enriching and reconciling accounts at scale rather than letting reps patch records one at a time.

Reps used to spend hours validating account information because they couldn't trust the data. With Clay, reps are much more confident in our CRM data and most accounts in their books of business are now worth reaching out to.

That confidence is the real output of a clean merge: a record a rep does not have to re-verify before every call.

Step 4: Merge safely back into HubSpot or Salesforce

The merge is only safe if it preserves activity history and writes back without creating new records. A contact's emails, calls, meetings, and deal associations are the reason the record matters. Lose them in a sloppy merge and you have a clean-looking record with no memory. Before you write anything back, confirm the merge keeps every activity timeline from every record in the group attached to the survivor.

Work the write-back the way Clay's own team teaches it in the CRM enrichment course. Treat lookups as read-only: pull records into Clay, build and check your match groups and survivor values, and change nothing in production while you do. When you are ready to write, test in a Salesforce sandbox first so a mistake never touches live data. Use the Salesforce Update record action keyed on the Record ID so the surviving record updates in place. Clay prevents duplicate creation by default, and the duplicate records are merged or archived on the Salesforce side. Writes to Salesforce cannot be undone. Keying on the record ID is the safeguard that stops the cleanup from becoming the next source of duplicates.

Advance the write-back one stage at a time, no skipping the sandbox

  1. 1Pull (read-only)
  2. 2Match and build survivor
  3. 3Test in sandbox
  4. 4Write to production by record ID

Pull (read-only)

Records flow from the CRM into Clay. Production is unchanged while you work.

Test before writing to production — no skipping the sandbox.

Writing back keyed on the CRM record ID updates the survivor in place and preserves every activity timeline, so cleanup never spawns new duplicates.

The end state is one live record, zero new records, and the full history attached. That is the whole point of keying on the record ID rather than letting the cleanup insert fresh rows.

Step 5: Prevent duplicate contacts at the point of entry

Cleaning once doesn't last; the next import refills the database unless you dedupe at entry. A one-time cleanup feels like a win for about a week. Then a new list lands, the same near-matches slip in, and you are back where you started. The only durable fix is to run the same normalize-and-match logic on every new record before it gets written, not just on the batch you cleaned today.

Set the dedupe check as a standing step on the inbound path. New records (form fills, list uploads, manual adds) flow through Clay, get normalized on the same keys you used in Step 2, and get checked against existing contacts. A clear match updates the existing record instead of creating a new one. A near-match below the auto-merge threshold routes to a review queue rather than silently doubling an account. Run it on a schedule so it processes new arrivals continuously, the same way a refresh list keeps enriched data current.

Send records through the entry check, then switch the rule off

Dedupe at entry

Exact existing match

Update in place

No new record

High-confidence near-match

Merge into existing

No new record

Below threshold

Route to review queue

No new record
12Duplicates prevented
0Duplicates created

A normalize-and-match check on every new record, run on a schedule, is what keeps a cleaned database from refilling on the next import.

With the rule on, new arrivals update, merge, or queue for review, and the duplicates-created counter stays at zero. Switch it off and the same records all create fresh rows. That gap is the difference between cleaning once and never having to run the project again.

Common failure modes when removing duplicate contacts

Three mistakes turn a dedupe project into wasted effort. The first is blind "merge all": trusting a one-click button that only collapses exact string matches, so every near-match duplicate survives and you declare victory over the easy 10%. The second is keeping the newest row instead of the best value: the new record overwrites a richer old one, and you lose the verified phone or the real title to save a worse version. The third is re-creating duplicates on the next import: you clean today, skip the entry-time check, and a list upload next month rebuilds the mess because nothing normalized the new arrivals.

All three come from treating deduplication as a cleanup task instead of a standing matching system. Exact-match tools miss the duplicates that hurt, newest-row merges destroy data, and one-time cleanups expire. The fix in each case is the same logic applied at the right moment: normalize before you compare, merge by best value per field, and run the check at entry so it never has to be a project again.

Find the duplicate contacts your CRM can't see

Fuzzy-match, merge by best value, and stop new duplicates at entry: build the whole loop in Clay, free.

Frequently asked questions

How do I find duplicate contacts my CRM doesn't flag?

Native CRM finders compare values as literal strings, so they only catch records where every character matches. To find the rest, normalize each field to a canonical form first: strip dots and plus-tags from emails, drop legal suffixes and punctuation from company names, and resolve nicknames. Then compare the normalized values. Records that shared no identical field, like j.smith@acme.com and john.smith@acme.com, collapse into one match group.

How do you merge duplicate contacts without losing data?

Merge field by field instead of keeping one row and deleting the rest. For each field, keep the most complete and most trustworthy value across the group: the verified work email over the personal one, the populated phone over the blank, the specific title over the generic. Confirm the merge preserves every activity timeline from all records, then write back keyed on the CRM record ID so the survivor updates in place.

What causes duplicate contacts in a CRM?

Multiple entry points that never agree on formatting. The same person arrives through a form fill, a list import, a PLG signup, and a manual rep entry, each using a different spelling, email alias, or company name. Because no step normalizes the values before they land, the CRM stores one human as several records.

How do I remove duplicate contacts in Salesforce or HubSpot?

Pull records into Clay through the native integration as a read-only step, build your match groups and survivor values without touching production, and test the write-back in a Salesforce sandbox first. Then use update actions keyed on the record ID so Clay updates the surviving record in place and archives the duplicates, rather than inserting new rows.

How do you prevent duplicate contacts going forward?

Run the same normalize-and-match check on every new record before it gets written, not just on the batch you cleaned. New records flow through the check, an exact or high-confidence match updates the existing record instead of creating one, and a near-match below your threshold routes to a review queue. Run it on a schedule so it processes new arrivals continuously and the database stops refilling.