Match the source to the right capture method first
Most failed directory pulls come from reaching for one tool and forcing every site through it. The site decides the tool, not your habit.
A clean tabular member list behaves nothing like a marketplace behind a human-verification check, and neither behaves like a profile page written in prose. Pick wrong and you either get garbage rows, a blocked run, or a tool that times out trying to read structure that was never there.
Pick your source type to see the matching Clay tool
What is the source like?
The point of the four choices is not the tools. It is that you diagnose the source first and the tool falls out of that diagnosis. The rest of this guide walks each path.
Step one: Capture a clean paginated list with a Chrome recipe
A clean, paginated directory is the easy case, and the most common one. Association member lists, expert directories, supplier catalogs, and event exhibitor lists usually render their rows in a repeating table or card structure. That structure is exactly what the Clay for Chrome extension is built to read.
Open the extension on the directory page and let auto-detect run first. On a tabular page it often finds the full list on its own; in Clay's own expert directory walkthrough, auto-detect pulled all 78 entries without any setup. When auto-detect misses, you build a custom recipe: click Select Data to Add from Page, choose Select a List, then click two or three items in the same position so the extension learns the pattern. You add an attribute for each field you want (name, location, category, website) by clicking an example of it. Save the recipe, and the extension captures the list straight into a Clay table.
Pagination is the part people get wrong. The extension captures lists across multiple pages, but not through a page-range setting, because there is no such setting. You teach it the URL pattern instead. In the recipe's URL Matching settings, you replace the part of the address that changes from page to page with a variable, so a recipe built on one page applies to every page that matches. That is what turns a single-page grab into a full-directory capture.
One recipe captures all five pages, not just page one
One recipe with a URL-pattern variable walks every page of a directory into a single table, not just the page you started on.
Once the rows land, you have a structured table: one row per directory entry, one column per attribute you mapped. That is your raw list. Everything after this is enrichment and cleanup.
Step two: Get past a bot wall with Zenrows
Some directories fight back. Marketplaces and premium databases often sit behind a human-verification check that fires the moment you click a listing, and even once you are through, the data is not laid out to be read cleanly. A Chrome recipe stalls here because it never gets a clean page to read.
This is what Zenrows is for. Inside a Clay table you add the Zenrows enrichment and feed it the listing URL. Then you turn on the settings that match the defense: Render JavaScript for pages that build themselves in the browser, Premium Proxy and Anti-Bot for stronger protections, and Autoparse to format the response. Clay's team uses exactly this configuration to pull funding rounds, investor names, and rankings off pages that block a normal click. The raw response then gets refined with Clay's column extraction or an AI formula into the fields you actually want.
Reach for Zenrows only when a simpler tool has already failed. If a normal page load gets you a readable list, a Chrome recipe is cheaper and faster. Zenrows earns its place specifically when the wall is the problem.
Step three: Pull specialized sources at volume with an Apify actor
When the source is specialized and you need volume, an Apify actor beats a generic scraper. Apify is a marketplace of purpose-built scrapers, called actors, each written for a particular kind of site: business directories like Yellow Pages, marketplaces, job boards, niche forums, real-estate listings. Someone has usually already solved the hard parts of a given source, and you run their actor on demand.
The setup is two-sided. In Apify you pick the actor, click Create Task to add it to your library, switch the input from Manual to JSON, and copy that input body. In Clay you connect your Apify account with an API key, add the Apify integration to your table, paste the input, and run the actor against your rows. Results come back into the table the same way any enrichment does. The advantage over a Chrome recipe is throughput and resilience: an actor maintained for a specific source handles that source's quirks and scale better than a recipe you build by hand for a one-time pull.
“Contractors maintain accurate Google Maps listings because they need service calls. This insight helped us build contact lists with much higher accuracy than traditional B2B data providers offered. We never would have been able to do that in Zoominfo.”
Regency Supply found this directly. Their targets, local electrical contractors, barely existed in ZoomInfo or Apollo, so the premium database route returned thin, unverifiable rows. Sourcing from where the businesses actually keep current listings produced contact lists with far higher accuracy than the incumbent providers could, and now feeds a system that tracks more than 5,000 of those contacts for job changes.
Step four: Read unstructured detail pages with Use AI or ScrapeMagic
Not every directory entry is a tidy row. Plenty of detail pages are prose: an About paragraph, a bio, a service description with the data you want buried in sentences. A list recipe has nothing to latch onto here, because there is no repeating structure to read.
This is where you read the page with AI instead of parsing it. A Use AI column running Web research takes the detail-page URL and a description of what you want, then returns those fields. ScrapeMagic does the same job through its Parse Data from URL action. You name each field on the left and describe it on the right, for example headcount mapped to the number of employees the page lists. It then extracts that field from the page. Both turn a paragraph into columns without you writing a selector.
Use a precise prompt and ask for one field per output, so the result lands in its own column rather than as a blob you have to split later.
ROLE: You are extracting structured data from a single directory detail page.INPUT: {{page_url}}TASK: Read the page at the URL and return only these fields. If a field isnot present on the page, return "not found" for that field. Do not guess.RETURN (one value per field):- business_name: the official name as written on the page- primary_category: the single main category or service line- year_established: four-digit year, or "not found"- service_area: cities or regions the page says it serves- contact_email: the email the page lists, or "not found"
The reason to name fields this tightly is that a vague prompt returns a paragraph, and a paragraph is just the unstructured problem again in a new column. Specific field instructions give you a table.
Step five: Enrich, verify, and dedupe before you trust the list
A captured directory is raw material, not a finished list. The rows are real, but they are dirty. Website columns arrive with tracking parameters or point at a social page instead of a real domain. The same business shows up twice under slightly different names, and contact details are missing or stale. Acting on that as-is wastes the work.
Clean and complete it in the same table. Normalize the obvious fields first, then fill the gaps with a waterfall. Email and phone come from a waterfall that checks one provider, and when that comes back empty, checks the next across dozens of sources, so a gap in one source does not leave the row blank. When even the waterfall comes up short, an AI read of the business's own site can pull the contact it publishes. Verify emails so you keep only Valid and Catch-all rows instead of mailing into dead inboxes, and dedupe on a stable key so two listings of one business collapse into a single row.
Watch a raw directory row become a clean, enriched record
Acme Law Group
unverifiedA raw captured row: a name, a dirty website, no email, no phone. Watch it pass through the cleanup gates into a record you can act on.
The captured row only becomes usable after a normalize-enrich-verify-dedupe pass. The scrape is the start of the work, not the end.
The scale this unlocks is real once the pipeline runs. By combining ten-plus data providers in Clay and standardizing the results before syncing to their CRM, Mistral built a foundational TAM in a fraction of the usual time, the kind of throughput that manual page-by-page collection never reaches.
Qualified global accounts Mistral sourced in two weeks by capturing and enriching at scale in Clay.
Read the full storyStep six: Watch for the failure modes that ruin a directory pull
Most directory extractions break in predictable ways, and all of them are avoidable.
- Forcing one tool onto every source: Running a Chrome recipe at a bot wall, or pointing AI research at a clean table it does not need. Match the tool to the source and these stop happening.
- Capturing page one and assuming you got the directory: Without a URL-pattern variable in your recipe, you collected a fraction and never noticed.
- Trusting raw rows: Dirty domains and duplicate listings look fine in the table and only reveal themselves when a send bounces or a rep calls the same business twice.
- Scraping behind a login or paywall: It is unreliable and not what these tools are for. Stick to publicly listed data.
The cleanest way to avoid all four is to treat capture and enrichment as two separate jobs. Get the rows in with the right tool for the source, then run every row through the same normalize-enrich-verify-dedupe pass before anyone acts on it.
If you want the broader picture of capturing from any site, the companion guide on scraping data from any website covers the full tool ladder. The guide on scraping a website to a spreadsheet covers clean exports, and the Google Maps lead guide goes deep on local-business sourcing. This guide is the directory-and-database slice of that same toolkit.