How to Extract Data from Paginated Databases and Directories

A directory is a database someone forgot to give you an export button for. The names, the categories, the contact details are all there, structured, across page after page. The hard part is almost never the data. It is the pagination, and sometimes a bot wall in front of it. Here is how to match the source to the right capture tool, pull every page into one table, and enrich from there.

June 10, 20269 min read

DirectoryFilterSort

25,000+

Rows pulled from directories

Match the source to the right capture method first

Most failed directory pulls come from reaching for one tool and forcing every site through it. The site decides the tool, not your habit.

A clean tabular member list behaves nothing like a marketplace behind a human-verification check, and neither behaves like a profile page written in prose. Pick wrong and you either get garbage rows, a blocked run, or a tool that times out trying to read structure that was never there.

Pick your source type to see the matching Clay tool

What is the source like?

The point of the four choices is not the tools. It is that you diagnose the source first and the tool falls out of that diagnosis. The rest of this guide walks each path.

Step one: Capture a clean paginated list with a Chrome recipe

A clean, paginated directory is the easy case, and the most common one. Association member lists, expert directories, supplier catalogs, and event exhibitor lists usually render their rows in a repeating table or card structure. That structure is exactly what the Clay for Chrome extension is built to read.

Open the extension on the directory page and let auto-detect run first. On a tabular page it often finds the full list on its own; in Clay's own expert directory walkthrough, auto-detect pulled all 78 entries without any setup. When auto-detect misses, you build a custom recipe: click Select Data to Add from Page, choose Select a List, then click two or three items in the same position so the extension learns the pattern. You add an attribute for each field you want (name, location, category, website) by clicking an example of it. Save the recipe, and the extension captures the list straight into a Clay table.

Pagination is the part people get wrong. The extension captures lists across multiple pages, but not through a page-range setting, because there is no such setting. You teach it the URL pattern instead. In the recipe's URL Matching settings, you replace the part of the address that changes from page to page with a variable, so a recipe built on one page applies to every page that matches. That is what turns a single-page grab into a full-directory capture.

One recipe captures all five pages, not just page one

››››

RecipeURL Matching variable

Page 1 of 5reading…

0 rows capturedof 30

NameCategoryLocationWebsite

Empty table — run the recipe to capture page one.

Click a page to hold it — one recipe walks every page, not just page one

One recipe with a URL-pattern variable walks every page of a directory into a single table, not just the page you started on.

Once the rows land, you have a structured table: one row per directory entry, one column per attribute you mapped. That is your raw list. Everything after this is enrichment and cleanup.

Step two: Get past a bot wall with Zenrows

Some directories fight back. Marketplaces and premium databases often sit behind a human-verification check that fires the moment you click a listing, and even once you are through, the data is not laid out to be read cleanly. A Chrome recipe stalls here because it never gets a clean page to read.

This is what Zenrows is for. Inside a Clay table you add the Zenrows enrichment and feed it the listing URL. Then you turn on the settings that match the defense: Render JavaScript for pages that build themselves in the browser, Premium Proxy and Anti-Bot for stronger protections, and Autoparse to format the response. Clay's team uses exactly this configuration to pull funding rounds, investor names, and rankings off pages that block a normal click. The raw response then gets refined with Clay's column extraction or an AI formula into the fields you actually want.

Reach for Zenrows only when a simpler tool has already failed. If a normal page load gets you a readable list, a Chrome recipe is cheaper and faster. Zenrows earns its place specifically when the wall is the problem.

Step three: Pull specialized sources at volume with an Apify actor

When the source is specialized and you need volume, an Apify actor beats a generic scraper. Apify is a marketplace of purpose-built scrapers, called actors, each written for a particular kind of site: business directories like Yellow Pages, marketplaces, job boards, niche forums, real-estate listings. Someone has usually already solved the hard parts of a given source, and you run their actor on demand.

The setup is two-sided. In Apify you pick the actor, click Create Task to add it to your library, switch the input from Manual to JSON, and copy that input body. In Clay you connect your Apify account with an API key, add the Apify integration to your table, paste the input, and run the actor against your rows. Results come back into the table the same way any enrichment does. The advantage over a Chrome recipe is throughput and resilience: an actor maintained for a specific source handles that source's quirks and scale better than a recipe you build by hand for a one-time pull.

“Contractors maintain accurate Google Maps listings because they need service calls. This insight helped us build contact lists with much higher accuracy than traditional B2B data providers offered. We never would have been able to do that in Zoominfo.”
— Andrew Thomas, Director of Marketing, Regency Supply · Read the Regency Supply story

Regency Supply found this directly. Their targets, local electrical contractors, barely existed in ZoomInfo or Apollo, so the premium database route returned thin, unverifiable rows. Sourcing from where the businesses actually keep current listings produced contact lists with far higher accuracy than the incumbent providers could, and now feeds a system that tracks more than 5,000 of those contacts for job changes.

Step four: Read unstructured detail pages with Use AI or ScrapeMagic

Not every directory entry is a tidy row. Plenty of detail pages are prose: an About paragraph, a bio, a service description with the data you want buried in sentences. A list recipe has nothing to latch onto here, because there is no repeating structure to read.

This is where you read the page with AI instead of parsing it. A Use AI column running Web research takes the detail-page URL and a description of what you want, then returns those fields. ScrapeMagic does the same job through its Parse Data from URL action. You name each field on the left and describe it on the right, for example headcount mapped to the number of employees the page lists. It then extracts that field from the page. Both turn a paragraph into columns without you writing a selector.

Use a precise prompt and ask for one field per output, so the result lands in its own column rather than as a blob you have to split later.

AI scrape: structured extraction from a directory detail page

ROLE: You are extracting structured data from a single directory detail page.INPUT: {{page_url}}TASK: Read the page at the URL and return only these fields. If a field isnot present on the page, return "not found" for that field. Do not guess.RETURN (one value per field):- business_name: the official name as written on the page- primary_category: the single main category or service line- year_established: four-digit year, or "not found"- service_area: cities or regions the page says it serves- contact_email: the email the page lists, or "not found"

The reason to name fields this tightly is that a vague prompt returns a paragraph, and a paragraph is just the unstructured problem again in a new column. Specific field instructions give you a table.

Step five: Enrich, verify, and dedupe before you trust the list

A captured directory is raw material, not a finished list. The rows are real, but they are dirty. Website columns arrive with tracking parameters or point at a social page instead of a real domain. The same business shows up twice under slightly different names, and contact details are missing or stale. Acting on that as-is wastes the work.

Clean and complete it in the same table. Normalize the obvious fields first, then fill the gaps with a waterfall. Email and phone come from a waterfall that checks one provider, and when that comes back empty, checks the next across dozens of sources, so a gap in one source does not leave the row blank. When even the waterfall comes up short, an AI read of the business's own site can pull the contact it publishes. Verify emails so you keep only Valid and Catch-all rows instead of mailing into dead inboxes, and dedupe on a stable key so two listings of one business collapse into a single row.

Watch a raw directory row become a clean, enriched record

›››

Acme Law Group

unverified

Websiteacmelaw.com/?utm=dir&ref=99dirty

Email—

Phone—

Acme Law Group (dup)near-duplicate listing

A raw captured row: a name, a dirty website, no email, no phone. Watch it pass through the cleanup gates into a record you can act on.

Click a gate to hold it — the scrape is the start of the work, not the end

The captured row only becomes usable after a normalize-enrich-verify-dedupe pass. The scrape is the start of the work, not the end.

The scale this unlocks is real once the pipeline runs. By combining ten-plus data providers in Clay and standardizing the results before syncing to their CRM, Mistral built a foundational TAM in a fraction of the usual time, the kind of throughput that manual page-by-page collection never reaches.

25,000+

Qualified global accounts Mistral sourced in two weeks by capturing and enriching at scale in Clay.

Read the full story

Step six: Watch for the failure modes that ruin a directory pull

Most directory extractions break in predictable ways, and all of them are avoidable.

Forcing one tool onto every source: Running a Chrome recipe at a bot wall, or pointing AI research at a clean table it does not need. Match the tool to the source and these stop happening.
Capturing page one and assuming you got the directory: Without a URL-pattern variable in your recipe, you collected a fraction and never noticed.
Trusting raw rows: Dirty domains and duplicate listings look fine in the table and only reveal themselves when a send bounces or a rep calls the same business twice.
Scraping behind a login or paywall: It is unreliable and not what these tools are for. Stick to publicly listed data.

The cleanest way to avoid all four is to treat capture and enrichment as two separate jobs. Get the rows in with the right tool for the source, then run every row through the same normalize-enrich-verify-dedupe pass before anyone acts on it.

If you want the broader picture of capturing from any site, the companion guide on scraping data from any website covers the full tool ladder. The guide on scraping a website to a spreadsheet covers clean exports, and the Google Maps lead guide goes deep on local-business sourcing. This guide is the directory-and-database slice of that same toolkit.

Turn any paginated directory into a clean, enriched table

Match the source to the right capture tool, pull every page into Clay, then enrich and verify in one place.

Start 14 day trial Watch Clay's team source a list live

Frequently asked questions

How do I scrape a directory that has multiple pages?

Build a custom recipe in the Clay for Chrome extension on one page, then set the URL Matching so the part of the address that changes per page becomes a variable. The recipe then captures the list across every matching page into one Clay table. There is no page range field; the URL-pattern variable is what extends a single-page grab to the whole directory.

What if the directory blocks scraping or shows a verification check?

Use the Zenrows enrichment inside Clay. Feed it the listing URL and turn on Render JavaScript, Premium Proxy, and Anti-Bot to get past the defenses, with Autoparse to format the response. Reach for Zenrows only after a plain Chrome recipe fails, since it is the tool built specifically for sites that fight back.

Can I extract data from a directory without writing code?

Yes. None of the four methods require code. The Chrome extension builds recipes by clicking example items, Apify actors run from a Clay column with an API key, Zenrows is a toggle-driven enrichment, and Use AI or ScrapeMagic extract fields from a plain-language description. You write field names, not selectors or scripts.

How do I handle a directory where each listing is a paragraph instead of a table?

Read it with AI rather than parsing structure that is not there. A Use AI column running Web research, or ScrapeMagic's Parse Data from URL action, takes the detail-page URL and the fields you name and returns them as columns. Name one field per output and tell the model to return not found when a field is absent, so you get a clean table instead of a paragraph.

Is the data I scrape from a directory ready to use right away?

No. Raw directory rows carry dirty website URLs, missing contact details, and duplicate listings. Run every row through a normalize-enrich-verify-dedupe pass: clean the domains, fill email and phone with a waterfall across multiple providers, verify email status to drop dead inboxes, and dedupe on a stable key. The capture is the start of the work; the enrichment pass is what makes the list usable.

Related guides

Account research

How to scrape business data from professional profiles

Turn a profile URL into structured, rep-ready fields at scale

How-to9 min read

Account research

How to scrape any website, no code

Pull data off any page and into a table you can use

Complete guide10 min read

Account research

What firmographic data is and how to use it

The data that decides ICP fit, and why it's the least accurate

Complete guide10 min read

Account research

10 Best ABM & Intent Data Platforms for 2026

We compared 10 ABM and intent data platforms on targeting, intent signals, orchestration, pricing, and real G2 ratings. Find the one that fits your team in 2026.

Comparison18 min read

Account research