The hard part of scraping a website into a spreadsheet was never pulling text off the page. The hard part is getting clean, structured, deduplicated rows your spreadsheet can actually use. A copy-paste or a raw IMPORTXML gives you a brittle mess: merged cells, half-loaded pages, a formula that returns #N/A the morning the site changes one tag.
A scrape, structure, enrich pipeline gives you a table instead. This walks through getting real data into Excel or Google Sheets, why the consumer shortcuts break the moment you scale past one clean page, and the durable path that still works when the markup changes.
Why IMPORTHTML and IMPORTXML break the moment you scale
Spreadsheet scraping formulas only work on the one kind of page they were built for, and most pages are not that kind. IMPORTHTML reads a clean HTML table or list. IMPORTXML reads a static page and walks an XPath you hand it. Both assume the data is already sitting in the raw HTML the server sends, already structured, and frozen in place. That assumption holds for a Wikipedia table. It fails everywhere a real research target lives.
The failures stack up fast, and they are worth seeing as a set rather than discovering one at a time.
What a spreadsheet formula returns vs. what you actually needed
Formula handled: 0 / 5There is a quieter failure too. Even when IMPORTXML works today, it is pinned to an XPath like //div[@class='price']. The day the site renames that class, every cell goes blank with no error, and you find out when the report built on top is wrong.
Step 1: Decide what data you need before you scrape anything
Start from the question, not the page. The most common scraping mistake is collecting data for its own sake, then realizing the columns you grabbed are not the ones the work needs. In Clay's framework, web scraping serves two of the four data jobs, find and enrich, and it only earns its place when you already know what the finished rows are for.
Write the objective as a sentence with the columns baked in. "Every commercial HVAC contractor in three metros, with company name, website, phone, and a verified owner email, for an outbound list" tells you exactly what the table needs and which scrape and enrichment steps get you there. "Scrape this directory" does not. The objective decides the method, and the method is the easy part.
Step 2: Capture the page, even when a formula can't
Pick the capture method by the page in front of you, not by habit. A clean public table, a prose page that needs reading, a paginated directory, and a site that blocks bots each call for a different tool. Reaching for the same one every time is how people end up fighting a page that a different method would have handled in seconds.
Answer a few questions to find the right capture method
Which best describes the page you want to capture?
Clay's internal hierarchy is a useful default: start with Claygent for general public pages, fall back to Zenrows when a site fights back, reach for an Apify actor when you need a specialized source in bulk, and use the Chrome extension or native scraper for clean structured pages. The point is the same regardless of which one fires: the rows land in a table, not a static file, so the next steps can run on them.
Step 3: Read messy pages with AI when there's no table to point at
When a page has no clean structure, an AI scraper reads it like a person and returns the fields you named. This is the method that makes "any website" true. Instead of mapping an XPath, you give the agent a URL and a list of fields, and it pulls them from prose, mixed layouts, and pages no point-and-click tool could parse. It reaches any publicly available data that is not behind a login.
A prose page turned into spreadsheet columns by AI
Northwind Labs builds industrial IoT sensors for the factory floor, founded in 2017 and based in Austin, Texas.
The team has grown to roughly 120-150 people across engineering and field ops, and just opened a second facility in Q1.
Under the hood they run analytics on Snowflake and manage outreach in HubSpot. There is no public pricing page.
Structured fields
AI scraping reads an unstructured page and returns the named columns you asked for, so prose pages need no XPath, no selectors, and no code.
The field list is the whole interface. You write it the way you would brief a colleague, and the agent returns one value per column. Here is a reusable prompt for structured extraction:
Visit {{url}} and extract these as separate fields, one value per field:{{fields, e.g. "company name, HQ city, employee range, the tools theymention using, phone number, and the contact email if shown"}}.Return only what is present on the page. If a field is not found,return "Not found". Do not guess and do not pull from other pages.Keep each value clean enough to drop straight into a spreadsheet column.
Step 4: Structure and deduplicate so the rows are spreadsheet-ready
A scrape gives you raw rows; a usable table needs them cleaned, typed, and deduped first. This is the step the consumer tools skip and the reason a raw scrape lands in your sheet as a mess. Job titles arrive in fifteen formats, the same company shows up three times under three URL variants, and half the phone numbers carry stray characters. Dropped into Excel as-is, that is hours of manual cleanup before a single formula runs.
Doing it in a table before export turns the cleanup into rules that apply to every row at once: standardize messy fields, classify records (B2B vs. B2C, persona, industry) with an AI formula, merge fragmented columns into one, and collapse duplicates so each company or person appears once. Clay's transform step is built for exactly this, and it is where scraped content becomes consistent enough to trust. The output is a table whose columns line up with what you actually wanted in Step 1, not a pile of strings you will spend the afternoon untangling.
“We did an analysis of on-prem trends regionalized across the world in 3 days. This market research analysis would have taken my team 3 months to do without Clay.”
That speed comes from the structure happening inline. Mistral mapped global accounts and pulled regional firmographics in a fraction of the usual time because the data was being cleaned and shaped as it was collected, not after, in a separate cleanup pass that never quite finishes.
Step 5: Enrich the rows before they ever reach the spreadsheet
Scraped rows are thin; the columns that make a spreadsheet worth opening usually are not on the page you scraped. The directory gave you a company name and a URL. What the outbound list needs is a verified work email, a direct phone, headcount, and a tech stack, and none of that was sitting on the page. Enrichment is the difference between a list of names and a list you can act on, and it has to happen before export or you are back to manual research one row at a time.
This is where scraping into a workspace beats scraping into a file. Because the rows live in a table, each missing column gets filled by checking provider after provider until one returns a value, instead of relying on a single source that leaves gaps.
Each node adds a column the scraped page never had
| Company | URL |
|---|---|
| Northwind Labs | northwind.io |
| Cedar & Co | cedarco.com |
| Atlas Freight | atlasfreight.co |
Capture: Raw rows pulled straight from the page: company name and URL, the few fields the listing actually shows.
Each stage adds a column the scraped page never had, so the table that reaches your spreadsheet is complete rather than a list of names. That same coverage logic is what took one team's enrichment from roughly 40% to the high 80s, the difference between a half-blank export and a complete one.
Skip this step and you will do it by hand in the spreadsheet anyway, lookup by lookup. Run it in the table and the column is filled for every row before the file even exists.
Step 6: Export clean rows to CSV, Google Sheets, or Excel
The export format is a downstream decision, not a default, and the three options behave differently in ways that matter. A one-time file and a live connection are not interchangeable. Plan the destination by how the data will be used: who else needs it, whether it changes, and whether you want it to keep refreshing on its own.
Pick the export format by how you'll actually use the data
CSV download
One-timeA one-time transfer, a backup, or any tool with no native integration.
Universal: opens in Excel or Sheets and imports almost anywhere.
Google Sheets sync
LiveA collaborative, always-current sheet your team works in.
A live link: add, look up, or update rows on a schedule.
Excel
One-timeOffline analysis, pivots, and formula-heavy work in one file.
Open the exported CSV; refresh by re-exporting when the table changes.
The one that surprises people is the Google Sheets sync. It is not a download, it is a live link, so rows stay current instead of going stale the day after you pull them. Pair it with a scheduled re-scrape and a competitor price page or directory keeps refreshing on its own, which no spreadsheet formula and no one-time export can do.
Common failure modes, and how to avoid them
Most spreadsheet-scraping projects fail in the same few places, and all of them trace back to treating extraction as the whole job. Knowing the patterns is most of the fix.
Five ways spreadsheet scraping fails, and the fix for each
0 / 5 flipped