Clay logo, go to homepage

Clay GTM guide

How to Scrape a Website into Excel or Google Sheets

The hard part of scraping into a spreadsheet was never pulling text off the page. It is getting clean, structured, deduplicated rows your spreadsheet can actually use. Here is the scrape, structure, enrich, then export method that still works when the markup changes.

May 17, 20269 min read

The hard part of scraping a website into a spreadsheet was never pulling text off the page. The hard part is getting clean, structured, deduplicated rows your spreadsheet can actually use. A copy-paste or a raw IMPORTXML gives you a brittle mess: merged cells, half-loaded pages, a formula that returns #N/A the morning the site changes one tag.

A scrape, structure, enrich pipeline gives you a table instead. This walks through getting real data into Excel or Google Sheets, why the consumer shortcuts break the moment you scale past one clean page, and the durable path that still works when the markup changes.

Why IMPORTHTML and IMPORTXML break the moment you scale

Spreadsheet scraping formulas only work on the one kind of page they were built for, and most pages are not that kind. IMPORTHTML reads a clean HTML table or list. IMPORTXML reads a static page and walks an XPath you hand it. Both assume the data is already sitting in the raw HTML the server sends, already structured, and frozen in place. That assumption holds for a Wikipedia table. It fails everywhere a real research target lives.

The failures stack up fast, and they are worth seeing as a set rather than discovering one at a time.

What a spreadsheet formula returns vs. what you actually needed

Formula handled: 0 / 5

There is a quieter failure too. Even when IMPORTXML works today, it is pinned to an XPath like //div[@class='price']. The day the site renames that class, every cell goes blank with no error, and you find out when the report built on top is wrong.

Step 1: Decide what data you need before you scrape anything

Start from the question, not the page. The most common scraping mistake is collecting data for its own sake, then realizing the columns you grabbed are not the ones the work needs. In Clay's framework, web scraping serves two of the four data jobs, find and enrich, and it only earns its place when you already know what the finished rows are for.

Write the objective as a sentence with the columns baked in. "Every commercial HVAC contractor in three metros, with company name, website, phone, and a verified owner email, for an outbound list" tells you exactly what the table needs and which scrape and enrichment steps get you there. "Scrape this directory" does not. The objective decides the method, and the method is the easy part.

Step 2: Capture the page, even when a formula can't

Pick the capture method by the page in front of you, not by habit. A clean public table, a prose page that needs reading, a paginated directory, and a site that blocks bots each call for a different tool. Reaching for the same one every time is how people end up fighting a page that a different method would have handled in seconds.

Answer a few questions to find the right capture method

1

Which best describes the page you want to capture?

Clay's internal hierarchy is a useful default: start with Claygent for general public pages, fall back to Zenrows when a site fights back, reach for an Apify actor when you need a specialized source in bulk, and use the Chrome extension or native scraper for clean structured pages. The point is the same regardless of which one fires: the rows land in a table, not a static file, so the next steps can run on them.

Step 3: Read messy pages with AI when there's no table to point at

When a page has no clean structure, an AI scraper reads it like a person and returns the fields you named. This is the method that makes "any website" true. Instead of mapping an XPath, you give the agent a URL and a list of fields, and it pulls them from prose, mixed layouts, and pages no point-and-click tool could parse. It reaches any publicly available data that is not behind a login.

A prose page turned into spreadsheet columns by AI

Company about-page · prose, no table

Northwind Labs builds industrial IoT sensors for the factory floor, founded in 2017 and based in Austin, Texas.

The team has grown to roughly 120-150 people across engineering and field ops, and just opened a second facility in Q1.

Under the hood they run analytics on Snowflake and manage outreach in HubSpot. There is no public pricing page.

Structured fields

AI scraping reads an unstructured page and returns the named columns you asked for, so prose pages need no XPath, no selectors, and no code.

The field list is the whole interface. You write it the way you would brief a colleague, and the agent returns one value per column. Here is a reusable prompt for structured extraction:

Claygent: structured extraction for a spreadsheet
Visit {{url}} and extract these as separate fields, one value per field:{{fields, e.g. "company name, HQ city, employee range, the tools theymention using, phone number, and the contact email if shown"}}.Return only what is present on the page. If a field is not found,return "Not found". Do not guess and do not pull from other pages.Keep each value clean enough to drop straight into a spreadsheet column.

Step 4: Structure and deduplicate so the rows are spreadsheet-ready

A scrape gives you raw rows; a usable table needs them cleaned, typed, and deduped first. This is the step the consumer tools skip and the reason a raw scrape lands in your sheet as a mess. Job titles arrive in fifteen formats, the same company shows up three times under three URL variants, and half the phone numbers carry stray characters. Dropped into Excel as-is, that is hours of manual cleanup before a single formula runs.

Doing it in a table before export turns the cleanup into rules that apply to every row at once: standardize messy fields, classify records (B2B vs. B2C, persona, industry) with an AI formula, merge fragmented columns into one, and collapse duplicates so each company or person appears once. Clay's transform step is built for exactly this, and it is where scraped content becomes consistent enough to trust. The output is a table whose columns line up with what you actually wanted in Step 1, not a pile of strings you will spend the afternoon untangling.

We did an analysis of on-prem trends regionalized across the world in 3 days. This market research analysis would have taken my team 3 months to do without Clay.

That speed comes from the structure happening inline. Mistral mapped global accounts and pulled regional firmographics in a fraction of the usual time because the data was being cleaned and shaped as it was collected, not after, in a separate cleanup pass that never quite finishes.

Step 5: Enrich the rows before they ever reach the spreadsheet

Scraped rows are thin; the columns that make a spreadsheet worth opening usually are not on the page you scraped. The directory gave you a company name and a URL. What the outbound list needs is a verified work email, a direct phone, headcount, and a tech stack, and none of that was sitting on the page. Enrichment is the difference between a list of names and a list you can act on, and it has to happen before export or you are back to manual research one row at a time.

This is where scraping into a workspace beats scraping into a file. Because the rows live in a table, each missing column gets filled by checking provider after provider until one returns a value, instead of relying on a single source that leaves gaps.

Each node adds a column the scraped page never had

Capture
Structure
Enrich
Export
3 rows
CompanyURL
Northwind Labsnorthwind.io
Cedar & Cocedarco.com
Atlas Freightatlasfreight.co
1

Capture: Raw rows pulled straight from the page: company name and URL, the few fields the listing actually shows.

Each stage adds a column the scraped page never had, so the table that reaches your spreadsheet is complete rather than a list of names. That same coverage logic is what took one team's enrichment from roughly 40% to the high 80s, the difference between a half-blank export and a complete one.

Skip this step and you will do it by hand in the spreadsheet anyway, lookup by lookup. Run it in the table and the column is filled for every row before the file even exists.

Step 6: Export clean rows to CSV, Google Sheets, or Excel

The export format is a downstream decision, not a default, and the three options behave differently in ways that matter. A one-time file and a live connection are not interchangeable. Plan the destination by how the data will be used: who else needs it, whether it changes, and whether you want it to keep refreshing on its own.

Pick the export format by how you'll actually use the data

CSV download

One-time

A one-time transfer, a backup, or any tool with no native integration.

Universal: opens in Excel or Sheets and imports almost anywhere.

Google Sheets sync

Live

A collaborative, always-current sheet your team works in.

A live link: add, look up, or update rows on a schedule.

Excel

One-time

Offline analysis, pivots, and formula-heavy work in one file.

Open the exported CSV; refresh by re-exporting when the table changes.

The one that surprises people is the Google Sheets sync. It is not a download, it is a live link, so rows stay current instead of going stale the day after you pull them. Pair it with a scheduled re-scrape and a competitor price page or directory keeps refreshing on its own, which no spreadsheet formula and no one-time export can do.

Common failure modes, and how to avoid them

Most spreadsheet-scraping projects fail in the same few places, and all of them trace back to treating extraction as the whole job. Knowing the patterns is most of the fix.

Five ways spreadsheet scraping fails, and the fix for each

0 / 5 flipped

Scrape any page into clean, enriched spreadsheet rows

Capture, structure, and enrich web data in Clay, then export complete rows to CSV, Google Sheets, or Excel.

Frequently asked questions

Can I scrape any website into Excel or Google Sheets?

Almost any publicly available page, yes. A clean HTML table can go straight in with a built-in import; anything more complex (JavaScript-rendered grids, paginated directories, prose pages, or sites that block bots) needs a real scraper rather than a spreadsheet formula. The reliable pattern is to capture the page into a table, clean and enrich the rows there, then export to your sheet, so what lands is complete instead of raw. The boundaries are login-gated data, copyrighted content, and anything a site's terms prohibit.

Why does my IMPORTXML or IMPORTHTML formula keep breaking?

Because it is pinned to a specific HTML structure and assumes the data is static. IMPORTHTML only reads a clean table or list, and IMPORTXML follows an XPath that breaks the moment the site renames a class or ships a redesign, usually with no error, just blank cells. Both also miss anything loaded by JavaScript after the page renders. Reading the page with AI extraction instead of hard-coded selectors survives markup changes that snap a formula.

How do I scrape a website to Excel or Sheets automatically, on a schedule?

Scrape into a workspace that supports scheduled runs, then connect it to your destination. A live Google Sheets sync updates rows on a schedule so the sheet stays current, and a scheduled re-scrape keeps the source data fresh, which is something a one-time CSV or a manual formula cannot do. For Excel, re-export the table when it changes, or sync to Sheets and reference it. The key is that the schedule lives where the scrape and enrichment run, not in the spreadsheet.

What's the difference between scraping into a spreadsheet and scraping into a table first?

A spreadsheet is built for finished data, not for the work that makes data finished. Scraping straight into Excel or Sheets leaves you to dedupe, standardize, and enrich by hand, row by row. Scraping into a table first lets you clean, classify, and enrich every row as rules that run once, then export the finished result. So the spreadsheet receives complete, deduped, enriched rows instead of a pile of strings.

Do I need to know how to code to scrape a website into a spreadsheet?

No. Built-in import formulas handle simple static tables with no code, and for everything harder, no-code scrapers and AI extraction do the work a formula cannot. With AI extraction you describe the columns you want in plain language and the agent returns them, no XPath, no selectors, no script. The only real skill is deciding what data you need and picking the capture method that fits the page.