Free Data, Premium UX: The Business Model Hiding in Public Datasets

Public-data arbitrage: the raw material is free and the access layer is the product. Here’s how BuiltWith, ImportGenius, and 2026 micro-SaaS founders built $299–1,999/mo subscriptions on public datasets.

Updated 10 min read
Public Datasets Business Model

I looked across about ten of my own projects and noticed something: twice, I'd built essentially the same business without recognizing it. Free public data, a normalization layer, a B2B buyer who'd rather pay $300 a month than spend a weekend wrangling raw files. BuiltWith perfected this exact pattern in 2007; a fresh crop of founders rediscovered it in 2025 and 2026.

The model is called public-data arbitrage: the raw material is free and the access layer is the product. In 2026, with 361,525 public datasets on Data.gov and AI agent ecosystems creating a new distribution channel that didn't exist three years ago, the window is wider than most founders realize.

Key Takeaways

  • Public-data arbitrage: collect a legally free but operationally painful dataset, build the access layer, charge a B2B buyer to skip the work.
  • The moat is time-series history nobody can backfill, the proprietary join across two ordinary public sources, freshness, and UX: in roughly that order of defensibility.
  • BuiltWith runs at $14M+ ARR with one employee on public web headers. FreshFilings launched a 3M-entity API for under $50 a month in infrastructure.
  • Price at 10–20% of what it would cost the buyer to replicate the product themselves. That number is almost always higher than founders expect.
  • Start collecting now. Historical depth is the one moat you cannot buy back later.

The Model Has a Name: Public-Data Arbitrage

Here's the base case: SEC EDGAR contains every financial filing from every US public company, free, forever. Crunchbase charges $49–$99 a month. The gap between "legally free" and "operationally usable" is where the business lives.

The same gap exists at nearly every major public data source. US Customs releases bills of lading for every import shipment. ImportGenius normalizes them and charges $229–$1,999 a month.

SAM.gov publishes every US government contract and grant. GovWin charges enterprise pricing in the thousands per year for a cleaned, searchable version. Instrumentl does the same for Grants.gov at $299–$499 a month.

McKinsey estimates $3T+ in annual economic value from open data across seven global domains. The EU valued its open data market at €184 billion in 2019. Whoever built the access layer captures that value; the original publishers collect almost none of it.

Five hundred-plus US companies built their core business directly on open government data, documented in the GovLab/NYU Open Data 500 dataset: genuine product companies whose raw material just happens to be public.

The Four-Rule Filter

Not every public dataset becomes a product. Before writing a line of code, check four conditions:

A B2B buyer with a defined budget. Real estate developers, government contractors, nonprofit grant officers, compliance teams, and B2B sales teams all have line-item budgets for data subscriptions. Consumer plays rarely work; consumers know the data is technically free and won't pay $179 a month to avoid a weekend of searching.

The simplest validator: is there already a competitor charging money for something adjacent? Crunchbase at $99 a month is proof that someone pays for normalized company data.

Genuinely painful raw data. The raw source must be actively inaccessible, not just unclean. EDGAR ships 18GB XML dumps with inconsistent XBRL tagging.

SAM.gov requires account registration and bulk exports that need real engineering to parse. Building permit data is scattered across 522 separate municipal portals in incompatible formats. If the raw source is a clean API anyone can hit in a weekend, there's no wedge for a product.

At least one defensible moat type. Without history, joins, freshness, or UX on top of the raw data, you have a feature, not a product. A dataset a competitor can replicate next week with no structural disadvantage has no floor under the price.

Ships on your existing stack. FreshFilings launched on FastAPI and Postgres on Render for under $50 a month in infrastructure. The Municipal Signal Shop model runs on cron jobs, Playwright, and a simple subscription backend. Businesses that stall are the ones that build custom data infrastructure before talking to a single paying customer.

The Map: Free Source to Product to Price

Public source

What makes it painful

Product

Entry price

Top tier

SEC EDGAR

18GB XML dumps, inconsistent XBRL

Crunchbase

$49/mo

$99/mo

US Customs bills of lading

2B+ manifests, no entity resolution

ImportGenius

$229/mo (USA)

$1,999/mo

SAM.gov / USAspending.gov

Account-gated bulk exports, no analysis layer

GovWin (Deltek)

Enterprise

$1,000s/yr

Grants.gov / CORDIS

Thousands of programs, no matching layer

Instrumentl

$299/mo

$499/mo

Common Crawl + public web headers

Petabytes of unprocessed HTML

BuiltWith

$295/mo

$995/mo

Secretary of State databases (50 states)

50 separate systems, mixed formats

FreshFilings

$29/mo

$299/mo

Municipal permit offices

522+ jurisdictions, no unified API

PermitStack

Free tier

Paid

OpenStreetMap

Routing, freshness, and SDK layers missing

Mapbox

Usage-based

Enterprise

The correct pricing frame is 10–20% of what it would cost the buyer to replicate the access layer themselves: ImportGenius Enterprise at $1,999 a month versus hiring a trade analyst at $60K+ a year. State that number out loud before setting a price. It will almost always be higher than your instinct.

The Moat Isn't the Model: What Actually Defends the Business

The a16z counterargument is worth taking seriously: data moats weaken for most products over time. But a16z notes that assembling and standardizing large pools of public datasets creates a scale effect that emerging competitors must recreate from the ground up. That carve-out covers nearly every business in the table above.

The four moat types form a hierarchy:

Moat type

Defensibility

Why

Time-series history

Highest

Cannot be backfilled; a 17-year gap is permanent

Proprietary join

Most profitable

Two unremarkable public sources produce one premium product

Freshness

Most common

Lag between a public dump and real-time is directly monetizable

UX / API

Floor

Minimum differentiation from raw public access

History is irreplicable. BuiltWith has 18+ years of tech-stack data across 491.9M+ root domains. A competitor starting today can replicate current state; nobody can backfill that span.

This moat starts accumulating the moment your first collection run completes. Every day you delay is a day of historical depth you cannot recover.

Joins are the most monetizable. Job postings plus tech-stack fingerprints equal "companies using Salesforce that are hiring sales reps": a sellable lead list that neither source produces alone. That combination is BuiltWith's core B2B product; neither Salesforce job boards nor Common Crawl headers command much value separately, yet the combined product sells at $995 a month.

A new version of this join appeared in 2026. FreshFilings ships a Model Context Protocol (MCP) server so AI agents can verify business registrations as a tool call, a format the builder expects will become table stakes for any public-data API. Founders who build MCP-compatible APIs for public datasets in 2026 access distribution through AI agent ecosystems that neither BuiltWith nor ImportGenius could have plugged into at launch.

Freshness is the most common differentiation point. ImportGenius tracks container shipments in near-real time, while the CBP bulk release runs days or weeks behind. Start indexing nightly at launch; don't wait until a customer asks.

UX is the floor. Raw EDGAR is 18GB XML. Crunchbase is a clean JSON API with webhooks, Salesforce integration, and CSV export.

The transformation from unusable to usable is the product. What these businesses actually defend is the cost to replicate the access layer: the engineering, normalization, entity resolution, and historical depth. Call it an access moat rather than a data moat; the description is more precise.

The Honest Part

Niche down or lose to incumbents. The Municipal Signal Shop model is the clearest argument for this. The instinct is to launch with the most comprehensive dataset possible.

The wedge is the opposite: one buyer type, one signal, one geography. Thirty subscribers at $299 a month from one metro plus one signal type (LLC filings, building permits, restaurant inspections) is $8,970 MRR before you touch features two through ten. Incumbents can't profitably serve thirty-subscriber micro-niches; a solo founder can.

Check licensing before building. US federal data (EDGAR, Census, SAM.gov, Common Crawl, USPTO) is public domain under 17 U.S.C. § 105. UK Companies House has a free API that explicitly encourages commercial use; EU TED and CORDIS operate under open data frameworks.

The GDPR risk sits in EU public registries that contain director names and addresses: document your legitimate interest basis before selling EU-targeted lead lists, and get counsel for anything involving individual identifiers at scale. Check each data source's terms for "commercial use" and "resale" language before building anything.

hiQ Labs versus LinkedIn and Meta versus Bright Data confirmed in 2022 and 2024 respectively that scraping publicly accessible (logged-out) pages is generally permissible under the CFAA. Logged-in scraping and fake accounts are where the legal risk concentrates.

The hardest conversation is the "free data = free product" objection. Benjamin Cave of the Open Data Institute documented this as the top conversion barrier for open-data startups:

"An objection from customers is that if the data is already out there, we no longer have a problem accessing it, so why should we purchase a service which is offering essentially to do for money what we can do ourselves for free? Good value propositions for open data startups stress and emphasize the convenience and the time-saving and the cost-saving potentially of using them rather than going directly to the data."

Open Data Institute webinar, @19:42

Will Richards spent four years aggregating Australian startup funding data before building Deals OS. He described his reasoning to The Business of Content in 2026: make niche information accessible to the professionals who need it. Deals OS now covers more Australian deals than Crunchbase.

Your buyer isn't comparing your $299 a month to $0. They're comparing it to twelve hours of their weekend, every week, forever. Name the alternative accurately and the price holds.

Distribution doesn't solve itself. BuiltWith grew through B2B SEO on "what technology does X use?" queries; FreshFilings routes through developer communities on dev.to and GitHub. The access moat only matters if someone finds the product; first customers come from one narrow channel, not comprehensive coverage.

How to Start This Weekend

  1. Pick a buyer, not a dataset. Find someone with a defined, recurring problem, then trace it back to its public data source. Demand-led entry (buyer pain first) consistently outperforms supply-led (cool dataset, then find buyers).
  2. Understand the access model. EDGAR has a free REST API and bulk FTP at 10 requests per second; SAM.gov provides daily entity extracts and Companies House has a commercial-use-friendly REST API. Municipal portals often require browser automation via Playwright or Skyvern; Common Crawl is CC0-licensed with free S3 downloads.
  3. Build the minimal access layer. The search or API you'd want to use yourself. No custom infrastructure before the first customer. FastAPI, Postgres, and a hosted platform like Render keep total infra costs under $50 a month.
  4. Charge before feature #2. Stripe plus one price tier. The unit economics are not complicated: thirty customers at $299 a month is $8,970 MRR before you add a second data source or a second geography.
  5. Start the nightly collection job today. The history moat accumulates from run one. Every week you wait is a week of historical depth your competitors can never replicate. This is the one task worth completing before anything else is validated.

Frequently Asked Questions

Related Articles