Is it legal to build a business on public data?

US federal data (EDGAR, Census, SAM.gov) is public domain under 17 U.S.C. § 105 (no copyright, commercial use permitted). UK Companies House explicitly encourages commercial use; EU TED and CORDIS operate under open data directives. Check each source's terms for "commercial use" and "resale" restrictions before building. GDPR applies to EU registry data containing personal details (director names, addresses); document a legitimate interest basis before selling EU-targeted lead lists.

What makes a public-data product defensible against competitors?

History first: nobody can backfill a time-series you started three years ago. Then proprietary joins across two sources that are worth little separately. Then freshness (real-time versus dump lag), then UX and API quality. Build in that order of priority.

Is this the same as an AI wrapper?

No. The input is free public data rather than a paid API. COGS are infrastructure (storage, compute, indexing), not per-inference fees. The moat compounds with time rather than eroding when a model provider cuts prices or changes terms. The underlying engineering challenge is normalization, entity resolution, and historical depth, not prompt engineering.

What's the easiest public dataset to start with for a solo founder?

US federal sources have the cleanest access infrastructure. EDGAR has a documented REST API and bulk FTP with a 10-request-per-second rate limit. SAM.gov provides daily entity extracts via API. Common Crawl is CC0-licensed and available on S3. All three are public domain with no ToS surprises on commercial use.

How do you price a public-data product?

Price at 10–20% of what it would cost the buyer to replicate the access layer themselves. Calculate that replacement cost first, then set your price as a fraction of it. The number is almost always higher than you expect.

Can one person actually build and run this?

BuiltWith runs at $14M+ ARR with one employee . The real constraint is distribution, not engineering. Solve the distribution problem before building deeper data infrastructure.

What's the first signal that a niche public-data product could work?

Is there already a competitor charging money for something adjacent? That validates willingness to pay. You don't need a new category; you need a narrower version of what already sells, in a market the incumbent can't serve profitably at your scale. When I look at the businesses in this piece, the common thread is patience. BuiltWith launched in 2007; Instrumentl took eleven years to become the default grant-matching tool for nonprofits. FreshFilings launched in 2025 applying the identical pattern to a problem that's existed since the first Secretary of State database went online. What niche public dataset have you been ignoring because it seemed too boring to be a business?

Free Data, Premium UX: The Business Model Hiding in Public Datasets

Public-data arbitrage: the raw material is free and the access layer is the product. Here’s how BuiltWith, ImportGenius, and 2026 micro-SaaS founders built $299–1,999/mo subscriptions on public datasets.

Updated June 15, 202610 min read

I looked across about ten of my own projects and noticed something: twice, I'd built essentially the same business without recognizing it. Free public data, a normalization layer, a B2B buyer who'd rather pay $300 a month than spend a weekend wrangling raw files. BuiltWith perfected this exact pattern in 2007; a fresh crop of founders rediscovered it in 2025 and 2026.

The model is called public-data arbitrage: the raw material is free and the access layer is the product. In 2026, with 361,525 public datasets on Data.gov and AI agent ecosystems creating a new distribution channel that didn't exist three years ago, the window is wider than most founders realize.

Key Takeaways

Public-data arbitrage: collect a legally free but operationally painful dataset, build the access layer, charge a B2B buyer to skip the work.
The moat is time-series history nobody can backfill, the proprietary join across two ordinary public sources, freshness, and UX: in roughly that order of defensibility.
BuiltWith runs at $14M+ ARR with one employee on public web headers. FreshFilings launched a 3M-entity API for under $50 a month in infrastructure.
Price at 10–20% of what it would cost the buyer to replicate the product themselves. That number is almost always higher than founders expect.
Start collecting now. Historical depth is the one moat you cannot buy back later.

The Model Has a Name: Public-Data Arbitrage

Here's the base case: SEC EDGAR contains every financial filing from every US public company, free, forever. Crunchbase charges $49–$99 a month. The gap between "legally free" and "operationally usable" is where the business lives.

The same gap exists at nearly every major public data source. US Customs releases bills of lading for every import shipment. ImportGenius normalizes them and charges $229–$1,999 a month.

SAM.gov publishes every US government contract and grant. GovWin charges enterprise pricing in the thousands per year for a cleaned, searchable version. Instrumentl does the same for Grants.gov at $299–$499 a month.

McKinsey estimates $3T+ in annual economic value from open data across seven global domains. The EU valued its open data market at €184 billion in 2019. Whoever built the access layer captures that value; the original publishers collect almost none of it.

Five hundred-plus US companies built their core business directly on open government data, documented in the GovLab/NYU Open Data 500 dataset: genuine product companies whose raw material just happens to be public.

The Four-Rule Filter

Not every public dataset becomes a product. Before writing a line of code, check four conditions:

A B2B buyer with a defined budget. Real estate developers, government contractors, nonprofit grant officers, compliance teams, and B2B sales teams all have line-item budgets for data subscriptions. Consumer plays rarely work; consumers know the data is technically free and won't pay $179 a month to avoid a weekend of searching.

The simplest validator: is there already a competitor charging money for something adjacent? Crunchbase at $99 a month is proof that someone pays for normalized company data.

Genuinely painful raw data. The raw source must be actively inaccessible, not just unclean. EDGAR ships 18GB XML dumps with inconsistent XBRL tagging.

SAM.gov requires account registration and bulk exports that need real engineering to parse. Building permit data is scattered across 522 separate municipal portals in incompatible formats. If the raw source is a clean API anyone can hit in a weekend, there's no wedge for a product.

At least one defensible moat type. Without history, joins, freshness, or UX on top of the raw data, you have a feature, not a product. A dataset a competitor can replicate next week with no structural disadvantage has no floor under the price.

Ships on your existing stack. FreshFilings launched on FastAPI and Postgres on Render for under $50 a month in infrastructure. The Municipal Signal Shop model runs on cron jobs, Playwright, and a simple subscription backend. Businesses that stall are the ones that build custom data infrastructure before talking to a single paying customer.

The Map: Free Source to Product to Price

Public source	What makes it painful	Product	Entry price	Top tier
SEC EDGAR	18GB XML dumps, inconsistent XBRL	Crunchbase	$49/mo	$99/mo
US Customs bills of lading	2B+ manifests, no entity resolution	ImportGenius	$229/mo (USA)	$1,999/mo
SAM.gov / USAspending.gov	Account-gated bulk exports, no analysis layer	GovWin (Deltek)	Enterprise	$1,000s/yr
Grants.gov / CORDIS	Thousands of programs, no matching layer	Instrumentl	$299/mo	$499/mo
Common Crawl + public web headers	Petabytes of unprocessed HTML	BuiltWith	$295/mo	$995/mo
Secretary of State databases (50 states)	50 separate systems, mixed formats	FreshFilings	$29/mo	$299/mo
Municipal permit offices	522+ jurisdictions, no unified API	PermitStack	Free tier	Paid
OpenStreetMap	Routing, freshness, and SDK layers missing	Mapbox	Usage-based	Enterprise

The correct pricing frame is 10–20% of what it would cost the buyer to replicate the access layer themselves: ImportGenius Enterprise at $1,999 a month versus hiring a trade analyst at $60K+ a year. State that number out loud before setting a price. It will almost always be higher than your instinct.

The Moat Isn't the Model: What Actually Defends the Business

The a16z counterargument is worth taking seriously: data moats weaken for most products over time. But a16z notes that assembling and standardizing large pools of public datasets creates a scale effect that emerging competitors must recreate from the ground up. That carve-out covers nearly every business in the table above.

The four moat types form a hierarchy:

Moat type	Defensibility	Why
Time-series history	Highest	Cannot be backfilled; a 17-year gap is permanent
Proprietary join	Most profitable	Two unremarkable public sources produce one premium product
Freshness	Most common	Lag between a public dump and real-time is directly monetizable
UX / API	Floor	Minimum differentiation from raw public access

History is irreplicable. BuiltWith has 18+ years of tech-stack data across 491.9M+ root domains. A competitor starting today can replicate current state; nobody can backfill that span.

This moat starts accumulating the moment your first collection run completes. Every day you delay is a day of historical depth you cannot recover.

Joins are the most monetizable. Job postings plus tech-stack fingerprints equal "companies using Salesforce that are hiring sales reps": a sellable lead list that neither source produces alone. That combination is BuiltWith's core B2B product; neither Salesforce job boards nor Common Crawl headers command much value separately, yet the combined product sells at $995 a month.

A new version of this join appeared in 2026. FreshFilings ships a Model Context Protocol (MCP) server so AI agents can verify business registrations as a tool call, a format the builder expects will become table stakes for any public-data API. Founders who build MCP-compatible APIs for public datasets in 2026 access distribution through AI agent ecosystems that neither BuiltWith nor ImportGenius could have plugged into at launch.

Freshness is the most common differentiation point. ImportGenius tracks container shipments in near-real time, while the CBP bulk release runs days or weeks behind. Start indexing nightly at launch; don't wait until a customer asks.

UX is the floor. Raw EDGAR is 18GB XML. Crunchbase is a clean JSON API with webhooks, Salesforce integration, and CSV export.

The transformation from unusable to usable is the product. What these businesses actually defend is the cost to replicate the access layer: the engineering, normalization, entity resolution, and historical depth. Call it an access moat rather than a data moat; the description is more precise.

The Honest Part

Niche down or lose to incumbents. The Municipal Signal Shop model is the clearest argument for this. The instinct is to launch with the most comprehensive dataset possible.

The wedge is the opposite: one buyer type, one signal, one geography. Thirty subscribers at $299 a month from one metro plus one signal type (LLC filings, building permits, restaurant inspections) is $8,970 MRR before you touch features two through ten. Incumbents can't profitably serve thirty-subscriber micro-niches; a solo founder can.

Check licensing before building. US federal data (EDGAR, Census, SAM.gov, Common Crawl, USPTO) is public domain under 17 U.S.C. § 105. UK Companies House has a free API that explicitly encourages commercial use; EU TED and CORDIS operate under open data frameworks.

The GDPR risk sits in EU public registries that contain director names and addresses: document your legitimate interest basis before selling EU-targeted lead lists, and get counsel for anything involving individual identifiers at scale. Check each data source's terms for "commercial use" and "resale" language before building anything.

hiQ Labs versus LinkedIn and Meta versus Bright Data confirmed in 2022 and 2024 respectively that scraping publicly accessible (logged-out) pages is generally permissible under the CFAA. Logged-in scraping and fake accounts are where the legal risk concentrates.

The hardest conversation is the "free data = free product" objection. Benjamin Cave of the Open Data Institute documented this as the top conversion barrier for open-data startups:

"An objection from customers is that if the data is already out there, we no longer have a problem accessing it, so why should we purchase a service which is offering essentially to do for money what we can do ourselves for free? Good value propositions for open data startups stress and emphasize the convenience and the time-saving and the cost-saving potentially of using them rather than going directly to the data."

Open Data Institute webinar, @19:42

Will Richards spent four years aggregating Australian startup funding data before building Deals OS. He described his reasoning to The Business of Content in 2026: make niche information accessible to the professionals who need it. Deals OS now covers more Australian deals than Crunchbase.

Your buyer isn't comparing your $299 a month to $0. They're comparing it to twelve hours of their weekend, every week, forever. Name the alternative accurately and the price holds.

Distribution doesn't solve itself. BuiltWith grew through B2B SEO on "what technology does X use?" queries; FreshFilings routes through developer communities on dev.to and GitHub. The access moat only matters if someone finds the product; first customers come from one narrow channel, not comprehensive coverage.

How to Start This Weekend

Pick a buyer, not a dataset. Find someone with a defined, recurring problem, then trace it back to its public data source. Demand-led entry (buyer pain first) consistently outperforms supply-led (cool dataset, then find buyers).
Understand the access model. EDGAR has a free REST API and bulk FTP at 10 requests per second; SAM.gov provides daily entity extracts and Companies House has a commercial-use-friendly REST API. Municipal portals often require browser automation via Playwright or Skyvern; Common Crawl is CC0-licensed with free S3 downloads.
Build the minimal access layer. The search or API you'd want to use yourself. No custom infrastructure before the first customer. FastAPI, Postgres, and a hosted platform like Render keep total infra costs under $50 a month.
Charge before feature #2. Stripe plus one price tier. The unit economics are not complicated: thirty customers at $299 a month is $8,970 MRR before you add a second data source or a second geography.
Start the nightly collection job today. The history moat accumulates from run one. Every week you wait is a week of historical depth your competitors can never replicate. This is the one task worth completing before anything else is validated.

Frequently Asked Questions

Two professionals shaking hands in front of a laptop — advisory shares agreement

June 23, 2026

Advisory Shares: What Real Market Data Says About Equity, Vesting, and Tax

Learn how advisory shares work, what Carta data says about real grant sizes, and the tax rules advisors and founders need to know.

Tomas Laurinavicius

Read

Horizontal vs Vertical SaaS comparison strategy

June 7, 2026

Horizontal vs Vertical SaaS: Which Model Should You Build?

Vertical SaaS earns 15% EBITDA vs horizontal's 6%. Compare unit economics, churn, CAC, AI resilience, and the decision framework every founder needs.

Tomas Laurinavicius

Read

June 5, 2026

Horizontal SaaS Explained: Unit Economics, AI Disruption, and the Orthogonal Middle Path

Horizontal SaaS solves universal business functions for any industry. This guide covers unit economics, real company data, AI disruption signals, and a decision framework for founders choosing between horizontal and vertical SaaS in 2026.

Tomas Laurinavicius

Read

Free Data, Premium UX: The Business Model Hiding in Public Datasets

Key Takeaways

The Model Has a Name: Public-Data Arbitrage

The Four-Rule Filter

The Map: Free Source to Product to Price

The Moat Isn't the Model: What Actually Defends the Business

The Honest Part

How to Start This Weekend

Frequently Asked Questions

Is it legal to build a business on public data?

What makes a public-data product defensible against competitors?

Is this the same as an AI wrapper?

What's the easiest public dataset to start with for a solo founder?

How do you price a public-data product?

Can one person actually build and run this?

What's the first signal that a niche public-data product could work?

Related Articles

Advisory Shares: What Real Market Data Says About Equity, Vesting, and Tax

Horizontal vs Vertical SaaS: Which Model Should You Build?

Horizontal SaaS Explained: Unit Economics, AI Disruption, and the Orthogonal Middle Path