May 17, 2026

Historical Prediction Market Data for Backtesting: Kalshi and Polymarket Sources

Kalshi did $23.8 billion in trading volume across 97 million trades in 2025 — up 1,108% and 1,680% YoY (RootData, 2025). Polymarket recorded 95 million on-chain transactions in 2025 alone, with monthly market creation peaking at 38,270 new markets in October (ChainCatcher, 2025; World Metrics, Oct 2025). The historical data exists. Knowing where to pull it from — and what each source actually costs — is the difference between a real backtest and a toy one.

Most "prediction market data" guides cover a single platform or pitch a single vendor. This post is the side-by-side map for historical prediction market data: every viable source for Kalshi and Polymarket history, official APIs, third-party aggregators, on-chain queries, plus the academic datasets you probably haven't seen. It also covers the trap that wrecks most Polymarket backtests if you don't know about it: Paradigm flagged in December 2025 that most Dune dashboards double-count Polymarket volume by 2x.

**Key Takeaways** - Kalshi's public REST API is free with no auth required for market/orderbook/trade reads; rate limit is ~20 requests/second on the Basic tier ([Kalshi Docs](https://docs.kalshi.com/getting_started/rate_limits)) - Polymarket's official Subgraph on The Graph costs $0 up to 100k queries/month, then ~$0.00002/query ([The Graph Billing](https://thegraph.com/docs/en/subgraphs/billing/)) - Dune Analytics' free tier (2,500 credits/month) is enough for most Polymarket exploratory work, but full-history pulls require paid tiers ([Dune FAQ](https://docs.dune.com/learning/how-tos/pricing-faqs)) - **Paradigm published research in December 2025 showing most Dune dashboards double-count Polymarket volume** — backtesters need to divide reported volume by 2 ([Paradigm Research](https://www.paradigm.xyz/2025/12/polymarket-volume-is-being-double-counted), Dec 2025)

Editorial split-screen photo with a Kalshi candlestick chart visualization on the left and a Polymarket on-chain trade flow visualization on the right, dark theme with cyan and amber accents

Historical prediction market data source map

Five tiers of access. Each trades cost, depth, and friction differently:

Tier	Source	Free?	Coverage	Best For
1	Official REST/WebSocket APIs	Yes (rate-limited)	Real-time + ~3mo live	Latest data, live execution
2	Official historical endpoints	Yes	Full history	Trade-level Kalshi backtests
3	The Graph subgraphs	Free under 100k/mo	Full Polymarket history	Polymarket queries at scale
4	Dune Analytics SQL	Free under 2,500 credits/mo	Full Polymarket history	Aggregations, dashboards
5	Third-party aggregators	Paid (Lychee, Kingsets)	Pre-built bulk datasets	Skip the ETL, pay for time

Pick by what you're optimizing. If you can write Python, tiers 1-3 are free. If you want to skip a weekend of plumbing, tier 5 is faster. The DIY path is straightforward — we covered the cost math for small accounts in our affordable bots guide.

Kalshi Data Sources

Official REST API

The Kalshi REST API at api.kalshi.com/trade-api/v2 is the canonical source. Public reads (markets, orderbook, trades) require no auth. Portfolio data and fills require an API key.

Historical endpoints: Live data is available for 3 months through standard endpoints. Anything older lives at /historical/markets, /historical/markets/{ticker}/candlesticks, /historical/trades, /historical/fills, /historical/orders, and /historical/cutoff (Kalshi Docs).

Rate limits by tier:

Tier	Read tokens/sec	Write tokens/sec
Basic	200	100
Advanced	300	300
Premier	1,000	1,000
Paragon	2,000	2,000
Prime	4,000	4,000

Default request cost is 10 tokens, so Basic = ~20 req/s. That's fine for backfilling historical data overnight (Kalshi Docs). A full historical pull of 97M trades at 20 req/s = roughly 56 days of continuous polling. In practice you'd batch and use Kalshi's pagination cursors to bring that down to days, not weeks.

Kalshi WebSocket

Real-time data lives at wss://api.kalshi.com/trade-api/ws/v2. Three channels matter for backtesters: ticker, orderbook_delta, and trade.

The server sends an initial orderbook_snapshot and then incremental orderbook_delta events. To reconstruct full historical order books from WebSocket data, you need to apply the deltas in sequence to the snapshot. A 2026 SSRN paper walks through the full LOB reconstruction methodology specifically for Kalshi (SSRN 6583921).

WebSocket is real-time only — you can't pull history through it. If you want both live tick data and historical depth, you run WebSocket forward and REST backward.

Third-Party Kalshi Aggregators

Four sources worth knowing about:

Lychee Data: 36GB historical dataset, every trade since launch, no-code queries, CSV/Excel/JSON export. The fastest path if you don't want to write ETL code (Lychee Data Kalshi Guide).
Kingsets: Bulk CSV downloads of trades, markets, events, series. Updated daily by 03:00 UTC (Kingsets).
DeFi Rate: Free CSV export of aggregate volume charts. Good for sanity-checking your own pulls (DeFi Rate Kalshi Volume).
GitHub mickbransfield/kalshi: Community Python scripts to bulk-download market history. Free, MIT-licensed (GitHub).

For a working example of pulling 100 strategies' worth of Kalshi 15-minute data and backtesting them, see our Coinbase-spot edge case study — it walks through the data layer end-to-end.

Polymarket Data Sources

Polymarket is harder than Kalshi because the data lives in three places: smart contracts on Polygon, an off-chain orderbook (CLOB), and metadata APIs. Knowing which to query for what saves hours.

Polymarket Subgraph (The Graph)

The single most useful tool for systematic Polymarket backtests. Polymarket maintains an official subgraph on The Graph protocol with full history of trades, positions, and resolutions.

Main endpoint pattern: https://gateway.thegraph.com/api/{API_KEY}/subgraphs/id/Bx1W4S7kDVxs9gC3s2G6DS8kdNBJNVhMviCtin2DiBp

Five specialized subgraphs are hosted on Goldsky for different query patterns: Orders, Positions, Activity, Open Interest, PnL. The full GraphQL schema is on GitHub (Polymarket Subgraph GitHub).

Cost: The Graph's free tier gives you 100,000 queries/month. Overage is $20 per million queries — roughly $0.00002 per query (The Graph Billing). A full historical pull of Polymarket's 95M trades, paginated at 1,000 records per query, is ~95,000 queries. That fits under the free tier.

Polymarket Gamma API

https://gamma-api.polymarket.com is the public metadata API. No auth needed. Rate limit is ~60 requests/minute unauthenticated (pm.wiki Polymarket API).

Best for: market slugs, condition IDs, question IDs, the ERC-1155 token IDs for Yes/No positions. Limited historical depth — primarily a snapshot of current state. Pair Gamma with the CLOB time-series endpoint for actual price history (Polymarket Gamma API Docs).

Polymarket CLOB API

https://clob.polymarket.com exposes the orderbook and price history. The endpoint backtesters care most about:

GET /prices-history takes parameters: market (asset ID), startTs, endTs, interval (max, 1d, 6h, 1h, 1w, 1m), and fidelity in minutes (default 1). Returns arrays of {t, p} (timestamp, price) (Polymarket CLOB Timeseries).

For user-scoped trade data (your own fills), use GET /data/trades — this requires L2 auth.

Dune Analytics (Polymarket-Specific)

If you prefer SQL to GraphQL, Dune Analytics has the cleanest interface to on-chain Polymarket data.

Free tier: 2,500 credits/month plus API access. Paid tiers (Analyst $349/mo, Plus, Enterprise) raise the limit (Dune Pricing FAQ). Overage is $5/100 credits on Free, dropping to $1.596/100 on Plus.

Notable public dashboards:

dune.com/polymarket_analytics — official dashboards
dune.com/datadashboards/polymarket-overview — overview metrics
dune.com/filarm/polymarket-activity — user activity

For ready-to-fork SQL, the EigenAtlas GitHub repo has working queries for cumulative volume, 30-day moving averages, and user activity (EigenAtlas Polymarket SQL).

Polymarket On-Chain (Polygon Raw)

The deepest source: Polymarket's Conditional Tokens contract at 0x4D97DCd97eC945f40cF65F87097ACe5EA0476045 on Polygon (PolygonScan).

Index three events for resolution data: PositionSplit, PositionsMerge, PayoutRedemption. If The Graph indexer goes down or you want hard guarantees on data integrity, raw event logs from Polygon archive nodes are the source of truth. Bitquery is the most accessible alternative if you don't run your own indexer (Bitquery Polymarket API).

The Trap: Polymarket Volume Is Double-Counted

This one wrecks a lot of backtests. In December 2025, Paradigm published research showing most Dune dashboards double-count Polymarket volume because both sides of every trade are recorded separately on-chain (Paradigm Research, Dec 2025).

A buy from Alice and a sell from Bob in the same trade create two on-chain events. Naive Dune queries sum both, so "$22B" Polymarket volume is really ~$11B in unique trade value. Liquidity figures get inflated by the same factor.

**Practical fix:** When pulling Polymarket history from Dune for backtesting, divide reported volume by 2. Alternatively, query for `Trades` filtered to one side of the trade (e.g., `maker_side`) instead of aggregating both sides. The Paradigm methodology paper has the cleanest reference SQL.

If your strategy assumes Polymarket markets are deeper than they actually are, your backtest fill rates will be wildly optimistic. This is exactly the kind of regime gap we covered in our overfitting playbook — the backtest doesn't know what it doesn't know.

Other Prediction Market Datasets Worth Knowing

Three sources outside Kalshi and Polymarket that systematic researchers use:

Manifold Markets (API docs) — Free, real-time, full history since December 2021. WebSocket at wss://api.manifold.markets/ws, REST at api.manifold.markets. Markets are play-money but the data is rich and the platform is alpha-stage so rate limits are generous. Useful for backtesting model logic before paying real fees on Kalshi or Polymarket.

PredictIt (rpredictit R package) — The classic political prediction market dataset (2014-present, mostly US elections). Public API at predictit.org/api/marketdata/all/ provides current quotes only. Historical CSV downloads are per-contract from their site, non-commercial license only — you can use it for research and academic work, not for live trading.

Iowa Electronic Markets (IEM) — Running since 1988. The University of Iowa research repository hosts replication datasets going back nearly four decades. Small markets, but the longest continuous history in prediction-market research.

For the deepest academic Polymarket dataset, the 2026 arXiv paper "Unlocking the Forecasting Economy" releases a three-layer dataset of market metadata, fills, and oracle resolutions (arXiv 2604.20421).

How to Bootstrap When Data Is Limited

The hardest backtest problem isn't sourcing data — it's having enough of it. Polymarket has 2,550 resolved markets out of 3,600 created (Polymarket Help). For some strategy types, that's not enough resolutions to get statistically significant validation. Four techniques:

1. Cross-platform triangulation. Many Kalshi markets have Polymarket equivalents (BTC price predictions, election outcomes, Fed rate decisions). Build a dataset using both platforms' history together. Doubles your sample size for the overlap markets.

2. Spot data as proxy. Kalshi's BTC and ETH event contracts settle on spot prices. You can backtest signal logic on years of Coinbase or Binance spot data, then validate the prediction-market mechanics on the shorter Kalshi sample. We did exactly this in our Coinbase spot edge case study.

3. Synthetic resolution events. For sports or political markets, simulate resolutions by sampling from the historical distribution of similar markets. Less rigorous than real data — useful for parameter sensitivity testing.

4. Bootstrap resampling. Standard statistical technique: sample your existing trades with replacement to estimate the distribution of strategy returns. Lopez de Prado's PBO methodology (covered in our overfitting post) explicitly uses combinatorial cross-validation when data is thin.

The honest answer: if you have fewer than 30 resolutions for a strategy type, you don't have a strategy — you have a hypothesis. Bootstrap for sensitivity, paper trade for confirmation, then decide.

Survivorship Bias: The 1,050 Markets That Vanished

If you only pull current API state, you're missing the markets that were created, traded, and then never resolved or got delisted. Polymarket has roughly 1,050 such unresolved markets (Polymarket Help).

A backtest that ignores those is subject to survivorship bias — your historical universe is biased toward markets that resolved cleanly, which over-represents predictable outcomes. Eurekahedge's aggregation of hedge-fund survivorship research suggests survivorship bias can inflate returns by ~2%/year and Sharpe ratios by up to 0.5 (Eurekahedge).

For systematic backtests, pull market metadata via the subgraph using createdAt and resolvedAt filters and include the unresolved cohort in your universe. Treat unresolved markets as a separate failure mode in your strategy stats.

What This Looks Like in Turbine Studio

[PERSONAL EXPERIENCE] We maintain a normalized Kalshi historical dataset internally — every trade since launch, indexed by market, ticker, and event. The Studio backtester runs against that data with proper fill-side modeling (no using mid-quotes for taker simulation) and includes delisted markets in the universe.

For Polymarket-specific backtests, we mirror the subgraph and apply the Paradigm volume correction by default. Turbine users don't have to think about double-counting because the platform applies the fix automatically. See Turbine Studio plans.

The DIY path is fully viable — every source above is public. Building the ETL, schema normalization, and backtest infrastructure is the work. The point of paying for a managed platform is skipping that work, not getting access to data that's otherwise unavailable.

Frequently Asked Questions

What's the absolute minimum free data setup for Kalshi backtesting?

Kalshi's public REST API + a Python script using the historical endpoints. Total cost: $0 (plus a $5/month VPS if you want it running overnight). Coverage: every market since launch. The only friction is writing the pagination logic to walk through millions of trades efficiently.

Can I backtest Polymarket strategies without paying for Dune?

Yes. The Polymarket Subgraph on The Graph is free under 100,000 queries/month, which covers a full history pull for most strategies. Use GraphQL pagination (1,000 records per query) and you'll fit comfortably under the cap (The Graph Billing).

How recent is each platform's data?

Kalshi's /historical/cutoff endpoint tells you the exact cutoff timestamp. As of 2026, live data is the last 3 months; older data is in historical endpoints. Polymarket's subgraph is updated in near-real-time (block-by-block on Polygon, ~2-second blocks). Dune dashboards typically lag by 1-2 hours depending on the query.

Why does my Polymarket backtest show 2x the volume of the real market?

Almost certainly because you queried Dune without correcting for double-counting. Both sides of every Polymarket trade record separately on-chain. Divide reported volume by 2, or filter to single-side trades only (Paradigm Research, Dec 2025).

Is there a standard schema across Kalshi and Polymarket data?

No. Kalshi uses a binary YES/NO model with cents pricing (0-100¢). Polymarket uses a continuous 0-1 probability model with USDC settlement on Polygon. You have to normalize the schemas yourself. Most teams use the model: {event_id, market_id, side, price, size, timestamp, resolution_status} as a common base.

The Bottom Line

The historical data exists, free or near-free, for both Kalshi and Polymarket. The work is knowing where to look and what each source actually delivers:

Kalshi: Public REST + WebSocket are free; full historical depth via /historical/* endpoints
Polymarket Subgraph: Free under 100k queries/month — fits most use cases
Dune Analytics: Free 2,500 credits/month for exploratory work; correct for double-counting
Third-party aggregators (Lychee, Kingsets): Pay to skip ETL
Manifold Markets: Free real-time, useful for model prototyping
Iowa Electronic Markets: Longest continuous history (since 1988)

If you're building systematic strategies, the data layer is the foundation. Skip it and your quant playbook sits on sand. Build it once, then you can backtest as many strategies as you want.

Or start with Turbine Studio and let us handle the data layer. We've already done the schema normalization, double-counting corrections, and survivorship-bias handling. The DIY path teaches you how everything works. The Studio path lets you skip to the strategy work.

This article is for educational purposes only. Trading prediction markets involves substantial risk of loss. Always verify data sources and pricing before relying on third-party tools.

May 17, 2026

By Ryan Bajollari

Historical Prediction Market Data for Backtesting: Kalshi and Polymarket Sources

**Key Takeaways** - Kalshi's public REST API is free with no auth required for market/orderbook/trade reads; rate limit is ~20 requests/second on the Basic tier ([Kalshi Docs](https://docs.kalshi.com/getting_started/rate_limits)) - Polymarket's official Subgraph on The Graph costs $0 up to 100k queries/month, then ~$0.00002/query ([The Graph Billing](https://thegraph.com/docs/en/subgraphs/billing/)) - Dune Analytics' free tier (2,500 credits/month) is enough for most Polymarket exploratory work, but full-history pulls require paid tiers ([Dune FAQ](https://docs.dune.com/learning/how-tos/pricing-faqs)) - **Paradigm published research in December 2025 showing most Dune dashboards double-count Polymarket volume** — backtesters need to divide reported volume by 2 ([Paradigm Research](https://www.paradigm.xyz/2025/12/polymarket-volume-is-being-double-counted), Dec 2025)

Editorial split-screen photo with a Kalshi candlestick chart visualization on the left and a Polymarket on-chain trade flow visualization on the right, dark theme with cyan and amber accents

Historical prediction market data source map

Five tiers of access. Each trades cost, depth, and friction differently:

Tier	Source	Free?	Coverage	Best For
1	Official REST/WebSocket APIs	Yes (rate-limited)	Real-time + ~3mo live	Latest data, live execution
2	Official historical endpoints	Yes	Full history	Trade-level Kalshi backtests
3	The Graph subgraphs	Free under 100k/mo	Full Polymarket history	Polymarket queries at scale
4	Dune Analytics SQL	Free under 2,500 credits/mo	Full Polymarket history	Aggregations, dashboards
5	Third-party aggregators	Paid (Lychee, Kingsets)	Pre-built bulk datasets	Skip the ETL, pay for time

Kalshi Data Sources

Official REST API

The Kalshi REST API at api.kalshi.com/trade-api/v2 is the canonical source. Public reads (markets, orderbook, trades) require no auth. Portfolio data and fills require an API key.

Rate limits by tier:

Tier	Read tokens/sec	Write tokens/sec
Basic	200	100
Advanced	300	300
Premier	1,000	1,000
Paragon	2,000	2,000
Prime	4,000	4,000

Kalshi WebSocket

Real-time data lives at wss://api.kalshi.com/trade-api/ws/v2. Three channels matter for backtesters: ticker, orderbook_delta, and trade.

WebSocket is real-time only — you can't pull history through it. If you want both live tick data and historical depth, you run WebSocket forward and REST backward.

Third-Party Kalshi Aggregators

Four sources worth knowing about:

Lychee Data: 36GB historical dataset, every trade since launch, no-code queries, CSV/Excel/JSON export. The fastest path if you don't want to write ETL code (Lychee Data Kalshi Guide).
Kingsets: Bulk CSV downloads of trades, markets, events, series. Updated daily by 03:00 UTC (Kingsets).
DeFi Rate: Free CSV export of aggregate volume charts. Good for sanity-checking your own pulls (DeFi Rate Kalshi Volume).
GitHub mickbransfield/kalshi: Community Python scripts to bulk-download market history. Free, MIT-licensed (GitHub).

For a working example of pulling 100 strategies' worth of Kalshi 15-minute data and backtesting them, see our Coinbase-spot edge case study — it walks through the data layer end-to-end.

Polymarket Data Sources

Polymarket is harder than Kalshi because the data lives in three places: smart contracts on Polygon, an off-chain orderbook (CLOB), and metadata APIs. Knowing which to query for what saves hours.

Polymarket Subgraph (The Graph)

The single most useful tool for systematic Polymarket backtests. Polymarket maintains an official subgraph on The Graph protocol with full history of trades, positions, and resolutions.

Main endpoint pattern: https://gateway.thegraph.com/api/{API_KEY}/subgraphs/id/Bx1W4S7kDVxs9gC3s2G6DS8kdNBJNVhMviCtin2DiBp

Five specialized subgraphs are hosted on Goldsky for different query patterns: Orders, Positions, Activity, Open Interest, PnL. The full GraphQL schema is on GitHub (Polymarket Subgraph GitHub).

Polymarket Gamma API

https://gamma-api.polymarket.com is the public metadata API. No auth needed. Rate limit is ~60 requests/minute unauthenticated (pm.wiki Polymarket API).

Polymarket CLOB API

https://clob.polymarket.com exposes the orderbook and price history. The endpoint backtesters care most about:

For user-scoped trade data (your own fills), use GET /data/trades — this requires L2 auth.

Dune Analytics (Polymarket-Specific)

If you prefer SQL to GraphQL, Dune Analytics has the cleanest interface to on-chain Polymarket data.

Notable public dashboards:

dune.com/polymarket_analytics — official dashboards
dune.com/datadashboards/polymarket-overview — overview metrics
dune.com/filarm/polymarket-activity — user activity

For ready-to-fork SQL, the EigenAtlas GitHub repo has working queries for cumulative volume, 30-day moving averages, and user activity (EigenAtlas Polymarket SQL).

Polymarket On-Chain (Polygon Raw)

The deepest source: Polymarket's Conditional Tokens contract at 0x4D97DCd97eC945f40cF65F87097ACe5EA0476045 on Polygon (PolygonScan).

The Trap: Polymarket Volume Is Double-Counted

**Practical fix:** When pulling Polymarket history from Dune for backtesting, divide reported volume by 2. Alternatively, query for `Trades` filtered to one side of the trade (e.g., `maker_side`) instead of aggregating both sides. The Paradigm methodology paper has the cleanest reference SQL.

Other Prediction Market Datasets Worth Knowing

Three sources outside Kalshi and Polymarket that systematic researchers use:

How to Bootstrap When Data Is Limited

Survivorship Bias: The 1,050 Markets That Vanished

What This Looks Like in Turbine Studio

Frequently Asked Questions

What's the absolute minimum free data setup for Kalshi backtesting?

Can I backtest Polymarket strategies without paying for Dune?

How recent is each platform's data?

Why does my Polymarket backtest show 2x the volume of the real market?

Is there a standard schema across Kalshi and Polymarket data?

The Bottom Line

The historical data exists, free or near-free, for both Kalshi and Polymarket. The work is knowing where to look and what each source actually delivers:

Kalshi: Public REST + WebSocket are free; full historical depth via /historical/* endpoints
Polymarket Subgraph: Free under 100k queries/month — fits most use cases
Dune Analytics: Free 2,500 credits/month for exploratory work; correct for double-counting
Third-party aggregators (Lychee, Kingsets): Pay to skip ETL
Manifold Markets: Free real-time, useful for model prototyping
Iowa Electronic Markets: Longest continuous history (since 1988)

If you're building systematic strategies, the data layer is the foundation. Skip it and your quant playbook sits on sand. Build it once, then you can backtest as many strategies as you want.

This article is for educational purposes only. Trading prediction markets involves substantial risk of loss. Always verify data sources and pricing before relying on third-party tools.