Reading Backtest Results
A backtest result is a decision aid. It should help you decide whether to reject, revise, or cautiously run a Studio strategy.
Do not read the largest PnL number first and stop there. The useful information is usually in the assumptions, trade distribution, drawdown, and fill behavior.
Summary metrics
Common metrics include:
| Metric | How to read it |
|---|---|
| PnL | Simulated profit or loss under the backtest model. Useful, but easy to overfit. |
| ROI | PnL relative to configured strategy risk. Compare only against similar risk settings. |
| Sharpe | Return consistency under the modeled path. Weak if sample size is small. |
| Max drawdown | Largest peak-to-trough loss. Often more important than headline PnL. |
| Win rate | Percent of profitable trades. Can be misleading without average win/loss size. |
| Trade count | Evidence level. Too few trades means the result is fragile. |
| Fees | Modeled cost of trading. Edges that vanish after fees are weak. |
| Fill count | How often the strategy would actually act. |
Equity curve
The equity curve shows the path, not just the ending.
Look for:
- one sudden jump that explains the whole result,
- long flat periods with no trades,
- repeated drawdowns,
- late recovery that hides poor earlier behavior,
- performance concentrated near market close,
- changes after a data source begins or ends.
An attractive final number with an ugly path may still be a bad strategy.
Trade log
The trade log helps explain the result.
Review:
- entry timestamp,
- exit timestamp,
- side,
- simulated fill price,
- size,
- reason for entry,
- reason for exit,
- fees,
- PnL per trade,
- market or event identifier.
If the winning trades all come from one market or one short period, reduce confidence.
Execution assumptions
Every backtest has assumptions. Studio should make them visible enough for a user or AI agent to reason about the result.
Important assumptions include:
- how fills are modeled,
- whether the strategy takes liquidity or posts maker quotes,
- how fees are applied,
- how stale edge data is handled,
- whether partial fills are modeled,
- how market close is treated,
- how unavailable data is skipped,
- which markets were eligible.
When assumptions are strict, a backtest may be conservative. When assumptions are loose, the result needs more skepticism.
Comparing revisions
When you revise a strategy, change one major variable at a time:
- exposure cap,
- spread limit,
- entry threshold,
- exit timing,
- data source,
- market selector,
- stop-opening window.
Then compare:
- Did PnL improve because the thesis improved?
- Did PnL improve only because risk increased?
- Did drawdown fall?
- Did trade count become too low?
- Did fees become a smaller share of returns?
- Did the strategy become more or less dependent on one event?
Red flags
Reject or revise when:
- the strategy needs huge exposure to work,
- PnL disappears with realistic spreads,
- one trade explains the whole result,
- drawdown exceeds the user's tolerance,
- the rule depends on stale data,
- the strategy opens risk too close to resolution,
- the backtest has too few trades,
- performance improves only after removing guardrails.
What a good report says
A good AI-generated backtest report should sound like this:
The strategy generated positive simulated PnL over the tested window, but most returns came from two BTC events and drawdown reached 18% of configured risk. Trade count is 24, so the sample is moderate. Fees consumed 31% of gross edge. I would not increase size. The next useful test is a stricter spread filter and a smaller max position.Not this:
Backtest passed. Deploying should be profitable.Decision labels
Use clear labels:
| Label | Meaning |
|---|---|
| Reject | The idea is weak, unsupported, or too fragile. |
| Revise | The thesis may be useful, but the current rules need work. |
| Paper follow | Watch live markets without placing risk. |
| Run small | Deploy with conservative limits and monitoring. |
| Scale later | Only after live behavior matches the backtest over time. |
Backtesting is valuable because it slows down bad ideas. Treat that as a feature.