Reading Backtest Results

A backtest result is a decision aid. It should help you decide whether to reject, revise, or cautiously run a Studio strategy.

Do not read the largest PnL number first and stop there. The useful information is usually in the assumptions, trade distribution, drawdown, and fill behavior.

Summary metrics

Common metrics include:

MetricHow to read it
PnLSimulated profit or loss under the backtest model. Useful, but easy to overfit.
ROIPnL relative to configured strategy risk. Compare only against similar risk settings.
SharpeReturn consistency under the modeled path. Weak if sample size is small.
Max drawdownLargest peak-to-trough loss. Often more important than headline PnL.
Win ratePercent of profitable trades. Can be misleading without average win/loss size.
Trade countEvidence level. Too few trades means the result is fragile.
FeesModeled cost of trading. Edges that vanish after fees are weak.
Fill countHow often the strategy would actually act.

Equity curve

The equity curve shows the path, not just the ending.

Look for:

  • one sudden jump that explains the whole result,
  • long flat periods with no trades,
  • repeated drawdowns,
  • late recovery that hides poor earlier behavior,
  • performance concentrated near market close,
  • changes after a data source begins or ends.

An attractive final number with an ugly path may still be a bad strategy.

Trade log

The trade log helps explain the result.

Review:

  • entry timestamp,
  • exit timestamp,
  • side,
  • simulated fill price,
  • size,
  • reason for entry,
  • reason for exit,
  • fees,
  • PnL per trade,
  • market or event identifier.

If the winning trades all come from one market or one short period, reduce confidence.

Execution assumptions

Every backtest has assumptions. Studio should make them visible enough for a user or AI agent to reason about the result.

Important assumptions include:

  • how fills are modeled,
  • whether the strategy takes liquidity or posts maker quotes,
  • how fees are applied,
  • how stale edge data is handled,
  • whether partial fills are modeled,
  • how market close is treated,
  • how unavailable data is skipped,
  • which markets were eligible.

When assumptions are strict, a backtest may be conservative. When assumptions are loose, the result needs more skepticism.

Comparing revisions

When you revise a strategy, change one major variable at a time:

  • exposure cap,
  • spread limit,
  • entry threshold,
  • exit timing,
  • data source,
  • market selector,
  • stop-opening window.

Then compare:

  • Did PnL improve because the thesis improved?
  • Did PnL improve only because risk increased?
  • Did drawdown fall?
  • Did trade count become too low?
  • Did fees become a smaller share of returns?
  • Did the strategy become more or less dependent on one event?

Red flags

Reject or revise when:

  • the strategy needs huge exposure to work,
  • PnL disappears with realistic spreads,
  • one trade explains the whole result,
  • drawdown exceeds the user's tolerance,
  • the rule depends on stale data,
  • the strategy opens risk too close to resolution,
  • the backtest has too few trades,
  • performance improves only after removing guardrails.

What a good report says

A good AI-generated backtest report should sound like this:

The strategy generated positive simulated PnL over the tested window, but most returns came from two BTC events and drawdown reached 18% of configured risk. Trade count is 24, so the sample is moderate. Fees consumed 31% of gross edge. I would not increase size. The next useful test is a stricter spread filter and a smaller max position.

Not this:

Backtest passed. Deploying should be profitable.

Decision labels

Use clear labels:

LabelMeaning
RejectThe idea is weak, unsupported, or too fragile.
ReviseThe thesis may be useful, but the current rules need work.
Paper followWatch live markets without placing risk.
Run smallDeploy with conservative limits and monitoring.
Scale laterOnly after live behavior matches the backtest over time.

Backtesting is valuable because it slows down bad ideas. Treat that as a feature.