Backtesting Forecasts That Use LLMs

This is the fourth post in a series about predicting stock returns. Previously I shared our forecasting approach to stock fundamentals.

Today I want to talk about backtesting. For those not in quantitative finance, backtesting is evaluating a predictive model by asking: if this model was run in the past, how would its predictions have measured up against what actually happened?

This method is absolutely core to how we develop our forecasting methods at FutureSearch. But because Large Language Models are involved, the way it works may surprise people used to working with classical forecasting methods like time-series forecasting or technical analysis.

First, a quick example of how backtesting normally works. Let's say you have a predictive model that says companies with a market cap less than 10 times last year's earnings are undervalued. You can easily test this theory by taking historical financial data, listing all companies that fell into this category at all points in time, and calculating what your returns would have been if you had held them during this period.

Of course, even if this synthetic strategy was profitable in the past, it may be not be profitable in the future. In fact, as quants know, they rarely are, for several reasons: overfitting past sample data; miscalculating risk profiles; or the pattern simply breaking due to others noticing the pattern.

Still, backtesting is the bread and butter of quantitative financial forecasting. Yet, before we get to LLMs, you may ask: this works for quantitative models. Do people use it to evaluate human traders?

The answer is no, for a simple reason. A human trader cannot selectively forget the past. They cannot answer "What would I have done in 2016 before Brexit?" Human traders aren't evaluated by hypothetical past decisions. Humans are always evaluated prospectively, based on the returns they generate over time.

And, despite the right of high-frequency and algorithmic trading, a lot of trading is still done by humans. Warren Buffett has never been backtested, though trading strategies based off of his philosophies have.

Backtesting Automated Warren Buffett

But what if you have an automated Warren Buffett, like Stockfisher? This would sit in the middle of quantitative models and human predictors with regards to backtesting. Our automated Warren Buffet is implemented in software (after extensive design, iteration, and QA from humans) yet it depends on LLMs, which are more like humans than conventional ML systems.

Backtesting comes down to the ability to forget. For statistical models, there's nothing to forget, as the entire model is based on a fixed set of signals. The "state of the world" is not part of the system. Whereas for humans, everything is done in the context of one's knowledge of the world, and there's no isolating a predictive theory to test.

LLMs can't forget or suppress knowledge. (Though there is early research into selective forgetting in the mechanistic interpretability community. I'm keen to hear the first "Right to Forget" request from Europe against a large language model!).

But LLMs do have training window cutoffs. Claude 4.5 Sonnet, our main LLM at FutureSearch and a key part of Stockfisher research, has a training window cutoff (also known as a knowledge cutoff) of July 2025, meaning it was not trained on any information generated after that point. Turn off web access, and ask it who won the New York Mayoral race in Nov 2025, and it's clear it doesn't have that information.

This means that you can evaluate a Claude 4.5 Sonnet-based forecasting system on whether it can predict whether Mamdani will be the next mayor of New York. It doesn't know, so it has a chance to try probabilistic forecasting techniques.

So how recent are the training window cutoffs in the LLMs that Stockfisher uses, or that any reasonable forecasting approach would use? Generally, they are all in the last 12 months, usually more recent. (GPT-5's training window cutoff, in 2024, is one of the oldest.)

This immediately tells us about the time horizon for which LLMs can be backtested. A few months is doable, whereas backtesting events from 1 year ago or more would require using a previous generation of LLMs in the forecaster, which would be a drastic quality reduction.

There are a few caveats here. First, future LLMs available over APIs may have "continual learning", meaning constant retraining on news like Mamdani's NYC Mayoral victory. This would then effectively make them like humans, where backtesting anything with them would not be possible. Also, LLMs can have data leakage. If the "June 2025" version of Claude actually has minor fine-tuning updates to fix bugs or improve chat experiences, that could train on future data that might include people asking about Mamdani's victory, and new data could be encoded in the model. In our tests, we have never seen this happen, so for now this approach works.

(Respected AI researcher Owain Evans has a fantastic proposal to train LLMs with data only up to 2019, and then have it try to predict things like Covid-19. This would enable backtesting LLM-based approaches many years.)

So we can backtest LLM-based forecasters on their ability to predict the most recent events, like Mamdani's NYC Mayoral Vicotry. Except - until the forecaster tries to research Mamdani's campaign on the internet, it will immediately have the winner spoiled!

And this is exactly what any serious LLM-based forecaster would do, when trying to predit the mayor. It would read Mamdani's wikipedia article to learn about him, and to find the most recent polls, etc. As I wrote previously in this series, a huge percentage of the tokens processed in forecasting is the present-day research, not the judgment or modeling.

How can this be done without accessing the internet, given the vast number of sources that a forecaster needs to read to study the question?

Capturing the Internet for Backtesting

So to properly backtest an LLM-based forecaster that actually researchers what it is forecasting, you need to prepare the data that it will research from. Say, from the perspective of Claude 4.5 Sonnet's training window cutoff of June 2025, what is known about Mamdani as a NYC mayoral candidate?

Last year we wrote a piece poking holes in four AI forecasting evaluations that do this too lazily. Any information leaking from the future completely breaks backtesting. Even the slightest hint about Mamdani's goings-on after June 2025 makes it an invalid backtest. So, for example, having the LLM agent research use a search engine like Google with "date: before June 2025" leaks too much, e.g. from articles written before that date being updated, or even what results are returned, or their snippets, etc.

Conventional backtesting doesn't require much data. In the quant strategy example from earlier, all you need are the P/E ratios of companies over time, and their stock prices, to set up a backtest. But an automated Warren Buffet must, as Tom wrote previously, spend many hours reading, reading, reading more documents, more industry news, more analysis.

Our approach at FutureSearch is to capture this vast amount of research data ahead of time. So, on June 2025, we send our research agents out on the web to research everything about Mamdani we possibly can. Typically this produces about 10,000 - 20,000 URLs from across the web, mostly news, but also some datasets, government filings, analyst reports, everything.

We then store the scraped content - 10,000+ URLs for each forecasting question - offline in a database. Then, when we train our forecasting agents, we have them "search" this corpus. To the agent, it's like searching on the open web. But it has no internet access. It only has its trained knowledge of the world up to June 2025, and whatever it chooses to find and read from that dataset.

If you'd like to read more, the details of this approach are in our paper Bench to the Future, from June 2025, about a dataset from spring 2025, that had ~150 questions, so a database of ~1.5 million URLs for the forecasting agents. We have recently completed the generation of our second question set, which is 1500 questions, so a database of ~15 million URLs. And we're designing new improvements, to capture questions and research that will correlate more with Stockfisher's revenue, margin, and payout forecasts to 2030 and 2035.

So How Well Does It Work?

The aforementioned paper has accuracy scores for our basic agents as of Spring 2025. But we didn't share our top performing approaches in that paper, approaches have improved since then.

And those accuracy scores do not directly translate to Stockfisher's accuracy, because (a) they are predicting major world events that inform company revenues and margins, not directly predicting revenues and margins, (b) as described above, these forecasting questions last only months, not the 5+ years of the Stockfisher forecasts, and (c) the Stockfisher forecasting approach is based on our best general-purpose forecaster, but is tuned to be optimal for the Stockfisher use case.

And even if the training environment was completely faithful to the Stockfisher use case, you still shouldn't trust the forecast accuracy based on benchmarks. All seasoned LLM users know that benchmarks never translate to real-world use, not matter how well designed they are.

So, you still have to read our forecasts and decide for yourself.

And if you want to see our prospective track record - forecasts we made in the past, and how they turned out - stay tuned.