Superforecasting Stock Earnings

This is the third post in a series about predicting stock returns. Previously Tom shared his path from skepticism to getting value out of it Stockfisher, and then how we iterated to a process that evaluates companies from scratch.

The stock valuation process Tom described in the last article requiers 6 accurate forecasts for every company company: the revenue, margins, and payout ratios (dividends + share repurchases), both 5 years out and 10 years out.

Today I want to share how we think about stock forecasting. What is the best possible accuracy we, or anyone, can reach on these forecasting questions?

Earlier, Tom wrote that his approach was informed by his Good Judgment Superforecaster background. Allow me to share a bit about mine. I first got interested in prediction markets in 2011 from early rationality writings on overcomingbias.com and lesswrong.com.

I've previously written about how, until FutureSearch started, making the most accurate forecasts possible was entirely a human art. The basic idea was to run competitions to select the best humans (who improve over time), and then a large number of them the best incentives to study and forecast the questions you care about.

There have been explorations for using this aggregate-human approach to predict financial returns. Metaculus, where I served as CTO in 2022-2023, ran a trade signals tournament in 2020-2021. In the prediction market I ran at Google from 2019-2022, people frequently submitted questions about stock prices, that I eventually created a policy to blanket reject! And top forecasters I know have been hired by hedge funds.

But I don't think the aggregate-humans approach suits this domain for a few reasons:

There are many stocks.

Crowd approaches won't give you consistent forecasting methodologies across the different stocks. Stockfisher has 900 and counting now.
Each stock requires a vast amount of research.

This can't be split across many people. For each company, every forecaster needs to read every critical document, form an assessment of the industry, of the company's managers, etc.
Stock prices are a "Keynesian Beauty Contest".

It's hard to predit how people will feel about a stock, when how those people feel is a function of how other people feel, etc.

Our solution, the one we use in Stockfisher, addresses exactly these three points.

The third point - why we forecast revenue, margins, and payouts instead of stock prices - was covered by Tom earlier in this series. So today I'll talk about the first two.

Both the number of stocks, and the depth of research in each stock, strongly point to systematizing the process. 900 stocks (with more coming) * 6 forecasts per stock would already be about 100 larger than most forecasting tournaments. And that would be to get a forecast one time - we need them to be updated. And even reading a single 10-K takes more time than forecasters put into most questions in a typical prediction market or forecasting platform.

Enter LLMs. The promise, and peril, of qualitative reasoning in software. As Tom wrote earlier, he, I, and the rest of FutureSearch are as skeptical of LLMs for this type of work as anyone.

It took significant developments from FutureSearch over the last two years before we even attempted applying LLMs to this domain. Our first forecaster, which we demoed to financial clients in Jan 2024, was clearly not up to the task. (It's amazing to think back to when we built with GPT-4-Turbo.) It did beat human betters on average on a series of 50 questions about the biggest world developments in 2024.

In mid 2024, we built our first "Deep Research" tools, long before any product existed with that name. It wasn't reliable enough to use for a domain as quality sensitive as financial research, but it was a starting point for what agent loops, and reading hundreds of articles, could achieve with the LLMs available back then.

In Jan 2025 we started serving private clients with financial research using humans in the loop. We used our web research tools and agent orchestration as thinking tools, and to scale our research. For example, at one point, we researched all the major suppliers in the top 120 technology processes underlying the economy.

By June 2025, we had completed our two benchmarks: Deep Research Bench, for getting our research agents accurate on present-day questions; and Bench to the Future, our past-casting framework (more on that in the next article). This let us quickly refine and measure progress of our basic research and forecasting agents.

One thing I now believe, though it is still controversial with Tom and other elite forecasters I know, is that quality of present-day research is more important than judgment or calibration in future probabilities.

We see this in Stockfisher. We process tens of millions of tokens for each company, and well over half are on the research - hunting around the 10ks, earnings calls, investor presentations, and historical financials for the most important information. (See our series of articles on our techniques here.) Each company uses dozens of web research agents, each performing many steps of analysis.

Our best forecasting approaches are token-intensive as well, though not quite as much. Each of the 6 forecasts per company is an ensemble of approaches, consuming large inputs and outputing large outputs at each step. But still, as you can see in the app, the main body of work is studying the known, present-day information about the company.

And finally, this process took months of running workflows, reading results, finding flaws, creating evals, fixing them, and iterating. There is no silver bullet. (And we don't think approaches like reinforcement learning will yet work in this domain - though again more in the next article on this.)

So, in brief, our stock "superforecasting" approach is a combination of crafting the right company research workflow according to forecasting best practices, improving the individual research and forecasting agents against evals, and a lot of elbow grease.

So, how well does it work? How accurate are our forecasts, and therefore our company valuations?

As always, we encourage you to read what it says about stocks you know well, and decide for yourself. But next week we'll go into more detail on our past-casting framework, and the nature of back-testing systems that are built on LLMs.