Implement base auto resolver by lbeuk · Pull Request #221 · Metaculus/forecasting-tools

lbeuk · 2026-03-10T03:54:47Z

Summary

Initial implementation of auto-resolver, along with a basic tui for interacting with the resolution output. Resolver uses agents sdk and follows the process:

Check whether the question is presently resolvable, by first comparing the date with the resolution date in the question, and then by checking whether there are any implicit dates in the question (i.e. "will event X happen before May 1st")
Runs an orchestration agent that has access to a researcher and resolver subagent
Resolver subagent has a subagent dedicated to cancelled questions, but as mentioned still needs work

What works well:

Low rate of false positives/negatives
TUI gives overall results and allows exploring the agent output on a per question basis
TUI allows exporting to a shortened markdown report

What needs work:

Resolver struggles with cancelled resolutions, both in terms of cancelling questions that are not cancelled on Metaculus, and not cancelling questions that are cancelled on Metaculus.
Similarly struggles in the case of not yet resolvable questions, see second image.

Supporting evidence

The following image shows result of running on a random 60 questions from the fall aib tournament.

The following image shows results of running on all present questions in spring aib tournament. Note that ~67 questions were marked as not yet resolvable automatic due to the resolution date not having passed.

The following images depicts two instances where the auto-resolver picked up on an event that is not yet reflected on Metaculus spring aib.

(Backing validation)

…om deps

…not be treated as dependable software.

…error logging

hlbmtc

Looking great! A couple of small nits

hlbmtc · 2026-03-25T17:43:49Z

forecasting_tools/agents_and_tools/auto_resolver/tui/widgets/feed_panel.py

@@ -0,0 +1,189 @@
+"""Main content panel showing resolution status and live agent feed."""
+
+from __future__ import annotations


Let's remove this. Current python version already supports annotations out of the box

hlbmtc · 2026-03-25T18:30:00Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/__init__.py

+        if isinstance(question, BinaryQuestion):
+            return await self._resolve_binary(question)
+        else:
+            return NotImplemented


I think we should change it to raise to explicitly trigger exceptions

hlbmtc · 2026-03-25T18:43:31Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/__init__.py

+    ) -> Optional[BinaryResolution]:
+
+        # Rephrase question if its time context has passed
+        question = await self._rephrase_question_if_needed(question)


Interesting. Isn’t Sonnet good enough to compose search queries without this extra step?

hlbmtc · 2026-03-25T18:55:41Z

forecasting_tools/agents_and_tools/minor_tools.py

+        )
+        searcher = AskNewsSearcher()
+        if cutoff_date is not None:
+            return await searcher.get_formatted_news_before_date_async(


Quick question: did you test this with a cutoff_date older than 48h? I remember AskNews had a bug where it returned [] when using historical=True together with end_timestamp. I might be mistaken, but I recall seeing something like that before

hlbmtc · 2026-03-25T19:30:40Z

forecasting_tools/agents_and_tools/auto_resolver/__init__.py

+       self.binary_theshold = binary_threshold
+       self.mc_threshold = mc_threshold
+
+    @abstractmethod


Nit: remove this. resolve_question is actually a concrete subclass' implementation here

hlbmtc · 2026-03-25T20:49:13Z

forecasting_tools/agents_and_tools/auto_resolver/assess.py

+    Supports concurrent resolution of questions when assessing resolvers.
+    """
+
+    def __init__(self, resolver: AutoResolver, allowed_types: list[QuestionBasicType], questions: list[int | str] = [], tournaments: list[int | str] = [], max_concurrency: int = 3):


Mutable detault arguments = []

hlbmtc · 2026-03-25T20:59:39Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/__init__.py

Do we flag uncertain cases for human review? Do you think we should also add certainty level for calibration purposes?

hlbmtc · 2026-03-25T21:50:59Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/instructions.py

+
+        1. **Current Status**: What is the current state of affairs related to this question?
+        2. **Resolution Criteria**: Have the resolution criteria been met?
+        3. **Timeline Check**: Consider the scheduled resolution date and current date


Should we also pass the current date into the prompt, just in case?

hlbmtc · 2026-03-25T21:52:10Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/__init__.py

+        model_for_resolver: str = "openrouter/anthropic/claude-sonnet-4.6",
+        model_for_output_structure: str = "openrouter/anthropic/claude-sonnet-4.6",
+        model_for_researcher: str = "openrouter/anthropic/claude-sonnet-4.6",
+        model_for_rephraser: str = "openrouter/anthropic/claude-sonnet-4.6",


A small nit: Maybe we can switch it to something lighter e.g haiku?

hlbmtc · 2026-03-25T21:59:56Z

forecasting_tools/agents_and_tools/auto_resolver/agentic/instructions.py

+
+        # Handoff
+
+        When you've gathered sufficient information, hand off to the resolver


Hm, I think handoff is not defined for the research agent

https://github.com/lbeuk/forecasting-tools/blob/7d426c4a1402a75a9e6a832897a8600ce3f0e59a/forecasting_tools/agents_and_tools/auto_resolver/agentic/__init__.py#L332

Luke Beukelman and others added 16 commits February 14, 2026 08:55

Basic skeleton of a forecast resolver

46550cc

Asyncio has been part of the standard library for awhile, removing fr…

576de9b

…om deps

Updated lock file for removing asyncio

40f1246

Stashing changes for creating perplexity auto resolver

7e0551c

Geting more detailed logs on failure

85149aa

Pushing a report

9465ed8

Commiting updates from yesterday

09f3fd8

Added a tui

c5605b1

Added a comment to __init__.py in the tui to indicate that it should …

a9abc1b

…not be treated as dependable software.

Merge branch 'main' into feat/auto-resolver

c8bff86

Updates for the weekend, integrated asknews

77e77f3

Addeding latest report

588340b

Added a resolver for annulled/ambiguous specificity, also fixed some …

6154eb2

…error logging

Final commit for now

e9dbf00

Merge remote-tracking branch 'origin/main' into feat/auto-resolver

77496d3

Removing old reports

7d426c4

lbeuk changed the title ~~Feat/auto resolver~~ Implement base auto resolver Mar 10, 2026

hlbmtc reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement base auto resolver#221

Implement base auto resolver#221
lbeuk wants to merge 16 commits intoMetaculus:mainfrom
lbeuk:feat/auto-resolver

lbeuk commented Mar 10, 2026

Uh oh!

hlbmtc left a comment

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

hlbmtc Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,189 @@
		"""Main content panel showing resolution status and live agent feed."""

		from __future__ import annotations


		# Handoff

		When you've gathered sufficient information, hand off to the resolver

Conversation

lbeuk commented Mar 10, 2026

Summary

Supporting evidence

Uh oh!

hlbmtc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants