Skip to content

Implement base auto resolver#221

Open
lbeuk wants to merge 16 commits intoMetaculus:mainfrom
lbeuk:feat/auto-resolver
Open

Implement base auto resolver#221
lbeuk wants to merge 16 commits intoMetaculus:mainfrom
lbeuk:feat/auto-resolver

Conversation

@lbeuk
Copy link
Copy Markdown

@lbeuk lbeuk commented Mar 10, 2026

Summary

Initial implementation of auto-resolver, along with a basic tui for interacting with the resolution output. Resolver uses agents sdk and follows the process:

  • Check whether the question is presently resolvable, by first comparing the date with the resolution date in the question, and then by checking whether there are any implicit dates in the question (i.e. "will event X happen before May 1st")
  • Runs an orchestration agent that has access to a researcher and resolver subagent
  • Resolver subagent has a subagent dedicated to cancelled questions, but as mentioned still needs work

What works well:

  • Low rate of false positives/negatives
  • TUI gives overall results and allows exploring the agent output on a per question basis
  • TUI allows exporting to a shortened markdown report

What needs work:

  • Resolver struggles with cancelled resolutions, both in terms of cancelling questions that are not cancelled on Metaculus, and not cancelling questions that are cancelled on Metaculus.
  • Similarly struggles in the case of not yet resolvable questions, see second image.

Supporting evidence

The following image shows result of running on a random 60 questions from the fall aib tournament.

image

The following image shows results of running on all present questions in spring aib tournament. Note that ~67 questions were marked as not yet resolvable automatic due to the resolution date not having passed.

image

The following images depicts two instances where the auto-resolver picked up on an event that is not yet reflected on Metaculus spring aib.

image image

(Backing validation)

image

@lbeuk lbeuk changed the title Feat/auto resolver Implement base auto resolver Mar 10, 2026
Copy link
Copy Markdown

@hlbmtc hlbmtc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! A couple of small nits

@@ -0,0 +1,189 @@
"""Main content panel showing resolution status and live agent feed."""

from __future__ import annotations
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this. Current python version already supports annotations out of the box

if isinstance(question, BinaryQuestion):
return await self._resolve_binary(question)
else:
return NotImplemented
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change it to raise to explicitly trigger exceptions

) -> Optional[BinaryResolution]:

# Rephrase question if its time context has passed
question = await self._rephrase_question_if_needed(question)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Isn’t Sonnet good enough to compose search queries without this extra step?

)
searcher = AskNewsSearcher()
if cutoff_date is not None:
return await searcher.get_formatted_news_before_date_async(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: did you test this with a cutoff_date older than 48h? I remember AskNews had a bug where it returned [] when using historical=True together with end_timestamp. I might be mistaken, but I recall seeing something like that before

self.binary_theshold = binary_threshold
self.mc_threshold = mc_threshold

@abstractmethod
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove this. resolve_question is actually a concrete subclass' implementation here

Supports concurrent resolution of questions when assessing resolvers.
"""

def __init__(self, resolver: AutoResolver, allowed_types: list[QuestionBasicType], questions: list[int | str] = [], tournaments: list[int | str] = [], max_concurrency: int = 3):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable detault arguments = []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we flag uncertain cases for human review? Do you think we should also add certainty level for calibration purposes?


1. **Current Status**: What is the current state of affairs related to this question?
2. **Resolution Criteria**: Have the resolution criteria been met?
3. **Timeline Check**: Consider the scheduled resolution date and current date
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also pass the current date into the prompt, just in case?

model_for_resolver: str = "openrouter/anthropic/claude-sonnet-4.6",
model_for_output_structure: str = "openrouter/anthropic/claude-sonnet-4.6",
model_for_researcher: str = "openrouter/anthropic/claude-sonnet-4.6",
model_for_rephraser: str = "openrouter/anthropic/claude-sonnet-4.6",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nit: Maybe we can switch it to something lighter e.g haiku?


# Handoff

When you've gathered sufficient information, hand off to the resolver
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants