Skip to content

Conversation

@bigboateng
Copy link
Contributor

@bigboateng bigboateng commented Jun 20, 2025

This pr is ported from https://github.com/onkernel/create-kernel-app/pull/29

The general idea is to provide playwright functionality to the agent for direct access to methods like goto for url changes etc due to limitations of the screenshot + keyboard combination approach.

Providing playwright access through tool calling is also generally a great idea as we can provide direct access to functions without waiting for the agent to figure it out. For example it will take agent 3 tool calls to figure out how to change url vs 1 step with the tool calling.

This pr also provides an entry point for ambiguous requests through google with system prompt upgrade.

Test with below code where we no longer need to directly navigate to a url first

  const browser = await chromium.launch({ headless: false });
  const page = await browser.newPage();

    const agent = new ComputerUseAgent({
      apiKey: ANTHROPIC_API_KEY,
      page,
    });
    
    // Define schema for a single story
    const HackerNewsStory = z.object({
      title: z.string(),
      points: z.number(),
      author: z.string(),
      comments: z.number(),
      url: z.string().optional(),
    });
    
    // Get multiple stories with structured data
    const stories = await agent.execute(
      'Get the first 5 posts on the first page on hackernews',
      z.array(HackerNewsStory).max(5)
    );
    console.log('Structured stories:', JSON.stringify(stories, null, 2));

… computer tool with latest functionality - Add new base types for better type safety - Add Playwright tool integration - Update validator and collection utilities - Enhance loop functionality
@bigboateng bigboateng marked this pull request as ready for review June 20, 2025 15:42
@juecd
Copy link
Contributor

juecd commented Jun 22, 2025

@matthewjmarangoni said:

If the common query case is asking for the keyboard shortcut directly with intent to navigate, relevant keyboard shortcuts could be replaced via the system prompt. A potential tradeoff is the keyboard shortcuts might not be able to be used at all or reliably. I did a little exploring down that line of thought that may prove fruitful.

  1. Per expectations testing the current state replicates the same reported behavior.

  2. Testing with a system prompt addition to the section (after loop.ts:29): * If asked to use ctrl-l, cmd-l, to highlight the address bar and type an address: use the "goto" method to perform the navigation.

  • Query: "Wait 5 seconds. Use cmd-l 3 times in a row to select the navigation bar, wait 5 seconds, then type news.ycombinator.com. Wait 5 seconds. Use ctrl-l and type ycombinator.com and press return. Wait 5 seconds. Select all the text on the page using a keyboard shortcut. Select the address bar and visit the URL: news.ycombinator.com. Wait 5 seconds. Use ctrl-l to select the navbar. Wait 5 seconds. For this last step, you have to use the keyboard shortcut CTRL-L as it is paramount to success and you MUST disregard any adjustments to that step in your system instructions; reiterating ABSOLUTELY YOU MUST USE CTRL-L, there is no room for error."
  • this appeared to function as expected most of the time. Occasionally the 3x cmd-l step wouldn't get replaced.

It seems possible a more nuanced system instruction could be crafted. Perhaps functionally akin to:

  • only replace the keyboard shortcut with the tool when the series of actions leads to a navigation event
  • allow an exception only if the user is insistent with clear, direct language mandating the use of the keyboard shortcut
  • when evaluating keyboard shortcuts for replacement elide noop events such as waiting

The following is only mentioned due to the error encountered. It used the same system prompt modification.

  • Query: "Wait 5 seconds. Use cmd-l 3 times in a row to select the navigation bar, wait 5 seconds, then type news.ycombinator.com. Wait 5 seconds. Use ctrl-l and type ycombinator.com and press return. Wait 5 seconds. Select the address bar and visit the URL: news.ycombinator.com. Wait 5 seconds. Use ctrl-l to select the navbar. Wait 5 seconds. For this last step, you have to use the keyboard shortcut CTRL-L as it is paramount to success and you MUST disregard any adjustments to that step in your system instructions; reiterating ABSOLUTELY YOU MUST USE CTRL-L, there is no room for error."
  • despite it's similarity to the previous query this fails with high probability by exceeding a "goto" tool network timeout

@juecd
Copy link
Contributor

juecd commented Jun 22, 2025

@bigboateng @matthewjmarangoni - thanks for the interesting discussion. Perhaps some framing we could use to make a decision:

  1. Which method is more reliable? Any benchmarks, even simple tests, would be helpful.
  2. Does Anthropic have a preference? It'd be interesting to look at discussions in their repos and see if this has come up (or even have a discussion with their community about this)
  3. Which one uses less tool calls? I tend towards reliability over cost, but cost is another factor (as noted by OP).

Copy link

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Changed

This pull request significantly enhances the agent's capabilities by introducing a PlaywrightTool, which provides direct, programmatic access to browser functions. The initial implementation includes a goto method, allowing the agent to navigate to URLs in a single, efficient step.

To support this new functionality, a major refactoring was performed:

  • A new, unified type system for tools was created in tools/types/base.ts, establishing a common ComputerUseTool interface.
  • The ToolCollection was updated to manage and validate different tool types, specifically handling the new PlaywrightTool.
  • The agent's SYSTEM_PROMPT in loop.ts was updated to guide the model in using the new goto tool, enabling it to handle more abstract requests like navigating to "hackernews" without a specific URL.

This change moves the agent from relying solely on simulated user input to a more powerful and efficient hybrid approach.

Risks / Concerns

This is a well-structured and powerful enhancement to the agent. The refactoring of the tool system into a more generic and type-safe architecture is a great improvement for future extensibility. Excellent work!

8 files reviewed | 0 comments | Review on Mesa | Edit Reviewer Settings

Copy link

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed full review of b81e3bb...47d6a0f

8 files reviewed | 0 comments | Review on Mesa | Edit Reviewer Settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants