Skip to content

feat: Support native logprobs in LlamaChatSession.prompt() to avoid double inference #584

@steve02081504

Description

@steve02081504

Feature Description

When using LlamaChatSession.prompt() to generate a response, the model internally computes a
full probability distribution (via softmax) for every token it samples. This data exists at the
native layer but is currently discarded before surfacing to JavaScript.

Applications that need per-token logprobs (e.g. for confidence visualization, uncertainty
estimation, or OpenAI-compatible logprobs API emulation) therefore have no choice but to
replay the entire output sequence on a second LlamaContextSequence using
controlledEvaluate with generateNext: { probabilities: true }. This doubles the inference
cost:

  • Pass 1 (main generation): LlamaChatSession.prompt() on sequence
  • Pass 2 (logprob replay): controlledEvaluate on a dedicated replaySequence

Because a second sequence is needed simultaneously, the context must be created with
sequences: 2, which also splits the KV cache budget.

The Solution

Add a logprobs / topLogprobs option to LlamaChatSession.prompt() (and ideally to
LlamaCompletion as well) that captures the probability distribution during the original
generation pass
and returns it alongside the text, similar to the OpenAI API:

const result = await session.prompt(input, {
  logprobs: true,
  topLogprobs: 5,
  // ...existing options
})

// result.logprobs.content[i] = { token, logprob, top_logprobs: [...] }

This would eliminate the replay pass entirely and halve the inference cost for any consumer
that needs token-level probabilities.

Considered Alternatives

The only current workaround is to use controlledEvaluate on a parallel replaySequence after
(or concurrently with) the main generation. A reference implementation of this approach can be
found here:
https://github.com/steve02081504/fount/blob/master/src/public/parts/serviceGenerators/AI/local/src/localLogprobs.mjs

Additional Context

  • node-llama-cpp version: 3.18.1
  • Use case: OpenAI-compatible logprobs visualization in a local GGUF inference service

Related Features to This Feature Request

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions