LLM chat

chat is the main text entry point. LLM is callable, so llm("hello") and llm.chat("hello") are equivalent.

import heavenbase as hb

llm = hb.LLM(model="ds-flash")

text = llm.chat("Summarize HeavenBase in one sentence.")
same = llm("Summarize HeavenBase in one sentence.")

Pass a system prompt with system=...:

answer = llm.chat("What should I do next?", system="You are concise and practical.")

Provider arguments can be set at construction time or call time:

llm = hb.LLM(model="ds-flash", temperature=0)
answer = llm.chat("Draft a title", max_tokens=24)

Message inputs

chat and stream accept a single string, one OpenAI-style message dictionary, or a list of messages.

llm.chat("Hello")

llm.chat({"role": "user", "content": "Hello"})

llm.chat([
    {"role": "system", "content": "Be direct."},
    {"role": "user", "content": "Explain vector search."},
])

The formatter also accepts objects with model_dump(), to_dict(), dict(), json(), or role and content attributes. For multimodal models, pass image inputs with images=. HeavenBase normalizes each input to LLMImage and appends OpenAI-compatible image_url content parts to the last user message.

answer = hb.LLM(model="qwen3.6", provider="openrouter").chat(
    "What is in this image?",
    images=["./sample.png"],
)

for chunk in hb.LLM(model="gpt-nano").stream("Describe this image.", images=b"...png bytes..."):
    print(chunk, end="")

images= accepts LLMImage, URLs, data URLs, base64 strings, local paths, bytes-like objects, binary file objects, provider-style dictionaries, Pillow images, and numpy-compatible ndarrays.

Streaming

Use stream when you want deltas as they arrive:

for chunk in llm.stream("Write a short haiku"):
    print(chunk, end="")

chat uses the same streaming path internally and gathers the final response. This keeps regular chat, reasoning streams, usage accounting, structured outputs, and tool calls on one response pipeline.

Include projection

The include argument selects response fields:

text: final assistant text.
think: reasoning or thinking content, when a provider streams it separately.
content: thinking plus visible answer, with <think> tags around reasoning.
message: OpenAI-format assistant response dictionary with role, content, and optional tool_calls; it is not the full history.
delta: new OpenAI-format messages produced by this inference call. It starts with message and will include tool-result messages after it once tool execution is wired in.
messages: full conversation history: normalized input messages plus delta.
tool_calls: normalized OpenAI tool_calls from the assistant response.
usage: provider usage counters for this call. Common keys are prompt_tokens, completion_tokens, and total_tokens; streamed usage chunks are merged by summing numeric counters and keeping the first non-numeric value per key.
raw: raw provider payloads.
elapsed: request elapsed seconds.
created_at: local response creation timestamp.
structured: parsed structured output.

detail = llm.chat(
    "Reply with exactly: hb-ok",
    include=["text", "usage", "elapsed"],
    reduce=False,
    max_tokens=8,
)

When stream includes delta or messages, progressive content still arrives as normal and HeavenBase emits one final metadata chunk with empty text/think and the completed message delta. Reasoning presets enable the canonical think option. You can override the default per call, and HeavenBase will convert it to gateway-level extra_body.reasoning for OpenAI-compatible gateways:

result = hb.LLM(preset="reason").chat(
    "Solve 17 * 23.",
    think=True,
    reasoning_effort="medium",
    include=["think", "text"],
    reduce=False,
)

Introduction

Quickstart

Core

Configuration and Utilities

LLM

Agents

Protocols

Integrations

Community

Message inputs

Streaming

Include projection

Introduction

Quickstart

Core

Configuration and Utilities

LLM

Agents

Protocols

Integrations

Community

Documentation Index

​Message inputs

​Streaming

​Include projection

Message inputs

Streaming

Include projection