What it is
The llm package is the thin waist of Phero: a minimal interface for chat models plus a small tool system.
Higher-level packages (like agent) build on this to orchestrate multi-turn loops and tool execution.
All message and content types are defined in this package — no dependency on any specific provider SDK is required.
Using an LLM backend
Any backend that implements llm.LLM can power agents and tools.
Phero includes an OpenAI-compatible client at llm/openai and an Anthropic Messages API client at llm/anthropic.
- Choose a backend (e.g.
llm/openai,llm/anthropic, or your own implementation) - Pass that client into agent.New
- Optionally attach tools created via
llm.NewTool
Anthropic backend
llm/anthropic implements llm.LLM using Anthropic's Messages API.
The same Phero message/tool types are used at the boundary, so the rest of the framework (agents, tools, memory)
doesn't need to change.
import (
"os"
"github.com/henomis/phero/llm/anthropic"
)
llmClient := anthropic.New(
os.Getenv("ANTHROPIC_API_KEY"),
anthropic.WithModel("claude-sonnet-4-6"),
anthropic.WithMaxTokens(2048),
)
If you pass an empty API key, the underlying Anthropic SDK will fall back to its environment variable configuration.
Available options:
WithModel(m): Anthropic model name (e.g."claude-sonnet-4-6")WithMaxTokens(n): maximum completion tokensWithTemperature(t): sampling temperature (0–1)WithBaseURL(url): override the API endpoint (useful for proxies or tests)WithPromptCaching(): mark the system prompt and tool list as cacheable, so repeated requests reuse them at the cheaper cache-read rate. Cache token counts appear onResult.Usage(CacheReadTokens/CacheWriteTokens).WithThinking(budget): enable extended thinking with the given token budget. Reasoning is returned asContentTypeReasoningparts (read it withMessage.ReasoningContent()) and replayed on later turns so tool use keeps working. Reasoning never leaks intoTextContent().
llmClient := anthropic.New(
os.Getenv("ANTHROPIC_API_KEY"),
anthropic.WithModel("claude-sonnet-4-6"),
anthropic.WithPromptCaching(),
anthropic.WithThinking(2048),
)
result, _ := llmClient.Execute(ctx, messages, tools)
fmt.Println("reasoning:", result.Message.ReasoningContent())
fmt.Println("answer:", result.Message.TextContent())
OpenAI backend
llm/openai implements llm.LLM using the OpenAI Chat Completions API.
It also implements llm.Transcriber and llm.SpeechSynthesizer for audio (see below).
import (
"os"
"github.com/henomis/phero/llm/openai"
)
llmClient := openai.New(
os.Getenv("OPENAI_API_KEY"),
openai.WithModel("gpt-4o"),
openai.WithTemperature(0.7),
)
Available options:
WithModel(m): model name (e.g."gpt-4o","gpt-4o-mini")WithTemperature(t): sampling temperatureWithBaseURL(url): override the API base URL (e.g. point at a local proxy)WithOllamaBaseURL(): shortcut to point at a local Ollama server
Messages and content parts
Every message is an llm.Message containing a Role and a slice of
llm.ContentPart values. A content part is either text or an image (URL or base64-encoded bytes).
// Plain text
llm.Text("Hello, world!")
// Image by URL
llm.ImageURL("https://example.com/photo.png")
// Image from a local file (MIME type is detected automatically)
part, err := llm.ImageFile("/path/to/photo.jpg")
// Image as raw base64 bytes
llm.ImageBase64("image/png", base64EncodedData)
Role constants and message constructors:
// Role constants
llm.RoleSystem // "system"
llm.RoleUser // "user"
llm.RoleAssistant // "assistant"
llm.RoleTool // "tool"
// Constructors
llm.SystemMessage("You are a helpful assistant.")
llm.UserMessage(llm.Text("Hello!"))
llm.UserMessage(llm.Text("Describe this image:"), llm.ImageURL("https://..."))
llm.AssistantMessage([]llm.ContentPart{llm.Text("Hi!")})
llm.ToolResultMessage(toolCallID, llm.Text("42"))
When a tool fails, set ToolError on its result message so the model is told the call errored
rather than succeeded. The agent loop does this automatically — a handler returning an error yields a tool-result
message with ToolError: true carrying the error text — but you can set the field yourself when
constructing messages by hand.
msg := llm.ToolResultMessage(toolCallID, llm.Text("division by zero"))
msg.ToolError = true
To extract the plain text from a message or from loose parts, use TextContent:
// From a message
text := msg.TextContent()
// From loose content parts
text := llm.TextContent(parts...)
Multimodal input
Pass image parts alongside text to send multimodal messages to vision-capable models.
The agent's Run method accepts variadic ContentPart values:
imagePart, err := llm.ImageFile("screenshot.png")
if err != nil {
panic(err)
}
result, err := a.Run(ctx,
llm.Text("What does this image show?"),
imagePart,
)
if err != nil {
panic(err)
}
fmt.Println(result.TextContent())
See examples/multimodal for a complete working example.
LLM middleware
Just as tools support middleware, the llm package provides an LLMMiddleware type and a
llm.Use function to compose decorators around any llm.LLM.
This is the right place to add caching, rate limiting, logging, or automatic retries without modifying individual backends.
// A simple logging middleware
func loggingMiddleware(next llm.LLM) llm.LLM {
return llm.LLMFunc(func(ctx context.Context, msgs []llm.Message, tools []*llm.Tool) (*llm.Result, error) {
fmt.Printf("calling LLM with %d messages\n", len(msgs))
result, err := next.Execute(ctx, msgs, tools)
fmt.Printf("LLM responded; err=%v\n", err)
return result, err
})
}
// Wrap a base client with one or more middlewares
base := openai.New(os.Getenv("OPENAI_API_KEY"))
wrapped := llm.Use(base, loggingMiddleware)
Middlewares are applied in declaration order: llm.Use(base, m1, m2) means m1 is outermost and runs first.
See examples/llm-middleware for a full example.
The llm/middleware package ships ready-made middlewares: NewRetry (exponential back-off),
NewRateLimit, NewGuardrails, and NewSemanticCache.
Semantic response caching
middleware.NewSemanticCache wraps any backend with a cache keyed by the semantic similarity
of the conversation rather than an exact string match. Each request is embedded with an
embedding.Embedder and looked up in a
vectorstore.Store; when the nearest neighbour's cosine similarity meets
the configured threshold, the stored response is returned without calling the model. Cache failures degrade
gracefully to a normal, uncached call.
cacheMW, err := middleware.NewSemanticCache(
embedder, // embedding.Embedder
store, // vectorstore.Store
middleware.WithSimilarityThreshold(0.97),
)
if err != nil {
panic(err)
}
client := llm.Use(base, cacheMW)
By default a cache hit reports zero token usage (no model call was made), keeping cost accounting truthful, and
requests that carry tools bypass the cache so cached tool calls are never silently replayed. Both behaviors are
configurable via WithReportCachedUsage and WithSkipToolCalls.
Audio: transcription and speech
The llm/openai client also implements llm.Transcriber and llm.SpeechSynthesizer,
so the same client used for chat can also transcribe audio and synthesize speech.
// Speech-to-text (Transcriber)
result, err := llmClient.Transcribe(ctx, llm.TranscriptionRequest{
Input: llm.AudioFile("recording.mp3"),
})
fmt.Println(result.Text)
// Text-to-speech (SpeechSynthesizer)
speech, err := llmClient.SynthesizeSpeech(ctx, llm.SpeechRequest{
Input: "Hello from Phero!",
Format: llm.SpeechResponseFormatMP3,
})
// speech.Data holds the raw MP3 bytes; speech.MIMEType is "audio/mpeg"
See examples/audio for a runnable example.
Function tools
The main way you integrate capabilities is via function tools.
In examples/conversational-agent,
a get_current_time tool is exposed to the agent.
type TimeInput struct{}
type TimeOutput struct {
CurrentTime string `json:"current_time" jsonschema:"description=The current local time in RFC3339 format"`
}
func getCurrentTime(_ context.Context, _ *TimeInput) (*TimeOutput, error) {
return &TimeOutput{CurrentTime: time.Now().Format(time.RFC3339)}, nil
}
tool, err := llm.NewTool(
"get_current_time",
"Get the current local time",
getCurrentTime,
)
if err != nil {
panic(err)
}
Tools are added to an agent with AddTool, and the agent will run them when the model requests a tool call.
Tool middleware
Tools support middleware via tool.Use(...).
This is the place to add validation, permission checks, logging, or other cross-cutting behavior without baking it into each tool handler.
timeTool, err := llm.NewTool(
"get_current_time",
"Get the current local time",
getCurrentTime,
)
if err != nil {
panic(err)
}
timeTool.Use(func(tool *llm.Tool, next llm.ToolHandler) llm.ToolHandler {
return func(ctx context.Context, arguments string) (any, error) {
fmt.Printf("running %s with args %s\n", tool.Name(), arguments)
return next(ctx, arguments)
}
})
Middleware order is preserved: if you call tool.Use(m1, m2), then m1 runs before m2.
This replaces older per-tool validation helpers and keeps approval logic at wiring time.
Tracing raw LLM calls
The trace package can wrap any llm.LLM with trace.NewLLM. This is useful when you want
observability around direct Execute calls without going through an agent.
import (
"github.com/henomis/phero/trace"
"github.com/henomis/phero/trace/text"
)
traced := trace.NewLLM(llmClient, text.New(os.Stderr))
result, err := traced.Execute(ctx, messages, tools)
When called inside an agent, request and response events are automatically annotated with the agent name and iteration number.
Token usage and cost
Every Result carries Usage{InputTokens, OutputTokens, CacheReadTokens, CacheWriteTokens}
and the resolved Model name. Call usage.Cost(model) for a best-effort US-dollar
estimate using a built-in per-model pricing table; unknown models return 0. Override or add rates
with llm.RegisterPricing. Agents aggregate this into RunSummary.Usage.CostUSD.
// Override or add pricing (USD per 1M tokens).
llm.RegisterPricing("my-local-model", llm.Pricing{
InputPer1M: 0.50,
OutputPer1M: 1.50,
})
cost := result.Usage.Cost(result.Model)
fmt.Printf("this call cost $%.4f\n", cost)
Streaming
Backends that support incremental responses implement llm.StreamingLLM (both the OpenAI and
Anthropic clients do). Use llm.StreamOrBuffer to consume any backend uniformly: it streams when
the client supports it and otherwise falls back to a single buffered chunk. Each StreamChunk
carries a TextDelta (or ReasoningDelta); the terminal chunk (Done) holds
the complete Message, Usage, and Model.
for chunk, err := range llm.StreamOrBuffer(ctx, llmClient, messages, tools) {
if err != nil {
panic(err)
}
fmt.Print(chunk.TextDelta) // print tokens as they arrive
if chunk.Done {
fmt.Printf("\n[tokens in=%d out=%d]\n", chunk.Usage.InputTokens, chunk.Usage.OutputTokens)
}
}
At the agent level, use Agent.RunStream (see the agent docs) for streamed text and tool events.
Putting it together
The minimal loop is: create an LLM client, register one or more tools, then run an agent.
This is the core pattern used throughout examples/.
# from repo root
go run ./examples/simple-agent
go run ./examples/conversational-agent