Phero • Textsplitter

What it is

When building RAG systems, you usually don’t embed an entire document at once. Instead, you split it into smaller chunks, embed each chunk, and store them for retrieval.

The textsplitter package defines the Splitter interface; concrete implementations live in sub-packages. All splitters read a source file and emit document.Document values via a lazy Go iterator (iter.Seq2).

// Splitter is implemented by all strategies:
type Splitter interface {
    Split(ctx context.Context) iter.Seq2[document.Document, error]
}

Available splitters

textsplitter/recursive: recursively splits on an ordered list of separators (\n\n, \n, space, empty string). Good for generic plain-text documents.
textsplitter/markdown: same strategy, but the default separator list prioritises Markdown heading levels (## , ### , …) before falling back to paragraph and line boundaries.

Key ideas

Chunk size: the maximum size of each chunk (measured by a configurable length function, default is byte length)
Overlap: how much content is repeated between adjacent chunks to avoid hard cutoffs
Metadata: each document.Document carries source, chunk_index, start_offset, and end_offset metadata keys automatically
Downstream usage: chunks are fed into embeddings and stored in a vector store

Example: splitting a file into chunks

Splitters read from a file path given at construction time. The iterator yields document.Document values one at a time so large files are not held in memory all at once.

import (
    "context"

    "github.com/henomis/phero/textsplitter/recursive"
)

const (
    chunkSize    = 1000
    chunkOverlap = 200
)

splitter := recursive.New("./my-document.txt", chunkSize, chunkOverlap)

var chunks []string
for doc, err := range splitter.Split(context.Background()) {
    if err != nil {
        // handle error
    }
    chunks = append(chunks, doc.Content)
}

// chunks are then embedded and ingested into a vector store

For Markdown files, use markdown.New from textsplitter/markdown:

import "github.com/henomis/phero/textsplitter/markdown"

splitter := markdown.New("./README.md", chunkSize, chunkOverlap)

for doc, err := range splitter.Split(ctx) {
    // doc.Metadata["source"], doc.Metadata["chunk_index"], ...
}

Run the example

The RAG Chatbot example exercises the splitter as part of a full ingestion + retrieval flow. Provider setup depends on your chosen LLM/embedder; follow the example README.

# from repo root

go run ./examples/rag-chatbot -file /path/to/your/file.txt

# tuning knobs
# -chunk-size / -chunk-overlap

Practical tuning tips

If answers miss context: increase overlap so facts aren’t split apart
If retrieval feels noisy: decrease chunk size so chunks are more specific
If ingestion is slow: increase chunk size (fewer chunks) and keep overlap modest

Related packages

rag: ingests and retrieves chunks
embedding: turns chunks/queries into vectors
vectorstore: stores vectors for similarity search