← posts

building a document search engine in go

may 2026 · 8 min read · go · search · bleve

godochive started as a small itch. i had a folder full of documents — notes, pdfs flattened to text, scraps of documentation — and no good way to search across all of them. grep got me part of the way, but i wanted ranking, highlighting, and something that would not fall over at forty thousand files.

i reached for bleve, the full-text search library for go. it does the unglamorous work — tokenizing, building an inverted index, scoring results — and leaves the interesting decisions to you. this post is about the decisions that actually mattered.

mappings are the whole game

the default mapping indexes every field the same way. that is fine until it is not. a title should be weighted differently from a body; a tag is a keyword, not prose; a date should be searchable as a date, not a string. bleve lets you describe all of this up front with a document mapping.

go
indexMapping := bleve.NewIndexMapping()

doc := bleve.NewDocumentMapping()
doc.AddFieldMappingsAt("title", titleField)
doc.AddFieldMappingsAt("body", bodyField)
doc.AddFieldMappingsAt("tags", keywordField)

indexMapping.AddDocumentMapping("doc", doc)

once the mapping reflected what the documents actually were, relevance stopped feeling random. the right results moved to the top and stayed there.

batch your writes or wait forever

indexing one document at a time is the obvious first version and the slowest possible one — every call flushes to disk. the fix is batching: accumulate a few hundred documents, commit them in one transaction, repeat.

go
batch := index.NewBatch()
for i, d := range docs {
    batch.Index(d.ID, d)
    if i%500 == 0 {
        index.Batch(batch)
        batch.Reset()
    }
}
index.Batch(batch) // flush the remainder

that one change took a full reindex from minutes to seconds. the trailing flush is easy to forget — and the bug is silent, because everything works except the last few hundred documents simply are not there.

the bugs that only show up at scale

  • memory crept up during large reindexes until i reused a single batch instead of allocating one per loop.
  • highlighting broke on documents with unusual unicode — the byte offsets and the rune offsets disagreed.
  • a handful of enormous files dominated every result set until i capped how much any single document could contribute.

none of these were visible on ten documents. all of them were obvious on forty thousand. the lesson i keep relearning: scale does not introduce new bugs so much as it stops hiding the ones you already wrote.