Best practices for Mosaic AI Vector Search

This article gives some tips for how to use Mosaic AI Vector Search most effectively.

Recommendations for optimizing latency

Use the service principal authorization flow to take advantage of network-optimized routes.
Use the latest version of the Python SDK.
When testing, start with a concurrency of around 16 to 32. Higher concurrency does not yield a higher throughput.
Use a model served with provisioned throughput (for example, bge-large-en or a fine tuned version), instead of a pay-per-token foundation model.

When to use GPUs

Use CPUs only for basic testing and for small datasets (up to 100s of rows).
For GPU compute type, Databricks recommends using GPU-small or GPU-medium.
For GPU compute scale-out, choosing more concurrency might improve ingestion times, but it depends on factors such as total dataset size and index metadata.

Working with images, video, or non-text data

Pre-compute the embeddings and use a Delta Sync Index with self-managed embeddings.
Don’t store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.

Embedding sequence length

Check the embedding model sequence length to make sure documents are not being truncated. For example, BGE supports a context of 512 tokens. For longer context requirements, use gte-large-en-v1.5.

Use Triggered sync mode to reduce costs

The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.