# Scalable Asymmetric Retrieval with Voyage AI Embeddings in Vespa[¶](#scalable-asymmetric-retrieval-with-voyage-ai-embeddings-in-vespa) The [Voyage 4 model family](https://blog.voyageai.com/2026/01/15/voyage-4/) offers state-of-the-art embedding quality across a range of model sizes. Vespa recently added an integration to allow for seamless embedding through Voyage's API. This notebook demonstrates an **asymmetric retrieval** pattern, combining both this API-based integration, and a Vespa-local Open Source model: - **Indexing**: Use `voyage-4-large` (API-based, highest quality) to embed documents once via Vespa's [voyage-ai-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder). - **Querying**: Use `voyage-4-nano` (open-source, runs locally on the Vespa container) via [hugging-face-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#hugging-face-embedder) for zero-cost, low-latency queries. We combine [binary embeddings](https://docs.vespa.ai/en/rag/binarizing-vectors.html) for fast first-phase retrieval with **float reranking** for accuracy. Relevant resources: - [Vespa embedding documentation](https://docs.vespa.ai/en/rag/embedding.html) - [Embedding Tradeoffs, Quantified](https://blog.vespa.ai/embedding-tradeoffs-quantified/) — benchmarks of voyage-4-nano-int8 and other models on Vespa - [Nearest Neighbor Search](https://docs.vespa.ai/en/querying/nearest-neighbor-search.html) In \[ \]: Copied! ``` !pip3 install -U pyvespa vespacli ``` !pip3 install -U pyvespa vespacli ## Why Asymmetric Retrieval?[¶](#why-asymmetric-retrieval) Embedding documents and queries with the same API-based model works well, but at high query volumes the cost of embedding every query adds up. Asymmetric retrieval eliminates this cost entirely. ### The asymmetric insight[¶](#the-asymmetric-insight) - **Documents are embedded once** at indexing time. Use the best model (`voyage-4-large`) for maximum quality. - **Queries happen on every search**. Use a fast, local model (`voyage-4-nano`) with zero API cost and no rate limits. ### Example: 10K QPS[¶](#example-10k-qps) At 10,000 queries/sec with ~30-token queries, that's ~18M tokens per minute. Even at $0.02 per 1M tokens, this adds up to \*\*~$31K/month\*\* in embedding costs alone. Running `voyage-4-nano` locally on the Vespa container reduces this to **$0/month** with single-digit ms latency — the model runs as part of the serving infrastructure you're already paying for. The `voyage-4-nano` model from the same Voyage 4 family produces embeddings in the same vector space as `voyage-4-large`, making cross-model similarity meaningful. ### voyage-4-nano-int8 Quality Benchmarks[¶](#voyage-4-nano-int8-quality-benchmarks) From [Embedding Tradeoffs, Quantified](https://blog.vespa.ai/embedding-tradeoffs-quantified/), benchmarked on an AWS c7g.2xlarge instance. The model is 332 MB and supports a 32,768 token context with an embedding latency of 12.6-15.0 ms, running on CPU. It also supports [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) for flexible dimensionality. ## Define the Vespa Schema[¶](#define-the-vespa-schema) We define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with two document fields (`id`, `text`) and two synthetic embedding fields computed at indexing time by the `voyage-4-large` embedder: - `embedding_float`: Half-precision (bfloat16) embeddings (2048 dimensions) for accurate reranking. Uses [`paged` attribute](https://docs.vespa.ai/en/content/attributes.html#paged-attributes) to keep data on disk, reducing memory cost. - `embedding_binary`: [Binary (int8) embeddings](https://docs.vespa.ai/en/rag/binarizing-vectors.html) (2048/8 = 256 bytes) for fast [hamming-distance](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) retrieval. In \[54\]: Copied! ``` from vespa.package import Schema, Document, Field SCHEMA_NAME = "doc" FEED_MODEL_ID = "voyage-4-large" QUERY_MODEL_ID = "voyage-4-nano-int8" schema = Schema( name=SCHEMA_NAME, document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="text", type="string", indexing=["index", "summary"]), ] ), ) # Synthetic fields: computed from 'text' at indexing time using the voyage-4-large embedder. # These are not part of the document type, so is_document_field=False. schema.add_fields( Field( name="embedding_float", type="tensor(x[2048])", indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: prenormalized-angular", "paged"], is_document_field=False, ) ) schema.add_fields( Field( name="embedding_binary", type="tensor(x[256])", # 2048 bits / 8 = 256 bytes indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: hamming"], is_document_field=False, ) ) ``` from vespa.package import Schema, Document, Field SCHEMA_NAME = "doc" FEED_MODEL_ID = "voyage-4-large" QUERY_MODEL_ID = "voyage-4-nano-int8" schema = Schema( name=SCHEMA_NAME, document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="text", type="string", indexing=["index", "summary"]), \] ), ) # Synthetic fields: computed from 'text' at indexing time using the voyage-4-large embedder. # These are not part of the document type, so is_document_field=False. schema.add_fields( Field( name="embedding_float", type="tensor(x[2048])", indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: prenormalized-angular", "paged"], is_document_field=False, ) ) schema.add_fields( Field( name="embedding_binary", type="tensor(x[256])", # 2048 bits / 8 = 256 bytes indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: hamming"], is_document_field=False, ) ) ## Rank Profile: Binary Retrieval with Float Reranking[¶](#rank-profile-binary-retrieval-with-float-reranking) This [rank profile](https://docs.vespa.ai/en/ranking.html) implements a two-phase strategy: 1. **First phase**: Hamming distance on binary embeddings. This is extremely fast and scans many candidates cheaply. 1. **Second phase**: Cosine closeness on full float embeddings. This is more accurate and applied only to the top candidates from phase one. The query inputs (`q_float`, `q_bin`) are produced by the local `voyage-4-nano` model at query time. The [`rerank_count`](https://docs.vespa.ai/en/reference/schema-reference.html#rerank-count) controls how many first-phase candidates are rescored in the second phase. In \[55\]: Copied! ``` from vespa.package import RankProfile, Function, SecondPhaseRanking RERANK_COUNT = 2000 schema.add_rank_profile( RankProfile( name="binary-with-rerank", inputs=[ ("query(q_float)", "tensor(x[2048])"), ("query(q_bin)", "tensor(x[256])"), ], functions=[ Function( name="binary_closeness", expression="1 - (distance(field, embedding_binary) / 2048)", ), Function( name="float_closeness", expression="reduce(query(q_float) * attribute(embedding_float), sum, x)", ), ], first_phase="binary_closeness", second_phase=SecondPhaseRanking( expression="float_closeness", rerank_count=RERANK_COUNT ), summary_features=[ "binary_closeness", "float_closeness", ], ) ) ``` from vespa.package import RankProfile, Function, SecondPhaseRanking RERANK_COUNT = 2000 schema.add_rank_profile( RankProfile( name="binary-with-rerank", inputs=\[ ("query(q_float)", "tensor(x[2048])"), ("query(q_bin)", "tensor(x[256])"), \], functions=[ Function( name="binary_closeness", expression="1 - (distance(field, embedding_binary) / 2048)", ), Function( name="float_closeness", expression="reduce(query(q_float) * attribute(embedding_float), sum, x)", ), ], first_phase="binary_closeness", second_phase=SecondPhaseRanking( expression="float_closeness", rerank_count=RERANK_COUNT ), summary_features=[ "binary_closeness", "float_closeness", ], ) ) ### Why `paged` for float embeddings?[¶](#why-paged-for-float-embeddings) The `embedding_float` field uses the [`paged` attribute](https://docs.vespa.ai/en/content/attributes.html#paged-attributes), which lets Vespa page attribute data from memory to disk. This is critical for keeping memory costs manageable. **Napkin math** — memory per document at 2048 dimensions: | Representation | Type | Bytes/vector | 1M docs | 10M docs | 100M docs | | ------------------ | --------------------- | ------------ | -------- | -------- | --------- | | `embedding_float` | `bfloat16` (16-bit) | 4,096 B | ~3.8 GB | ~38 GB | ~381 GB | | `embedding_binary` | `int8` (1-bit packed) | 256 B | ~0.24 GB | ~2.4 GB | ~24 GB | The float embeddings are **16x larger** than the binary ones. Without `paged`, all float vectors must fit in memory. At 100M documents that's ~381 GB of RAM just for one field. With `paged`, the OS kernel manages what's in memory based on access patterns — only the vectors actually touched during reranking need to be resident. This works well here because **float vectors are only accessed during second-phase reranking**, not during first-phase retrieval. The first phase uses only the compact binary embeddings (always in memory), and the second phase touches at most `rerank-count` float vectors per query per content node. > **Important**: Do not combine `paged` with [HNSW indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html), as HNSW requires random access across the full graph during search, which would cause excessive disk I/O. Here we use `paged` safely because `embedding_float` has no HNSW index — it's accessed only via direct attribute lookups during reranking. ### Why `rerank-count` matters with `paged`[¶](#why-rerank-count-matters-with-paged) The [`rerank-count`](https://docs.vespa.ai/en/ranking/phased-ranking.html) parameter (set to 2000 above) controls how many first-phase candidates are re-scored in the second phase **per content node**. This is the knob that bounds cost: - **Too low** (e.g., 50): Fast, but the cheap binary first-phase may miss relevant documents that float reranking would have rescued. Recall suffers. - **Too high** (e.g., 50,000): More float vectors paged in from disk per query, increasing latency and disk I/O. The quality gains diminish quickly — most relevant documents are already in the top few thousand candidates. - **2000**: A reasonable default that balances recall, latency, and disk I/O. At 2000 candidates x 4,096 bytes per vector = ~8 MB of float data accessed per query per node — easily serviceable from the OS page cache for any reasonable query rate. The combination of `paged` + bounded `rerank-count` is what makes this architecture work: you get the storage efficiency of keeping float vectors on disk, with the performance guarantee that each query only touches a small, predictable number of them. ## Services Configuration[¶](#services-configuration) We configure two [embedder components](https://docs.vespa.ai/en/rag/embedding.html): 1. **`voyage-4-large`** ([voyage-ai-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder)): Calls the Voyage AI API. Used at document indexing time to produce high-quality embeddings. Requires an API key stored in Vespa Cloud's [secret store](https://cloud.vespa.ai/en/security/secret-store). 1. **`voyage-4-nano-int8`** ([hugging-face-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#hugging-face-embedder)): Runs locally on the Vespa container as an ONNX model. Used at query time for zero-cost, low-latency embedding. No API key needed. ### Batching for throughput[¶](#batching-for-throughput) The `voyage-ai-embedder` supports [dynamic batching](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder) of concurrent embedding requests into single API calls. We configure `max-size: 20` (up to 20 documents per batch) and `max-delay: 20ms` (maximum wait time before sending a partial batch). Since each batch counts as a single API call, this can reduce the number of calls by up to 20x — making it much easier to stay within the RPM (Requests Per Minute) limit of your Voyage API key. Combined with the increased `document-processing` threadpool (512 threads), this enables high-throughput parallel embedding at index time. ### Quantization[¶](#quantization) The `voyage-ai-embedder` also supports server-side `quantization` (with values `auto`, `float`, `int8`, or `binary`). When set to `auto` (the default), Vespa infers the appropriate quantization from the destination tensor's cell type and dimensions — so our `tensor` float field and `tensor` binary field are handled automatically. For this notebook we rely on `auto` quantization, which gives us full-precision bfloat16 embeddings paged to disk for accurate reranking, and compact binary embeddings in memory for fast retrieval. The `ServicesConfiguration` class below uses pyvespa's type-safe Python API for generating [`services.xml`](https://docs.vespa.ai/en/reference/services.html). For a deeper dive into all the configuration options available, see the [advanced configuration](https://vespa-engine.github.io/pyvespa/advanced-configuration.ipynb) notebook. In \[56\]: Copied! ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, batching, container, content, search, document_api, document_processing, component, components, model, api_key_secret_ref, dimensions, documents, document, nodes, node, resources, secrets, threadpool, threads, redundancy, transformer_model, tokenizer_model, pooling_strategy, normalize, prepend, max_tokens, query, ) from vespa.configuration.vt import vt APPLICATION_NAME = "voyageai" # Replace with your Vespa Cloud secret store vault and secret name SECRET_STORE_VAULT_NAME = "pyvespa-testvault" VOYAGE_SECRET_NAME = "voyage_api_key" services_config = ServicesConfiguration( application_name=APPLICATION_NAME, services_config=services( container(id=f"{APPLICATION_NAME}_container", version="1.0")( secrets( vt( tag="apiKey", vault=SECRET_STORE_VAULT_NAME, name=VOYAGE_SECRET_NAME, ) ), search(), document_api(), # 256 threads per vCPU = 512 total with 2 vCPUs document_processing(threadpool(threads("256"))), components( # Local model for query-time embedding (zero API cost) component(id="voyage-4-nano-int8", type_="hugging-face-embedder")( transformer_model(model_id="voyage-4-nano-int8"), tokenizer_model(model_id="voyage-4-nano-vocab"), max_tokens("32768"), pooling_strategy("mean"), normalize("true"), prepend( query( "Represent the query for retrieving supporting documents: " ) ), ), # API-based model for index-time embedding (highest quality) component(id="voyage-4-large", type_="voyage-ai-embedder")( model("voyage-4-large"), api_key_secret_ref("apiKey"), dimensions("2048"), batching(max_size="20", max_delay="20ms"), ), ), nodes(count="1", required="true")( resources(vcpu="2", memory="8Gb", disk="50Gb", architecture="arm64") ), ), content(id=f"{APPLICATION_NAME}_content", version="1.0")( redundancy("1"), documents(document(type_="doc", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), ), ), ) ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, batching, container, content, search, document_api, document_processing, component, components, model, api_key_secret_ref, dimensions, documents, document, nodes, node, resources, secrets, threadpool, threads, redundancy, transformer_model, tokenizer_model, pooling_strategy, normalize, prepend, max_tokens, query, ) from vespa.configuration.vt import vt APPLICATION_NAME = "voyageai" # Replace with your Vespa Cloud secret store vault and secret name SECRET_STORE_VAULT_NAME = "pyvespa-testvault" VOYAGE_SECRET_NAME = "voyage_api_key" services_config = ServicesConfiguration( application_name=APPLICATION_NAME, services_config=services( container(id=f"{APPLICATION_NAME}\_container", version="1.0")( secrets( vt( tag="apiKey", vault=SECRET_STORE_VAULT_NAME, name=VOYAGE_SECRET_NAME, ) ), search(), document_api(), # 256 threads per vCPU = 512 total with 2 vCPUs document_processing(threadpool(threads("256"))), components( # Local model for query-time embedding (zero API cost) component(id="voyage-4-nano-int8", type\_="hugging-face-embedder")( transformer_model(model_id="voyage-4-nano-int8"), tokenizer_model(model_id="voyage-4-nano-vocab"), max_tokens("32768"), pooling_strategy("mean"), normalize("true"), prepend( query( "Represent the query for retrieving supporting documents: " ) ), ), # API-based model for index-time embedding (highest quality) component(id="voyage-4-large", type\_="voyage-ai-embedder")( model("voyage-4-large"), api_key_secret_ref("apiKey"), dimensions("2048"), batching(max_size="20", max_delay="20ms"), ), ), nodes(count="1", required="true")( resources(vcpu="2", memory="8Gb", disk="50Gb", architecture="arm64") ), ), content(id=f"{APPLICATION_NAME}_content", version="1.0")( redundancy("1"), documents(document(type_="doc", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), ), ), ) ## Create and Deploy the Application Package[¶](#create-and-deploy-the-application-package) In \[57\]: Copied! ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name=APPLICATION_NAME, schema=[schema], services_config=services_config, ) ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name=APPLICATION_NAME, schema=[schema], services_config=services_config, ) Deploy to [Vespa Cloud](https://cloud.vespa.ai/en/). Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) if you don't have one. Before deploying, you need to configure a secret in the Vespa Cloud secret store with your Voyage AI API key. See [Vespa Cloud secret store](https://cloud.vespa.ai/en/security/secret-store) for instructions. > Deployments to dev expire after 14 days of inactivity. In \[58\]: Copied! ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os tenant_name = "vespa-team" # Replace with your tenant name key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") vespa_cloud = VespaCloud( tenant=tenant_name, application=APPLICATION_NAME, # key_content=key, application_package=app_package, ) app: Vespa = vespa_cloud.deploy() ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os tenant_name = "vespa-team" # Replace with your tenant name key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") vespa_cloud = VespaCloud( tenant=tenant_name, application=APPLICATION_NAME, # key_content=key, application_package=app_package, ) app: Vespa = vespa_cloud.deploy() ``` Setting application... Running: vespa config set application vespa-team.voyageai.default Setting target cloud... Running: vespa config set target cloud No api-key found for control plane access. Using access token. Checking for access token in auth.json... Access token expired. Please re-authenticate. Your Device Confirmation code is: RWHK-VXWW Automatically open confirmation page in your default browser? [Y/n] y Opened link in your browser: https://login.console.vespa-cloud.com/activate?user_code=RWHK-VXWW Waiting for login to complete in browser ... done;1m⣻ Success: Logged in auth.json created at /Users/thomas/.vespa/auth.json Successfully obtained access token for control plane access. Deployment started in run 9 of dev-aws-us-east-1c for vespa-team.voyageai. This may take a few minutes the first time. INFO [13:26:31] Deploying platform version 8.649.29 and application dev build 9 for dev-aws-us-east-1c of default ... INFO [13:26:31] Using CA signed certificate version 1 INFO [13:26:37] Session 404822 for tenant 'vespa-team' prepared and activated. INFO [13:27:04] ######## Details for all nodes ######## INFO [13:27:04] h136163a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- logserver-container on port 4080 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136163b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- container-clustercontroller on port 19050 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136175a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- container on port 4080 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136066a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- storagenode on port 19102 has not started INFO [13:27:04] --- searchnode on port 19107 has not started INFO [13:27:04] --- distributor on port 19111 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:28:01] Waiting for convergence of 10 services across 4 nodes INFO [13:28:01] 1 nodes booting INFO [13:28:01] 1 application services still deploying DEBUG [13:28:01] h136175a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP DEBUG [13:28:01] --- platform vespa/cloud-tenant-rhel8:8.649.29 DEBUG [13:28:01] --- container on port 4080 has not started DEBUG [13:28:01] --- metricsproxy-container on port 19092 has config generation 404822, wanted is 404822 INFO [13:28:43] Found endpoints: INFO [13:28:43] - dev.aws-us-east-1c INFO [13:28:43] |-- https://ca603d84.b347094a.z.vespa-app.cloud/ (cluster 'voyageai_container') INFO [13:28:43] Deployment complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for voyageai_container URL: https://ca603d84.b347094a.z.vespa-app.cloud/ Application is up! ``` ## Feed Sample Documents[¶](#feed-sample-documents) We feed a few sample passages. At indexing time, Vespa calls the `voyage-4-large` API to generate both the float and binary embedding representations. In \[59\]: Copied! ``` from vespa.io import VespaResponse sample_docs = [ { "id": "1", "fields": { "id": "1", "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers.", }, }, { "id": "2", "fields": { "id": "2", "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost.", }, }, { "id": "3", "fields": { "id": "3", "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases.", }, }, { "id": "4", "fields": { "id": "4", "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down.", }, }, { "id": "5", "fields": { "id": "5", "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment.", }, }, ] for doc in sample_docs: response: VespaResponse = app.feed_data_point( schema=SCHEMA_NAME, data_id=doc["id"], fields=doc["fields"] ) assert ( response.is_successful() ), f"Failed to feed doc {doc['id']}: {response.get_json()}" print(f"Fed document {doc['id']}") ``` from vespa.io import VespaResponse sample_docs = [ { "id": "1", "fields": { "id": "1", "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers.", }, }, { "id": "2", "fields": { "id": "2", "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost.", }, }, { "id": "3", "fields": { "id": "3", "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases.", }, }, { "id": "4", "fields": { "id": "4", "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down.", }, }, { "id": "5", "fields": { "id": "5", "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment.", }, }, ] for doc in sample_docs: response: VespaResponse = app.feed_data_point( schema=SCHEMA_NAME, data_id=doc["id"], fields=doc["fields"] ) assert ( response.is_successful() ), f"Failed to feed doc {doc['id']}: {response.get_json()}" print(f"Fed document {doc['id']}") ``` Fed document 1 Fed document 2 Fed document 3 Fed document 4 Fed document 5 ``` ## Query with Binary Retrieval and Float Reranking[¶](#query-with-binary-retrieval-and-float-reranking) At query time, Vespa uses the local `voyage-4-nano` model to embed the query text. The [`embed()` function](https://docs.vespa.ai/en/rag/embedding.html#embedding-a-query-text) in the query invokes the local embedder, producing both float and binary query representations. The retrieval pipeline: 1. [`nearestNeighbor`](https://docs.vespa.ai/en/querying/nearest-neighbor-search.html) on `embedding_binary` scans candidates using fast hamming distance. 1. First-phase ranking scores by `binary_closeness`. 1. Second-phase reranking scores the top candidates by `float_closeness`. In \[60\]: Copied! ``` from vespa.io import VespaQueryResponse import vespa.querybuilder as qb import json query_text = "How does asymmetric embedding retrieval work?" response: VespaQueryResponse = app.query( yql=str( qb.select("*") .from_(SCHEMA_NAME) .where( qb.nearestNeighbor( field="embedding_binary", query_vector="q_bin", annotations={"targetHits": 100}, ) ) ), ranking="binary-with-rerank", body={ "input.query(q_bin)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "input.query(q_float)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "hits": 5, "presentation.timing": "true", }, ) assert response.is_successful() for hit in response.hits: print(json.dumps(hit, indent=2)) ``` from vespa.io import VespaQueryResponse import vespa.querybuilder as qb import json query_text = "How does asymmetric embedding retrieval work?" response: VespaQueryResponse = app.query( yql=str( qb.select("\*") .from\_(SCHEMA_NAME) .where( qb.nearestNeighbor( field="embedding_binary", query_vector="q_bin", annotations={"targetHits": 100}, ) ) ), ranking="binary-with-rerank", body={ "input.query(q_bin)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "input.query(q_float)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "hits": 5, "presentation.timing": "true", }, ) assert response.is_successful() for hit in response.hits: print(json.dumps(hit, indent=2)) ``` { "fields": { "documentid": "id:doc:doc::4", "id": "4", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.63623046875, "float_closeness": 0.5481828630707257, "vespa.summaryFeatures.cached": 0.0 }, "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down." }, "id": "id:doc:doc::4", "relevance": 0.5481828630707257, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::2", "id": "2", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.607421875, "float_closeness": 0.44831951343722665, "vespa.summaryFeatures.cached": 0.0 }, "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost." }, "id": "id:doc:doc::2", "relevance": 0.44831951343722665, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::1", "id": "1", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.58837890625, "float_closeness": 0.34075708348829004, "vespa.summaryFeatures.cached": 0.0 }, "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers." }, "id": "id:doc:doc::1", "relevance": 0.34075708348829004, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::5", "id": "5", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.58154296875, "float_closeness": 0.31555799518827143, "vespa.summaryFeatures.cached": 0.0 }, "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment." }, "id": "id:doc:doc::5", "relevance": 0.31555799518827143, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::3", "id": "3", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.5703125, "float_closeness": 0.29142659282264916, "vespa.summaryFeatures.cached": 0.0 }, "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases." }, "id": "id:doc:doc::3", "relevance": 0.29142659282264916, "source": "voyageai_content" } ``` The `summaryfeatures` in each hit show both scoring phases: - `binary_closeness`: The first-phase hamming-based score (fast, approximate). - `float_closeness`: The second-phase dot-product score between the query and document float embeddings. Since both embeddings are unit-normalized (`prenormalized-angular`), the dot product equals cosine similarity. The final `relevance` score is the second-phase float closeness for reranked candidates. In \[61\]: Copied! ``` response.json["timing"] ``` response.json["timing"] Out\[61\]: ``` {'querytime': 0.032, 'searchtime': 0.051000000000000004, 'summaryfetchtime': 0.015} ``` ## Cleanup[¶](#cleanup) In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete()