# Vespa python API > pyvespa is the official Python API for [Vespa.ai](https://vespa.ai/), the scalable open-source serving engine for storing, computing, and ranking big data at user serving time. pyvespa lets you create, modify, deploy, and interact with Vespa applications from Python, enabling rapid prototyping and access to Vespa features including vector search, hybrid retrieval, ranking, and real-time serving. # Getting Started # Vespa python API [Vespa](https://vespa.ai/) is the scalable open-sourced serving engine to store, compute and rank big data at user serving time. `pyvespa` provides a python API to Vespa. We aim for complete feature parity with Vespa, and estimate that we cover > 95% of Vespa features, with all most commonly used features supported. If you find a Vespa feature that you are not able to express/use with `pyvespa`, please [open an issue](https://github.com/vespa-engine/pyvespa/issues/new/choose). ## Quick start To get a sense of the most basic functionality, check out the Hybrid Search Quick start: - [Hybrid search quick start - Docker](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa.md) - [Hybrid search quick start - Vespa Cloud](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa-cloud.md) ## Overview of pyvespa features Info There are two main interfaces to Vespa: 1. Control-plane API: Used to deploy and manage Vespa applications. - [`VespaCloud`](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaCloud): Control-plane interface to Vespa Cloud. - [`VespaDocker`](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaDocker): Control-plane iterface to local Vespa instance (docker/podman). 1. Data-plane API: Used to feed and query data in Vespa applications. - [`Vespa`](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) Note that `VespaCloud` and `Vespa` require two separate authentication methods. Refer to the [Authenticating to Vespa Cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md) for details. - Create and deploy application packages, including schemas, rank profiles, `services.xml`, query profiles etc. - [Feed and retrieve](https://vespa-engine.github.io/pyvespa/reads-writes.md) documents to/from Vespa, using `/document/v1/` API. - [Query](https://vespa-engine.github.io/pyvespa/query.md) Vespa applications, using `/search/` API. - [Build complex queries](https://vespa-engine.github.io/pyvespa/query.md#using-the-querybuilder-dsl-api) using the [`QueryBuilder`](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/builder.md) API. - [Collect training data](https://vespa-engine.github.io/pyvespa/evaluating-vespa-application-cloud.md) for ML using [`VespaFeatureCollector`](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaFeatureCollector). - [Evaluate](https://vespa-engine.github.io/pyvespa/evaluating-vespa-application-cloud.md) Vespa applications using [`VespaEvaluator`](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaEvaluator)/[`VespaMatchEvaluator`](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaMatchEvaluator). ## Requirements Install `pyvespa`: We recommend using [`uv`](https://docs.astral.sh/uv/) to manage your python environments: ```text uv add pyvespa ``` or using `pip`: ```text pip install pyvespa ``` ## Check out the examples Check out our wide variety of [Examples](https://vespa-engine.github.io/pyvespa/examples/index.md) that demonstrate how to use the Vespa Python API to serve various use cases. # Hybrid Search - Quickstart[¶](#hybrid-search-quickstart) This tutorial creates a hybrid text search application combining traditional keyword matching with semantic vector search (dense retrieval). It also demonstrates using [Vespa native embedder](https://docs.vespa.ai/en/embedding.html) functionality. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. [Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available: In \[1\]: Copied! ``` !pip3 install pyvespa !docker info | grep "Total Memory" ``` !pip3 install pyvespa !docker info | grep "Total Memory" ## Create an application package[¶](#create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) has all the Vespa configuration files - create one from scratch: In \[ \]: Copied! ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) package = ApplicationPackage( name="hybridsearch", schema=[ Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input title . " " . input body', "embed", "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "body"])], rank_profiles=[ RankProfile( name="bm25", inputs=[("query(q)", "tensor(x[384])")], functions=[ Function(name="bm25sum", expression="bm25(title) + bm25(body)") ], first_phase="bm25sum", ), RankProfile( name="semantic", inputs=[("query(q)", "tensor(x[384])")], first_phase="closeness(field, embedding)", ), RankProfile( name="fusion", inherits="bm25", inputs=[("query(q)", "tensor(x[384])")], first_phase="closeness(field, embedding)", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))", rerank_count=1000, ), ), ], ) ], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/tokenizer.json" }, ), ], ) ], ) ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) package = ApplicationPackage( name="hybridsearch", schema=\[ Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input title . " " . input body', "embed", "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "body"])\], rank_profiles=\[ RankProfile( name="bm25", inputs=\[("query(q)", "tensor(x[384])")\], functions=[ Function(name="bm25sum", expression="bm25(title) + bm25(body)") ], first_phase="bm25sum", ), RankProfile( name="semantic", inputs=\[("query(q)", "tensor(x[384])")\], first_phase="closeness(field, embedding)", ), RankProfile( name="fusion", inherits="bm25", inputs=\[("query(q)", "tensor(x[384])")\], first_phase="closeness(field, embedding)", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))", rerank_count=1000, ), ), \], ) \], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/tokenizer.json" }, ), ], ) \], ) Note that the name cannot have `-` or `_`. ## Deploy the Vespa application[¶](#deploy-the-vespa-application) Deploy `package` on the local machine using Docker, without leaving the notebook, by creating an instance of [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#vespa.deployment.VespaDocker). `VespaDocker` connects to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop). In \[1\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=package) `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. ## Feeding documents to Vespa[¶](#feeding-documents-to-vespa) In this example we use the [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the [BeIR/nfcorpus](https://huggingface.co/datasets/BeIR/nfcorpus) dataset and index in our newly deployed Vespa instance. Read more about the [NFCorpus](https://huggingface.co/datasets/mteb/nfcorpus): > NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without downloading all the contents locally. The `map` functionality allows us to convert the dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`: `{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}` In \[1\]: Copied! ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["_id"]}, } ) ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["\_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["\_id"]}, } ) Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html) functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step is computionally expensive. Read more about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/). In \[1\]: Copied! ``` from vespa.io import VespaResponse, VespaQueryResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ``` from vespa.io import VespaResponse, VespaQueryResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ## Querying Vespa[¶](#querying-vespa) Using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html) we can query the indexed data. - Using a context manager `with app.syncio() as session` to handle connection pooling ([best practices](https://cloud.vespa.ai/en/http-best-practices)) - The query method accepts any valid Vespa [query api parameter](https://docs.vespa.ai/en/reference/query-api-reference.html) in `**kwargs` - Vespa api parameter names that contains `.` must be sent as `dict` parameters in the `body` method argument The following searches for `How Fruits and Vegetables Can Treat Asthma?` using different retrieval and [ranking](https://docs.vespa.ai/en/ranking.html) strategies. In \[1\]: Copied! ``` import pandas as pd def display_hits_as_df(response: VespaQueryResponse, fields) -> pd.DataFrame: records = [] for hit in response.hits: record = {} for field in fields: record[field] = hit["fields"][field] records.append(record) return pd.DataFrame(records) ``` import pandas as pd def display_hits_as_df(response: VespaQueryResponse, fields) -> pd.DataFrame: records = [] for hit in response.hits: record = {} for field in fields: record[field] = hit["fields"][field] records.append(record) return pd.DataFrame(records) ### Plain Keyword search[¶](#plain-keyword-search) The following uses plain keyword search functionality with [bm25](https://docs.vespa.ai/en/reference/bm25.html) ranking, the `bm25` rank-profile was configured in the application package to use a linear combination of the bm25 score of the query terms against the title and the body field. In \[1\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() limit 5", query=query, ranking="bm25", ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() limit 5", query=query, ranking="bm25", ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ### Plain Semantic Search[¶](#plain-semantic-search) The following uses dense vector representations of the query and the document and matching is performed and accelerated by Vespa's support for [approximate nearest neighbor search](https://docs.vespa.ai/en/approximate-nn-hnsw.html). The vector embedding representation of the text is obtained using Vespa's [embedder functionality](https://docs.vespa.ai/en/embedding.html#embedding-a-query-text). In \[1\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="semantic", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="semantic", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ### Hybrid Search[¶](#hybrid-search) This is one approach to combine the two retrieval strategies and where we use Vespa's support for [cross-hits feature normalization and reciprocal rank fusion](https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion). This functionality is exposed in the context of `global` re-ranking, after the distributed query retrieval execution which might span 1000s of nodes. #### Hybrid search with the OR query operator[¶](#hybrid-search-with-the-or-query-operator) This combines the two methods using logical disjunction (OR). Note that the first-phase expression in our `fusion` expression is only using the semantic score, this because usually semantic search provides better recall than sparse keyword search alone. In \[1\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) #### Hybrid search with the RANK query operator[¶](#hybrid-search-with-the-rank-query-operator) This combines the two methods using the [rank](https://docs.vespa.ai/en/reference/query-language-reference.html#rank) query operator. In this case we express that we want to retrieve the top-1000 documents using vector search, and then have sparse features like BM25 calculated as well (second operand of the rank operator). Finally the hits are re-ranked using the reciprocal rank fusion In \[1\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) #### Hybrid search with filters[¶](#hybrid-search-with-filters) In this example we add another query term to the yql, restricting the nearest neighbor search to only consider documents that have vegetable in the title. In \[1\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql='select * from sources * where title contains "vegetable" and rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5', query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql='select * from sources * where title contains "vegetable" and rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5', query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ## Cleanup[¶](#cleanup) In \[1\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() ## Next steps[¶](#next-steps) This is just an intro into the capabilities of Vespa and pyvespa. Browse the site to learn more about schemas, feeding and queries - find more complex applications in [examples](https://vespa-engine.github.io/pyvespa/examples/index.md). # Hybrid Search - Quickstart on Vespa Cloud[¶](#hybrid-search-quickstart-on-vespa-cloud) This is the same guide as [getting-started-pyvespa](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa.md), deploying to Vespa Cloud. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. **Pre-requisite**: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name. ## Install[¶](#install) Install [pyvespa](https://pyvespa.readthedocs.io/) >= 0.45 and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). The Vespa CLI is used for data and control plane key management ([Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide)). In \[ \]: Copied! ``` !pip3 install pyvespa vespacli ``` !pip3 install pyvespa vespacli ## Configure application[¶](#configure-application) In \[2\]: Copied! ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "hybridsearch" ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "hybridsearch" ## Create an application package[¶](#create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md#vespa.package.ApplicationPackage) has all the Vespa configuration files - create one from scratch: In \[ \]: Copied! ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) package = ApplicationPackage( name=application, schema=[ Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input title . " " . input body', "embed", "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "body"])], rank_profiles=[ RankProfile( name="bm25", inputs=[("query(q)", "tensor(x[384])")], functions=[ Function(name="bm25sum", expression="bm25(title) + bm25(body)") ], first_phase="bm25sum", ), RankProfile( name="semantic", inputs=[("query(q)", "tensor(x[384])")], first_phase="closeness(field, embedding)", ), RankProfile( name="fusion", inherits="bm25", inputs=[("query(q)", "tensor(x[384])")], first_phase="closeness(field, embedding)", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))", rerank_count=1000, ), ), ], ) ], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/tokenizer.json" }, ), ], ) ], ) ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) package = ApplicationPackage( name=application, schema=\[ Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input title . " " . input body', "embed", "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "body"])\], rank_profiles=\[ RankProfile( name="bm25", inputs=\[("query(q)", "tensor(x[384])")\], functions=[ Function(name="bm25sum", expression="bm25(title) + bm25(body)") ], first_phase="bm25sum", ), RankProfile( name="semantic", inputs=\[("query(q)", "tensor(x[384])")\], first_phase="closeness(field, embedding)", ), RankProfile( name="fusion", inherits="bm25", inputs=\[("query(q)", "tensor(x[384])")\], first_phase="closeness(field, embedding)", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))", rerank_count=1000, ), ), \], ) \], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://data.vespa-cloud.com/sample-apps-data/e5-small-v2-int8/tokenizer.json" }, ), ], ) \], ) Note that the name cannot have `-` or `_`. ## Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) The app is now defined and ready to deploy to Vespa Cloud. Deploy `package` to Vespa Cloud, by creating an instance of [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaCloud): In \[4\]: Copied! ``` from vespa.deployment import VespaCloud import os # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` from vespa.deployment import VespaCloud import os # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` Setting application... Running: vespa config set application vespa-team.hybridsearch Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` For more details on different authentication options and methods, see [authenticating-to-vespa-cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md). The following will upload the application package to Vespa Cloud Dev Zone (`aws-us-east-1c`), read more about [Vespa Zones](https://cloud.vespa.ai/en/reference/zones.html). The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically. For information about production deployments, see the following [method](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaCloud.deploy_to_prod). > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.) In \[5\]: Copied! ``` app = vespa_cloud.deploy() ``` app = vespa_cloud.deploy() ``` Deployment started in run 7 of dev-aws-us-east-1c for vespa-team.hybridsearch. This may take a few minutes the first time. INFO [07:04:51] Deploying platform version 8.367.14 and application dev build 6 for dev-aws-us-east-1c of default ... INFO [07:04:51] Using CA signed certificate version 3 INFO [07:04:52] Using 1 nodes in container cluster 'hybridsearch_container' INFO [07:04:53] Validating Onnx models memory usage for container cluster 'hybridsearch_container', percentage of available memory too low (10 < 15) to avoid restart, consider a flavor with more memory to avoid this WARNING [07:04:53] Auto-overriding validation which would be disallowed in production: certificate-removal: Data plane certificate(s) from cluster 'hybridsearch_container' is removed (removed certificates: [CN=cloud.vespa.example]) This can cause client connection issues.. To allow this add certificate-removal to validation-overrides.xml, see https://docs.vespa.ai/en/reference/validation-overrides.html INFO [07:04:55] Session 298587 for tenant 'vespa-team' prepared and activated. INFO [07:04:55] ######## Details for all nodes ######## INFO [07:04:55] h94416a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:04:55] --- platform vespa/cloud-tenant-rhel8:8.367.14 INFO [07:04:55] --- container on port 4080 has config generation 298580, wanted is 298587 INFO [07:04:55] --- metricsproxy-container on port 19092 has config generation 298587, wanted is 298587 INFO [07:04:55] h94249f.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:04:55] --- platform vespa/cloud-tenant-rhel8:8.367.14 INFO [07:04:55] --- container-clustercontroller on port 19050 has config generation 298580, wanted is 298587 INFO [07:04:55] --- metricsproxy-container on port 19092 has config generation 298580, wanted is 298587 INFO [07:04:55] h93394a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:04:55] --- platform vespa/cloud-tenant-rhel8:8.367.14 INFO [07:04:55] --- logserver-container on port 4080 has config generation 298587, wanted is 298587 INFO [07:04:55] --- metricsproxy-container on port 19092 has config generation 298580, wanted is 298587 INFO [07:04:55] h94419a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:04:55] --- platform vespa/cloud-tenant-rhel8:8.367.14 INFO [07:04:55] --- storagenode on port 19102 has config generation 298587, wanted is 298587 INFO [07:04:55] --- searchnode on port 19107 has config generation 298587, wanted is 298587 INFO [07:04:55] --- distributor on port 19111 has config generation 298587, wanted is 298587 INFO [07:04:55] --- metricsproxy-container on port 19092 has config generation 298587, wanted is 298587 INFO [07:05:02] Found endpoints: INFO [07:05:02] - dev.aws-us-east-1c INFO [07:05:02] |-- https://f7f73182.eb1181f2.z.vespa-app.cloud/ (cluster 'hybridsearch_container') INFO [07:05:02] Deployment of new application complete! Found mtls endpoint for hybridsearch_container URL: https://f7f73182.eb1181f2.z.vespa-app.cloud/ Connecting to https://f7f73182.eb1181f2.z.vespa-app.cloud/ Using mtls_key_cert Authentication against endpoint https://f7f73182.eb1181f2.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the `vespa auth api-key` step above. If you can authenticate, you should see lines like the following ``` Deployment started in run 1 of dev-aws-us-east-1c for mytenant.hybridsearch. ``` The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. We can access the mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the mTLS certificate generated in previous steps. In \[6\]: Copied! ``` endpoint = vespa_cloud.get_mtls_endpoint() endpoint ``` endpoint = vespa_cloud.get_mtls_endpoint() endpoint ``` Found mtls endpoint for hybridsearch_container URL: https://f7f73182.eb1181f2.z.vespa-app.cloud/ ``` Out\[6\]: ``` 'https://f7f73182.eb1181f2.z.vespa-app.cloud/' ``` ## Feeding documents to Vespa[¶](#feeding-documents-to-vespa) In this example we use the [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the [BeIR/nfcorpus](https://huggingface.co/datasets/BeIR/nfcorpus) dataset and index in our newly deployed Vespa instance. Read more about the [NFCorpus](https://huggingface.co/datasets/mteb/nfcorpus): > NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without downloading all the contents locally. The `map` functionality allows us to convert the dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`: `{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}` In \[7\]: Copied! ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["_id"]}, } ) ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["\_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["\_id"]}, } ) Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html) functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step is resource intensive. Read more about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/) blog post. Default node resources in Vespa Cloud have 2 v-cpu for the Dev Zone. In \[8\]: Copied! ``` from vespa.io import VespaResponse, VespaQueryResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ``` from vespa.io import VespaResponse, VespaQueryResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ``` Using mtls_key_cert Authentication against endpoint https://f7f73182.eb1181f2.z.vespa-app.cloud//ApplicationStatus ``` ## Querying Vespa[¶](#querying-vespa) Using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html) we can query the indexed data. - Using a context manager `with app.syncio() as session` to handle connection pooling ([best practices](https://cloud.vespa.ai/en/http-best-practices)) - The query method accepts any valid Vespa [query api parameter](https://docs.vespa.ai/en/reference/query-api-reference.html) in `**kwargs` - Vespa api parameter names that contains `.` must be sent as `dict` parameters in the `body` method argument The following searches for `How Fruits and Vegetables Can Treat Asthma?` using different retrieval and [ranking](https://docs.vespa.ai/en/ranking.html) strategies. Query the text search app using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html) by sending the parameters to the body argument of [Vespa.query](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa.query). First we define a simple routine that will return a dataframe of the results for prettier display in the notebook. In \[9\]: Copied! ``` import pandas as pd def display_hits_as_df(response: VespaQueryResponse, fields) -> pd.DataFrame: records = [] for hit in response.hits: record = {} for field in fields: record[field] = hit["fields"][field] records.append(record) return pd.DataFrame(records) ``` import pandas as pd def display_hits_as_df(response: VespaQueryResponse, fields) -> pd.DataFrame: records = [] for hit in response.hits: record = {} for field in fields: record[field] = hit["fields"][field] records.append(record) return pd.DataFrame(records) ### Plain Keyword search[¶](#plain-keyword-search) The following uses plain keyword search functionality with [bm25](https://docs.vespa.ai/en/reference/bm25.html) ranking, the `bm25` rank-profile was configured in the application package to use a linear combination of the bm25 score of the query terms against the title and the body field. In \[10\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() limit 5", query=query, ranking="bm25", ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() limit 5", query=query, ranking="bm25", ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` id title 0 MED-2450 Protective effect of fruits, vegetables and th... 1 MED-2464 Low vegetable intake is associated with allerg... 2 MED-1162 Pesticide residues in imported, organic, and "... 3 MED-2461 The association of diet with respiratory sympt... 4 MED-2085 Antiplatelet, anticoagulant, and fibrinolytic ... ``` ### Plain Semantic Search[¶](#plain-semantic-search) The following uses dense vector representations of the query and the document and matching is performed and accelerated by Vespa's support for [approximate nearest neighbor search](https://docs.vespa.ai/en/approximate-nn-hnsw.html). The vector embedding representation of the text is obtained using Vespa's [embedder functionality](https://docs.vespa.ai/en/embedding.html#embedding-a-query-text). In \[11\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where ({targetHits:5}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="semantic", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where ({targetHits:5}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="semantic", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` id title 0 MED-5072 Lycopene-rich treatments modify noneosinophili... 1 MED-2472 Vegan regimen with reduced medication in the t... 2 MED-2464 Low vegetable intake is associated with allerg... 3 MED-2458 Manipulating antioxidant intake in asthma: a r... 4 MED-2450 Protective effect of fruits, vegetables and th... ``` ### Hybrid Search[¶](#hybrid-search) This is one approach to combine the two retrieval strategies and where we use Vespa's support for [cross-hits feature normalization and reciprocal rank fusion](https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion). This functionality is exposed in the context of `global` re-ranking, after the distributed query retrieval execution which might span 1000s of nodes. #### Hybrid search with the OR query operator[¶](#hybrid-search-with-the-or-query-operator) This combines the two methods using logical disjunction (OR). Note that the first-phase expression in our `fusion` expression is only using the semantic score, this because usually semantic search provides better recall than sparse keyword search alone. In \[12\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` id title 0 MED-2464 Low vegetable intake is associated with allerg... 1 MED-2450 Protective effect of fruits, vegetables and th... 2 MED-2458 Manipulating antioxidant intake in asthma: a r... 3 MED-2461 The association of diet with respiratory sympt... 4 MED-5072 Lycopene-rich treatments modify noneosinophili... ``` #### Hybrid search with the RANK query operator[¶](#hybrid-search-with-the-rank-query-operator) This combines the two methods using the [rank](https://docs.vespa.ai/en/reference/query-language-reference.html#rank) query operator. In this case we express that we want to retrieve the top-1000 documents using vector search, and then have sparse features like BM25 calculated as well (second operand of the rank operator). Finally the hits are re-ranked using the reciprocal rank fusion In \[13\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5", query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` id title 0 MED-2464 Low vegetable intake is associated with allerg... 1 MED-2450 Protective effect of fruits, vegetables and th... 2 MED-2458 Manipulating antioxidant intake in asthma: a r... 3 MED-2461 The association of diet with respiratory sympt... 4 MED-5072 Lycopene-rich treatments modify noneosinophili... ``` #### Hybrid search with filters[¶](#hybrid-search-with-filters) In this example we add another query term to the yql, restricting the nearest neighbor search to only consider documents that have vegetable in the title. In \[14\]: Copied! ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql='select * from sources * where title contains "vegetable" and rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5', query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` with app.syncio(connections=1) as session: query = "How Fruits and Vegetables Can Treat Asthma?" response: VespaQueryResponse = session.query( yql='select * from sources * where title contains "vegetable" and rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5', query=query, ranking="fusion", body={"input.query(q)": f"embed({query})"}, ) assert response.is_successful() print(display_hits_as_df(response, ["id", "title"])) ``` id title 0 MED-2464 Low vegetable intake is associated with allerg... 1 MED-2450 Protective effect of fruits, vegetables and th... 2 MED-3199 Potential risks resulting from fruit/vegetable... 3 MED-2085 Antiplatelet, anticoagulant, and fibrinolytic ... 4 MED-4496 The effect of fruit and vegetable intake on ri... ``` ## Next steps[¶](#next-steps) This is just an intro into the capabilities of Vespa and pyvespa. Browse the site to learn more about schemas, feeding and queries - find more complex applications in [examples](https://vespa-engine.github.io/pyvespa/examples). ## Example: Document operations using cert/key pair[¶](#example-document-operations-using-certkey-pair) Above, we deployed to Vespa Cloud, and as part of that, generated a data-plane mTLS cert/key pair. This pair can be used to access the dataplane for reads/writes to documents and running queries from many different clients. The following demonstrates that using the `requests` library. Set up a dataplane connection using the cert/key pair: In \[15\]: Copied! ``` import requests cert_path = app.cert key_path = app.key session = requests.Session() session.cert = (cert_path, key_path) ``` import requests cert_path = app.cert key_path = app.key session = requests.Session() session.cert = (cert_path, key_path) Get a document from the endpoint returned when we deployed to Vespa Cloud above. PyVespa wraps the Vespa [document api](https://docs.vespa.ai/en/document-v1-api-guide.html) internally and in these examples we use the document api directly, but with the mTLS key/cert pair we used when deploying the app. In \[16\]: Copied! ``` url = "{0}/document/v1/{1}/{2}/docid/{3}".format(endpoint, "tutorial", "doc", "MED-10") doc = session.get(url).json() doc ``` url = "{0}/document/v1/{1}/{2}/docid/{3}".format(endpoint, "tutorial", "doc", "MED-10") doc = session.get(url).json() doc Out\[16\]: ``` {'pathId': '/document/v1/tutorial/doc/docid/MED-10', 'id': 'id:tutorial:doc::MED-10', 'fields': {'body': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.', 'title': 'Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland', 'id': 'MED-10'}} ``` Update the title and post the new version: In \[17\]: Copied! ``` doc["fields"]["title"] = "Can you eat lobster?" response = session.post(url, json=doc).json() response ``` doc["fields"]["title"] = "Can you eat lobster?" response = session.post(url, json=doc).json() response Out\[17\]: ``` {'pathId': '/document/v1/tutorial/doc/docid/MED-10', 'id': 'id:tutorial:doc::MED-10'} ``` Get the doc again to see the new title: In \[18\]: Copied! ``` doc = session.get(url).json() doc ``` doc = session.get(url).json() doc Out\[18\]: ``` {'pathId': '/document/v1/tutorial/doc/docid/MED-10', 'id': 'id:tutorial:doc::MED-10', 'fields': {'body': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.', 'title': 'Can you eat lobster?', 'id': 'MED-10'}} ``` ## Example: Reconnect pyvespa using cert/key pair[¶](#example-reconnect-pyvespa-using-certkey-pair) Above, we stored the dataplane credentials for later use. Deployment of an application usually happens when the schema changes, whereas accessing the dataplane is for document updates and user queries. One only needs to know the endpoint and the cert/key pair to enable a connection to a Vespa Cloud application: In \[19\]: Copied! ``` # cert_path = "/Users/me/.vespa/mytenant.hybridsearch.default/data-plane-public-cert.pem" # key_path = "/Users/me/.vespa/mytenant.hybridsearch.default/data-plane-private-key.pem" from vespa.application import Vespa the_app = Vespa(endpoint, cert=cert_path, key=key_path) res = the_app.query( yql="select documentid, id, title from sources * where userQuery()", query="Can you eat lobster?", ranking="bm25", ) res.hits[0] ``` # cert_path = "/Users/me/.vespa/mytenant.hybridsearch.default/data-plane-public-cert.pem" # key_path = "/Users/me/.vespa/mytenant.hybridsearch.default/data-plane-private-key.pem" from vespa.application import Vespa the_app = Vespa(endpoint, cert=cert_path, key=key_path) res = the_app.query( yql="select documentid, id, title from sources * where userQuery()", query="Can you eat lobster?", ranking="bm25", ) res.hits[0] ``` Using mtls_key_cert Authentication against endpoint https://f7f73182.eb1181f2.z.vespa-app.cloud//ApplicationStatus ``` Out\[19\]: ``` {'id': 'id:tutorial:doc::MED-10', 'relevance': 25.27992205160453, 'source': 'hybridsearch_content', 'fields': {'documentid': 'id:tutorial:doc::MED-10', 'id': 'MED-10', 'title': 'Can you eat lobster?'}} ``` A common problem is a cert mismatch - the cert/key pair used when deployed is different than the pair used when making requests against Vespa. This will cause 40x errors. Make sure it is the same pair / re-create with `vespa auth cert -f` AND redeploy. If you re-generate a mTLS certificate pair, and use that when connecting to Vespa cloud endpoint, it will fail until you have updaded the deployment with the new public certificate. ### Delete application[¶](#delete-application) The following will delete the application and data from the dev environment. In \[20\]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() ``` Deactivated vespa-team.hybridsearch in dev.aws-us-east-1c Deleted instance vespa-team.hybridsearch.default ``` # Guides # Advanced Configuration[¶](#advanced-configuration) This notebook demonstrates how to use pyvespa's advanced configuration features to customize Vespa applications beyond the basic settings. You'll learn to express Vespa's XML configuration files using Python code for greater flexibility and control. ## What you'll learn[¶](#what-youll-learn) 1. **[services.xml Configuration](#services-xml-configuration)** - Configure `services.xml` using the `ServicesConfiguration` object to customize system behavior (document expiry, threading, tuning parameters). Available since `pyvespa=0.50.0` 1. **[Query Profiles Configuration](#query-profiles-configuration)** - Define multiple query profiles and query profile types programmatically using the new configuration approach. Available since `pyvespa=0.60.0` 1. **[deployment.xml Configuration](#deploymentxml-configuration)** - Configure deployment zones, regions and windows to block upgrades. Applicable for Vespa Cloud only. Available since `pyvespa=0.60.0` ## Why?[¶](#why) pyvespa has proven to be a preferred framework for deploying and managing Vespa applications. With the legacy configuration methods, not all possible configurations were available. The new approach ensures full feature parity with the XML configuration options. ## Configuration Approach[¶](#configuration-approach) The `vespa.configuration` modules in pyvespa provides a **Vespa Tag (VT)** system that mirrors Vespa's XML configuration structure: - **Tags**: Python functions representing XML elements (e.g., `container()`, `content()`, `query_profile()`) - **Attributes**: Function parameters that become XML attributes (hyphens become underscores: `garbage-collection` → `garbage_collection`) - **Values**: Automatic type conversion and XML escaping (no manual escaping needed) - **Structure**: Nested function calls create the XML hierarchy **Example**: This Python code: ``` service_config = ServicesConfiguration( name="myapp", container(id="myapp_container", version="1.0")( search(), document_api() ) ) service_config.to_xml() ``` Generates this XML: ``` ``` ## Prerequisites[¶](#prerequisites) - pyvespa installed and Docker running with at least 6GB memory - Understanding of basic Vespa concepts (schemas, deployment) For detailed XML configuration options, refer to: - [Vespa services.xml reference](https://docs.vespa.ai/en/reference/services.html) - [Query profiles reference](https://docs.vespa.ai/en/querying/query-profiles.html) - [Deployment reference](https://docs.vespa.ai/en/reference/deployment.html) Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. [Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available: In \[1\]: Copied! ``` #!pip3 install pyvespa #!docker info | grep "Total Memory" ``` #!pip3 install pyvespa #!docker info | grep "Total Memory" ## services.xml Configuration[¶](#servicesxml-configuration) ### Example 1 - Configure document-expiry[¶](#example-1-configure-document-expiry) As an example of a common use case for advanced configuration, we will configure document-expiry. This feature allows you to set a time-to-live for documents in your Vespa application. This is useful when you have documents that are only relevant for a certain period of time, and you want to avoid serving stale data. For reference, see the [docs on document-expiry](https://docs.vespa.ai/en/documents.html#document-expiry). #### Define a schema[¶](#define-a-schema) We define a simple schema, with a timestamp field that we will use in the document selection expression to set the document-expiry. Note that the fields that are referenced in the selection expression should be attributes(in-memory). Also, either the fields should be set with `fast-access` or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system. In \[2\]: Copied! ``` from vespa.package import Document, Field, Schema, ApplicationPackage application_name = "music" music_schema = Schema( name=application_name, document=Document( fields=[ Field( name="artist", type="string", indexing=["attribute", "summary"], ), Field( name="title", type="string", indexing=["attribute", "summary"], ), Field( name="timestamp", type="long", indexing=["attribute", "summary"], attribute=["fast-access"], ), ] ), ) ``` from vespa.package import Document, Field, Schema, ApplicationPackage application_name = "music" music_schema = Schema( name=application_name, document=Document( fields=\[ Field( name="artist", type="string", indexing=["attribute", "summary"], ), Field( name="title", type="string", indexing=["attribute", "summary"], ), Field( name="timestamp", type="long", indexing=["attribute", "summary"], attribute=["fast-access"], ), \] ), ) ### The `ServicesConfiguration` object[¶](#the-servicesconfiguration-object) The `ServicesConfiguration` object allows you to define any configuration you want in the `services.xml` file. The syntax is as follows: In \[3\]: Copied! ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, container, search, document_api, document_processing, content, redundancy, documents, document, node, nodes, ) # Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400) services_config = ServicesConfiguration( application_name=application_name, services_config=services( container( search(), document_api(), document_processing(), id=f"{application_name}_container", version="1.0", ), content( redundancy("1"), documents( document( type=application_name, mode="index", # Note that the selection-expression does not need to be escaped, as it will be automatically escaped during xml-serialization selection="music.timestamp > now() - 86400", ), garbage_collection="true", ), nodes(node(distribution_key="0", hostalias="node1")), id=f"{application_name}_content", version="1.0", ), ), ) application_package = ApplicationPackage( name=application_name, schema=[music_schema], services_config=services_config, ) ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, container, search, document_api, document_processing, content, redundancy, documents, document, node, nodes, ) # Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400) services_config = ServicesConfiguration( application_name=application_name, services_config=services( container( search(), document_api(), document_processing(), id=f"{application_name}\_container", version="1.0", ), content( redundancy("1"), documents( document( type=application_name, mode="index", # Note that the selection-expression does not need to be escaped, as it will be automatically escaped during xml-serialization selection="music.timestamp > now() - 86400", ), garbage_collection="true", ), nodes(node(distribution_key="0", hostalias="node1")), id=f"{application_name}\_content", version="1.0", ), ), ) application_package = ApplicationPackage( name=application_name, schema=[music_schema], services_config=services_config, ) There are some useful gotchas to keep in mind when constructing the `ServicesConfiguration` object. First, let's establish a common vocabulary through an example. Consider the following `services.xml` file, which is what we are actually representing with the `ServicesConfiguration` object from the previous cell: ``` 1 ``` In this example, `services`, `container`, `search`, `document-api`, `document-processing`, `content`, `redundancy`, `documents`, `document`, and `nodes` are *tags*. The `id`, `version`, `type`, `mode`, `selection`, `distribution-key`, `hostalias`, and `garbage-collection` are *attributes*, with a corresponding *value*. ### Tag names[¶](#tag-names) All tags as referenced in the [Vespa documentation](https://docs.vespa.ai/en/reference/services.html) are available in `vespa.configuration.{services,query_profiles,deployment}` modules with the following modifications: - All `-` in the tag names are replaced by `_` to avoid conflicts with Python syntax. - Some tags that are Python reserved words (or commonly used objects) are constructed by adding a `_` at the end of the tag name. To see which tags are affected, you can check this variable: In \[4\]: Copied! ``` from vespa.configuration.vt import replace_reserved replace_reserved ``` from vespa.configuration.vt import replace_reserved replace_reserved Out\[4\]: ``` {'type': 'type_', 'class': 'class_', 'for': 'for_', 'time': 'time_', 'io': 'io_', 'from': 'from_', 'match': 'match_'} ``` Only valid tags will be exported by the `vespa.configuration.` modules. ### Attributes[¶](#attributes) - *any* attribute can be passed to the tag constructor (no validation at construction time). - The attribute name should be the same as in the Vespa documentation, but with `-` replaced by `_`. For example, the `garbage-collection` attribute in the `query` tag should be passed as `garbage_collection`. - In case the attribute name is a Python reserved word, the same rule as for the tag names applies (add `_` at the end). An example of this is the `global` attribute which should be passed as `global_`. - Some attributes, such as `id`, in the `container` tag, are mandatory and should be passed as positional arguments to the tag constructor. ### Values[¶](#values) - The value of an attribute can be a string, an integer, or a boolean. For types `bool` and `int`, the value is converted to a string (lowercased for `bool`). If you need to pass a float, you should convert it to a string before passing it to the tag constructor, e.g. `container(version="1.0")`. - Note that we are *not* escaping the values. In the xml file, the value of the `selection` attribute in the `document` tag is `music.timestamp > now() - 86400`. (`>` is the escaped form of `>`.) When passing this value to the `document` tag constructor in python, we should *not* escape the `>` character, i.e. `document(selection="music.timestamp > now() - 86400")`. ## Deploy the Vespa application[¶](#deploy-the-vespa-application) Deploy `package` on the local machine using Docker, without leaving the notebook, by creating an instance of [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaDocker). `VespaDocker` connects to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop). In \[5\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=application_package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=application_package) ``` Waiting for configuration server, 0/60 seconds... Waiting for application to come up, 0/300 seconds. Waiting for application to come up, 5/300 seconds. Waiting for application to come up, 10/300 seconds. Application is up! Finished deployment. ``` `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. see this [notebook](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md) for details on authenticating to Vespa Cloud. ## Feeding documents to Vespa[¶](#feeding-documents-to-vespa) Now, let us feed some documents to Vespa. We will feed one document with a timestamp of 24 hours (+1 sec (86401)) ago and another document with a timestamp of the current time. We will then query the documents to check verify that the document-expiry is working as expected. In \[6\]: Copied! ``` import time docs_to_feed = [ { "id": "1", "fields": { "artist": "Snoop Dogg", "title": "Gin and Juice", "timestamp": int(time.time()) - 86401, }, }, { "id": "2", "fields": { "artist": "Dr.Dre", "title": "Still D.R.E", "timestamp": int(time.time()), }, }, ] ``` import time docs_to_feed = [ { "id": "1", "fields": { "artist": "Snoop Dogg", "title": "Gin and Juice", "timestamp": int(time.time()) - 86401, }, }, { "id": "2", "fields": { "artist": "Dr.Dre", "title": "Still D.R.E", "timestamp": int(time.time()), }, }, ] In \[7\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(docs_to_feed, schema=application_name, callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(docs_to_feed, schema=application_name, callback=callback) ## Verify document expiry through visiting[¶](#verify-document-expiry-through-visiting) [Visiting](https://docs.vespa.ai/en/visiting.html) is a feature to efficiently get or process a set of documents, identified by a [document selection](https://docs.vespa.ai/en/reference/document-select-language.html) expression. Here is how you can use visiting in pyvespa: In \[8\]: Copied! ``` visit_results = [] for slice_ in app.visit( schema=application_name, content_cluster_name=f"{application_name}_content", timeout="5s", ): for response in slice_: visit_results.append(response.json) visit_results ``` visit_results = [] for slice\_ in app.visit( schema=application_name, content_cluster_name=f"{application_name}_content", timeout="5s", ): for response in slice_: visit_results.append(response.json) visit_results Out\[8\]: ``` [{'pathId': '/document/v1/music/music/docid/', 'documents': [{'id': 'id:music:music::2', 'fields': {'artist': 'Dr.Dre', 'title': 'Still D.R.E', 'timestamp': 1754981413}}], 'documentCount': 1}] ``` We can see that the document with the timestamp of 24 hours ago is not returned by the query, while the document with the current timestamp is returned. ### Clean up[¶](#clean-up) In \[9\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() ### Example 2 - Configuring `requestthreads` per search[¶](#example-2-configuring-requestthreads-per-search) In Vespa, there are several configuration options that might be tuned to optimize the serving latency of your application. For an overview, see the [Vespa documentation - Vespa Serving Scaling Guide](https://docs.vespa.ai/en/performance/sizing-search.html). An example of a configuration that one might want to tune is the `requestthreads` `persearch` [parameter](https://docs.vespa.ai/en/reference/services-content.html#requestthreads). This parameter controls the number of search threads that are used to handle each search on the content nodes. The default value is 1. For some applications, where a significant portion of the work per query is linear with the number of documents, increasing the number of `requestthreads` `persearch` can improve the serving latency, as it allows more parallelism in the search phase. Examples of potentially expensive work that scales linearly with the number of documents, and thus are likely to benefit from increasing `requestthreads` `persearch` are: - Xgboost inference with a large GDBT-model - ONNX inference, e.g with a crossencoder. - MaxSim-operations for late interaction scoring, as in ColBERT and ColPali. - Exact nearest neighbor search. Example of query operators that are less likely to benefit from increasing `requestthreads` `persearch` are: - `wand`/`weakAnd`, see [Using wand with Vespa](https://docs.vespa.ai/en/using-wand-with-vespa.html). - Approximate nearest neighbor search with HNSW. In this example, we will demonstrate an example of configuring `requestthreads` `persearch` to 4 for an application where a Crossencoder is used in first-phase ranking. The demo is based on the [Cross-encoders for global reranking](https://vespa-engine.github.io/pyvespa/examples/cross-encoders-for-global-reranking.md) guide, but here we will use a cross-encoder in first-phase instead of global-phase. First-phase and second-phase ranking are executed on the content nodes, while global-phase ranking is executed on the container node. See [Phased ranking](https://docs.vespa.ai/en/phased-ranking.html) for more details. ### Download the crossencoder-model[¶](#download-the-crossencoder-model) In \[10\]: Copied! ``` from pathlib import Path import requests from vespa.deployment import VespaDocker # Download the model if it doesn't exist url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model.onnx" local_model_path = "model/model.onnx" if not Path(local_model_path).exists(): print("Downloading the mxbai-rerank model...") r = requests.get(url) Path(local_model_path).parent.mkdir(parents=True, exist_ok=True) with open(local_model_path, "wb") as f: f.write(r.content) print(f"Downloaded model to {local_model_path}") else: print("Model already exists, skipping download.") ``` from pathlib import Path import requests from vespa.deployment import VespaDocker # Download the model if it doesn't exist url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model.onnx" local_model_path = "model/model.onnx" if not Path(local_model_path).exists(): print("Downloading the mxbai-rerank model...") r = requests.get(url) Path(local_model_path).parent.mkdir(parents=True, exist_ok=True) with open(local_model_path, "wb") as f: f.write(r.content) print(f"Downloaded model to {local_model_path}") else: print("Model already exists, skipping download.") ``` Model already exists, skipping download. ``` ### Define a schema[¶](#define-a-schema) In \[11\]: Copied! ``` from vespa.package import ( OnnxModel, RankProfile, Schema, ApplicationPackage, Field, FieldSet, Function, FirstPhaseRanking, Document, ) application_name = "requestthreads" # Define the reranking, as we will use it for two different rank profiles reranking = FirstPhaseRanking( keep_rank_count=8, expression="sigmoid(onnx(crossencoder).logits{d0:0,d1:0})", ) # Define the schema schema = Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body_tokens", type="tensor(d0[512])", indexing=[ "input text", "embed tokenizer", "attribute", "summary", ], is_document_field=False, # Indicates a synthetic field ), ], ), fieldsets=[FieldSet(name="default", fields=["text"])], models=[ OnnxModel( model_name="crossencoder", model_file_path=f"{local_model_path}", inputs={ "input_ids": "input_ids", "attention_mask": "attention_mask", }, outputs={"logits": "logits"}, ) ], rank_profiles=[ RankProfile(name="bm25", first_phase="bm25(text)"), RankProfile( name="reranking", inherits="default", inputs=[("query(q)", "tensor(d0[64])")], functions=[ Function( name="input_ids", expression="customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens))", ), Function( name="attention_mask", expression="tokenAttentionMask(512, query(q), attribute(body_tokens))", ), ], first_phase=reranking, summary_features=[ "query(q)", "input_ids", "attention_mask", "onnx(crossencoder).logits", ], ), RankProfile( name="one-thread-profile", first_phase=reranking, inherits="reranking", num_threads_per_search=1, ), ], ) ``` from vespa.package import ( OnnxModel, RankProfile, Schema, ApplicationPackage, Field, FieldSet, Function, FirstPhaseRanking, Document, ) application_name = "requestthreads" # Define the reranking, as we will use it for two different rank profiles reranking = FirstPhaseRanking( keep_rank_count=8, expression="sigmoid(onnx(crossencoder).logits{d0:0,d1:0})", ) # Define the schema schema = Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body_tokens", type="tensor(d0[512])", indexing=[ "input text", "embed tokenizer", "attribute", "summary", ], is_document_field=False, # Indicates a synthetic field ), \], ), fieldsets=\[FieldSet(name="default", fields=["text"])\], models=[ OnnxModel( model_name="crossencoder", model_file_path=f"{local_model_path}", inputs={ "input_ids": "input_ids", "attention_mask": "attention_mask", }, outputs={"logits": "logits"}, ) ], rank_profiles=\[ RankProfile(name="bm25", first_phase="bm25(text)"), RankProfile( name="reranking", inherits="default", inputs=\[("query(q)", "tensor(d0[64])")\], functions=[ Function( name="input_ids", expression="customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens))", ), Function( name="attention_mask", expression="tokenAttentionMask(512, query(q), attribute(body_tokens))", ), ], first_phase=reranking, summary_features=[ "query(q)", "input_ids", "attention_mask", "onnx(crossencoder).logits", ], ), RankProfile( name="one-thread-profile", first_phase=reranking, inherits="reranking", num_threads_per_search=1, ), \], ) ### Define the ServicesConfiguration[¶](#define-the-servicesconfiguration) Note that the ServicesConfiguration may be used to define any configuration in the `services.xml` file. In this example, we are only configuring the `requestthreads` `persearch` parameter, but you can use the same approach to configure any other parameter. For a full reference of the available configuration options, see the [Vespa documentation - services.xml](https://docs.vespa.ai/en/reference/services.html). In \[12\]: Copied! ``` from vespa.configuration.services import * from vespa.package import ServicesConfiguration # Define services configuration with persearch threads set to 4 services_config = ServicesConfiguration( application_name=f"{application_name}", services_config=services( container(id=f"{application_name}_default", version="1.0")( component( model( url="https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json" ), id="tokenizer", type="hugging-face-tokenizer", ), document_api(), search(), ), content(id=f"{application_name}", version="1.0")( min_redundancy("1"), documents(document(type="doc", mode="index")), engine( proton( tuning( searchnode(requestthreads(persearch("4"))), ), ), ), ), version="1.0", minimum_required_vespa_version="8.311.28", ), ) ``` from vespa.configuration.services import * from vespa.package import ServicesConfiguration # Define services configuration with persearch threads set to 4 services_config = ServicesConfiguration( application_name=f"{application_name}", services_config=services( container(id=f"{application_name}\_default", version="1.0")( component( model( url="https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json" ), id="tokenizer", type="hugging-face-tokenizer", ), document_api(), search(), ), content(id=f"{application_name}", version="1.0")( min_redundancy("1"), documents(document(type="doc", mode="index")), engine( proton( tuning( searchnode(requestthreads(persearch("4"))), ), ), ), ), version="1.0", minimum_required_vespa_version="8.311.28", ), ) Now, we are ready to deploy our application-package with the defined `ServicesConfiguration`. ### Deploy the application package[¶](#deploy-the-application-package) In \[13\]: Copied! ``` app_package = ApplicationPackage( name=f"{application_name}", schema=[schema], services_config=services_config, ) ``` app_package = ApplicationPackage( name=f"{application_name}", schema=[schema], services_config=services_config, ) In \[14\]: Copied! ``` vespa_docker = VespaDocker(port=8089) app = vespa_docker.deploy(application_package=app_package) ``` vespa_docker = VespaDocker(port=8089) app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/60 seconds... Waiting for application to come up, 0/300 seconds. Waiting for application to come up, 5/300 seconds. Waiting for application to come up, 10/300 seconds. Application is up! Finished deployment. ``` ### Feed some sample documents[¶](#feed-some-sample-documents) In \[15\]: Copied! ``` sample_docs = [ {"id": i, "fields": {"text": text}} for i, text in enumerate( [ "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird'. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.", "was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961. Jane Austen was an English novelist known primarily for her six major novels, ", "which interpret, critique and comment upon the British landed gentry at the end of the 18th century. The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, ", "is among the most popular and critically acclaimed books of the modern era. 'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.", ] ) ] app.feed_iterable(sample_docs, schema="doc") # Define the query body query_body = { "yql": "select * from sources * where userQuery();", "query": "who wrote to kill a mockingbird?", "timeout": "10s", "input.query(q)": "embed(tokenizer, @query)", "presentation.timing": "true", } # Warm-up query app.query(body=query_body) query_body_reranking = { **query_body, "ranking.profile": "reranking", } # Query with default persearch threads (set to 4) with app.syncio() as sess: response_default = app.query(body=query_body_reranking) # Query with num-threads-per-search overridden to 1 query_body_one_thread = { **query_body, "ranking.profile": "one-thread-profile", # "ranking.matching.numThreadsPerSearch": 1, Could potentiall also set numThreadsPerSearch in query parameters. } with app.syncio() as sess: response_one_thread = sess.query(body=query_body_one_thread) # Extract query times timing_default = response_default.json["timing"]["querytime"] timing_one_thread = response_one_thread.json["timing"]["querytime"] # Beautifully formatted statement of - num threads and ratio of query times print(f"Query time with 4 threads: {timing_default:.2f}s") print(f"Query time with 1 thread: {timing_one_thread:.2f}s") ratio = timing_one_thread / timing_default print(f"4 threads is approximately {ratio:.2f}x faster than 1 thread") ``` sample_docs = \[ {"id": i, "fields": {"text": text}} for i, text in enumerate( [ "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird'. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.", "was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961. Jane Austen was an English novelist known primarily for her six major novels, ", "which interpret, critique and comment upon the British landed gentry at the end of the 18th century. The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, ", "is among the most popular and critically acclaimed books of the modern era. 'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.", ] ) \] app.feed_iterable(sample_docs, schema="doc") # Define the query body query_body = { "yql": "select * from sources * where userQuery();", "query": "who wrote to kill a mockingbird?", "timeout": "10s", "input.query(q)": "embed(tokenizer, @query)", "presentation.timing": "true", } # Warm-up query app.query(body=query_body) query_body_reranking = { \*\*query_body, "ranking.profile": "reranking", } # Query with default persearch threads (set to 4) with app.syncio() as sess: response_default = app.query(body=query_body_reranking) # Query with num-threads-per-search overridden to 1 query_body_one_thread = { \*\*query_body, "ranking.profile": "one-thread-profile", # "ranking.matching.numThreadsPerSearch": 1, Could potentiall also set numThreadsPerSearch in query parameters. } with app.syncio() as sess: response_one_thread = sess.query(body=query_body_one_thread) # Extract query times timing_default = response_default.json["timing"]["querytime"] timing_one_thread = response_one_thread.json["timing"]["querytime"] # Beautifully formatted statement of - num threads and ratio of query times print(f"Query time with 4 threads: {timing_default:.2f}s") print(f"Query time with 1 thread: {timing_one_thread:.2f}s") ratio = timing_one_thread / timing_default print(f"4 threads is approximately {ratio:.2f}x faster than 1 thread") ``` Query time with 4 threads: 0.73s Query time with 1 thread: 1.24s 4 threads is approximately 1.69x faster than 1 thread ``` ## Query-profile Configuration[¶](#query-profile-configuration) Until pyvespa version 0.60.0, this was the way to add a query profile or query profile type to the application package: In \[16\]: Copied! ``` from vespa.package import ( QueryProfile, QueryProfileType, QueryTypeField, QueryField, ) app_package = ApplicationPackage( name=f"{application_name}", schema=[music_schema], query_profile=QueryProfile( fields=[ QueryField( name="hits", value="30", ) ] ), query_profile_type=QueryProfileType( fields=[ QueryTypeField( name="ranking.features.query(query_embedding)", type="tensor(x[512])", ) ] ), ) ``` from vespa.package import ( QueryProfile, QueryProfileType, QueryTypeField, QueryField, ) app_package = ApplicationPackage( name=f"{application_name}", schema=[music_schema], query_profile=QueryProfile( fields=[ QueryField( name="hits", value="30", ) ] ), query_profile_type=QueryProfileType( fields=\[ QueryTypeField( name="ranking.features.query(query_embedding)", type="tensor(x[512])", ) \] ), ) As you can see from the reference in the [Vespa Docs](https://docs.vespa.ai/en/querying/query-profiles.html), this makes it impossible to define multiple query profiles or query profile types in the application package, and there are many variants you are unable to express. ## Query-profiles - new approach[¶](#query-profiles-new-approach) By importing the tag-functions like this: `from vespa.configuration.query_profiles import *`, you can access all supported tags of a query profile or query profile type. Pass these (one or as many as you like) to the `query_profile_config` parameter of your `ApplicationPackage`, and they will be added to the application package as query profiles or query profile types. Only two validations are done at construction time: 1. The `id` attribute is mandatory for both query profiles and query profile types, as it is used to create the file name in the application package. 1. The top-level tag of each element in the `query_profile_config` list should be either `query_profile` or `query_profile_type`. By using the new `query_profile_config` parameter, you can now express any combination of query profile or query profile type in python code, and add it to your `ApplicationPackage`. Here are some examples: In \[17\]: Copied! ``` from vespa.configuration.query_profiles import * # From https://docs.vespa.ai/en/tutorials/rag-blueprint.html#training-a-first-phase-ranking-model qp_hybrid = query_profile( field("doc", name="schema"), field("embed(@query)", name="ranking.features.query(embedding)"), field("embed(@query)", name="ranking.features.query(float_embedding)"), field(-7.798639, name="ranking.features.query(intercept)"), field( 13.383840, name="ranking.features.query(avg_top_3_chunk_sim_scores_param)", ), field( 0.203145, name="ranking.features.query(avg_top_3_chunk_text_scores_param)", ), field(0.159914, name="ranking.features.query(bm25_chunks_param)"), field(0.191867, name="ranking.features.query(bm25_title_param)"), field(10.067169, name="ranking.features.query(max_chunk_sim_scores_param)"), field(0.153392, name="ranking.features.query(max_chunk_text_scores_param)"), field( """select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))""", name="yql", ), field(10, name="hits"), field("learned-linear", name="ranking.profile"), field("top_3_chunks", name="presentation.summary"), id="hybrid", type="hybrid-type", ) qpt_hybrid = query_profile_type( field( name="ranking.features.query(embedding)", type="tensor(x[96])", mandatory=True, strict=True, ), field( name="ranking.features.query(float_embedding)", type="tensor(x[384])", mandatory=True, strict=True, ), id="hybrid-type", ) ``` from vespa.configuration.query_profiles import \* # From https://docs.vespa.ai/en/tutorials/rag-blueprint.html#training-a-first-phase-ranking-model qp_hybrid = query_profile( field("doc", name="schema"), field("embed(@query)", name="ranking.features.query(embedding)"), field("embed(@query)", name="ranking.features.query(float_embedding)"), field(-7.798639, name="ranking.features.query(intercept)"), field( 13.383840, name="ranking.features.query(avg_top_3_chunk_sim_scores_param)", ), field( 0.203145, name="ranking.features.query(avg_top_3_chunk_text_scores_param)", ), field(0.159914, name="ranking.features.query(bm25_chunks_param)"), field(0.191867, name="ranking.features.query(bm25_title_param)"), field(10.067169, name="ranking.features.query(max_chunk_sim_scores_param)"), field(0.153392, name="ranking.features.query(max_chunk_text_scores_param)"), field( """select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))""", name="yql", ), field(10, name="hits"), field("learned-linear", name="ranking.profile"), field("top_3_chunks", name="presentation.summary"), id="hybrid", type="hybrid-type", ) qpt_hybrid = query_profile_type( field( name="ranking.features.query(embedding)", type="tensor(x[96])", mandatory=True, strict=True, ), field( name="ranking.features.query(float_embedding)", type="tensor(x[384])", mandatory=True, strict=True, ), id="hybrid-type", ) As you can see below, we get type conversion (`True` -> `true`), XML-escaping and correct indentaion of the XML outout. In \[18\]: Copied! ``` print(qp_hybrid.to_xml()) ``` print(qp_hybrid.to_xml()) ``` doc embed(@query) embed(@query) -7.798639 13.38384 0.203145 0.159914 0.191867 10.067169 0.153392 select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding)) 1 learned-linear top_3_chunks ``` In \[19\]: Copied! ``` print(qpt_hybrid.to_xml()) ``` print(qpt_hybrid.to_xml()) ``` ``` ### Query profile variant[¶](#query-profile-variant) See Vespa documentation on [Query Profile Variants](https://docs.vespa.ai/en/query-profiles.html#query-profile-variants) for more details. In \[20\]: Copied! ``` from vespa.configuration.query_profiles import * qp_variant = query_profile( description("Multidimensional query profile"), dimensions("region,model,bucket"), field("My general a value", name="a"), query_profile(for_="us,nokia,test1")( field("My value of the combination us-nokia-test1-a", name="a"), ), query_profile(for_="us")( field("My value of the combination us-a", name="a"), field("My value of the combination us-b", name="b"), ), query_profile(for_="us,nokia,*")( field("My value of the combination us-nokia-a", name="a"), field("My value of the combination us-nokia-b", name="b"), ), query_profile(for_="us,*,test1")( field("My value of the combination us-test1-a", name="a"), field("My value of the combination us-test1-b", name="b"), ), id="multiprofile1", ) ``` from vespa.configuration.query_profiles import * qp_variant = query_profile( description("Multidimensional query profile"), dimensions("region,model,bucket"), field("My general a value", name="a"), query_profile(for\_="us,nokia,test1")( field("My value of the combination us-nokia-test1-a", name="a"), ), query_profile(for\_="us")( field("My value of the combination us-a", name="a"), field("My value of the combination us-b", name="b"), ), query_profile(for\_="us,nokia,\*")( field("My value of the combination us-nokia-a", name="a"), field("My value of the combination us-nokia-b", name="b"), ), query_profile(for\_="us,\*,test1")( field("My value of the combination us-test1-a", name="a"), field("My value of the combination us-test1-b", name="b"), ), id="multiprofile1", ) In \[21\]: Copied! ``` from vespa.configuration.query_profiles import * qpt_alias = query_profile_type( match_(path="true"), # Match is sanitized due to python keyword field( name="ranking.features.query(query_embedding)", type="tensor(x[512])", alias="q_emb query_emb", ), id="queryemb", inherits="native", ) ``` from vespa.configuration.query_profiles import * qpt_alias = query_profile_type( match\_(path="true"), # Match is sanitized due to python keyword field( name="ranking.features.query(query_embedding)", type="tensor(x[512])", alias="q_emb query_emb", ), id="queryemb", inherits="native", ) You can pass this configuration to the `ApplicationPackage` when creating it, and it will be included in the generated `services.xml` file. Or, you can add it to the `ApplicationPackage` after it has been created by using the `add_query_profile` method: In \[22\]: Copied! ``` app_package.add_query_profile([qp_hybrid, qp_variant, qpt_hybrid, qpt_alias]) ``` app_package.add_query_profile([qp_hybrid, qp_variant, qpt_hybrid, qpt_alias]) And by dumping the application package to files, we can see that all query profiles and query profile types are written to the `search/query-profiles` directory in the application package. In \[23\]: Copied! ``` import tempfile import os temp_dir = tempfile.mkdtemp() app_package.to_files(temp_dir) print(f"Application package files written to {temp_dir}") print("Files in the temporary directory:") print(os.listdir(temp_dir)) print("Files in the `search/query-profiles` directory:") print(os.listdir(os.path.join(temp_dir, "search", "query-profiles"))) ``` import tempfile import os temp_dir = tempfile.mkdtemp() app_package.to_files(temp_dir) print(f"Application package files written to {temp_dir}") print("Files in the temporary directory:") print(os.listdir(temp_dir)) print("Files in the `search/query-profiles` directory:") print(os.listdir(os.path.join(temp_dir, "search", "query-profiles"))) ``` Application package files written to /var/folders/vb/ch14y_kn4mqfz75bhc9_g5980000gn/T/tmpyzrfju5a Files in the temporary directory: ['services.xml', 'models', 'schemas', 'search', 'files'] Files in the `search/query-profiles` directory: ['types', 'multiprofile1.xml', 'hybrid.xml', 'default.xml'] ``` Note that this combination of query profiles would not make sense to deploy together in the same application, but the point here is to demonstrate the flexibility of the new `query_profile_config` parameter, which should enable you to express any query profile or query profile type in python code, and add it to your `ApplicationPackage`. The following xml-tags are available to construct query profiles and query profile types: In \[24\]: Copied! ``` queryprofile_tags ``` queryprofile_tags Out\[24\]: ``` ['query-profile', 'query-profile-type', 'field', 'match', 'strict', 'description', 'dimensions', 'ref'] ``` In order to avoid conflicts with Python reserved words, or commonly used objects, the following tags are (optionally) constructed by adding a `_` at the end of the tag name, or attribute name: In \[25\]: Copied! ``` from vespa.configuration.vt import restore_reserved restore_reserved ``` from vespa.configuration.vt import restore_reserved restore_reserved Out\[25\]: ``` {'type_': 'type', 'class_': 'class', 'for_': 'for', 'time_': 'time', 'io_': 'io', 'from_': 'from', 'match_': 'match'} ``` Note that we also here must sanitize the names of the `match` tag to avoid any conflicts with Python keyword, so `match` should be passed as `match_`. Additionally, we use the same approach as for the `ServicesConfiguration` object, so any hyphens in the tag names should be replaced with underscores. ## Configuring Deployment.xml[¶](#configuring-deploymentxml) The `deployment.xml` configuration is used to specify how your Vespa application should be deployed across different environments and regions. This only applies to [Vespa Cloud](https://cloud.vespa.ai/) deployments, where you can specify deployment targets, regions, and deployment policies. For complete deployment configuration reference, see the [Vespa deployment.xml documentation](https://docs.vespa.ai/en/reference/deployment.html). Similar to services.xml and query profiles, you can now express `deployment.xml` configuration using Python with the **Vespa Tag (VT)** syntax. ### Simple deployment configuration[¶](#simple-deployment-configuration) Here's a basic example that deploys to two production regions: In \[26\]: Copied! ``` from vespa.configuration.deployment import deployment, prod, region from vespa.package import ApplicationPackage # Simple deployment to multiple regions simple_deployment = deployment( prod(region("aws-us-east-1c"), region("aws-us-west-2a")), version="1.0" ) app_package = ApplicationPackage(name="myapp", deployment_config=simple_deployment) ``` from vespa.configuration.deployment import deployment, prod, region from vespa.package import ApplicationPackage # Simple deployment to multiple regions simple_deployment = deployment( prod(region("aws-us-east-1c"), region("aws-us-west-2a")), version="1.0" ) app_package = ApplicationPackage(name="myapp", deployment_config=simple_deployment) This configuration will generate a `deployment.xml` file that looks like this: In \[28\]: Copied! ``` print(app_package.deployment_config.to_xml()) ``` print(app_package.deployment_config.to_xml()) ``` aws-us-east-1c aws-us-west-2a ``` ### Advanced deployment configuration[¶](#advanced-deployment-configuration) For more complex scenarios, you can configure multiple instances, deployment delays, upgrade blocking windows, and endpoints: In \[29\]: Copied! ``` from vespa.configuration.deployment import ( deployment, instance, prod, region, block_change, delay, parallel, steps, endpoints, endpoint, ) # Complex deployment with multiple instances and advanced policies complex_deployment = deployment( # Beta instance - simple deployment instance(prod(region("aws-us-east-1c")), id="beta"), # Default instance with advanced configuration instance( # Block changes during specific time windows block_change( revision="false", days="mon,wed-fri", hours="16-23", time_zone="UTC" ), prod( # First region region("aws-us-east-1c"), # Delay before next deployment delay(hours="3", minutes="7", seconds="13"), # Parallel deployment to multiple regions parallel( region("aws-us-west-1c"), # Sequential steps within parallel block steps(region("aws-eu-west-1a"), delay(hours="3")), ), ), # Configure endpoints for this instance endpoints( endpoint(region("aws-us-east-1c"), container_id="my-container-service") ), id="default", ), # Global endpoints across instances endpoints( endpoint( instance("beta", weight="1"), id="my-weighted-endpoint", container_id="my-container-service", region="aws-us-east-1c", ) ), version="1.0", ) app_package = ApplicationPackage(name="myapp", deployment_config=complex_deployment) ``` from vespa.configuration.deployment import ( deployment, instance, prod, region, block_change, delay, parallel, steps, endpoints, endpoint, ) # Complex deployment with multiple instances and advanced policies complex_deployment = deployment( # Beta instance - simple deployment instance(prod(region("aws-us-east-1c")), id="beta"), # Default instance with advanced configuration instance( # Block changes during specific time windows block_change( revision="false", days="mon,wed-fri", hours="16-23", time_zone="UTC" ), prod( # First region region("aws-us-east-1c"), # Delay before next deployment delay(hours="3", minutes="7", seconds="13"), # Parallel deployment to multiple regions parallel( region("aws-us-west-1c"), # Sequential steps within parallel block steps(region("aws-eu-west-1a"), delay(hours="3")), ), ), # Configure endpoints for this instance endpoints( endpoint(region("aws-us-east-1c"), container_id="my-container-service") ), id="default", ), # Global endpoints across instances endpoints( endpoint( instance("beta", weight="1"), id="my-weighted-endpoint", container_id="my-container-service", region="aws-us-east-1c", ) ), version="1.0", ) app_package = ApplicationPackage(name="myapp", deployment_config=complex_deployment) And the generated `deployment.xml` will include all specified configurations: In \[30\]: Copied! ``` print(app_package.deployment_config.to_xml()) ``` print(app_package.deployment_config.to_xml()) ``` aws-us-east-1c aws-us-east-1c aws-us-west-1c aws-eu-west-1a aws-us-east-1c beta ``` ``` ``` This advanced configuration generates a comprehensive `deployment.xml` with: - Multiple application instances (beta and default) - Upgrade blocking windows to prevent deployments during peak hours - Deployment delays and parallel deployment strategies - Regional and cross-instance endpoint configurations To see the available tags for each configuration category, you can print the corresponding tag lists: In \[31\]: Copied! ``` from vespa.configuration.deployment import deployment_tags from vespa.configuration.query_profiles import queryprofile_tags from vespa.configuration.services import services_tags print(deployment_tags) print(queryprofile_tags) print(services_tags) ``` from vespa.configuration.deployment import deployment_tags from vespa.configuration.query_profiles import queryprofile_tags from vespa.configuration.services import services_tags print(deployment_tags) print(queryprofile_tags) print(services_tags) ``` ['deployment', 'instance', 'prod', 'region', 'block-change', 'delay', 'parallel', 'steps', 'endpoints', 'endpoint', 'staging'] ['query-profile', 'query-profile-type', 'field', 'match', 'strict', 'description', 'dimensions', 'ref'] ['abortondocumenterror', 'accesslog', 'admin', 'adminserver', 'age', 'binding', 'bucket-splitting', 'cache', 'certificate', 'chain', 'chunk', 'client', 'clients', 'cluster-controller', 'clustercontroller', 'clustercontrollers', 'component', 'components', 'compression', 'concurrency', 'config', 'configserver', 'configservers', 'conservative', 'container', 'content', 'coverage', 'disk', 'disk-limit-factor', 'diskbloatfactor', 'dispatch', 'dispatch-policy', 'distribution', 'document', 'document-api', 'document-processing', 'document-token-id', 'documentprocessor', 'documents', 'engine', 'environment-variables', 'execution-mode', 'federation', 'feeding', 'filtering', 'flush-on-shutdown', 'flushstrategy', 'gpu', 'gpu-device', 'group', 'groups-allowed-down-ratio', 'handler', 'http', 'ignore-undefined-fields', 'include', 'index', 'init-progress-time', 'initialize', 'interop-threads', 'interval', 'intraop-threads', 'io', 'jvm', 'level', 'lidspace', 'logstore', 'maintenance', 'max-bloat-factor', 'max-concurrent', 'max-document-tokens', 'max-hits-per-partition', 'max-premature-crashes', 'max-query-tokens', 'max-tokens', 'max-wait-after-coverage-factor', 'maxage', 'maxfilesize', 'maxmemorygain', 'maxpendingbytes', 'maxpendingdocs', 'maxsize', 'maxsize-percent', 'mbusport', 'memory', 'memory-limit-factor', 'merges', 'min-active-docs-coverage', 'min-distributor-up-ratio', 'min-node-ratio-per-group', 'min-redundancy', 'min-storage-up-ratio', 'min-wait-after-coverage-factor', 'minimum', 'model', 'model-evaluation', 'models', 'native', 'niceness', 'node', 'nodes', 'onnx', 'onnx-execution-mode', 'onnx-gpu-device', 'onnx-interop-threads', 'onnx-intraop-threads', 'persearch', 'persistence-threads', 'pooling-strategy', 'prepend', 'processing', 'processor', 'proton', 'provider', 'prune', 'query', 'query-timeout', 'query-token-id', 'read', 'redundancy', 'removed-db', 'renderer', 'requestthreads', 'resource-limits', 'resources', 'retrydelay', 'retryenabled', 'route', 'search', 'searchable-copies', 'searcher', 'searchnode', 'secret-store', 'server', 'services', 'slobrok', 'slobroks', 'stable-state-period', 'store', 'summary', 'sync-transactionlog', 'term-score-threshold', 'threadpool', 'threads', 'time', 'timeout', 'token', 'tokenizer-model', 'top-k-probability', 'total', 'tracelevel', 'transactionlog', 'transformer-attention-mask', 'transformer-end-sequence-token', 'transformer-input-ids', 'transformer-mask-token', 'transformer-model', 'transformer-output', 'transformer-pad-token', 'transformer-start-sequence-token', 'transition-time', 'tuning', 'type', 'unpack', 'visibility-delay', 'visitors', 'warmup', 'zookeeper'] ``` ### No proper validation until deploy time[¶](#no-proper-validation-until-deploy-time) Note that any attribute can be passed to the tag constructor, with no validation at construction time. You will still get validation at deploy time as usual though. ### Cleanup[¶](#cleanup) In \[32\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() ## Next steps[¶](#next-steps) This is just an intro into to the advanced configuration options available in Vespa. For more details, see the [Vespa documentation](https://docs.vespa.ai/en/reference/services.html). # Authenticating to Vespa Cloud[¶](#authenticating-to-vespa-cloud) Security is a top priority for the Vespa Team. We understand that as a newcomer to Vespa, the different authentication methods may not always be immediately clear. This notebook is intended to provide some clarity on the different authentication methods needed when interacting with Vespa Cloud for different purposes. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. **Pre-requisite**: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name. ## Install[¶](#install) Install [pyvespa](https://pyvespa.readthedocs.io/) >= 0.45 and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). In \[1\]: Copied! ``` !pip3 install pyvespa vespacli ``` !pip3 install pyvespa vespacli For background context, it is useful to read the [Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide). ## Control-plane vs Data-plane[¶](#control-plane-vs-data-plane) This may be self-explanatory for some, but it is worth mentioning that Vespa Cloud has two main components: the control-plane and the data-plane, which provide access to different functionalities. | | Control-plane | Data-plane | Comments | | ---------------------------------------------------------------------------------------------------------- | ------------- | ---------- | ------------------------------------------------------------------------------------------ | | Deploy application | ✅ | ❌ | | | Modify application (re-deploy) | ✅ | ❌ | | | Add or modify data-plane certs or token(s) | ✅ | ❌ | | | Feed data | ❌ | ✅ | | | Query data | ❌ | ✅ | | | Delete data | ❌ | ✅ | | | [Visiting](https://docs.vespa.ai/en/visiting.html) | ❌ | ✅ | | | [Monitoring](https://cloud.vespa.ai/en/monitoring) | ❌ | ✅ | | | Get application package | ✅ | ❌ | | | [vespa auth login](https://docs.vespa.ai/en/clients/vespa-cli.html) | ✅ | ❌ | Interactive control-plane login in browser | | [vespa auth api-key](https://docs.vespa.ai/en/clients/vespa-cli.html) | ✅ | ❌ | Headless control-plane authentication with an API key generated in the Vespa Cloud console | | [vespa auth cert](https://docs.vespa.ai/en/clients/vespa-cli.html) | ❌ | ✅ | Used to generate a certificate for a data-plane connection | | [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespacloud) | ✅ | ❌ | `VespaCloud` is a control-plane connection to Vespa Cloud | | [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaDocker) | ✅ | ❌ | `VespaDocker` is a control-plane connection to a Vespa server running in Docker | | [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) | ❌ | ✅ | `Vespa` is a data-plane connection to an existing Vespa application | ## Defining your application[¶](#defining-your-application) To initialize a connection to Vespa Cloud, you need to define your tenant name and application name. In \[2\]: Copied! ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "authnotebook" ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "authnotebook" ## Defining your application package[¶](#defining-your-application-package) An [application package](https://docs.vespa.ai/en/application-packages.html) is the whole Vespa application configuration. It can either be constructed directly from python (as we will do below) or initalized from a path, for example by cloning a sample application from the [Vespa sample apps](https://github.com/vespa-engine/sample-apps). Tip: You can use the command [vespa clone album-recommendation my-app](https://docs.vespa.ai/en/clients/vespa-cli.html) to clone a single sample app if you have the Vespa CLI installed. For this guide, we will create a minimal application package. See other guides for more complex examples. In \[3\]: Copied! ``` from vespa.package import ApplicationPackage, Field, Schema, Document schema_name = "doc" schema = Schema( name=schema_name, document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", ), ] ), ) package = ApplicationPackage(name=application, schema=[schema]) ``` from vespa.package import ApplicationPackage, Field, Schema, Document schema_name = "doc" schema = Schema( name=schema_name, document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", ), \] ), ) package = ApplicationPackage(name=application, schema=[schema]) ## Control-plane authentication[¶](#control-plane-authentication) Next, we need to authenticate to the Vespa Cloud control-plane. There are two ways to authenticate to the control-plane: ### 1. **Interactive login**:[¶](#1-interactive-login) This is the recommended way to authenticate to the control-plane. It opens a browser window for you to authenticate with either google or github. This method does not work on windows, currently. You can run `vespa auth login` in a terminal to authenticate first, and then use this method (which will then reuse the generated token). (We will not run this method here, as the notebook is run in CI, but you should run it in your local environment) ``` from vespa.deployment import VespaCloud vespa_cloud = VespaCloud( tenant=tenant_name, application=application, application_package=package, # Could also initialize from application_root (path to application package) ) ``` You should see something similar to this: ``` log Checking for access token in auth.json... Access token expired. Please re-authenticate. Your Device Confirmation code is: DRDT-ZZDC Automatically open confirmation page in your default browser? [Y/n] y Opened link in your browser: https://vespa.auth0.com/activate?user_code=DRDT-ZZDC Waiting for login to complete in browser ... done;1m⣯ Success: Logged in auth.json created at /Users/thomas/.vespa/auth.json Successfully obtained access token for control plane access. ``` ### 2. **API-key authentication**[¶](#2-api-key-authentication) This is a headless way to authenticate to the control-plane. Note that the key must be generated, either with `vespa auth api-key` or in the Vespa Cloud console directly. In \[4\]: Copied! ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, # Note that the name cannot contain the characters `-` or `_`. application=application, key_content=key, # Prefer to use key_location="" application_package=package, ) ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, # Note that the name cannot contain the characters `-` or `_`. application=application, key_content=key, # Prefer to use key_location="\" application_package=package, ) ``` Setting application... Running: vespa config set application vespa-team.authnotebook Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` When you have authenticated to the control-plane of Vespa Cloud, key/cert for data-plane authentication will be generated automatically for you, if none exists. The `data-plane-public-cert.pem` will be added to the application package (in `/security/clients.pem` directory) that will be deployed. You should keep them safe, as any app or users that need data-plane access to your Vespa application will need them. For `dev`-deployments, we allow redeploying an application with a different key/cert than the previous deployment. For `prod`-deployments however, this is not allowed, and will require a `validation-overrides`-specification in the application package. ## Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) The app is now defined and ready to deploy to Vespa Cloud. Deploy `package` to Vespa Cloud, by creating an instance of [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#VespaCloud): The following will upload the application package to Vespa Cloud Dev Zone (`aws-us-east-1c`), read more about [Vespa Zones](https://cloud.vespa.ai/en/reference/zones.html). The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically. For information about production deployments, see the following [docs](https://cloud.vespa.ai/en/reference/deployment). > Note: Deployments to dev and perf expire after 14 days of inactivity, i.e., 14 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 14 more days. In \[5\]: Copied! ``` app: Vespa = vespa_cloud.deploy() ``` app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 1 of dev-aws-us-east-1c for vespa-team.authnotebook. This may take a few minutes the first time. INFO [06:35:26] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:35:27] Using CA signed certificate version 1 INFO [06:35:27] Using 1 nodes in container cluster 'authnotebook_container' INFO [06:35:30] Session 309490 for tenant 'vespa-team' prepared, but activation failed: 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:35:33] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:35:33] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:35:42] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:35:42] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:35:52] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:35:52] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:03] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:03] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:14] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:14] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:22] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:22] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:33] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:33] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:42] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:42] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:36:52] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:36:53] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:03] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:03] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:12] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:12] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:22] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:22] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:33] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:33] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:43] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:43] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:37:53] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:37:54] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:03] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:03] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:12] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:12] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:22] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:22] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:33] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:34] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:42] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:43] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:38:52] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:38:53] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:39:02] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:39:03] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:39:12] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:39:13] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:39:22] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:39:22] 1/2 application hosts and 2/2 admin hosts for vespa-team.authnotebook have completed provisioning and bootstrapping, still waiting for h98840.dev.us-east-1c.aws.vespa-cloud.net INFO [06:39:34] Deploying platform version 8.408.12 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [06:39:35] Session 309490 for vespa-team.authnotebook.default activated INFO [06:39:56] ######## Details for all nodes ######## INFO [06:39:56] h98612b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:39:56] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:39:56] --- storagenode on port 19102 has not started INFO [06:39:56] --- searchnode on port 19107 has not started INFO [06:39:56] --- distributor on port 19111 has not started INFO [06:39:56] --- metricsproxy-container on port 19092 has not started INFO [06:39:56] h97566a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:39:56] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:39:56] --- logserver-container on port 4080 has not started INFO [06:39:56] --- metricsproxy-container on port 19092 has not started INFO [06:39:56] h98840a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:39:56] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:39:56] --- container on port 4080 has not started INFO [06:39:56] --- metricsproxy-container on port 19092 has not started INFO [06:39:56] h98621d.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:39:56] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:39:56] --- container-clustercontroller on port 19050 has not started INFO [06:39:56] --- metricsproxy-container on port 19092 has not started INFO [06:40:33] Found endpoints: INFO [06:40:33] - dev.aws-us-east-1c INFO [06:40:33] |-- https://ea8555a9.c6970ada.z.vespa-app.cloud/ (cluster 'authnotebook_container') INFO [06:40:33] Deployment complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for authnotebook_container URL: https://ea8555a9.c6970ada.z.vespa-app.cloud/ Application is up! ``` If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the `vespa auth api-key` step above. If you can authenticate, you should see lines like the following ``` Deployment started in run 1 of dev-aws-us-east-1c for mytenant.authdemo. ``` The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. We can access the mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the mTLS certificate generated in previous steps. In \[6\]: Copied! ``` mtls_endpoint = vespa_cloud.get_mtls_endpoint() mtls_endpoint ``` mtls_endpoint = vespa_cloud.get_mtls_endpoint() mtls_endpoint ``` Found mtls endpoint for authnotebook_container URL: https://ea8555a9.c6970ada.z.vespa-app.cloud/ ``` Out\[6\]: ``` 'https://ea8555a9.c6970ada.z.vespa-app.cloud/' ``` ## Data-plane authentication[¶](#data-plane-authentication) As we have mentioned, there are two ways to authenticate to the data-plane: ### 1. **mTLS - Certificate authentication**[¶](#1-mtls-certificate-authentication) This is the default way to authenticate to the data-plane. It uses the certificate which was added to the application package upon deployment. ### 2. **Token-based authentication**[¶](#2-token-based-authentication) A more convenient way to authenticate to the data-plane is to use a token. A token must be generated in the Vespa Cloud console. For more details, see the [Security Guide](https://cloud.vespa.ai/en/security/guide#configure-tokens) Set a reasonable expiry, and copy the token to a safe place, such as for instance a passwordmanager. You will not be able to see it again. After the token is generated, you need to add it as an auth-client to the application you want to access. In pyvespa, this is done by adding the AuthClients to the application package: **NB! - The method below applies to `dev`** The approach described above applies to `dev`-deployments. For `prod`-deployments, it is a little more complex, and you need to add the `AuthClients` to your application package like this: ``` from vespa.package import ContainerCluster auth_clients = [ AuthClient( id="mtls", permissions=["read"], parameters=[Parameter("certificate", {"file": "security/clients.pem"})], ), AuthClient( id="token", permissions=["read"], # Set the permissions you need parameters=[Parameter("token", {"id": CLIENT_TOKEN_ID})], ), ] # Add prod deployment config prod_region = "aws-us-east-1c" clusters = [ ContentCluster( id=f"{schema_name}_content", nodes=Nodes(count="2"), document_name=schema_name, min_redundancy="2", ), ContainerCluster( id=f"{schema_name}_container", nodes=Nodes(count="2"), auth_clients=auth_clients, # Note that the auth_clients are added here for prod deployments ), ] prod_region = "aws-us-east-1c" deployment_config = DeploymentConfiguration( environment="prod", regions=[prod_region] ) app_package = ApplicationPackage(name=application, schema=[schema], clusters=clusters, deployment=deployment_config) ``` See [Application Package reference](https://cloud.vespa.ai/en/reference/application-package) for more details. In \[7\]: Copied! ``` from vespa.package import AuthClient, Parameter CLIENT_TOKEN_ID = "pyvespa_integration" # Same as token name from the Vespa Cloud Console auth_clients = [ AuthClient( id="mtls", # Note that you still need to include the mtls client. permissions=["read", "write"], parameters=[Parameter("certificate", {"file": "security/clients.pem"})], ), AuthClient( id="token", permissions=["read"], parameters=[Parameter("token", {"id": CLIENT_TOKEN_ID})], ), ] app_package = ApplicationPackage( name=application, schema=[schema], auth_clients=auth_clients ) ``` from vespa.package import AuthClient, Parameter CLIENT_TOKEN_ID = "pyvespa_integration" # Same as token name from the Vespa Cloud Console auth_clients = \[ AuthClient( id="mtls", # Note that you still need to include the mtls client. permissions=["read", "write"], parameters=[Parameter("certificate", {"file": "security/clients.pem"})], ), AuthClient( id="token", permissions=["read"], parameters=[Parameter("token", {"id": CLIENT_TOKEN_ID})], ), \] app_package = ApplicationPackage( name=application, schema=[schema], auth_clients=auth_clients ) Notice that we added the `read` and `write` permissions to mtls clients, and only `read` to the token client. Make sure to restrict the permissions to suit your needs. Now, we can deploy a new instance of the application package with the new auth-client added: See [Tenants, apps, instances](https://cloud.vespa.ai/en/tenant-apps-instances) for details on terminology for Vespa Cloud. In \[8\]: Copied! ``` instance = "token" vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, application_package=app_package, ) app = vespa_cloud.deploy(instance=instance) ``` instance = "token" vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, application_package=app_package, ) app = vespa_cloud.deploy(instance=instance) ``` Setting application... Running: vespa config set application vespa-team.authnotebook Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. Deployment started in run 60 of dev-aws-us-east-1c for vespa-team.authnotebook.token. This may take a few minutes the first time. INFO [06:40:38] Deploying platform version 8.408.12 and application dev build 54 for dev-aws-us-east-1c of token ... INFO [06:40:39] Using CA signed certificate version 1 INFO [06:40:39] Using 1 nodes in container cluster 'authnotebook_container' WARNING [06:40:41] Auto-overriding validation which would be disallowed in production: certificate-removal: Data plane certificate(s) from cluster 'authnotebook_container' is removed (removed certificates: [CN=cloud.vespa.example]) This can cause client connection issues.. To allow this add certificate-removal to validation-overrides.xml, see https://docs.vespa.ai/en/reference/validation-overrides.html INFO [06:40:42] Session 309492 for tenant 'vespa-team' prepared and activated. INFO [06:40:43] ######## Details for all nodes ######## INFO [06:40:43] h97526a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:40:43] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:40:43] --- storagenode on port 19102 has config generation 309488, wanted is 309492 INFO [06:40:43] --- searchnode on port 19107 has config generation 309488, wanted is 309492 INFO [06:40:43] --- distributor on port 19111 has config generation 309488, wanted is 309492 INFO [06:40:43] --- metricsproxy-container on port 19092 has config generation 309488, wanted is 309492 INFO [06:40:43] h97566b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:40:43] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:40:43] --- logserver-container on port 4080 has config generation 309488, wanted is 309492 INFO [06:40:43] --- metricsproxy-container on port 19092 has config generation 309488, wanted is 309492 INFO [06:40:43] h97538e.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:40:43] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:40:43] --- container-clustercontroller on port 19050 has config generation 309492, wanted is 309492 INFO [06:40:43] --- metricsproxy-container on port 19092 has config generation 309488, wanted is 309492 INFO [06:40:43] h97567a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:40:43] --- platform vespa/cloud-tenant-rhel8:8.408.12 INFO [06:40:43] --- container on port 4080 has config generation 309488, wanted is 309492 INFO [06:40:43] --- metricsproxy-container on port 19092 has config generation 309488, wanted is 309492 INFO [06:40:53] Found endpoints: INFO [06:40:53] - dev.aws-us-east-1c INFO [06:40:53] |-- https://ab50e0c2.c6970ada.z.vespa-app.cloud/ (cluster 'authnotebook_container') INFO [06:40:53] Deployment of new application complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for authnotebook_container URL: https://ab50e0c2.c6970ada.z.vespa-app.cloud/ Application is up! ``` Note that the connection that will be returned by default, will be the mTLS connection. If you want to get a connection using token-based authentication, you can do it like this: In \[9\]: Copied! ``` token_app = vespa_cloud.get_application( instance=instance, endpoint_type="token", vespa_cloud_secret_token=os.getenv("VESPA_CLOUD_SECRET_TOKEN"), ) ``` token_app = vespa_cloud.get_application( instance=instance, endpoint_type="token", vespa_cloud_secret_token=os.getenv("VESPA_CLOUD_SECRET_TOKEN"), ) ``` Only region: aws-us-east-1c available in dev environment. Found token endpoint for authnotebook_container URL: https://c7f94a93.c6970ada.z.vespa-app.cloud/ Application is up! ``` In \[10\]: Copied! ``` token_app.get_application_status() ``` token_app.get_application_status() Out\[10\]: ``` ``` Note that a Vespa application creates a separate URL endpoint for each auth-client added. Here is how you can retrieve the URL for the token endpoint: In \[11\]: Copied! ``` token_endpoint = vespa_cloud.get_token_endpoint(instance=instance) token_endpoint ``` token_endpoint = vespa_cloud.get_token_endpoint(instance=instance) token_endpoint ``` Found token endpoint for authnotebook_container URL: https://c7f94a93.c6970ada.z.vespa-app.cloud/ ``` Out\[11\]: ``` 'https://c7f94a93.c6970ada.z.vespa-app.cloud/' ``` ## Re-connecting to a deployed application[¶](#re-connecting-to-a-deployed-application) To connect to a deployed application, you can use the `Vespa` class, which is a data-plane connection to an existing Vespa application. The `Vespa` class requires the endpoint URL. Note that this class can also be instantiated without authentication, typically used if connecting to an instance running in Docker, see [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md#vespa.deployment.VespaDocker). ### Connecting using mTLS[¶](#connecting-using-mtls) To connect to the Vespa application using mTLS, you must pass `key` and `cert` to the `Vespa` class. Both should be a path to the respective files, matching the cert that was added to the application package upon deployment. A common error is to try to regenerate the key/cert after deployment, causing a mismatch between the key/cert you are trying to authenticate with, and the cert added to the application package. In \[12\]: Copied! ``` import os # Get user home directory home = os.path.expanduser("~") # Vespa key/cert directory app_dir = f"{home}/.vespa/{tenant_name}.{application}.default/" cert_path = f"{app_dir}/data-plane-public-cert.pem" key_path = f"{app_dir}/data-plane-private-key.pem" ``` import os # Get user home directory home = os.path.expanduser("~") # Vespa key/cert directory app_dir = f"{home}/.vespa/{tenant_name}.{application}.default/" cert_path = f"{app_dir}/data-plane-public-cert.pem" key_path = f"{app_dir}/data-plane-private-key.pem" In \[13\]: Copied! ``` from vespa.application import Vespa app = Vespa(url=mtls_endpoint, cert=cert_path, key=key_path) app.get_application_status() ``` from vespa.application import Vespa app = Vespa(url=mtls_endpoint, cert=cert_path, key=key_path) app.get_application_status() Out\[13\]: ``` ``` #### Using `requests`[¶](#using-requests) It is often overlooked that all interactions with Vespa are through HTTP-api calls, so you are free to use any HTTP client you like. Below is an example of how to use the `requests` library to interact with Vespa, using `key` and `cert` for authentication, and the [/document/v1/](https://docs.vespa.ai/en/reference/document-v1-api-reference.html) endpoint to feed data to Vespa. In \[14\]: Copied! ``` import requests session = requests.Session() session.cert = (cert_path, key_path) url = f"{mtls_endpoint}/document/v1/doc/doc/docid/1" data = { "fields": { "id": "id:doc:doc::1", "title": "the title", "body": "the body", } } resp = session.post(url, json=data).json() resp ``` import requests session = requests.Session() session.cert = (cert_path, key_path) url = f"{mtls_endpoint}/document/v1/doc/doc/docid/1" data = { "fields": { "id": "id:doc:doc::1", "title": "the title", "body": "the body", } } resp = session.post(url, json=data).json() resp Out\[14\]: ``` {'pathId': '/document/v1/doc/doc/docid/1', 'id': 'id:doc:doc::1'} ``` ## Connecting using token[¶](#connecting-using-token) To connect to the Vespa application using a token, you must pass the token value to the `Vespa` class as `vespa_cloud_secret_token`. In \[15\]: Copied! ``` app = Vespa( url=token_endpoint, vespa_cloud_secret_token=os.getenv("VESPA_CLOUD_SECRET_TOKEN") ) app.get_application_status() ``` app = Vespa( url=token_endpoint, vespa_cloud_secret_token=os.getenv("VESPA_CLOUD_SECRET_TOKEN") ) app.get_application_status() Out\[15\]: ``` ``` ### Using cURL[¶](#using-curl) Token authentication provides an even more convenient way to authenticate to the data-plane, as you do not need to handle key/cert files, and can just add the token to the HTTP header, as shown in the example below. ``` curl -H "Authorization: Bearer $TOKEN" https://{endpoint}/document/v1/{document-type}/{document-id} ``` ## Next steps[¶](#next-steps) This was a guide to the different authentication methods when interacting with Vespa Cloud for different purposes. Try to deploy a frontend as interface to your Vespa application. Example of some providers are: - [Cloudflare Workers](https://workers.cloudflare.com/), see also - [Vercel](https://vercel.com/) - [Railway](https://railway.app/) etc. ## Cleanup[¶](#cleanup) In \[16\]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() ``` Deactivated vespa-team.authnotebook in dev.aws-us-east-1c Deleted instance vespa-team.authnotebook.default ``` # Application packages[¶](#application-packages) Vespa is configured using an [application package](https://docs.vespa.ai/en/application-packages.html). Pyvespa provides an API to generate a deployable application package. An application package has at a minimum a [schema](https://docs.vespa.ai/en/schemas.html) and [services.xml](https://docs.vespa.ai/en/reference/services.html). > ***NOTE: pyvespa generally does not support all indexing options in Vespa - it is made for easy experimentation.*** ***To configure setting an unsupported indexing option (or any other unsupported option),*** ***export the application package like above, modify the schema or other files*** ***and deploy the application package from the directory, or as a zipped file.*** ***Find more details at the end of this notebook.*** Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. In \[ \]: Copied! ``` !pip3 install pyvespa ``` !pip3 install pyvespa By exporting to disk, one can see the generated files: In \[50\]: Copied! ``` import os import tempfile from pathlib import Path from vespa.package import ApplicationPackage app_name = "myschema" app_package = ApplicationPackage(name=app_name, create_query_profile_by_default=False) temp_dir = tempfile.TemporaryDirectory() app_package.to_files(temp_dir.name) for p in Path(temp_dir.name).rglob("*"): if p.is_file(): print(p) ``` import os import tempfile from pathlib import Path from vespa.package import ApplicationPackage app_name = "myschema" app_package = ApplicationPackage(name=app_name, create_query_profile_by_default=False) temp_dir = tempfile.TemporaryDirectory() app_package.to_files(temp_dir.name) for p in Path(temp_dir.name).rglob("\*"): if p.is_file(): print(p) ``` /var/folders/9_/z105jyln7jz8h2vwsrjb7kxh0000gp/T/tmp6geo2dpg/services.xml /var/folders/9_/z105jyln7jz8h2vwsrjb7kxh0000gp/T/tmp6geo2dpg/schemas/myschema.sd ``` ## Schema[¶](#schema) A schema is created with the same name as the application package: In \[51\]: Copied! ``` os.environ["TMP_APP_DIR"] = temp_dir.name os.environ["APP_NAME"] = app_name !cat $TMP_APP_DIR/schemas/$APP_NAME.sd ``` os.environ["TMP_APP_DIR"] = temp_dir.name os.environ["APP_NAME"] = app_name !cat $TMP_APP_DIR/schemas/$APP_NAME.sd ``` schema myschema { document myschema { } } ``` Configure the schema with [fields](https://docs.vespa.ai/en/schemas.html#field), [fieldsets](https://docs.vespa.ai/en/schemas.html#fieldset) and a [ranking function](https://docs.vespa.ai/en/ranking.html): In \[52\]: Copied! ``` from vespa.package import Field, FieldSet, RankProfile app_package.schema.add_fields( Field(name="id", type="string", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25" ), ) app_package.schema.add_field_set(FieldSet(name="default", fields=["title", "body"])) app_package.schema.add_rank_profile( RankProfile(name="default", first_phase="bm25(title) + bm25(body)") ) ``` from vespa.package import Field, FieldSet, RankProfile app_package.schema.add_fields( Field(name="id", type="string", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25" ), ) app_package.schema.add_field_set(FieldSet(name="default", fields=["title", "body"])) app_package.schema.add_rank_profile( RankProfile(name="default", first_phase="bm25(title) + bm25(body)") ) Export the application package again, show schema: In \[53\]: Copied! ``` app_package.to_files(temp_dir.name) !cat $TMP_APP_DIR/schemas/$APP_NAME.sd ``` app_package.to_files(temp_dir.name) !cat $TMP_APP_DIR/schemas/$APP_NAME.sd ``` schema myschema { document myschema { field id type string { indexing: attribute | summary } field title type string { indexing: index | summary index: enable-bm25 } field body type string { indexing: index | summary index: enable-bm25 } } fieldset default { fields: title, body } rank-profile default { first-phase { expression { bm25(title) + bm25(body) } } } } ``` ## Services[¶](#services) `services.xml` configures container and content clusters - see the [Vespa Overview](https://docs.vespa.ai/en/overview.html). This is a file you will normally not change or need to know much about: In \[54\]: Copied! ``` !cat $TMP_APP_DIR/services.xml ``` !cat $TMP_APP_DIR/services.xml ``` 1 ``` Observe: - A *content cluster* (this is where the index is stored) called `myschema_content` is created. This is information not normally needed, unless using [delete_all_docs](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa.delete_all_docs) to quickly remove all documents from a schema ## Deploy[¶](#deploy) After completing the code for the fields and ranking, deploy the application into a Docker container - the container is started by pyvespa: In \[55\]: Copied! ``` from vespa.deployment import VespaDocker vespa_container = VespaDocker() vespa_connection = vespa_container.deploy(application_package=app_package) ``` from vespa.deployment import VespaDocker vespa_container = VespaDocker() vespa_connection = vespa_container.deploy(application_package=app_package) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Waiting for application status, 0/300 seconds... Waiting for application status, 5/300 seconds... Waiting for application status, 10/300 seconds... Waiting for application status, 15/300 seconds... Waiting for application status, 20/300 seconds... Waiting for application status, 25/300 seconds... Finished deployment. ``` ## Deploy from modified files[¶](#deploy-from-modified-files) To add configuration the the schema, which is not supported by the pyvespa code, export the files, modify, then deploy by using `deploy_from_disk`. This example adds custom configuration to the `services.xml` file above and deploys it: In \[56\]: Copied! ``` %%sh cat << EOF > $TMP_APP_DIR/services.xml 1 0.90 EOF ``` %%sh cat \<< EOF > $TMP_APP_DIR/services.xml 1 0.90 EOF The [resource-limits](https://docs.vespa.ai/en/reference/services-content.html#resource-limits) in `tuning/resource-limits/disk` configuration setting allows a higher disk usage. Deploy using the exported files: In \[57\]: Copied! ``` vespa_connection = vespa_container.deploy_from_disk( application_name=app_name, application_root=temp_dir.name ) ``` vespa_connection = vespa_container.deploy_from_disk( application_name=app_name, application_root=temp_dir.name ) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Waiting for application status, 0/300 seconds... Waiting for application status, 5/300 seconds... Finished deployment. ``` One can also export a deployable zip-file, which can be deployed using the Vespa Cloud Console: In \[58\]: Copied! ``` Path.mkdir(Path(temp_dir.name) / "zip", exist_ok=True, parents=True) app_package.to_zipfile(temp_dir.name + "/zip/application.zip") ! find "$TMP_APP_DIR/zip" -type f ``` Path.mkdir(Path(temp_dir.name) / "zip", exist_ok=True, parents=True) app_package.to_zipfile(temp_dir.name + "/zip/application.zip") ! find "$TMP_APP_DIR/zip" -type f ``` /var/folders/9_/z105jyln7jz8h2vwsrjb7kxh0000gp/T/tmp6geo2dpg/zip/application.zip ``` ### Cleanup[¶](#cleanup) Remove the container resources and temporary application package file export: In \[59\]: Copied! ``` temp_dir.cleanup() vespa_container.container.stop() vespa_container.container.remove() ``` temp_dir.cleanup() vespa_container.container.stop() vespa_container.container.remove() ## Next step: Deploy, feed and query[¶](#next-step-deploy-feed-and-query) Once the schema is ready for deployment, decide deployment option and deploy the application package: - [Deploy to local container](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa.md) - [Deploy to Vespa Cloud](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa-cloud.md) Use the guides on the pyvespa site to feed and query data. # Querying Vespa[¶](#querying-vespa) This guide goes through how to query a Vespa instance using the Query API and app as an example. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. You can run this tutorial in Google Colab: In \[ \]: Copied! ``` !pip3 install pyvespa ``` !pip3 install pyvespa Let us first just deploy and get a connection to a Vespa instance. In \[2\]: Copied! ``` from vespa.application import Vespa from vespa.deployment import VespaDocker from vespa.io import VespaQueryResponse from vespa.exceptions import VespaError from vespa.package import sample_package vespa_docker = VespaDocker() app: Vespa = vespa_docker.deploy(sample_package) ``` from vespa.application import Vespa from vespa.deployment import VespaDocker from vespa.io import VespaQueryResponse from vespa.exceptions import VespaError from vespa.package import sample_package vespa_docker = VespaDocker() app: Vespa = vespa_docker.deploy(sample_package) ``` Waiting for configuration server, 0/60 seconds... Waiting for configuration server, 5/60 seconds... Waiting for application to come up, 0/300 seconds. Waiting for application to come up, 5/300 seconds. Application is up! Finished deployment. ``` In \[3\]: Copied! ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["_id"]}, } ).take(100) ``` from datasets import load_dataset dataset = load_dataset("BeIR/nfcorpus", "corpus", split="corpus", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["\_id"], "fields": {"title": x["title"], "body": x["text"], "id": x["\_id"]}, } ).take(100) In \[4\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) See the [Vespa query language](https://docs.vespa.ai/en/reference/query-api-reference.html) for Vespa query api request parameters. The YQL [userQuery()](https://docs.vespa.ai/en/reference/query-language-reference.html#userquery) operator uses the query read from `query`. The query also specifies to use the app-specific [documentation rank profile](https://docs.vespa.ai/en/reference/bm25.html). The code uses [context manager](https://realpython.com/python-with-statement/) `with session` statement to make sure that connection pools are released. If you attempt to make multiple queries, this is important as each query will not have to set up new connections. In \[5\]: Copied! ``` with app.syncio() as session: response: VespaQueryResponse = session.query( yql="select title, body from doc where userQuery()", hits=1, query="Is statin use connected to breast cancer?", ranking="bm25", ) print(response.is_successful()) print(response.url) ``` with app.syncio() as session: response: VespaQueryResponse = session.query( yql="select title, body from doc where userQuery()", hits=1, query="Is statin use connected to breast cancer?", ranking="bm25", ) print(response.is_successful()) print(response.url) ``` True http://localhost:8080/search/?yql=select+title%2C+body+from+doc+where+userQuery%28%29&hits=1&query=Is+statin+use+connected+to+breast+cancer%3F&ranking=bm25 ``` Alternatively, if the native [Vespa query parameter](https://docs.vespa.ai/en/reference/query-api-reference.html) contains ".", which cannot be used as a `kwarg`, the parameters can be sent as HTTP POST with the `body` argument. In this case, `ranking` is an alias of `ranking.profile`, but using `ranking.profile` as a `**kwargs` argument is not allowed in python. This will combine HTTP parameters with an HTTP POST body. In \[6\]: Copied! ``` with app.syncio() as session: response: VespaQueryResponse = session.query( body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) ``` with app.syncio() as session: response: VespaQueryResponse = session.query( body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) ``` True ``` The query specified that we wanted one hit: In \[7\]: Copied! ``` response.hits ``` response.hits Out\[7\]: ``` [{'id': 'index:sample_content/0/2deca9d7029f3a77c092dfeb', 'relevance': 21.850306796449487, 'source': 'sample_content', 'fields': {'body': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.', 'title': 'Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland'}}, {'id': 'index:sample_content/0/d2f48bedc26e3838b2fc40d8', 'relevance': 20.71331154820049, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND: Preclinical studies have shown that statins, particularly simvastatin, can prevent growth in breast cancer cell lines and animal models. We investigated whether statins used after breast cancer diagnosis reduced the risk of breast cancer-specific, or all-cause, mortality in a large cohort of breast cancer patients. METHODS: A cohort of 17,880 breast cancer patients, newly diagnosed between 1998 and 2009, was identified from English cancer registries (from the National Cancer Data Repository). This cohort was linked to the UK Clinical Practice Research Datalink, providing prescription records, and to the Office of National Statistics mortality data (up to 2013), identifying 3694 deaths, including 1469 deaths attributable to breast cancer. Unadjusted and adjusted hazard ratios (HRs) for breast cancer-specific, and all-cause, mortality in statin users after breast cancer diagnosis were calculated using time-dependent Cox regression models. Sensitivity analyses were conducted using multiple imputation methods, propensity score methods and a case-control approach. RESULTS: There was some evidence that statin use after a diagnosis of breast cancer had reduced mortality due to breast cancer and all causes (fully adjusted HR = 0.84 [95% confidence interval = 0.68-1.04] and 0.84 [0.72-0.97], respectively). These associations were more marked for simvastatin 0.79 (0.63-1.00) and 0.81 (0.70-0.95), respectively. CONCLUSIONS: In this large population-based breast cancer cohort, there was some evidence of reduced mortality in statin users after breast cancer diagnosis. However, these associations were weak in magnitude and were attenuated in some sensitivity analyses.', 'title': 'Statin use after diagnosis of breast cancer and survival: a population-based cohort study.'}}, {'id': 'index:sample_content/0/abb8c59b326f35dc406e914d', 'relevance': 10.546049129391914, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND: Although high soy consumption may be associated with lower breast cancer risk in Asian populations, findings from epidemiological studies have been inconsistent. OBJECTIVE: We investigated the effects of soy intake on breast cancer risk among Korean women according to their menopausal and hormone receptor status. METHODS: We conducted a case-control study with 358 incident breast cancer patients and 360 age-matched controls with no history of malignant neoplasm. Dietary consumption of soy products was examined using a 103-item food frequency questionnaire. RESULTS: The estimated mean intakes of total soy and isoflavones from this study population were 76.5 g per day and 15.0 mg per day, respectively. Using a multivariate logistic regression model, we found a significant inverse association between soy intake and breast cancer risk, with a dose-response relationship (odds ratios (OR) (95% confidence interval (CI)) for the highest vs the lowest intake quartile: 0.36 (0.20-0.64)). When the data were stratified by menopausal status, the protective effect was observed only among postmenopausal women (OR (95% CI) for the highest vs the lowest intake quartile: 0.08 (0.03-0.22)). The association between soy and breast cancer risk did not differ according to estrogen receptor (ER)/progesterone receptor (PR) status, but the estimated intake of soy isoflavones showed an inverse association only among postmenopausal women with ER+/PR+ tumors. CONCLUSIONS: Our findings suggest that high consumption of soy might be related to lower risk of breast cancer and that the effect of soy intake could vary depending on several factors.', 'title': 'Effect of dietary soy intake on breast cancer risk according to menopause and hormone receptor status.'}}, {'id': 'index:sample_content/0/3b8b700c0db54b8272a2da54', 'relevance': 7.360259724997032, 'source': 'sample_content', 'fields': {'body': 'Docosahexaenoic acid (DHA) is an omega-3 fatty acid that comprises 22 carbons and 6 alternative double bonds in its hydrocarbon chain (22:6omega3). Previous studies have shown that DHA from fish oil controls the growth and development of different cancers; however, safety issues have been raised repeatedly about contamination of toxins in fish oil that makes it no longer a clean and safe source of the fatty acid. We investigated the cell growth inhibition of DHA from the cultured microalga Crypthecodinium cohnii (algal DHA [aDHA]) in human breast carcinoma MCF-7 cells. aDHA exhibited growth inhibition on breast cancer cells dose-dependently by 16.0% to 59.0% of the control level after 72-h incubations with 40 to 160 microM of the fatty acid. DNA flow cytometry shows that aDHA induced sub-G(1) cells, or apoptotic cells, by 64.4% to 171.3% of the control levels after incubations with 80 mM of the fatty acid for 24, 48, and 72 h. Western blot studies further show that aDHA did not modulate the expression of proapoptotic Bax protein but induced the downregulation of anti-apoptotic Bcl-2 expression time-dependently, causing increases of Bax/Bcl-2 ratio by 303.4% and 386.5% after 48- and 72-h incubations respectively with the fatty acid. Results from this study suggest that DHA from the cultured microalga is also effective in controlling cancer cell growth and that downregulation of antiapoptotic Bcl-2 is an important step in the induced apoptosis.', 'title': 'Docosahexaenoic acid from a cultured microalga inhibits cell growth and induces apoptosis by upregulating Bax/Bcl-2 ratio in human breast carcinoma...'}}, {'id': 'index:sample_content/0/9c2d39bb63ce85fcda9bfe6c', 'relevance': 5.441906201913548, 'source': 'sample_content', 'fields': {'body': 'Background Based on the hypothesized protective effect, we examined the effect of soy foods on estrogens in nipple aspirate fluid (NAF) and serum, possible indicators of breast cancer risk. Methods In a cross-over design, we randomized 96 women who produced ≥10 μL NAF to a high- or low-soy diet for 6-months. During the high-soy diet, participants consumed 2 soy servings of soy milk, tofu, or soy nuts (approximately 50 mg of isoflavones/day); during the low-soy diet, they maintained their usual diet. Six NAF samples were obtained using a FirstCyte© Aspirator. Estradiol (E2) and estrone sulfate (E1S) were assessed in NAF and estrone (E1) in serum only using highly sensitive radioimmunoassays. Mixed-effects regression models accounting for repeated measures and left-censoring limits were applied. Results Mean E2 and E1S were lower during the high-soy than the low-soy diet (113 vs. 313 pg/mL and 46 vs. 68 ng/mL, respectively) without reaching significance (p=0.07); the interaction between group and diet and was not significant. There was no effect of the soy treatment on serum E2 (p=0.76), E1 (p=0.86), or E1S (p=0.56). Within individuals, NAF and serum levels of E2 (rs=0.37; p<0.001) but not E1S (rs=0.004; p=0.97) were correlated. E2 and E1S in NAF and serum were strongly associated (rs=0.78 and rs=0.48; p<0.001). Conclusions Soy foods in amounts consumed by Asians did not significantly modify estrogen levels in NAF and serum. Impact The trend towards lower estrogens in NAF during the high-soy diet counters concerns about adverse effects of soy foods on breast cancer risk.', 'title': 'Estrogen levels in nipple aspirate fluid and serum during a randomized soy trial'}}, {'id': 'index:sample_content/0/449eccc1b30615316ab136bc', 'relevance': 5.241472721415711, 'source': 'sample_content', 'fields': {'body': 'The relation between various types of fiber and oral, pharyngeal and esophageal cancer was investigated using data from a case-control study conducted between 1992 and 1997 in Italy. Cases were 271 hospital patients with incident, histologically confirmed oral cancer, 327 with pharyngeal cancer and 304 with esophageal cancer. Controls were 1,950 subjects admitted to the same network of hospitals as the cases for acute, nonneoplastic diseases. Cases and controls were interviewed during their hospital stay using a validated food frequency questionnaire. Odds ratios (OR) were computed after allowance for age, sex, and other potential confounding factors, including alcohol, tobacco consumption, and energy intake. The ORs for the highest vs. the lowest quintile of intake of oral, pharyngeal and esophageal cancer combined were 0.40 for total (Englyst) fiber, 0.37 for soluble fiber, 0.52 for cellulose, 0.48 for insoluble non cellulose polysaccharide, 0.33 for total insoluble fiber and 0.38 for lignin. The inverse relation were similar for vegetable fiber (OR = 0.51), fruit fiber (OR = 0.60) and grain fiber (OR = 0.56), and were somewhat stronger for oral and pharyngeal cancer than for esophageal cancer. The ORs were similar for the two sexes and strata of age, education, alcohol and tobacco consumption, and total non-alcohol energy intake. Our study indicates that fiber intake may have a protective role on oral, pharyngeal and esophageal cancer.', 'title': 'Fiber intake and the risk of oral, pharyngeal and esophageal cancer.'}}, {'id': 'index:sample_content/0/c4cb3b969a89b81a3da71e9d', 'relevance': 5.0658599969730735, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND & AIMS: Increasing evidence suggests that a low folate intake and impaired folate metabolism may be implicated in the development of gastrointestinal cancers. We conducted a systematic review with meta-analysis of epidemiologic studies evaluating the association of folate intake or genetic polymorphisms in 5,10-methylenetetrahydrofolate reductase (MTHFR), a central enzyme in folate metabolism, with risk of esophageal, gastric, or pancreatic cancer. METHODS: A literature search was performed using MEDLINE for studies published through March 2006. Study-specific relative risks were weighted by the inverse of their variance to obtain random-effects summary estimates. RESULTS: The summary relative risks for the highest versus the lowest category of dietary folate intake were 0.66 (95% confidence interval [CI], 0.53-0.83) for esophageal squamous cell carcinoma (4 case-control), 0.50 (95% CI, 0.39-0.65) for esophageal adenocarcinoma (3 case-control), and 0.49 (95% CI, 0.35-0.67) for pancreatic cancer (1 case-control, 4 cohort); there was no heterogeneity among studies. Results on dietary folate intake and risk of gastric cancer (9 case-control, 2 cohort) were inconsistent. In most studies, the MTHFR 677TT (variant) genotype, which is associated with reduced enzyme activity, was associated with an increased risk of esophageal squamous cell carcinoma, gastric cardia adenocarcinoma, noncardia gastric cancer, gastric cancer (all subsites), and pancreatic cancer; all but one of 22 odds ratios were >1, of which 13 estimates were statistically significant. Studies of the MTHFR A1298C polymorphism were limited and inconsistent. CONCLUSIONS: These findings support the hypothesis that folate may play a role in carcinogenesis of the esophagus, stomach, and pancreas.', 'title': 'Folate intake, MTHFR polymorphisms, and risk of esophageal, gastric, and pancreatic cancer: a meta-analysis.'}}, {'id': 'index:sample_content/0/bb0fe2bd511527ef78587e95', 'relevance': 4.780565525377517, 'source': 'sample_content', 'fields': {'body': 'Individual-based studies that investigated the relation between dietary alpha-linolenic acid (ALA) intake and prostate cancer risk have shown inconsistent results. We carried out a meta-analysis of prospective studies to examine this association. We systematically searched studies published up to December 2008. Log relative risks (RRs) were weighted by the inverse of their variances to obtain a pooled estimate with its 95% confidence interval (CI). We identified five prospective studies that met our inclusion criteria and reported risk estimates by categories of ALA intake. Comparing the highest to the lowest ALA intake category, the pooled RR was 0.97 (95% CI:0.86-1.10) but the association was heterogeneous. Using the reported numbers of cases and non-cases in each category of ALA intake, we found that subjects who consumed more than 1.5 g/day of ALA compared with subjects who consumed less than 1.5 g/day had a significant decreased risk of prostate cancer: RR = 0.95 (95% CI:0.91-0.99). Divergences in results could partly be explained by differences in sample sizes and adjustment but they also highlight limits in dietary ALA assessment in such prospective studies. Our findings support a weak protective association between dietary ALA intake and prostate cancer risk but further research is needed to conclude on this question.', 'title': 'Prospective studies of dietary alpha-linolenic acid intake and prostate cancer risk: a meta-analysis.'}}, {'id': 'index:sample_content/0/90efd2c6652f323a8244690d', 'relevance': 4.7044749035958535, 'source': 'sample_content', 'fields': {'body': 'High serum levels of testosterone and estradiol, the bioavailability of which may be increased by Western dietary habits, seem to be important risk factors for postmenopausal breast cancer. We hypothesized that an ad libitum diet low in animal fat and refined carbohydrates and rich in low-glycemic-index foods, monounsaturated and n-3 polyunsaturated fatty acids, and phytoestrogens, might favorably modify the hormonal profile of postmenopausal women. One hundred and four postmenopausal women selected from 312 healthy volunteers on the basis of high serum testosterone levels were randomized to dietary intervention or control. The intervention included intensive dietary counseling and specially prepared group meals twice a week over 4.5 months. Changes in serum levels of testosterone, estradiol, and sex hormone-binding globulin were the main outcome measures. In the intervention group, sex hormone-binding globulin increased significantly (from 36.0 to 45.1 nmol/liter) compared with the control group (25 versus 4%,; P < 0.0001) and serum testosterone decreased (from 0.41 to 0.33 ng/ml; -20 versus -7% in control group; P = 0.0038). Serum estradiol also decreased, but the change was not significant. The dietary intervention group also significantly decreased body weight (4.06 kg versus 0.54 kg in the control group), waist:hip ratio, total cholesterol, fasting glucose level, and area under insulin curve after oral glucose tolerance test. A radical modification in diet designed to reduce insulin resistance and also involving increased phytoestrogen intake decreases the bioavailability of serum sex hormones in hyperandrogenic postmenopausal women. Additional studies are needed to determine whether such effects can reduce the risk of developing breast cancer.', 'title': 'Reducing bioavailable sex hormones through a comprehensive change in diet: the diet and androgens (DIANA) randomized trial.'}}, {'id': 'index:sample_content/0/9b56be58163850a7b2ee2425', 'relevance': 3.896398317302996, 'source': 'sample_content', 'fields': {'body': 'Breast pain is a common condition affecting most women at some stage in their reproductive life. Mastalgia is resistant to treatment in 6% of cyclical and 26% non-cyclical patients. Surgery is not widely used to treat this condition and only considered in patients with severe mastalgia resistant to medication. The aims of this study were to audit the efficacy of surgery in severe treatment resistant mastalgia and to assess patient satisfaction following surgery. This is a retrospective review of the medical records of all patients seen in mastalgia clinic in the University Hospital of Wales, Cardiff since 1973. A postal questionnaire was distributed to all patients who had undergone surgery. Results showed that of the 1054 patients seen in mastalgia clinic, 12 (1.2%) had undergone surgery. Surgery included 8 subcutaneous mastectomies with implants (3 bilateral, 5 unilateral), 1 bilateral simple mastectomy and 3 quadrantectomies (1 having a further simple mastectomy). The median duration of symptoms was 6.5 years (range 2-16 years). Five patients (50%) were pain free following surgery, 3 developed capsular contractures and 2 wound infections with dehiscence. Pain persisted in both patients undergoing quadrantectomy. We conclude that surgery for mastalgia should only be considered in a minority of patients. Patients should be informed of possible complications inherent of reconstructive surgery and warned that in 50% cases their pain will not be improved.', 'title': 'Is there a role for surgery in the treatment of mastalgia?'}}] ``` Example of iterating over the returned hits obtained from `response.hits`, extracting the `title` field: In \[8\]: Copied! ``` [hit["fields"]["title"] for hit in response.hits] ``` \[hit["fields"]["title"] for hit in response.hits\] Out\[8\]: ``` ['Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland', 'Statin use after diagnosis of breast cancer and survival: a population-based cohort study.', 'Effect of dietary soy intake on breast cancer risk according to menopause and hormone receptor status.', 'Docosahexaenoic acid from a cultured microalga inhibits cell growth and induces apoptosis by upregulating Bax/Bcl-2 ratio in human breast carcinoma...', 'Estrogen levels in nipple aspirate fluid and serum during a randomized soy trial', 'Fiber intake and the risk of oral, pharyngeal and esophageal cancer.', 'Folate intake, MTHFR polymorphisms, and risk of esophageal, gastric, and pancreatic cancer: a meta-analysis.', 'Prospective studies of dietary alpha-linolenic acid intake and prostate cancer risk: a meta-analysis.', 'Reducing bioavailable sex hormones through a comprehensive change in diet: the diet and androgens (DIANA) randomized trial.', 'Is there a role for surgery in the treatment of mastalgia?'] ``` Access the full JSON response in the Vespa [default JSON result format](https://docs.vespa.ai/en/reference/default-result-format.html): In \[9\]: Copied! ``` response.json ``` response.json Out\[9\]: ``` {'timing': {'querytime': 0.004, 'summaryfetchtime': 0.005, 'searchtime': 0.011}, 'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 97}, 'coverage': {'coverage': 100, 'documents': 100, 'full': True, 'nodes': 1, 'results': 1, 'resultsFull': 1}, 'children': [{'id': 'index:sample_content/0/2deca9d7029f3a77c092dfeb', 'relevance': 21.850306796449487, 'source': 'sample_content', 'fields': {'body': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.', 'title': 'Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland'}}, {'id': 'index:sample_content/0/d2f48bedc26e3838b2fc40d8', 'relevance': 20.71331154820049, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND: Preclinical studies have shown that statins, particularly simvastatin, can prevent growth in breast cancer cell lines and animal models. We investigated whether statins used after breast cancer diagnosis reduced the risk of breast cancer-specific, or all-cause, mortality in a large cohort of breast cancer patients. METHODS: A cohort of 17,880 breast cancer patients, newly diagnosed between 1998 and 2009, was identified from English cancer registries (from the National Cancer Data Repository). This cohort was linked to the UK Clinical Practice Research Datalink, providing prescription records, and to the Office of National Statistics mortality data (up to 2013), identifying 3694 deaths, including 1469 deaths attributable to breast cancer. Unadjusted and adjusted hazard ratios (HRs) for breast cancer-specific, and all-cause, mortality in statin users after breast cancer diagnosis were calculated using time-dependent Cox regression models. Sensitivity analyses were conducted using multiple imputation methods, propensity score methods and a case-control approach. RESULTS: There was some evidence that statin use after a diagnosis of breast cancer had reduced mortality due to breast cancer and all causes (fully adjusted HR = 0.84 [95% confidence interval = 0.68-1.04] and 0.84 [0.72-0.97], respectively). These associations were more marked for simvastatin 0.79 (0.63-1.00) and 0.81 (0.70-0.95), respectively. CONCLUSIONS: In this large population-based breast cancer cohort, there was some evidence of reduced mortality in statin users after breast cancer diagnosis. However, these associations were weak in magnitude and were attenuated in some sensitivity analyses.', 'title': 'Statin use after diagnosis of breast cancer and survival: a population-based cohort study.'}}, {'id': 'index:sample_content/0/abb8c59b326f35dc406e914d', 'relevance': 10.546049129391914, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND: Although high soy consumption may be associated with lower breast cancer risk in Asian populations, findings from epidemiological studies have been inconsistent. OBJECTIVE: We investigated the effects of soy intake on breast cancer risk among Korean women according to their menopausal and hormone receptor status. METHODS: We conducted a case-control study with 358 incident breast cancer patients and 360 age-matched controls with no history of malignant neoplasm. Dietary consumption of soy products was examined using a 103-item food frequency questionnaire. RESULTS: The estimated mean intakes of total soy and isoflavones from this study population were 76.5 g per day and 15.0 mg per day, respectively. Using a multivariate logistic regression model, we found a significant inverse association between soy intake and breast cancer risk, with a dose-response relationship (odds ratios (OR) (95% confidence interval (CI)) for the highest vs the lowest intake quartile: 0.36 (0.20-0.64)). When the data were stratified by menopausal status, the protective effect was observed only among postmenopausal women (OR (95% CI) for the highest vs the lowest intake quartile: 0.08 (0.03-0.22)). The association between soy and breast cancer risk did not differ according to estrogen receptor (ER)/progesterone receptor (PR) status, but the estimated intake of soy isoflavones showed an inverse association only among postmenopausal women with ER+/PR+ tumors. CONCLUSIONS: Our findings suggest that high consumption of soy might be related to lower risk of breast cancer and that the effect of soy intake could vary depending on several factors.', 'title': 'Effect of dietary soy intake on breast cancer risk according to menopause and hormone receptor status.'}}, {'id': 'index:sample_content/0/3b8b700c0db54b8272a2da54', 'relevance': 7.360259724997032, 'source': 'sample_content', 'fields': {'body': 'Docosahexaenoic acid (DHA) is an omega-3 fatty acid that comprises 22 carbons and 6 alternative double bonds in its hydrocarbon chain (22:6omega3). Previous studies have shown that DHA from fish oil controls the growth and development of different cancers; however, safety issues have been raised repeatedly about contamination of toxins in fish oil that makes it no longer a clean and safe source of the fatty acid. We investigated the cell growth inhibition of DHA from the cultured microalga Crypthecodinium cohnii (algal DHA [aDHA]) in human breast carcinoma MCF-7 cells. aDHA exhibited growth inhibition on breast cancer cells dose-dependently by 16.0% to 59.0% of the control level after 72-h incubations with 40 to 160 microM of the fatty acid. DNA flow cytometry shows that aDHA induced sub-G(1) cells, or apoptotic cells, by 64.4% to 171.3% of the control levels after incubations with 80 mM of the fatty acid for 24, 48, and 72 h. Western blot studies further show that aDHA did not modulate the expression of proapoptotic Bax protein but induced the downregulation of anti-apoptotic Bcl-2 expression time-dependently, causing increases of Bax/Bcl-2 ratio by 303.4% and 386.5% after 48- and 72-h incubations respectively with the fatty acid. Results from this study suggest that DHA from the cultured microalga is also effective in controlling cancer cell growth and that downregulation of antiapoptotic Bcl-2 is an important step in the induced apoptosis.', 'title': 'Docosahexaenoic acid from a cultured microalga inhibits cell growth and induces apoptosis by upregulating Bax/Bcl-2 ratio in human breast carcinoma...'}}, {'id': 'index:sample_content/0/9c2d39bb63ce85fcda9bfe6c', 'relevance': 5.441906201913548, 'source': 'sample_content', 'fields': {'body': 'Background Based on the hypothesized protective effect, we examined the effect of soy foods on estrogens in nipple aspirate fluid (NAF) and serum, possible indicators of breast cancer risk. Methods In a cross-over design, we randomized 96 women who produced ≥10 μL NAF to a high- or low-soy diet for 6-months. During the high-soy diet, participants consumed 2 soy servings of soy milk, tofu, or soy nuts (approximately 50 mg of isoflavones/day); during the low-soy diet, they maintained their usual diet. Six NAF samples were obtained using a FirstCyte© Aspirator. Estradiol (E2) and estrone sulfate (E1S) were assessed in NAF and estrone (E1) in serum only using highly sensitive radioimmunoassays. Mixed-effects regression models accounting for repeated measures and left-censoring limits were applied. Results Mean E2 and E1S were lower during the high-soy than the low-soy diet (113 vs. 313 pg/mL and 46 vs. 68 ng/mL, respectively) without reaching significance (p=0.07); the interaction between group and diet and was not significant. There was no effect of the soy treatment on serum E2 (p=0.76), E1 (p=0.86), or E1S (p=0.56). Within individuals, NAF and serum levels of E2 (rs=0.37; p<0.001) but not E1S (rs=0.004; p=0.97) were correlated. E2 and E1S in NAF and serum were strongly associated (rs=0.78 and rs=0.48; p<0.001). Conclusions Soy foods in amounts consumed by Asians did not significantly modify estrogen levels in NAF and serum. Impact The trend towards lower estrogens in NAF during the high-soy diet counters concerns about adverse effects of soy foods on breast cancer risk.', 'title': 'Estrogen levels in nipple aspirate fluid and serum during a randomized soy trial'}}, {'id': 'index:sample_content/0/449eccc1b30615316ab136bc', 'relevance': 5.241472721415711, 'source': 'sample_content', 'fields': {'body': 'The relation between various types of fiber and oral, pharyngeal and esophageal cancer was investigated using data from a case-control study conducted between 1992 and 1997 in Italy. Cases were 271 hospital patients with incident, histologically confirmed oral cancer, 327 with pharyngeal cancer and 304 with esophageal cancer. Controls were 1,950 subjects admitted to the same network of hospitals as the cases for acute, nonneoplastic diseases. Cases and controls were interviewed during their hospital stay using a validated food frequency questionnaire. Odds ratios (OR) were computed after allowance for age, sex, and other potential confounding factors, including alcohol, tobacco consumption, and energy intake. The ORs for the highest vs. the lowest quintile of intake of oral, pharyngeal and esophageal cancer combined were 0.40 for total (Englyst) fiber, 0.37 for soluble fiber, 0.52 for cellulose, 0.48 for insoluble non cellulose polysaccharide, 0.33 for total insoluble fiber and 0.38 for lignin. The inverse relation were similar for vegetable fiber (OR = 0.51), fruit fiber (OR = 0.60) and grain fiber (OR = 0.56), and were somewhat stronger for oral and pharyngeal cancer than for esophageal cancer. The ORs were similar for the two sexes and strata of age, education, alcohol and tobacco consumption, and total non-alcohol energy intake. Our study indicates that fiber intake may have a protective role on oral, pharyngeal and esophageal cancer.', 'title': 'Fiber intake and the risk of oral, pharyngeal and esophageal cancer.'}}, {'id': 'index:sample_content/0/c4cb3b969a89b81a3da71e9d', 'relevance': 5.0658599969730735, 'source': 'sample_content', 'fields': {'body': 'BACKGROUND & AIMS: Increasing evidence suggests that a low folate intake and impaired folate metabolism may be implicated in the development of gastrointestinal cancers. We conducted a systematic review with meta-analysis of epidemiologic studies evaluating the association of folate intake or genetic polymorphisms in 5,10-methylenetetrahydrofolate reductase (MTHFR), a central enzyme in folate metabolism, with risk of esophageal, gastric, or pancreatic cancer. METHODS: A literature search was performed using MEDLINE for studies published through March 2006. Study-specific relative risks were weighted by the inverse of their variance to obtain random-effects summary estimates. RESULTS: The summary relative risks for the highest versus the lowest category of dietary folate intake were 0.66 (95% confidence interval [CI], 0.53-0.83) for esophageal squamous cell carcinoma (4 case-control), 0.50 (95% CI, 0.39-0.65) for esophageal adenocarcinoma (3 case-control), and 0.49 (95% CI, 0.35-0.67) for pancreatic cancer (1 case-control, 4 cohort); there was no heterogeneity among studies. Results on dietary folate intake and risk of gastric cancer (9 case-control, 2 cohort) were inconsistent. In most studies, the MTHFR 677TT (variant) genotype, which is associated with reduced enzyme activity, was associated with an increased risk of esophageal squamous cell carcinoma, gastric cardia adenocarcinoma, noncardia gastric cancer, gastric cancer (all subsites), and pancreatic cancer; all but one of 22 odds ratios were >1, of which 13 estimates were statistically significant. Studies of the MTHFR A1298C polymorphism were limited and inconsistent. CONCLUSIONS: These findings support the hypothesis that folate may play a role in carcinogenesis of the esophagus, stomach, and pancreas.', 'title': 'Folate intake, MTHFR polymorphisms, and risk of esophageal, gastric, and pancreatic cancer: a meta-analysis.'}}, {'id': 'index:sample_content/0/bb0fe2bd511527ef78587e95', 'relevance': 4.780565525377517, 'source': 'sample_content', 'fields': {'body': 'Individual-based studies that investigated the relation between dietary alpha-linolenic acid (ALA) intake and prostate cancer risk have shown inconsistent results. We carried out a meta-analysis of prospective studies to examine this association. We systematically searched studies published up to December 2008. Log relative risks (RRs) were weighted by the inverse of their variances to obtain a pooled estimate with its 95% confidence interval (CI). We identified five prospective studies that met our inclusion criteria and reported risk estimates by categories of ALA intake. Comparing the highest to the lowest ALA intake category, the pooled RR was 0.97 (95% CI:0.86-1.10) but the association was heterogeneous. Using the reported numbers of cases and non-cases in each category of ALA intake, we found that subjects who consumed more than 1.5 g/day of ALA compared with subjects who consumed less than 1.5 g/day had a significant decreased risk of prostate cancer: RR = 0.95 (95% CI:0.91-0.99). Divergences in results could partly be explained by differences in sample sizes and adjustment but they also highlight limits in dietary ALA assessment in such prospective studies. Our findings support a weak protective association between dietary ALA intake and prostate cancer risk but further research is needed to conclude on this question.', 'title': 'Prospective studies of dietary alpha-linolenic acid intake and prostate cancer risk: a meta-analysis.'}}, {'id': 'index:sample_content/0/90efd2c6652f323a8244690d', 'relevance': 4.7044749035958535, 'source': 'sample_content', 'fields': {'body': 'High serum levels of testosterone and estradiol, the bioavailability of which may be increased by Western dietary habits, seem to be important risk factors for postmenopausal breast cancer. We hypothesized that an ad libitum diet low in animal fat and refined carbohydrates and rich in low-glycemic-index foods, monounsaturated and n-3 polyunsaturated fatty acids, and phytoestrogens, might favorably modify the hormonal profile of postmenopausal women. One hundred and four postmenopausal women selected from 312 healthy volunteers on the basis of high serum testosterone levels were randomized to dietary intervention or control. The intervention included intensive dietary counseling and specially prepared group meals twice a week over 4.5 months. Changes in serum levels of testosterone, estradiol, and sex hormone-binding globulin were the main outcome measures. In the intervention group, sex hormone-binding globulin increased significantly (from 36.0 to 45.1 nmol/liter) compared with the control group (25 versus 4%,; P < 0.0001) and serum testosterone decreased (from 0.41 to 0.33 ng/ml; -20 versus -7% in control group; P = 0.0038). Serum estradiol also decreased, but the change was not significant. The dietary intervention group also significantly decreased body weight (4.06 kg versus 0.54 kg in the control group), waist:hip ratio, total cholesterol, fasting glucose level, and area under insulin curve after oral glucose tolerance test. A radical modification in diet designed to reduce insulin resistance and also involving increased phytoestrogen intake decreases the bioavailability of serum sex hormones in hyperandrogenic postmenopausal women. Additional studies are needed to determine whether such effects can reduce the risk of developing breast cancer.', 'title': 'Reducing bioavailable sex hormones through a comprehensive change in diet: the diet and androgens (DIANA) randomized trial.'}}, {'id': 'index:sample_content/0/9b56be58163850a7b2ee2425', 'relevance': 3.896398317302996, 'source': 'sample_content', 'fields': {'body': 'Breast pain is a common condition affecting most women at some stage in their reproductive life. Mastalgia is resistant to treatment in 6% of cyclical and 26% non-cyclical patients. Surgery is not widely used to treat this condition and only considered in patients with severe mastalgia resistant to medication. The aims of this study were to audit the efficacy of surgery in severe treatment resistant mastalgia and to assess patient satisfaction following surgery. This is a retrospective review of the medical records of all patients seen in mastalgia clinic in the University Hospital of Wales, Cardiff since 1973. A postal questionnaire was distributed to all patients who had undergone surgery. Results showed that of the 1054 patients seen in mastalgia clinic, 12 (1.2%) had undergone surgery. Surgery included 8 subcutaneous mastectomies with implants (3 bilateral, 5 unilateral), 1 bilateral simple mastectomy and 3 quadrantectomies (1 having a further simple mastectomy). The median duration of symptoms was 6.5 years (range 2-16 years). Five patients (50%) were pain free following surgery, 3 developed capsular contractures and 2 wound infections with dehiscence. Pain persisted in both patients undergoing quadrantectomy. We conclude that surgery for mastalgia should only be considered in a minority of patients. Patients should be informed of possible complications inherent of reconstructive surgery and warned that in 50% cases their pain will not be improved.', 'title': 'Is there a role for surgery in the treatment of mastalgia?'}}]}} ``` ## Query Performance[¶](#query-performance) There are several things that impact end-to-end query performance: - HTTP layer performance, connecting handling, mututal TLS handshake and network round-trip latency - Make sure to re-use connections using context manager `with vespa.app.syncio():` to avoid setting up new connections for every unique query. See [http best practises](https://cloud.vespa.ai/en/http-best-practices) - The size of the fields and the number of hits requested also greatly impact network performance; a larger payload means higher latency. - By adding `"presentation.timing": True` as a request parameter, the Vespa response includes the server-side processing (also including reading the query from the network, but not delivering the result over the network). This can be handy for debugging latency. - Vespa performance, the features used inside the Vespa instance. In \[10\]: Copied! ``` with app.syncio(connections=12) as session: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) ``` with app.syncio(connections=12) as session: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) ``` True ``` ## Compressing queries[¶](#compressing-queries) The `VespaSync` class has a `compress` argument that can be used to compress the query before sending it to Vespa. This can be useful when the query is large and/or the network is slow. The compression is done using `gzip`, and is supported by Vespa. By default, the `compress` argument is set to `"auto"`, which means that the query will be compressed if the size of the query is larger than 1024 bytes. The `compress` argument can also be set to `True` or `False` to force the query to be compressed or not, respectively. The compression will be applied to both queries and feed operations. (HTTP POST or PUT requests). In \[11\]: Copied! ``` import time # Will not compress the request, as body is less than 1024 bytes with app.syncio(connections=1, compress="auto") as session: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) # Will compress, as the size of the body exceeds 1024 bytes. large_body = { "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "input.query(q)": "asdf" * 10000, "ranking": "bm25", "presentation.timing": True, } compress_time = {} with app.syncio(connections=1, compress=True) as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, ) end_time = time.time() compress_time["force_compression"] = end_time - start_time print(response.is_successful()) with app.syncio(connections=1, compress="auto") as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, ) end_time = time.time() compress_time["auto"] = end_time - start_time print(response.is_successful()) # Force no compression with app.syncio(compress=False) as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, timeout="5s", ) end_time = time.time() compress_time["no_compression"] = end_time - start_time print(response.is_successful()) ``` import time # Will not compress the request, as body is less than 1024 bytes with app.syncio(connections=1, compress="auto") as session: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, }, ) print(response.is_successful()) # Will compress, as the size of the body exceeds 1024 bytes. large_body = { "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "input.query(q)": "asdf" * 10000, "ranking": "bm25", "presentation.timing": True, } compress_time = {} with app.syncio(connections=1, compress=True) as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, ) end_time = time.time() compress_time["force_compression"] = end_time - start_time print(response.is_successful()) with app.syncio(connections=1, compress="auto") as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, ) end_time = time.time() compress_time["auto"] = end_time - start_time print(response.is_successful()) # Force no compression with app.syncio(compress=False) as session: start_time = time.time() response: VespaQueryResponse = session.query( hits=1, body=large_body, timeout="5s", ) end_time = time.time() compress_time["no_compression"] = end_time - start_time print(response.is_successful()) ``` True True True True ``` In \[12\]: Copied! ``` compress_time ``` compress_time Out\[12\]: ``` {'force_compression': 0.02625894546508789, 'auto': 0.013608932495117188, 'no_compression': 0.009457826614379883} ``` The differences will be more significant the larger the size of the body, and the slower the network. It might be beneficial to perform a proper benchmarking if performance is critical for your application. ## Running Queries asynchronously[¶](#running-queries-asynchronously) If you want to benchmark the capacity of a Vespa application, we suggest using [vespa-fbench](https://docs.vespa.ai/en/performance/vespa-benchmarking.html#vespa-fbench), which is a load generator tool that lets you measure throughput and latency with a predefined number of clients. Vespa-fbench is not Vespa-specific, and can be used to benchmark any HTTP service. Another option is to use the Open Source [k6](https://k6.io/) load testing tool. If you want to run multiple queries from pyvespa, we suggest using the convenience method `Vespa.query_many_async()`, which allows you to run multiple queries in parallel using the async client. Below, we will demonstrate a simple example of running 100 queries in parallel, and capture the server-reported times and the client-reported time (including network latency). In \[13\]: Copied! ``` # This cell is necessary when running async code in Jupyter Notebooks, as it already runs an event loop import nest_asyncio nest_asyncio.apply() ``` # This cell is necessary when running async code in Jupyter Notebooks, as it already runs an event loop import nest_asyncio nest_asyncio.apply() In \[14\]: Copied! ``` import time query = { "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, } # List of queries with hits from 1 to 100 queries = [{**query, "hits": hits} for hits in range(1, 51)] # Run the queries concurrently start_time = time.time() responses = await app.query_many_async(queries=queries) end_time = time.time() print(f"Total time: {end_time - start_time:.2f} seconds") # Print QPS print(f"QPS: {len(queries) / (end_time - start_time):.2f}") ``` import time query = { "yql": "select title, body from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "ranking": "bm25", "presentation.timing": True, } # List of queries with hits from 1 to 100 queries = [{\*\*query, "hits": hits} for hits in range(1, 51)] # Run the queries concurrently start_time = time.time() responses = await app.query_many_async(queries=queries) end_time = time.time() print(f"Total time: {end_time - start_time:.2f} seconds") # Print QPS print(f"QPS: {len(queries) / (end_time - start_time):.2f}") ``` Total time: 0.68 seconds QPS: 73.49 ``` In \[15\]: Copied! ``` dict_responses = [response.json for response in responses] ``` dict_responses = [response.json for response in responses] In \[16\]: Copied! ``` # Create a pandas DataFrame with the responses import pandas as pd df = pd.DataFrame( [ { "hits": len( response.get("root", {}).get("children", []) ), # Some responses may not have 'children' "search_time": response["timing"]["searchtime"], "query_time": response["timing"]["querytime"], "summary_time": response["timing"]["summaryfetchtime"], } for response in dict_responses ] ) df ``` # Create a pandas DataFrame with the responses import pandas as pd df = pd.DataFrame( \[ { "hits": len( response.get("root", {}).get("children", []) ), # Some responses may not have 'children' "search_time": response["timing"]["searchtime"], "query_time": response["timing"]["querytime"], "summary_time": response["timing"]["summaryfetchtime"], } for response in dict_responses \] ) df Out\[16\]: | | hits | search_time | query_time | summary_time | | --- | ---- | ----------- | ---------- | ------------ | | 0 | 1 | 0.006 | 0.003 | 0.002 | | 1 | 2 | 0.014 | 0.006 | 0.006 | | 2 | 3 | 0.046 | 0.024 | 0.019 | | 3 | 4 | 0.037 | 0.015 | 0.010 | | 4 | 5 | 0.468 | 0.035 | 0.422 | | 5 | 6 | 0.199 | 0.014 | 0.177 | | 6 | 7 | 0.018 | 0.008 | 0.009 | | 7 | 8 | 0.041 | 0.012 | 0.025 | | 8 | 9 | 0.103 | 0.018 | 0.082 | | 9 | 10 | 0.288 | 0.022 | 0.265 | | 10 | 11 | 0.568 | 0.015 | 0.544 | | 11 | 12 | 0.507 | 0.026 | 0.480 | | 12 | 13 | 0.470 | 0.012 | 0.457 | | 13 | 14 | 0.566 | 0.025 | 0.535 | | 14 | 15 | 0.566 | 0.027 | 0.534 | | 15 | 16 | 0.213 | 0.018 | 0.194 | | 16 | 17 | 0.564 | 0.010 | 0.549 | | 17 | 18 | 0.543 | 0.025 | 0.516 | | 18 | 19 | 0.545 | 0.016 | 0.520 | | 19 | 20 | 0.329 | 0.017 | 0.308 | | 20 | 21 | 0.413 | 0.010 | 0.396 | | 21 | 22 | 0.088 | 0.010 | 0.078 | | 22 | 23 | 0.418 | 0.019 | 0.382 | | 23 | 24 | 0.401 | 0.021 | 0.379 | | 24 | 25 | 0.348 | 0.013 | 0.334 | | 25 | 26 | 0.554 | 0.020 | 0.527 | | 26 | 27 | 0.532 | 0.204 | 0.322 | | 27 | 28 | 0.550 | 0.023 | 0.524 | | 28 | 29 | 0.211 | 0.005 | 0.202 | | 29 | 30 | 0.524 | 0.312 | 0.208 | | 30 | 31 | 0.440 | 0.016 | 0.422 | | 31 | 32 | 0.537 | 0.459 | 0.075 | | 32 | 33 | 0.532 | 0.285 | 0.232 | | 33 | 34 | 0.397 | 0.024 | 0.371 | | 34 | 35 | 0.398 | 0.046 | 0.345 | | 35 | 36 | 0.555 | 0.036 | 0.512 | | 36 | 37 | 0.545 | 0.009 | 0.525 | | 37 | 38 | 0.145 | 0.018 | 0.116 | | 38 | 39 | 0.418 | 0.022 | 0.394 | | 39 | 40 | 0.373 | 0.013 | 0.359 | | 40 | 41 | 0.426 | 0.044 | 0.381 | | 41 | 0 | 0.446 | 0.446 | 0.000 | | 42 | 43 | 0.292 | 0.014 | 0.267 | | 43 | 44 | 0.383 | 0.027 | 0.344 | | 44 | 45 | 0.422 | 0.012 | 0.409 | | 45 | 46 | 0.515 | 0.034 | 0.475 | | 46 | 47 | 0.518 | 0.039 | 0.475 | | 47 | 48 | 0.504 | 0.007 | 0.493 | | 48 | 49 | 0.505 | 0.012 | 0.488 | | 49 | 50 | 0.517 | 0.007 | 0.500 | ## Error handling[¶](#error-handling) Vespa's default query timeout is 500ms; Pyvespa will by default retry up to 3 times for queries that return response codes like 429, 500,503 and 504. A `VespaError` is raised if retries did not end up with success. In the following example, we set a very low [timeout](https://docs.vespa.ai/en/reference/query-api-reference.html#timeout) of `1ms` which will cause Vespa to time out the request, and it returns a 504 http error code. The underlying error is wrapped in a `VespaError` with the payload error message returned from Vespa: In \[17\]: Copied! ``` with app.syncio(connections=12) as session: try: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select * from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "timeout": "1ms", }, ) print(response.is_successful()) except VespaError as e: print(str(e)) ``` with app.syncio(connections=12) as session: try: response: VespaQueryResponse = session.query( hits=1, body={ "yql": "select * from doc where userQuery()", "query": "Is statin use connected to breast cancer?", "timeout": "1ms", }, ) print(response.is_successful()) except VespaError as e: print(str(e)) ``` [{'code': 12, 'summary': 'Timed out', 'message': 'No time left after waiting for 1ms to execute query'}] ``` In the following example, we forgot to include the `query` parameter but still reference it in the yql. This causes a bad client request response (400): In \[18\]: Copied! ``` with app.syncio(connections=12) as session: try: response: VespaQueryResponse = session.query( hits=1, body={"yql": "select * from doc where userQuery()"} ) print(response.is_successful()) except VespaError as e: print(str(e)) ``` with app.syncio(connections=12) as session: try: response: VespaQueryResponse = session.query( hits=1, body={"yql": "select * from doc where userQuery()"} ) print(response.is_successful()) except VespaError as e: print(str(e)) ``` [{'code': 3, 'summary': 'Illegal query', 'source': 'sample_content', 'message': 'No query'}] ``` ## Using the Querybuilder DSL API[¶](#using-the-querybuilder-dsl-api) From `pyvespa>=0.52.0`, we provide a Domain Specific Language (DSL) that allows you to build queries programmatically in the `vespa.querybuilder`-module. See [reference](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/builder.md) for full details. There are also many examples in our tests: - - - - This section demonstrates common query patterns using the querybuilder DSL. All features of the Vespa Query Language are supported by the querybuilder DSL. Using the Querybuilder DSL is completely optional, and you can always use the Vespa Query Language directly by passing the query as a string, which might be more convenient for simple queries. We will use the Vespa documentation search app for some advanced examples that require specific schemas. For basic examples, our local sample app works well. In \[20\]: Copied! ``` # Example using QueryBuilder with our sample app import vespa.querybuilder as qb from vespa.querybuilder import QueryField title = QueryField("title") body = QueryField("body") # Build a query to find documents containing "asthma" in title or body q = ( qb.select(["title", "body"]) .from_("doc") .where(title.contains("asthma") | body.contains("asthma")) .set_limit(5) ) print(f"Query: {q}") with app.syncio() as session: resp = session.query(yql=q, ranking="bm25") print(f"Found {len(resp.hits)} documents") for hit in resp.hits: print(f"- {hit['fields']['title']}") ``` # Example using QueryBuilder with our sample app import vespa.querybuilder as qb from vespa.querybuilder import QueryField title = QueryField("title") body = QueryField("body") # Build a query to find documents containing "asthma" in title or body q = ( qb.select(["title", "body"]) .from\_("doc") .where(title.contains("asthma") | body.contains("asthma")) .set_limit(5) ) print(f"Query: {q}") with app.syncio() as session: resp = session.query(yql=q, ranking="bm25") print(f"Found {len(resp.hits)} documents") for hit in resp.hits: print(f"- {hit['fields']['title']}") ``` Query: select title, body from doc where title contains "asthma" or body contains "asthma" limit 5 Found 0 documents ``` In \[24\]: Copied! ``` app = Vespa(url="https://api.search.vespa.ai") ``` app = Vespa(url="https://api.search.vespa.ai") ### Advanced QueryBuilder Examples[¶](#advanced-querybuilder-examples) For the following advanced examples, we'll switch to using Vespa's documentation search app which has more complex schemas. First, let's clean up our sample app: ### Example 1 - matches, order by and limit[¶](#example-1-matches-order-by-and-limit) We want to find the 10 documents with the most terms in the 'pyvespa'-namespace (the documentation search has a 'namespace'-field, which refers to the source of the documentation). Note that the documentation search operates on the 'paragraph'-schema, but for demo purposes, we will use the 'document'-schema. In \[25\]: Copied! ``` import vespa.querybuilder as qb from vespa.querybuilder import QueryField namespace = QueryField("namespace") q = ( qb.select(["title", "path", "term_count"]) .from_("doc") .where( namespace.matches("pyvespa") ) # matches is regex-match, see https://docs.vespa.ai/en/reference/query-language-reference.html#matches .order_by("term_count", ascending=False) .set_limit(10) ) print(f"Query: {q}") resp = app.query(yql=q) results = [hit["fields"] for hit in resp.hits] df = pd.DataFrame(results) df ``` import vespa.querybuilder as qb from vespa.querybuilder import QueryField namespace = QueryField("namespace") q = ( qb.select(["title", "path", "term_count"]) .from\_("doc") .where( namespace.matches("pyvespa") ) # matches is regex-match, see https://docs.vespa.ai/en/reference/query-language-reference.html#matches .order_by("term_count", ascending=False) .set_limit(10) ) print(f"Query: {q}") resp = app.query(yql=q) results = \[hit["fields"] for hit in resp.hits\] df = pd.DataFrame(results) df ``` Query: select title, path, term_count from doc where namespace matches "pyvespa" order by term_count desc limit 10 ``` Out\[25\]: | | path | title | term_count | | --- | ------------------------------------------------- | -------------------------------------------------- | ---------- | | 0 | /examples/feed_performance.html | Feeding performance¶ | 76669 | | 1 | /examples/simplified-retrieval-with-colpali-vl... | Scaling ColPALI (VLM) Retrieval¶ | 14393 | | 2 | /examples/pdf-retrieval-with-ColQwen2-vlm_Vesp... | PDF-Retrieval using ColQWen2 (ColPali) with Ve... | 14309 | | 3 | /examples/colpali-document-retrieval-vision-la... | Vespa 🤝 ColPali: Efficient Document Retrieval ... | 13996 | | 4 | /examples/colpali-benchmark-vqa-vlm_Vespa-clou... | ColPali Ranking Experiments on DocVQA¶ | 13692 | | 5 | /examples/visual_pdf_rag_with_vespa_colpali_cl... | Visual PDF RAG with Vespa - ColPali demo appli... | 8237 | | 6 | /examples/billion-scale-vector-search-with-coh... | Billion-scale vector search with Cohere binary... | 7880 | | 7 | /examples/video_search_twelvelabs_cloud.html | Video Search and Retrieval with Vespa and Twel... | 7605 | | 8 | /examples/chat_with_your_pdfs_using_colbert_la... | Chat with your pdfs with ColBERT, langchain, a... | 7501 | | 9 | /api/vespa/package.html | Package | 6059 | ### Example 2 - timestamp range, contains[¶](#example-2-timestamp-range-contains) We want to find the documents where one of the indexed fields contains the query term `embedding`,is updated after Jan 1st 2024 and the current timestamp, and have the documents ranked the 'documentation' rank profile. See . In \[26\]: Copied! ``` import vespa.querybuilder as qb from vespa.querybuilder import QueryField from datetime import datetime queryterm = "embedding" # We need to instantiate a QueryField for fields that we want to call methods on last_updated = QueryField("last_updated") title = QueryField("title") headers = QueryField("headers") path = QueryField("path") namespace = QueryField("namespace") content = QueryField("content") from_ts = int(datetime(2024, 1, 1).timestamp()) to_ts = int(datetime.now().timestamp()) print(f"From: {from_ts}, To: {to_ts}") q = ( qb.select( [title, last_updated, content] ) # Select takes either a list of QueryField or strings, (or '*' for all fields) .from_("doc") .where( namespace.matches("op.*") & last_updated.in_range(from_ts, to_ts) # could also use > and < & qb.weakAnd( title.contains(queryterm), content.contains(queryterm), headers.contains(queryterm), path.contains(queryterm), ) ) .set_limit(3) ) print(f"Query: {q}") resp = app.query(yql=q, ranking="documentation") ``` import vespa.querybuilder as qb from vespa.querybuilder import QueryField from datetime import datetime queryterm = "embedding" # We need to instantiate a QueryField for fields that we want to call methods on last_updated = QueryField("last_updated") title = QueryField("title") headers = QueryField("headers") path = QueryField("path") namespace = QueryField("namespace") content = QueryField("content") from_ts = int(datetime(2024, 1, 1).timestamp()) to_ts = int(datetime.now().timestamp()) print(f"From: {from_ts}, To: {to_ts}") q = ( qb.select( [title, last_updated, content] ) # Select takes either a list of QueryField or strings, (or '\*' for all fields) .from\_("doc") .where( namespace.matches("op.\*") & last_updated.in_range(from_ts, to_ts) # could also use > and < & qb.weakAnd( title.contains(queryterm), content.contains(queryterm), headers.contains(queryterm), path.contains(queryterm), ) ) .set_limit(3) ) print(f"Query: {q}") resp = app.query(yql=q, ranking="documentation") ``` From: 1704063600, To: 1749803562 Query: select title, last_updated, content from doc where namespace matches "op.*" and range(last_updated, 1704063600, 1749803562) and weakAnd(title contains "embedding", content contains "embedding", headers contains "embedding", path contains "embedding") limit 3 ``` In \[27\]: Copied! ``` df = pd.DataFrame([hit["fields"] | hit for hit in resp.hits]) df = pd.concat( [ df.drop(["matchfeatures", "fields"], axis=1), pd.json_normalize(df["matchfeatures"]), ], axis=1, ) df.T ``` df = pd.DataFrame(\[hit["fields"] | hit for hit in resp.hits\]) df = pd.concat( \[ df.drop(["matchfeatures", "fields"], axis=1), pd.json_normalize(df["matchfeatures"]), \], axis=1, ) df.T Out\[27\]: | | 0 | 1 | 2 | | --------------------------- | ------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------- | | content | similar data by finding nearby points i... | Reference configuration for embedders... | basic news search application - applic... | | title | Embedding | Embedding Reference | News search and recommendation tutorial - embe... | | last_updated | 1749727838 | 1749727838 | 1749727839 | | id | index:documentation/0/5d6e77ca20d4e8ee29716747 | index:documentation/1/a03c4aef22fcde916804d3d9 | index:documentation/1/ad44f35cbd7b8214f88963e3 | | relevance | 23.259617 | 22.075122 | 16.505077 | | source | documentation | documentation | documentation | | bm25(content) | 2.385057 | 2.352575 | 2.384316 | | bm25(headers) | 7.476571 | 8.106656 | 5.46136 | | bm25(keywords) | 0.0 | 0.0 | 0.0 | | bm25(path) | 3.990027 | 3.349312 | 3.100325 | | bm25(title) | 4.703981 | 4.133289 | 2.779538 | | fieldLength(content) | 3825.0 | 2031.0 | 3273.0 | | fieldLength(title) | 1.0 | 2.0 | 6.0 | | fieldMatch(content) | 0.915753 | 0.892113 | 0.915871 | | fieldMatch(content).matches | 1.0 | 1.0 | 1.0 | | fieldMatch(title) | 1.0 | 0.933869 | 0.842758 | | query(contentWeight) | 1.0 | 1.0 | 1.0 | | query(headersWeight) | 1.0 | 1.0 | 1.0 | | query(pathWeight) | 1.0 | 1.0 | 1.0 | | query(titleWeight) | 2.0 | 2.0 | 2.0 | ### Example 3 - Basic grouping[¶](#example-3-basic-grouping) Vespa supports grouping and aggregation of matches through the Vespa grouping language. For an introduction to grouping, see . We will use [purchase schema](https://github.com/vespa-cloud/vespa-documentation-search/blob/main/src/main/application/schemas/purchase.sd) that is also deployed in the documentation search app. In \[28\]: Copied! ``` from vespa.querybuilder import Grouping as G grouping = G.all( G.group("customer"), G.each(G.output(G.sum("price"))), ) q = qb.select("*").from_("purchase").where(True).set_limit(0).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group = resp.hits[0]["children"][0]["children"] # get value and sum(price) into a DataFrame df = pd.DataFrame([hit["fields"] | hit for hit in group]) df = df.loc[:, ["value", "sum(price)"]] df ``` from vespa.querybuilder import Grouping as G grouping = G.all( G.group("customer"), G.each(G.output(G.sum("price"))), ) q = qb.select("\*").from\_("purchase").where(True).set_limit(0).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group = resp.hits[0]["children"][0]["children"] # get value and sum(price) into a DataFrame df = pd.DataFrame(\[hit["fields"] | hit for hit in group\]) df = df.loc\[:, ["value", "sum(price)"]\] df ``` Query: select * from purchase where true limit 0 | all(group(customer) each(output(sum(price)))) ``` Out\[28\]: | | value | sum(price) | | --- | ----- | ---------- | | 0 | Brown | 20537 | | 1 | Jones | 39816 | | 2 | Smith | 19484 | ### Example 4 - Nested grouping[¶](#example-4-nested-grouping) Let's find out how much each customer has spent per day by grouping on customer, then date: In \[29\]: Copied! ``` from vespa.querybuilder import Grouping as G # First, we construct the grouping expression: grouping = G.all( G.group("customer"), G.each( G.group(G.time_date("date")), G.each( G.output(G.sum("price")), ), ), ) # Then, we construct the query: q = qb.select("*").from_("purchase").where(True).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group_data = resp.hits[0]["children"][0]["children"] records = [ { "GroupId": group["value"], "Date": date_entry["value"], "Sum(price)": date_entry["fields"].get("sum(price)", 0), } for group in group_data for date_group in group.get("children", []) for date_entry in date_group.get("children", []) ] # Create DataFrame df = pd.DataFrame(records) df ``` from vespa.querybuilder import Grouping as G # First, we construct the grouping expression: grouping = G.all( G.group("customer"), G.each( G.group(G.time_date("date")), G.each( G.output(G.sum("price")), ), ), ) # Then, we construct the query: q = qb.select("\*").from\_("purchase").where(True).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group_data = resp.hits[0]["children"][0]["children"] records = \[ { "GroupId": group["value"], "Date": date_entry["value"], "Sum(price)": date_entry["fields"].get("sum(price)", 0), } for group in group_data for date_group in group.get("children", []) for date_entry in date_group.get("children", []) \] # Create DataFrame df = pd.DataFrame(records) df ``` Query: select * from purchase where true | all(group(customer) each(group(time.date(date)) each(output(sum(price))))) ``` Out\[29\]: | | GroupId | Date | Sum(price) | | --- | ------- | --------- | ---------- | | 0 | Brown | 2006-9-10 | 7540 | | 1 | Brown | 2006-9-11 | 1597 | | 2 | Brown | 2006-9-8 | 8000 | | 3 | Brown | 2006-9-9 | 3400 | | 4 | Jones | 2006-9-10 | 8900 | | 5 | Jones | 2006-9-11 | 20816 | | 6 | Jones | 2006-9-8 | 8000 | | 7 | Jones | 2006-9-9 | 2100 | | 8 | Smith | 2006-9-10 | 6100 | | 9 | Smith | 2006-9-11 | 2584 | | 10 | Smith | 2006-9-6 | 1000 | | 11 | Smith | 2006-9-7 | 3000 | | 12 | Smith | 2006-9-9 | 6800 | ### Example 5 - Grouping with expressions[¶](#example-5-grouping-with-expressions) Instead of just grouping on some attribute value, the group clause may contain arbitrarily complex expressions - see [Grouping reference](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/grouping/grouping.md#vespa.querybuilder.Grouping) for exhaustive list. Examples: - Select the minimum or maximum of sub-expressions - Addition, subtraction, multiplication, division, and even modulo of - sub-expressions - Bitwise operations on sub-expressions - Concatenation of the results of sub-expressions Let's use some of these expressions to get the sum the prices of purchases on a per-hour-of-day basis. In \[30\]: Copied! ``` from vespa.querybuilder import Grouping as G grouping = G.all( G.group(G.mod(G.div("date", G.mul(60, 60)), 24)), G.order(-G.sum("price")), G.each(G.output(G.sum("price"))), ) q = qb.select("*").from_("purchase").where(True).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group_data = resp.hits[0]["children"][0]["children"] df = pd.DataFrame([hit["fields"] | hit for hit in group_data]) df = df.loc[:, ["value", "sum(price)"]] df ``` from vespa.querybuilder import Grouping as G grouping = G.all( G.group(G.mod(G.div("date", G.mul(60, 60)), 24)), G.order(-G.sum("price")), G.each(G.output(G.sum("price"))), ) q = qb.select("\*").from\_("purchase").where(True).groupby(grouping) print(f"Query: {q}") resp = app.query(yql=q) group_data = resp.hits[0]["children"][0]["children"] df = pd.DataFrame(\[hit["fields"] | hit for hit in group_data\]) df = df.loc\[:, ["value", "sum(price)"]\] df ``` Query: select * from purchase where true | all(group(mod(div(date, mul(60, 60)),24)) order(-sum(price)) each(output(sum(price)))) ``` Out\[30\]: | | value | sum(price) | | --- | ----- | ---------- | | 0 | 10 | 26181 | | 1 | 9 | 23524 | | 2 | 8 | 22367 | | 3 | 11 | 6765 | | 4 | 7 | 1000 | # Read and write operations[¶](#read-and-write-operations) This notebook documents ways to feed, get, update and delete data: - Using context manager with `with` for efficiently managing resources - Feeding streams of data using `feed_iter` which can feed from streams, Iterables, Lists and files by the use of generators Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. ## Deploy a sample application[¶](#deploy-a-sample-application) [Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker, validate minimum 4G available: In \[1\]: Copied! ``` !docker info | grep "Total Memory" ``` !docker info | grep "Total Memory" Define a simple application package with five fields In \[1\]: Copied! ``` from vespa.application import ApplicationPackage from vespa.package import Schema, Document, Field, FieldSet, HNSW, RankProfile app_package = ApplicationPackage( name="vector", schema=[ Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="popularity", type="float", indexing=["attribute", "summary"], ), Field( name="embedding", type="tensor(x[1536])", indexing=["attribute", "summary", "index"], ann=HNSW( distance_metric="innerproduct", max_links_per_node=16, neighbors_to_explore_at_insert=128, ), ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "body"])], rank_profiles=[ RankProfile( name="default", inputs=[("query(q)", "tensor(x[1536])")], first_phase="closeness(field, embedding)", ) ], ) ], ) ``` from vespa.application import ApplicationPackage from vespa.package import Schema, Document, Field, FieldSet, HNSW, RankProfile app_package = ApplicationPackage( name="vector", schema=\[ Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="body", type="string", indexing=["index", "summary"], index="enable-bm25", ), Field( name="popularity", type="float", indexing=["attribute", "summary"], ), Field( name="embedding", type="tensor(x[1536])", indexing=["attribute", "summary", "index"], ann=HNSW( distance_metric="innerproduct", max_links_per_node=16, neighbors_to_explore_at_insert=128, ), ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "body"])\], rank_profiles=\[ RankProfile( name="default", inputs=\[("query(q)", "tensor(x[1536])")\], first_phase="closeness(field, embedding)", ) \], ) \], ) In \[2\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=app_package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/60 seconds... Waiting for configuration server, 5/60 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` ## Feed data by streaming over Iterable type[¶](#feed-data-by-streaming-over-iterable-type) This example notebook uses the [dbpedia-entities-openai-1M](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M) dataset (1M OpenAI Embeddings (1536 dimensions) from June 2023). The `streaming=True` option allow paging the data on-demand from HF S3. This is extremely useful for large datasets, where the data does not fit in memory and downloading the entire dataset is not needed. Read more about [datasets stream](https://huggingface.co/docs/datasets/stream). In \[ \]: Copied! ``` from datasets import load_dataset dataset = load_dataset( "KShivendu/dbpedia-entities-openai-1M", split="train", streaming=True ).take(1000) ``` from datasets import load_dataset dataset = load_dataset( "KShivendu/dbpedia-entities-openai-1M", split="train", streaming=True ).take(1000) ### Converting to dataset field names to Vespa schema field names[¶](#converting-to-dataset-field-names-to-vespa-schema-field-names) We need to convert the dataset field names to the configured Vespa schema field names, we do this with a simple lambda function. The map function does not page the data, the map step is performed lazily if we start iterating over the dataset. This allows chaining of map operations where the lambda is yielding the next document. In \[4\]: Copied! ``` pyvespa_feed_format = dataset.map( lambda x: {"id": x["_id"], "fields": {"id": x["_id"], "embedding": x["openai"]}} ) ``` pyvespa_feed_format = dataset.map( lambda x: {"id": x["\_id"], "fields": {"id": x["\_id"], "embedding": x["openai"]}} ) Feed using [feed_iterable](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa.feed_iterable) which accepts an `Iterable`. `feed_iterable` accepts a callback callable routine that is called for every single data operation so we can check the result. If the result `is_successful()` the operation is persisted and applied in Vespa. In \[5\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) In \[6\]: Copied! ``` app.feed_iterable( iter=pyvespa_feed_format, schema="doc", namespace="benchmark", callback=callback, max_queue_size=4000, max_workers=16, max_connections=16, ) ``` app.feed_iterable( iter=pyvespa_feed_format, schema="doc", namespace="benchmark", callback=callback, max_queue_size=4000, max_workers=16, max_connections=16, ) ### Feeding with generators[¶](#feeding-with-generators) The above handled streaming data from a remote repo, we can also use generators or just List. In this example, we generate synthetic data using a generator function. In \[7\]: Copied! ``` def my_generator() -> dict: for i in range(1000): yield { "id": str(i), "fields": { "id": str(i), "title": "title", "body": "this is body", "popularity": 1.0, }, } ``` def my_generator() -> dict: for i in range(1000): yield { "id": str(i), "fields": { "id": str(i), "title": "title", "body": "this is body", "popularity": 1.0, }, } In \[8\]: Copied! ``` app.feed_iterable( iter=my_generator(), schema="doc", namespace="benchmark", callback=callback, max_queue_size=4000, max_workers=12, max_connections=12, ) ``` app.feed_iterable( iter=my_generator(), schema="doc", namespace="benchmark", callback=callback, max_queue_size=4000, max_workers=12, max_connections=12, ) ### Visiting[¶](#visiting) Visiting is a feature to efficiently get or process a set of documents, identified by a document selection expression. Visit yields multiple slices (run concurrently) each yielding responses (depending on number of documents in each slice). This allows for custom handling of each response. Visiting can be useful for exporting data, for example for ML training or for migrating a vespa application. In \[9\]: Copied! ``` all_docs = [] for slice in app.visit( content_cluster_name="vector_content", schema="doc", namespace="benchmark", selection="true", # Document selection - see https://docs.vespa.ai/en/reference/document-select-language.html slices=4, wanted_document_count=300, ): for response in slice: print(response.number_documents_retrieved) all_docs.extend(response.documents) ``` all_docs = [] for slice in app.visit( content_cluster_name="vector_content", schema="doc", namespace="benchmark", selection="true", # Document selection - see https://docs.vespa.ai/en/reference/document-select-language.html slices=4, wanted_document_count=300, ): for response in slice: print(response.number_documents_retrieved) all_docs.extend(response.documents) ``` 300 196 303 185 309 191 303 213 ``` In \[10\]: Copied! ``` len(all_docs) ``` len(all_docs) Out\[10\]: ``` 2000 ``` In \[11\]: Copied! ``` for slice in app.visit( content_cluster_name="vector_content", wanted_document_count=1000 ): for response in slice: print(response.number_documents_retrieved) ``` for slice in app.visit( content_cluster_name="vector_content", wanted_document_count=1000 ): for response in slice: print(response.number_documents_retrieved) ``` 190 189 226 205 184 214 202 181 217 192 ``` ### Updates[¶](#updates) Using a similar generator we can update the fake data we added. This performs [partial updates](https://docs.vespa.ai/en/partial-updates.html), assigning the `popularity` field to have the value `2.0`. In \[12\]: Copied! ``` def my_update_generator() -> dict: for i in range(1000): yield {"id": str(i), "fields": {"popularity": 2.0}} ``` def my_update_generator() -> dict: for i in range(1000): yield {"id": str(i), "fields": {"popularity": 2.0}} In \[13\]: Copied! ``` app.feed_iterable( iter=my_update_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="update", max_queue_size=4000, max_workers=12, max_connections=12, ) ``` app.feed_iterable( iter=my_update_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="update", max_queue_size=4000, max_workers=12, max_connections=12, ) ## Other update operations[¶](#other-update-operations) We can also perform other update operations, see [Vespa docs on reads and writes](https://docs.vespa.ai/en/reads-and-writes.html). To achieve this we need to set the `auto_assign` parameter to `False` in the `feed_iterable` method (which will pass this to `update_data_point`-method). In \[14\]: Copied! ``` def my_increment_generator() -> dict: for i in range(1000): yield {"id": str(i), "fields": {"popularity": {"increment": 1.0}}} ``` def my_increment_generator() -> dict: for i in range(1000): yield {"id": str(i), "fields": {"popularity": {"increment": 1.0}}} In \[15\]: Copied! ``` app.feed_iterable( iter=my_increment_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="update", max_queue_size=4000, max_workers=12, max_connections=12, auto_assign=False, ) ``` app.feed_iterable( iter=my_increment_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="update", max_queue_size=4000, max_workers=12, max_connections=12, auto_assign=False, ) We can now query the data, notice how we use a context manager `with` to close connection after query This avoids resource leakage and allows for reuse of connections. In this case, we only do a single query and there is no need for having more than one connection. Setting more connections will just increase connection level overhead. In \[16\]: Copied! ``` from vespa.io import VespaQueryResponse with app.syncio(connections=1): response: VespaQueryResponse = app.query( yql="select id from doc where popularity > 2.5", hits=0 ) print(response.number_documents_retrieved) ``` from vespa.io import VespaQueryResponse with app.syncio(connections=1): response: VespaQueryResponse = app.query( yql="select id from doc where popularity > 2.5", hits=0 ) print(response.number_documents_retrieved) ``` 1000 ``` ### Deleting[¶](#deleting) Delete all the synthetic data with a custom generator. Now we don't need the `fields` key. In \[16\]: Copied! ``` def my_delete_generator() -> dict: for i in range(1000): yield {"id": str(i)} app.feed_iterable( iter=my_delete_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="delete", max_queue_size=5000, max_workers=48, max_connections=48, ) ``` def my_delete_generator() -> dict: for i in range(1000): yield {"id": str(i)} app.feed_iterable( iter=my_delete_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="delete", max_queue_size=5000, max_workers=48, max_connections=48, ) ## Feeding operations from a file[¶](#feeding-operations-from-a-file) This demonstrates how we can use `feed_iter` to feed from a large file without reading the entire file, this also uses a generator. First we dump some operations to the file and peak at the first line: In \[17\]: Copied! ``` # Dump some operation to a jsonl file, we store it in the format expected by pyvespa # This to demonstrate feeding from a file in the next section. import json with open("documents.jsonl", "w") as f: for doc in dataset: d = {"id": doc["_id"], "fields": {"id": doc["_id"], "embedding": doc["openai"]}} f.write(json.dumps(d) + "\n") ``` # Dump some operation to a jsonl file, we store it in the format expected by pyvespa # This to demonstrate feeding from a file in the next section. import json with open("documents.jsonl", "w") as f: for doc in dataset: d = {"id": doc["\_id"], "fields": {"id": doc["\_id"], "embedding": doc["openai"]}} f.write(json.dumps(d) + "\\n") Define the file generator that will yield one line at a time In \[18\]: Copied! ``` import json def from_file_generator() -> dict: with open("documents.jsonl") as f: for line in f: yield json.loads(line) ``` import json def from_file_generator() -> dict: with open("documents.jsonl") as f: for line in f: yield json.loads(line) In \[19\]: Copied! ``` app.feed_iterable( iter=from_file_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="feed", max_queue_size=4000, max_workers=32, max_connections=32, ) ``` app.feed_iterable( iter=from_file_generator(), schema="doc", namespace="benchmark", callback=callback, operation_type="feed", max_queue_size=4000, max_workers=32, max_connections=32, ) ### Get and Feed individual data points[¶](#get-and-feed-individual-data-points) Feed a single data point to Vespa In \[20\]: Copied! ``` with app.syncio(connections=1): response: VespaResponse = app.feed_data_point( schema="doc", namespace="benchmark", data_id="1", fields={ "id": "1", "title": "title", "body": "this is body", "popularity": 1.0, }, ) print(response.is_successful()) print(response.get_json()) ``` with app.syncio(connections=1): response: VespaResponse = app.feed_data_point( schema="doc", namespace="benchmark", data_id="1", fields={ "id": "1", "title": "title", "body": "this is body", "popularity": 1.0, }, ) print(response.is_successful()) print(response.get_json()) ``` True {'pathId': '/document/v1/benchmark/doc/docid/1', 'id': 'id:benchmark:doc::1'} ``` Get the same document, try also to change data_id to a document that does not exist which will raise a 404 http error. In \[21\]: Copied! ``` with app.syncio(connections=1): response: VespaResponse = app.get_data( schema="doc", namespace="benchmark", data_id="1", ) print(response.is_successful()) print(response.get_json()) ``` with app.syncio(connections=1): response: VespaResponse = app.get_data( schema="doc", namespace="benchmark", data_id="1", ) print(response.is_successful()) print(response.get_json()) ``` True {'pathId': '/document/v1/benchmark/doc/docid/1', 'id': 'id:benchmark:doc::1', 'fields': {'body': 'this is body', 'title': 'title', 'popularity': 1.0, 'id': '1'}} ``` ### Upsert[¶](#upsert) The following sends an update operation, if the document exist, the popularity field will be updated to take the value 3.0, and if the document does not exist, it's created and where the popularity value is 3.0. In \[22\]: Copied! ``` with app.syncio(connections=1): response: VespaResponse = app.update_data( schema="doc", namespace="benchmark", data_id="does-not-exist", fields={"popularity": 3.0}, create=True, ) print(response.is_successful()) print(response.get_json()) ``` with app.syncio(connections=1): response: VespaResponse = app.update_data( schema="doc", namespace="benchmark", data_id="does-not-exist", fields={"popularity": 3.0}, create=True, ) print(response.is_successful()) print(response.get_json()) ``` True {'pathId': '/document/v1/benchmark/doc/docid/does-not-exist', 'id': 'id:benchmark:doc::does-not-exist'} ``` In \[23\]: Copied! ``` with app.syncio(connections=1): response: VespaResponse = app.get_data( schema="doc", namespace="benchmark", data_id="does-not-exist", ) print(response.is_successful()) print(response.get_json()) ``` with app.syncio(connections=1): response: VespaResponse = app.get_data( schema="doc", namespace="benchmark", data_id="does-not-exist", ) print(response.is_successful()) print(response.get_json()) ``` True {'pathId': '/document/v1/benchmark/doc/docid/does-not-exist', 'id': 'id:benchmark:doc::does-not-exist', 'fields': {'popularity': 3.0}} ``` ## Cleanup[¶](#cleanup) In \[24\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() ## Next steps[¶](#next-steps) Read more on writing to Vespa in [reads-and-writes](https://docs.vespa.ai/en/reads-and-writes.html). # Evaluating a Vespa Application[¶](#evaluating-a-vespa-application) We are often asked by users and customers what is the best retrieval and ranking strategy for a given use case. And even though we might sometimes have an intuition, we always recommend to set up experiments and do a proper quantitative evaluation. > Models are temporary; Evals are forever. > > -Eugene Yan Without a proper evaluation setup, you run the risk of settling for `lgtm@10` (looks good to me @ 10). Then, if you deploy your application to users, you can be sure that you will get feedback of queries that does not produce relevant results. If you then try to optimize for that without knowing whether your tweaks are actually improving the overall quality of your search, you might end up with a system that is worse than the one you started with. ## So, what can you do?[¶](#so-what-can-you-do) You can set up a proper evaluation pipeline, where you can test different ranking strategies, and see how they perform on a set of evaluation queries that act as a proxy for your real users' queries. This way, you can make informed decisions about what works best for your use case. If you collect real user interactions, it could be even better, but it is important to also keep the evaluation pipeline light enough so that you can run it both during development and in CI pipelines (possibly at different scales). This guide will show how you easily can evaluate a Vespa application using pyvespa's `VespaMatchEvaluator` and `VespaEvaluator` class. ### Evaluate match-phase (retrieval) for recall[¶](#evaluate-match-phase-retrieval-for-recall) The match-phase (or retrieval phase) in Vespa is perform to retrieve candidate documents to rank. Here, what we care about is that all possibly relevant documents are retrieved fast, without matching too many documents. If we match too many documents, latency will suffer, as all retrieved docs will be exposed to ranking. For an introduction to phased retrieval in Vespa, see the [docs](https://docs.vespa.ai/en/phased-ranking.html) For this tutorial, we will evaluate and compare `weakAnd`, `nearestNeighbor`, as well as the combination of the two (using `OR`-operator). ### Evaluate ranking[¶](#evaluate-ranking) We will define and compare 4 different ranking strategies in this guide: 1. `bm25` - Keyword-based retrieval and ranking - The solid baseline. 1. `semantic` - Vector search using cosine similarity (using for embeddings) 1. `fusion`- Hybrid search (semantic+keyword). Combining BM25 and Semantic with [reciprocal rank fusion](https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion) 1. `atan_norm` - Hybrid search, combining BM25 and Semantic with [atan normalization](https://docs.vespa.ai/en/tutorials/hybrid-search.html#hybrid-ranking) as described in Aapo Tanskanen's [Guidebook to the State-of-the-Art Embeddings and Information Retrieval](https://www.linkedin.com/pulse/guidebook-state-of-the-art-embeddings-information-aapo-tanskanen-pc3mf/) (Originally proposed by [Seo et al. (2022)](https://www.mdpi.com/2227-7390/10/8/1335)) Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. **Pre-requisite**: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name. ## Install[¶](#install) Install [pyvespa](https://pyvespa.readthedocs.io/) >= 0.53.0 and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). The Vespa CLI is used for data and control plane key management ([Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide)). In \[ \]: Copied! ``` !pip3 install pyvespa vespacli datasets pandas ``` !pip3 install pyvespa vespacli datasets pandas ## Configure application[¶](#configure-application) In \[1\]: Copied! ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "evaluation" schema_name = "doc" ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "evaluation" schema_name = "doc" ## Create an application package[¶](#create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) has all the Vespa configuration files - create one from scratch: In \[2\]: Copied! ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) import pandas as pd package = ApplicationPackage( name=application, schema=[ Schema( name=schema_name, document=Document( fields=[ # Note that we need an id field as attribute to be able to do evaluation # Vespa internal query document id is used as fallback, but have some limitations, see https://docs.vespa.ai/en/document-v1-api-guide.html#query-result-id Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ "input text", "embed", # uses default model "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], rank_profiles=[ RankProfile( name="match-only", inputs=[("query(q)", "tensor(x[384])")], first_phase="random", # TODO: Remove when pyvespa supports empty first_phase ), RankProfile( name="bm25", inputs=[("query(q)", "tensor(x[384])")], functions=[Function(name="bm25text", expression="bm25(text)")], first_phase="bm25text", match_features=["bm25text"], ), RankProfile( name="semantic", inputs=[("query(q)", "tensor(x[384])")], functions=[ Function( name="cos_sim", expression="closeness(field, embedding)" ) ], first_phase="cos_sim", match_features=["cos_sim"], ), RankProfile( name="fusion", inherits="bm25", functions=[ Function( name="cos_sim", expression="closeness(field, embedding)" ) ], inputs=[("query(q)", "tensor(x[384])")], first_phase="cos_sim", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25text, closeness(field, embedding))", rerank_count=1000, ), match_features=["cos_sim", "bm25text"], ), RankProfile( name="atan_norm", inherits="bm25", inputs=[("query(q)", "tensor(x[384])")], functions=[ Function( name="scale", args=["val"], expression="2*atan(val)/(3.14159)", ), Function( name="normalized_bm25", expression="scale(bm25(text))" ), Function( name="cos_sim", expression="closeness(field, embedding)" ), ], first_phase="normalized_bm25", global_phase=GlobalPhaseRanking( expression="normalize_linear(normalized_bm25) + normalize_linear(cos_sim)", rerank_count=1000, ), match_features=["cos_sim", "normalized_bm25"], ), ], ) ], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "model-id": "e5-small-v2" }, # in vespa cloud, we can use the model-id for selected models, see https://cloud.vespa.ai/en/model-hub ), Parameter( "tokenizer-model", {"model-id": "e5-base-v2-vocab"}, ), ], ) ], ) ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Component, Parameter, FieldSet, GlobalPhaseRanking, Function, ) import pandas as pd package = ApplicationPackage( name=application, schema=\[ Schema( name=schema_name, document=Document( fields=\[ # Note that we need an id field as attribute to be able to do evaluation # Vespa internal query document id is used as fallback, but have some limitations, see https://docs.vespa.ai/en/document-v1-api-guide.html#query-result-id Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="embedding", type="tensor(x[384])", indexing=[ "input text", "embed", # uses default model "index", "attribute", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], rank_profiles=\[ RankProfile( name="match-only", inputs=\[("query(q)", "tensor(x[384])")\], first_phase="random", # TODO: Remove when pyvespa supports empty first_phase ), RankProfile( name="bm25", inputs=\[("query(q)", "tensor(x[384])")\], functions=[Function(name="bm25text", expression="bm25(text)")], first_phase="bm25text", match_features=["bm25text"], ), RankProfile( name="semantic", inputs=\[("query(q)", "tensor(x[384])")\], functions=[ Function( name="cos_sim", expression="closeness(field, embedding)" ) ], first_phase="cos_sim", match_features=["cos_sim"], ), RankProfile( name="fusion", inherits="bm25", functions=[ Function( name="cos_sim", expression="closeness(field, embedding)" ) ], inputs=\[("query(q)", "tensor(x[384])")\], first_phase="cos_sim", global_phase=GlobalPhaseRanking( expression="reciprocal_rank_fusion(bm25text, closeness(field, embedding))", rerank_count=1000, ), match_features=["cos_sim", "bm25text"], ), RankProfile( name="atan_norm", inherits="bm25", inputs=\[("query(q)", "tensor(x[384])")\], functions=\[ Function( name="scale", args=["val"], expression="2\*atan(val)/(3.14159)", ), Function( name="normalized_bm25", expression="scale(bm25(text))" ), Function( name="cos_sim", expression="closeness(field, embedding)" ), \], first_phase="normalized_bm25", global_phase=GlobalPhaseRanking( expression="normalize_linear(normalized_bm25) + normalize_linear(cos_sim)", rerank_count=1000, ), match_features=["cos_sim", "normalized_bm25"], ), \], ) \], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "model-id": "e5-small-v2" }, # in vespa cloud, we can use the model-id for selected models, see https://cloud.vespa.ai/en/model-hub ), Parameter( "tokenizer-model", {"model-id": "e5-base-v2-vocab"}, ), ], ) \], ) Note that the name cannot have `-` or `_`. ## Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) The app is now defined and ready to deploy to Vespa Cloud. Deploy `package` to Vespa Cloud, by creating an instance of [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#VespaCloud): In \[3\]: Copied! ``` from vespa.deployment import VespaCloud import os # Key is only used for CI/CD. Can be removed if logging in interactively vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=os.getenv( "VESPA_TEAM_API_KEY", None ), # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` from vespa.deployment import VespaCloud import os # Key is only used for CI/CD. Can be removed if logging in interactively vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=os.getenv( "VESPA_TEAM_API_KEY", None ), # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` Setting application... Running: vespa config set application vespa-team.evaluation.default Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` For more details on different authentication options and methods, see [authenticating-to-vespa-cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md). The following will upload the application package to Vespa Cloud Dev Zone (`aws-us-east-1c`), read more about [Vespa Zones](https://cloud.vespa.ai/en/reference/zones.html). The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically. For information about production deployments, see the following [method](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#vespa.deployment.VespaCloud.deploy_to_prod). > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.) In \[4\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 52 of dev-aws-us-east-1c for vespa-team.evaluation. This may take a few minutes the first time. INFO [06:52:41] Deploying platform version 8.586.25 and application dev build 50 for dev-aws-us-east-1c of default ... INFO [06:52:41] Using CA signed certificate version 9 INFO [06:52:42] Using 1 nodes in container cluster 'evaluation_container' INFO [06:52:44] Session 379645 for tenant 'vespa-team' prepared and activated. INFO [06:52:44] ######## Details for all nodes ######## INFO [06:52:45] h125699b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:52:45] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [06:52:45] --- storagenode on port 19102 has config generation 379643, wanted is 379645 INFO [06:52:45] --- searchnode on port 19107 has config generation 379643, wanted is 379645 INFO [06:52:45] --- distributor on port 19111 has config generation 379645, wanted is 379645 INFO [06:52:45] --- metricsproxy-container on port 19092 has config generation 379645, wanted is 379645 INFO [06:52:45] h119183e.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:52:45] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [06:52:45] --- container-clustercontroller on port 19050 has config generation 379643, wanted is 379645 INFO [06:52:45] --- metricsproxy-container on port 19092 has config generation 379645, wanted is 379645 INFO [06:52:45] h125689a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:52:45] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [06:52:45] --- container on port 4080 has config generation 379643, wanted is 379645 INFO [06:52:45] --- metricsproxy-container on port 19092 has config generation 379643, wanted is 379645 INFO [06:52:45] h97530b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [06:52:45] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [06:52:45] --- logserver-container on port 4080 has config generation 379643, wanted is 379645 INFO [06:52:45] --- metricsproxy-container on port 19092 has config generation 379643, wanted is 379645 INFO [06:52:56] Found endpoints: INFO [06:52:56] - dev.aws-us-east-1c INFO [06:52:56] |-- https://f4f49447.ccc9bd09.z.vespa-app.cloud/ (cluster 'evaluation_container') INFO [06:52:56] Deployment of new application revision complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for evaluation_container URL: https://f4f49447.ccc9bd09.z.vespa-app.cloud/ Application is up! ``` If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the `vespa auth api-key` step above. If you can authenticate, you should see lines like the following ``` Deployment started in run 1 of dev-aws-us-east-1c for mytenant.hybridsearch. ``` The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. We can access the mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the mTLS certificate generated in previous steps. See [Authenticating to Vespa Cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md) for details on using token authentication instead of mTLS. ## Getting your evaluation data[¶](#getting-your-evaluation-data) For evaluating information retrieval methods, in addition to the document corpus, we also need a set of queries and a mapping from queries to relevant documents. For this guide, we will use the [NanoMSMARCO](https://huggingface.co/datasets/zeta-alpha-ai/NanoMSMARCO) dataset, made available on huggingface by [Zeta Alpha](https://zeta-alpha.com/). This dataset is a subset of their 🍺[NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6)-collection, with 50 queries and up to 10K documents each. This is a great dataset for testing and evaluating information retrieval methods quickly, as it is small and easy to work with. Note that for almost any real-world use case, we would recommend you to create your own evaluation dataset. See [Vespa blog post](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) on how you can get help from an LLM for this. Note that creating 20-50 queries and annotating relevant documents for each query can be a good start and well worth the effort. In \[5\]: Copied! ``` from datasets import load_dataset dataset_id = "zeta-alpha-ai/NanoMSMARCO" dataset = load_dataset(dataset_id, "corpus", split="train", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["_id"], "fields": {"text": x["text"], "id": x["_id"]}, } ) ``` from datasets import load_dataset dataset_id = "zeta-alpha-ai/NanoMSMARCO" dataset = load_dataset(dataset_id, "corpus", split="train", streaming=True) vespa_feed = dataset.map( lambda x: { "id": x["\_id"], "fields": {"text": x["text"], "id": x["\_id"]}, } ) Note that we are only *evaluating* rank strategies here, we consider it OK to use the `train` split for evaluation. If we were to make changes to our ranking strategies, such as adding weighting terms, or training ml models for ranking, we would suggest to adopt a `train`, `validation`, `test` split approach to avoid overfitting. In \[6\]: Copied! ``` query_ds = load_dataset(dataset_id, "queries", split="train") qrels = load_dataset(dataset_id, "qrels", split="train") ``` query_ds = load_dataset(dataset_id, "queries", split="train") qrels = load_dataset(dataset_id, "qrels", split="train") In \[7\]: Copied! ``` ids_to_query = dict(zip(query_ds["_id"], query_ds["text"])) ``` ids_to_query = dict(zip(query_ds["\_id"], query_ds["text"])) Let us print the first 5 queries: In \[8\]: Copied! ``` for idx, (qid, q) in enumerate(ids_to_query.items()): print(f"qid: {qid}, query: {q}") if idx == 5: break ``` for idx, (qid, q) in enumerate(ids_to_query.items()): print(f"qid: {qid}, query: {q}") if idx == 5: break ``` qid: 994479, query: which health care system provides all citizens or residents with equal access to health care services qid: 1009388, query: what's right in health care qid: 1088332, query: weather in oran qid: 265729, query: how long keep financial records qid: 1099433, query: how do hoa fees work qid: 200600, query: heels or heal ``` In \[9\]: Copied! ``` relevant_docs = dict(zip(qrels["query-id"], qrels["corpus-id"])) ``` relevant_docs = dict(zip(qrels["query-id"], qrels["corpus-id"])) Let us print the first 5 query ids and their relevant documents: In \[10\]: Copied! ``` for idx, (qid, doc_id) in enumerate(relevant_docs.items()): print(f"qid: {qid}, doc_id: {doc_id}") if idx == 5: break ``` for idx, (qid, doc_id) in enumerate(relevant_docs.items()): print(f"qid: {qid}, doc_id: {doc_id}") if idx == 5: break ``` qid: 994479, doc_id: 7275120 qid: 1009388, doc_id: 7248824 qid: 1088332, doc_id: 7094398 qid: 265729, doc_id: 7369987 qid: 1099433, doc_id: 7255675 qid: 200600, doc_id: 7929603 ``` We can see that this dataset only has one relevant document per query. The `VespaEvaluator` class handles this just fine, but you could also provide a set of relevant documents per query if there are multiple relevant docs. ``` # multiple relevant docs per query qrels = { "q1": {"doc1", "doc2"}, "q2": {"doc3", "doc4"}, # etc. } ``` Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html) functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step may be resource intensive, depending on the model size. Read more about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/) blog post. Default node resources in Vespa Cloud have 2 v-cpu for the Dev Zone. In \[11\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error when feeding document {id}: {response.get_json()}") app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback) ## Evaluate match-phase[¶](#evaluate-match-phase) There are two separate classes provided for doing evaluations: 1. `VespaMatchEvaluator`, which is intended to evaluate only the *retrieval* (or match-phase), and should not do any ranking. This is useful to evaluate whether your relevant documents are retrieved (and thus exposed to ranking). It only computes recall, total matched documents per query as well as `searchtime`. 1. `VespaEvaluator` is intended to evaluate a complete ranking strategy, across several common IR metrics. Both API's are inspired by [SentenceTransformers](https://www.sbert.net/) [`InformationRetrievalEvaluator`](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) class. The difference it that `VespaMatchEvaluator` evaluates only the retrieval phase, while `VespaEvaluator` evaluates your whole retrieval and ranking *system* (Vespa application) as opposed to a single model. Your application should be fed with the document corpus in advance, instead of taking in the document corpus. We now have created the app, the queries, and the relevant documents. The only thing missing before we can initialize the `VespaMatchEvaluator` is a set of functions that defines the Vespa queries. Each of them is passed as `vespa_query_fn`. We will use the `vespa.querybuilder` module to create the queries. See [reference doc](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/builder.md) and [example notebook](https://vespa-engine.github.io/pyvespa/query.md#Using-the-Querybuilder-DSL-API) for more details on usage. This module is a Python wrapper around the Vespa Query Language (YQL), which is an alternative to providing the YQL query as a string directly. In \[12\]: Copied! ``` import vespa.querybuilder as qb def match_weakand_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str(qb.select("*").from_(schema_name).where(qb.userQuery(query_text))), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } def match_hybrid_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } def match_semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } ``` import vespa.querybuilder as qb def match_weakand_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str(qb.select("\*").from\_(schema_name).where(qb.userQuery(query_text))), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } def match_hybrid_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } def match_semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "match-only", "input.query(q)": f"embed({query_text})", } Now, let us run the evaluator: In \[13\]: Copied! ``` from vespa.evaluation import VespaMatchEvaluator match_results = {} for evaluator_name, query_fn in [ ("semantic", match_semantic_query_fn), ("weakand", match_weakand_query_fn), ("hybrid", match_hybrid_query_fn), ]: print(f"Evaluating {evaluator_name}...") match_evaluator = VespaMatchEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name="test-run", id_field="id", # specify the id field used in the relevant_docs write_csv=True, write_verbose=True, # optionally write verbose metrics to CSV ) results = match_evaluator() match_results[evaluator_name] = results print(f"Results for {evaluator_name}:") print(results) ``` from vespa.evaluation import VespaMatchEvaluator match_results = {} for evaluator_name, query_fn in \[ ("semantic", match_semantic_query_fn), ("weakand", match_weakand_query_fn), ("hybrid", match_hybrid_query_fn), \]: print(f"Evaluating {evaluator_name}...") match_evaluator = VespaMatchEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name="test-run", id_field="id", # specify the id field used in the relevant_docs write_csv=True, write_verbose=True, # optionally write verbose metrics to CSV ) results = match_evaluator() match_results[evaluator_name] = results print(f"Results for {evaluator_name}:") print(results) ``` Evaluating semantic... Results for semantic: {'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 100.0, 'total_queries': 50, 'searchtime_avg': 0.0535, 'searchtime_q50': 0.053, 'searchtime_q90': 0.0786, 'searchtime_q95': 0.08700000000000001} Evaluating weakand... Results for weakand: {'match_recall': 0.98, 'avg_recall_per_query': 0.98, 'total_relevant_docs': 50, 'total_matched_relevant': 49, 'avg_matched_per_query': 809.86, 'total_queries': 50, 'searchtime_avg': 0.04391999999999998, 'searchtime_q50': 0.043000000000000003, 'searchtime_q90': 0.058300000000000005, 'searchtime_q95': 0.06665} Evaluating hybrid... Results for hybrid: {'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 833.18, 'total_queries': 50, 'searchtime_avg': 0.03699999999999999, 'searchtime_q50': 0.037, 'searchtime_q90': 0.0531, 'searchtime_q95': 0.058299999999999984} ``` By setting `write_csv=True` and `verbose=True`, we can save a CSV-file for each query to inspect further the queries that were not matched. This is important to understand how you could improve recall if some relevant documents were not matched. In \[14\]: Copied! ``` results = pd.DataFrame(match_results) results ``` results = pd.DataFrame(match_results) results Out\[14\]: | | semantic | weakand | hybrid | | ---------------------- | -------- | --------- | -------- | | match_recall | 1.0000 | 0.98000 | 1.0000 | | avg_recall_per_query | 1.0000 | 0.98000 | 1.0000 | | total_relevant_docs | 50.0000 | 50.00000 | 50.0000 | | total_matched_relevant | 50.0000 | 49.00000 | 50.0000 | | avg_matched_per_query | 100.0000 | 809.86000 | 833.1800 | | total_queries | 50.0000 | 50.00000 | 50.0000 | | searchtime_avg | 0.0535 | 0.04392 | 0.0370 | | searchtime_q50 | 0.0530 | 0.04300 | 0.0370 | | searchtime_q90 | 0.0786 | 0.05830 | 0.0531 | | searchtime_q95 | 0.0870 | 0.06665 | 0.0583 | Here, we can see that all retrieval strategies actually match all the relevant documents. To tune number of documents retrieved and latency, we could tune the `targetHits`-parameter for both the `nearestNeighbor`-operator and `weakAnd`-parameter (our `userInput` is converted to `weakAnd`, see [docs](https://docs.vespa.ai/en/reference/query-language-reference.html)), as well as several additional `weakAnd`-parameters. See Vespa [blog](https://blog.vespa.ai/tripling-the-query-performance-of-lexical-search/) for details. We will not go in detail of this in this notebook. ## Evaluate ranking[¶](#evaluate-ranking) Now, we will move on to demonstrate how to evaluate the ranking strategies. For that, we will use the `VespaEvaluator`-class. Its interface is very similar to `VespaMatchEvaluator`, with the difference that it has much more metrics available. Also note that number of `hits` will affect the number of documents considered for evaluation. ## VespaEvaluator[¶](#vespaevaluator) Let us take a look at its API and documentation: In \[15\]: Copied! ``` from vespa.evaluation import VespaEvaluator ?VespaEvaluator ``` from vespa.evaluation import VespaEvaluator ?VespaEvaluator ```` Init signature: VespaEvaluator( queries: 'Dict[str, str]', relevant_docs: 'Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]', vespa_query_fn: 'Callable[[str, int, Optional[str]], dict]', app: 'Vespa', name: 'str' = '', id_field: 'str' = '', accuracy_at_k: 'List[int]' = [1, 3, 5, 10], precision_recall_at_k: 'List[int]' = [1, 3, 5, 10], mrr_at_k: 'List[int]' = [10], ndcg_at_k: 'List[int]' = [10], map_at_k: 'List[int]' = [100], write_csv: 'bool' = False, csv_dir: 'Optional[str]' = None, ) Docstring: Evaluate retrieval performance on a Vespa application. This class: - Iterates over queries and issues them against your Vespa application. - Retrieves top-k documents per query (with k = max of your IR metrics). - Compares the retrieved documents with a set of relevant document ids. - Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k. - Logs vespa search times for each query. - Logs/returns these metrics. - Optionally writes out to CSV. Note: The 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage: ```python from vespa.application import Vespa from vespa.evaluation import VespaEvaluator queries = { "q1": "What is the best GPU for gaming?", "q2": "How to bake sourdough bread?", # ... } relevant_docs = { "q1": {"d12", "d99"}, "q2": {"d101"}, # ... } # relevant_docs can also be a dict of query_id => single relevant doc_id # relevant_docs = { # "q1": "d12", # "q2": "d101", # # ... # } # Or, relevant_docs can be a dict of query_id => map of doc_id => relevance # relevant_docs = { # "q1": {"d12": 1, "d99": 0.1}, # "q2": {"d101": 0.01}, # # ... # Note that for non-binary relevance, the relevance values should be in [0, 1], and that # only the nDCG metric will be computed. def my_vespa_query_fn(query_text: str, top_k: int) -> dict: return { "yql": 'select * from sources * where userInput("' + query_text + '");', "hits": top_k, "ranking": "your_ranking_profile", } app = Vespa(url="http://localhost", port=8080) evaluator = VespaEvaluator( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, name="test-run", accuracy_at_k=[1, 3, 5], precision_recall_at_k=[1, 3, 5], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=True ) results = evaluator() print("Primary metric:", evaluator.primary_metric) print("All results:", results) ``` Args: queries (Dict[str, str]): A dictionary where keys are query IDs and values are query strings. relevant_docs (Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]): A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string. vespa_query_fn (Callable[[str, int, Optional[str]], dict]): A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary. app (Vespa): An instance of the Vespa application. name (str, optional): A name for this evaluation run. Defaults to "". id_field (str, optional): The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "". accuracy_at_k (List[int], optional): List of k values for which to compute Accuracy@k. Defaults to [1, 3, 5, 10]. precision_recall_at_k (List[int], optional): List of k values for which to compute Precision@k and Recall@k. Defaults to [1, 3, 5, 10]. mrr_at_k (List[int], optional): List of k values for which to compute MRR@k. Defaults to [10]. ndcg_at_k (List[int], optional): List of k values for which to compute NDCG@k. Defaults to [10]. map_at_k (List[int], optional): List of k values for which to compute MAP@k. Defaults to [100]. write_csv (bool, optional): Whether to write the evaluation results to a CSV file. Defaults to False. csv_dir (Optional[str], optional): Directory to save the CSV file. Defaults to None (current directory). File: ~/Repos/pyvespa/vespa/evaluation.py Type: ABCMeta Subclasses: ```` In \[16\]: Copied! ``` def semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "semantic", "input.query(q)": f"embed({query_text})", "hits": top_k, } def bm25_query_fn(query_text: str, top_k: int) -> dict: return { "yql": "select * from sources * where userQuery();", # provide the yql directly as a string "query": query_text, "ranking": "bm25", "hits": top_k, } def fusion_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery(query_text) ) ), "query": query_text, "ranking": "fusion", "input.query(q)": f"embed({query_text})", "hits": top_k, } def atan_norm_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery(query_text) ) ), "query": query_text, "ranking": "atan_norm", "input.query(q)": f"embed({query_text})", "hits": top_k, } ``` def semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "semantic", "input.query(q)": f"embed({query_text})", "hits": top_k, } def bm25_query_fn(query_text: str, top_k: int) -> dict: return { "yql": "select * from sources * where userQuery();", # provide the yql directly as a string "query": query_text, "ranking": "bm25", "hits": top_k, } def fusion_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery(query_text) ) ), "query": query_text, "ranking": "fusion", "input.query(q)": f"embed({query_text})", "hits": top_k, } def atan_norm_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(schema_name) .where( qb.nearestNeighbor( field="embedding", query_vector="q", annotations={"targetHits": 100}, ) | qb.userQuery(query_text) ) ), "query": query_text, "ranking": "atan_norm", "input.query(q)": f"embed({query_text})", "hits": top_k, } In \[23\]: Copied! ``` all_results = {} for evaluator_name, query_fn in [ ("semantic", semantic_query_fn), ("bm25", bm25_query_fn), ("fusion", fusion_query_fn), ("atan_norm", atan_norm_query_fn), ]: print(f"Evaluating {evaluator_name}...") evaluator = VespaEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name=evaluator_name, write_csv=True, # optionally write metrics to CSV ) results = evaluator.run() all_results[evaluator_name] = results ``` all_results = {} for evaluator_name, query_fn in \[ ("semantic", semantic_query_fn), ("bm25", bm25_query_fn), ("fusion", fusion_query_fn), ("atan_norm", atan_norm_query_fn), \]: print(f"Evaluating {evaluator_name}...") evaluator = VespaEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name=evaluator_name, write_csv=True, # optionally write metrics to CSV ) results = evaluator.run() all_results[evaluator_name] = results ``` Evaluating semantic... Evaluating bm25... Evaluating fusion... Evaluating atan_norm... ``` ### Looking at the results[¶](#looking-at-the-results) In \[24\]: Copied! ``` results = pd.DataFrame(all_results) ``` results = pd.DataFrame(all_results) In \[25\]: Copied! ``` # take out all rows with "searchtime" to a separate dataframe searchtime = results[results.index.str.contains("searchtime")] results = results[~results.index.str.contains("searchtime")] # Highlight the maximum value in each row def highlight_max(s): is_max = s == s.max() return ["background-color: lightgreen; color: black;" if v else "" for v in is_max] # Style the DataFrame: Highlight max values and format numbers to 4 decimals styled_df = results.style.apply(highlight_max, axis=1).format("{:.4f}") styled_df ``` # take out all rows with "searchtime" to a separate dataframe searchtime = results[results.index.str.contains("searchtime")] results = results[~results.index.str.contains("searchtime")] # Highlight the maximum value in each row def highlight_max(s): is_max = s == s.max() return ["background-color: lightgreen; color: black;" if v else "" for v in is_max] # Style the DataFrame: Highlight max values and format numbers to 4 decimals styled_df = results.style.apply(highlight_max, axis=1).format("{:.4f}") styled_df Out\[25\]: | | semantic | bm25 | fusion | atan_norm | | ------------ | -------- | ------ | ------ | --------- | | accuracy@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 | | accuracy@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 | | accuracy@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 | | accuracy@10 | 0.8200 | 0.7400 | 0.8000 | 0.8000 | | precision@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 | | recall@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 | | precision@3 | 0.2133 | 0.2000 | 0.2267 | 0.2333 | | recall@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 | | precision@5 | 0.1440 | 0.1320 | 0.1440 | 0.1480 | | recall@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 | | precision@10 | 0.0820 | 0.0740 | 0.0800 | 0.0800 | | recall@10 | 0.8200 | 0.7400 | 0.8000 | 0.8000 | | mrr@10 | 0.5309 | 0.4501 | 0.5529 | 0.5738 | | ndcg@10 | 0.6007 | 0.5206 | 0.6126 | 0.6296 | | map@100 | 0.5393 | 0.4594 | 0.5630 | 0.5838 | We can see that for this particular dataset, the hybrid strategy `atan_norm` is the best across all metrics. In \[26\]: Copied! ``` results.plot(kind="bar", figsize=(12, 6)) ``` results.plot(kind="bar", figsize=(12, 6)) Out\[26\]: ``` ``` ### Looking at searchtimes[¶](#looking-at-searchtimes) Ranking quality is not the only thing that matters. For many applications, search time is equally important. In \[27\]: Copied! ``` # plot search time, add (ms) to the y-axis # convert to ms searchtime = searchtime * 1000 searchtime.plot(kind="bar", figsize=(12, 6)).set(ylabel="time (ms)") ``` # plot search time, add (ms) to the y-axis # convert to ms searchtime = searchtime * 1000 searchtime.plot(kind="bar", figsize=(12, 6)).set(ylabel="time (ms)") Out\[27\]: ``` [Text(0, 0.5, 'time (ms)')] ``` We can see that both hybrid strategies, `fusion` and `atan_norm` strategy is a bit slower on average than pure `bm25` or `semantic`, as expected. Depending on the latency budget of your application, this is likely still an attractive trade-off. ## Conclusion and next steps[¶](#conclusion-and-next-steps) We have shown how you can evaluate a Vespa application on two different levels. 1. Evaluate retrieval (match-phase) using the `VespaMatchEvaluator` class. Here we checked recall, and 1. Evaluate ranking strategies using `VespaEvaluator` class. We have defined and compared 4 different ranking strategies in terms of both ranking quality and searchtime latency. We hope this can provide you with a good starting point for evaluating your own Vespa application. If you are ready to advance, you can try to optimize the ranking strategies further, by for example weighing each of the terms in the `atan_norm` strategy differently (`a * normalize_linear(normalized_bm25) + (1-a) * normalize_linear(cos_sim)`) , or by adding a [crossencoder](https://vespa-engine.github.io/pyvespa/examples/cross-encoders-for-global-reranking.md) for re-ranking the top-k results. ## Cleanup[¶](#cleanup) In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Getting help from AI All pyvespa documentation is available in plain Markdown for easy consumption by LLMs and AI coding assistants. ## LLM-optimized documentation Following the [llms.txt](https://llmstxt.org/) standard, we publish: - **[llms.txt](https://vespa-engine.github.io/pyvespa/llms.txt)** -- Concise index of all docs with descriptions - **[llms-full.txt](https://vespa-engine.github.io/pyvespa/llms-full.txt)** -- Complete documentation in a single file (~2 MB) Every documentation page also has a Markdown version available by replacing `.html` with `.md.txt` in the URL. Look for the Markdown icon button (top-right of each page) to view it. ## Instruct your AI assistant Add the following to your project's `CLAUDE.md`, `AGENTS.md`, `.cursorrules`, or equivalent file to help your AI coding assistant use pyvespa docs effectively: ```markdown ## pyvespa When looking up pyvespa documentation, prefer fetching the Markdown versions instead of scraping HTML pages: - Documentation index: https://vespa-engine.github.io/pyvespa/llms.txt - Full documentation: https://vespa-engine.github.io/pyvespa/llms-full.txt - Per-page markdown: replace `.html` with `.md.txt` in any docs URL Examples: - https://vespa-engine.github.io/pyvespa/reads-writes.md.txt - https://vespa-engine.github.io/pyvespa/api/vespa/application.md.txt ``` Compatibility The snippet above works with [Claude Code](https://claude.ai/code) (`CLAUDE.md`), [GitHub Copilot](https://github.com/features/copilot) (`.github/copilot-instructions.md`), [Cursor](https://cursor.com) (`.cursor/rules/*.mdc`), [OpenAI Codex](https://openai.com/codex/) and most other AI coding tools that support [AGENTS.md](https://agents.md/). All of these read plain Markdown instructions. # Troubleshooting Also see the [Vespa FAQ](https://docs.vespa.ai/en/faq.html) and [Vespa support](https://cloud.vespa.ai/support) for more help resources. ## Vespa.ai and pyvespa Both [Vespa](https://vespa.ai/) and pyvespa APIs change regularly - make sure to use the latest version of [vespaengine/vespa](https://hub.docker.com/r/vespaengine/vespa) by running `docker pull vespaengine/vespa` and [install pyvespa](https://vespa-engine.github.io/pyvespa). To check the current version, run: ```bash python3 -m pip show pyvespa ``` ## Docker Memory Vespa requires at least 4GB to run - make sure Docker settings have at least this available. Use the Docker Desktop settings or `docker info | grep "Total Memory"` or `podman info | grep "memTotal"` to validate. pyvespa will by default start a container without any memory limit. To set an explicit memory limit (e.g., 8GB): ```python from vespa.deployment import VespaDocker vespa_docker = VespaDocker(port=8080, container_memory=8 * 1024**3) ``` ## Port conflicts / Docker Some of the notebooks run a Docker container. Make sure to stop running Docker containers before (re)running pyvespa notebooks - run `docker ps` and `docker ps -a -q -f status=exited` to list containers. ## Deployment Vespa has safeguards for incompatible deployments, and will warn with *validation-override* or *INVALID_APPLICATION_PACKAGE* in the deploy output. See [validation-overrides](https://docs.vespa.ai/en/reference/validation-overrides.html). This is most often due to pyvespa reusing a Docker container instance. The fix is to list (`docker ps`) and remove (`docker rm -f `) the existing Docker containers. Alternatively, use the Docker Dashboard application. Then deploy again. After deployment, validate status: - Config server state: - Container state: Look for `"status" : { "code" : "up"}` - both URLs must work before feeding or querying. ## Full disk Make sure to allocate enough disk space for Docker in Docker settings. If writes/queries fail or return no results, look in the `vespa.log` (output in the Docker dashboard): ```text WARNING searchnode proton.proton.server.disk_mem_usage_filter Write operations are now blocked: 'diskLimitReached: { action: "add more content nodes", reason: "disk used (0.939172) > disk limit (0.9)", stats: { capacity: 50406772736, used: 47340617728, diskUsed: 0.939172, diskLimit: 0.9}}' ``` Future pyvespa versions might throw an exception in these cases. See [Feed block](https://docs.vespa.ai/en/operations/feed-block.html) - Vespa stops writes before the disk goes full. Add more disk space, clean up, or follow the [example](https://vespa-engine.github.io/pyvespa/application-packages.md#Deploy-from-modified-files) to reconfigure for higher usage. ## Check number of indexed documents For query errors, check the number of documents indexed before debugging further: ```python app.query(yql='select * from sources * where true').number_documents_indexed ``` If this is zero, check that the deployment of the application worked, and that the subsequent feeding step completed successfully. ## Too many open files during batch feeding This is an OS-related issue. There are two options to solve the problem: 1. Reduce the number of connections via the `connections` parameter: ```python with app.syncio(connections=12): ``` 1. Increase the open file limit: `ulimit -n 10000`. Check if the limit was increased with `ulimit -Sn`. ## Data export `vespa visit` exports data from Vespa - see [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html#documents). Use this to validate data feeding and troubleshoot query issues. # ANN Parameter Tuning Approximate Nearest Neighbor (ANN) search is a powerful way to make vector search scalable and efficient. In Vespa, this is implemented by building HNSW graphs for embedding fields. For a search that uses *only* vector similarity for retrieval, this works very well as you can just query the HNSW index and get (enough) relevant results back very fast. However, most Vespa applications are more complex and often combine vector similarity with filtering on metadata fields. There are multiple strategies in Vespa for handling queries that combine ANN with filtering, and there are parameters that control the strategy selection and the strategies themselves. While Vespa has chosen default values for these parameters that work well in most use cases, one often can benefit from further tuning these parameters for the application/use case/data set at hand. ## ANN Parameter Optimizer The `vespa.evaluation` module provides a `VespaNNParameterOptimizer` class that, given a sufficient sample of queries using ANN with filtering, performs measurements to analyze the effect of various tuning parameters and, based on this, provides suggestions for these parameters. Running the optimizer can be as simple as this: ```python from vespa.evaluation import VespaNNParameterOptimizer optimizer = VespaNNParameterOptimizer( app=my_vespa_app, queries=my_list_of_queries, hits=number_of_target_hits_used_in_my_queries, ) report = optimizer.run() suggested_parameters = { "ranking.matching.approximateThreshold": report["approximateThreshold"][ "suggestion" ], "ranking.matching.filterFirstThreshold": report["filterFirstThreshold"][ "suggestion" ], "ranking.matching.filterFirstExploration": report["filterFirstExploration"][ "suggestion" ], "ranking.matching.postFilterThreshold": report["postFilterThreshold"][ "suggestion" ], } ``` See the [example](https://vespa-engine.github.io/pyvespa/examples/ann-parameter-tuning-vespa-cloud.md) for a full guide on how to use this class and how to interpret the report it produces. See the [documentation](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaNNParameterOptimizer) for further details. # Examples # Examples Here you can find a wide variety of examples that demonstrate how to use the Vespa Python API. These examples cover different use cases and functionalities, providing a practical understanding of how to interact with Vespa using Python. ## Vespa Cloud To create a free Vespa Cloud account, visit [Vespa Cloud](https://vespa.ai/free-trial/). - [ANN parameter tuning](https://vespa-engine.github.io/pyvespa/examples/ann-parameter-tuning-vespa-cloud.md) - [BGE-M3 - The Mother of all embedding models](https://vespa-engine.github.io/pyvespa/examples/mother-of-all-embedding-models-cloud.md) - [Billion-scale vector search with Cohere binary embeddings in Vespa](https://vespa-engine.github.io/pyvespa/examples/billion-scale-vector-search-with-cohere-embeddings-cloud.md) - [Building cost-efficient retrieval-augmented personal AI assistants](https://vespa-engine.github.io/pyvespa/examples/scaling-personal-ai-assistants-with-streaming-mode-cloud.md) - [Chat with your pdfs with ColBERT, langchain, and Vespa](https://vespa-engine.github.io/pyvespa/examples/chat_with_your_pdfs_using_colbert_langchain_and_Vespa-cloud.md) - [ColPali Ranking Experiments on DocVQA](https://vespa-engine.github.io/pyvespa/examples/colpali-benchmark-vqa-vlm_Vespa-cloud.md) - [Exploring the potential of OpenAI Matryoshka 🪆 embeddings with Vespa](https://vespa-engine.github.io/pyvespa/examples/Matryoshka_embeddings_in_Vespa-cloud.md) - [Feeding to Vespa Cloud](https://vespa-engine.github.io/pyvespa/examples/feed_performance_cloud.md) - [Multilingual Hybrid Search with Cohere binary embeddings and Vespa](https://vespa-engine.github.io/pyvespa/examples/multilingual-multi-vector-reps-with-cohere-cloud.md) - [PDF-Retrieval using ColQWen2 (ColPali) with Vespa](https://vespa-engine.github.io/pyvespa/examples/pdf-retrieval-with-ColQwen2-vlm_Vespa-cloud.md) - [RAG Blueprint tutorial](https://vespa-engine.github.io/pyvespa/examples/rag-blueprint-vespa-cloud.md) - [Scaling ColPALI (VLM) Retrieval](https://vespa-engine.github.io/pyvespa/examples/simplified-retrieval-with-colpali-vlm_Vespa-cloud.md) - [Standalone ColBERT + Vespa for long-context ranking](https://vespa-engine.github.io/pyvespa/examples/colbert_standalone_long_context_Vespa-cloud.md) - [Standalone ColBERT with Vespa for end-to-end retrieval and ranking](https://vespa-engine.github.io/pyvespa/examples/colbert_standalone_Vespa-cloud.md) - [Turbocharge RAG with LangChain and Vespa Streaming Mode for Partitioned Data](https://vespa-engine.github.io/pyvespa/examples/turbocharge-rag-with-langchain-and-vespa-streaming-mode-cloud.md) - [Using Cohere Binary Embeddings in Vespa](https://vespa-engine.github.io/pyvespa/examples/cohere-binary-vectors-in-vespa-cloud.md) - [Using Mixedbread.ai embedding model with support for binary vectors](https://vespa-engine.github.io/pyvespa/examples/mixedbread-binary-embeddings-with-sentence-transformers-cloud.md) - [Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models](https://vespa-engine.github.io/pyvespa/examples/colpali-document-retrieval-vision-language-models-cloud.md) - [Video Search and Retrieval with Vespa and TwelveLabs](https://vespa-engine.github.io/pyvespa/examples/video_search_twelvelabs_cloud.md) - [Visual PDF RAG with Vespa - ColPali demo application](https://vespa-engine.github.io/pyvespa/examples/visual_pdf_rag_with_vespa_colpali_cloud.md) ## Local deployment (docker/podman) - [Evaluating retrieval with Snowflake arctic embed](https://vespa-engine.github.io/pyvespa/examples/evaluating-with-snowflake-arctic-embed.md) - [Feeding performance](https://vespa-engine.github.io/pyvespa/examples/feed_performance.md) - [LightGBM: Mapping model features to Vespa features](https://vespa-engine.github.io/pyvespa/examples/lightgbm-with-categorical-mapping.md) - [LightGBM: Training the model with Vespa features](https://vespa-engine.github.io/pyvespa/examples/lightgbm-with-categorical.md) - [Multi-vector indexing with HNSW](https://vespa-engine.github.io/pyvespa/examples/multi-vector-indexing.md) - [Pyvespa examples](https://vespa-engine.github.io/pyvespa/examples/pyvespa-examples.md) - [Using Mixedbread.ai cross-encoder for reranking in Vespa.ai](https://vespa-engine.github.io/pyvespa/examples/cross-encoders-for-global-reranking.md) # Exploring the potential of OpenAI Matryoshka 🪆 embeddings with Vespa[¶](#exploring-the-potential-of-openai-matryoshka-embeddings-with-vespa) This notebook demonstrates the effectiveness of using the recently released (as of January 2024) OpenAI `text-embedding-3` embeddings with Vespa. Specifically, we are interested in the [Matryoshka Representation Learning](https://aniketrege.github.io/blog/2024/mrl/) technique used in training, which lets us "shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties". This allow us to trade off a small amount of accuracy in exchange for much smaller embedding sizes, so we can store more documents and search them faster. [Exploring the potential of OpenAI Matryoshka 🪆 embeddings with Vespa](https://blog.vespa.ai/matryoshka-embeddings-in-vespa/) and [Matryoshka 🤝 Binary vectors: Slash vector search costs with Vespa](https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/) are good reads on this subject. By using [phased ranking](https://docs.vespa.ai/en/phased-ranking.html), we can re-rank the top K results with the full embeddings in a second step. This produces accuracy on par with using the full embeddings! We'll use a standard information retrieval benchmark to evaluate result quality with different embedding sizes and retrieval/ranking strategies. Let's get started! First, install a few dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa ir_datasets openai pytrec_eval vespacli ``` !pip3 install -U pyvespa ir_datasets openai pytrec_eval vespacli ## Examining the OpenAI embeddings[¶](#examining-the-openai-embeddings) In \[1\]: Copied! ``` from openai import OpenAI openai = OpenAI() def embed(text, model="text-embedding-3-large", dimensions=3072): return ( openai.embeddings.create(input=[text], model=model, dimensions=dimensions) .data[0] .embedding ) ``` from openai import OpenAI openai = OpenAI() def embed(text, model="text-embedding-3-large", dimensions=3072): return ( openai.embeddings.create(input=[text], model=model, dimensions=dimensions) .data[0] .embedding ) With these new embedding models, the API supports a `dimensions` parameter. Does this differ from just taking the first N dimensions? In \[2\]: Copied! ``` test_input = "This is just a test sentence." full = embed(test_input) short = embed(test_input, dimensions=8) print(full[:8]) print(short) ``` test_input = "This is just a test sentence." full = embed(test_input) short = embed(test_input, dimensions=8) print(full[:8]) print(short) ``` [0.0035371531266719103, 0.014166134409606457, -0.017565304413437843, 0.04296272248029709, 0.012746891938149929, -0.01731124334037304, -0.00855049304664135, 0.044189225882291794] [0.05076185241341591, 0.20329885184764862, -0.2520805299282074, 0.6165600419044495, 0.18293125927448273, -0.24843446910381317, -0.1227085217833519, 0.634161651134491] ``` Numerically, they are not the same. But looking more closely, they differ only by a scaling factor: In \[12\]: Copied! ``` scale = short[0] / full[0] print([x * scale for x in full[:8]]) print(short) ``` scale = short[0] / full[0] print(\[x * scale for x in full[:8]\]) print(short) ``` [0.05076185241341591, 0.2032988673141365, -0.2520805173822377, 0.6165600695594861, 0.18293125124128834, -0.2484344748635628, -0.12270853156530777, 0.6341616780980419] [0.05076185241341591, 0.20329885184764862, -0.2520805299282074, 0.6165600419044495, 0.18293125927448273, -0.24843446910381317, -0.1227085217833519, 0.634161651134491] ``` It seems the shortened vector has been L2 normalized to have a magnitude of 1. By cosine similarity, they are equivalent: In \[13\]: Copied! ``` from numpy.linalg import norm from numpy import dot def cos_sim(e1, e2): return dot(e1, e2) / (norm(e1) * norm(e2)) print(norm(short)) cos_sim(short, full[:8]) ``` from numpy.linalg import norm from numpy import dot def cos_sim(e1, e2): return dot(e1, e2) / (norm(e1) * norm(e2)) print(norm(short)) cos_sim(short, full[:8]) ``` 0.9999999899058183 ``` Out\[13\]: ``` 0.9999999999999996 ``` This is great, because it means that in a single API call we can get the full embeddings, and easily produce shortened embeddings just by slicing the list of numbers. Note that `text-embedding-3-large` and `text-embedding-3-small` do **not** produce compatible embeddings when sliced to the same size: In \[14\]: Copied! ``` cos_sim( embed(test_input, dimensions=1536), embed(test_input, dimensions=1536, model="text-embedding-3-small"), ) ``` cos_sim( embed(test_input, dimensions=1536), embed(test_input, dimensions=1536, model="text-embedding-3-small"), ) Out\[14\]: ``` -0.03217247156447633 ``` ## Getting a sample dataset[¶](#getting-a-sample-dataset) Let's download a dataset so we have some real data to embed: In \[15\]: Copied! ``` import ir_datasets dataset = ir_datasets.load("beir/trec-covid") print("Dataset has", dataset.docs_count(), "documents. Sample:") dataset.docs_iter()[120]._asdict() ``` import ir_datasets dataset = ir_datasets.load("beir/trec-covid") print("Dataset has", dataset.docs_count(), "documents. Sample:") dataset.docs_iter()[120].\_asdict() ``` Dataset has 171332 documents. Sample: ``` Out\[15\]: ``` {'doc_id': 'z2u5frvq', 'text': 'The authors discuss humoral immune responses to HIV and approaches to designing vaccines that induce viral neutralizing and other potentially protective antibodies.', 'title': 'Antibody-Based HIV-1 Vaccines: Recent Developments and Future Directions: A summary report from a Global HIV Vaccine Enterprise Working Group', 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2100141/', 'pubmed_id': '18052607'} ``` ### Queries[¶](#queries) This dataset also comes with a set of queries, and query/document relevance judgements: In \[16\]: Copied! ``` print(next(dataset.queries_iter())) print(next(dataset.qrels_iter())) ``` print(next(dataset.queries_iter())) print(next(dataset.qrels_iter())) ``` BeirCovidQuery(query_id='1', text='what is the origin of COVID-19', query='coronavirus origin', narrative="seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans") TrecQrel(query_id='1', doc_id='005b2j4b', relevance=2, iteration='0') ``` We'll use these later to evaluate the result quality. ## Defining the Vespa application[¶](#defining-the-vespa-application) First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[33\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="my_schema", mode="index", document=Document( fields=[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="pubmed_id", type="string", indexing=["summary", "index"]), Field( name="shortened", type="tensor(x[256])", indexing=["attribute", "index"], attribute=["distance-metric: angular"], ), Field( name="embedding", type="tensor(x[3072])", indexing=["attribute"], attribute=["paged", "distance-metric: angular"], ), ], ), fieldsets=[FieldSet(name="default", fields=["title", "text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="my_schema", mode="index", document=Document( fields=\[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="pubmed_id", type="string", indexing=["summary", "index"]), Field( name="shortened", type="tensor(x[256])", indexing=["attribute", "index"], attribute=["distance-metric: angular"], ), Field( name="embedding", type="tensor(x[3072])", indexing=["attribute"], attribute=["paged", "distance-metric: angular"], ), \], ), fieldsets=\[FieldSet(name="default", fields=["title", "text"])\], ) The two fields of type `tensor(x[3072/256])` are not in the dataset - they are tensor fields to hold the embeddings from OpenAI. - `shortened`: This field holds the embedding shortened to 256 dimensions, requiring only **8.3%** of the memory. `index` here means we will build an [HNSW Approximate Nearest Neighbor index](https://docs.vespa.ai/en/approximate-nn-hnsw.html), by which we can find the closest vectors while exploring only a very small subset of the documents. - `embedding`: This field contains the full size embedding. It is [paged](https://docs.vespa.ai/en/attributes.html#paged-attributes): accesses to this field may require disk access, unless it has been cached by the kernel. We must add the schema to a Vespa [application package](https://docs.vespa.ai/en/application-packages.html). This consists of configuration files, schemas, models, and possibly even custom code (plugins). In \[34\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "matryoshka" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "matryoshka" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. Vespa supports has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html), including many text-matching features such as: - [BM25](https://docs.vespa.ai/en/reference/bm25.html), - [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html) and [many more](https://docs.vespa.ai/en/reference/rank-features.html). Users can also define custom functions using [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html). The following defines three runtime selectable Vespa ranking profiles: - `exact` uses the full-size embedding - `shortened` uses only 256 dimensions (exact, or using the approximate nearest neighbor HNSW index) - `rerank` uses the 256-dimension shortened embeddings (exact or ANN) in a first phase, and the full 3072-dimension embeddings in a second phase. By default the second phase is applied to the top 100 documents from the first phase. In \[35\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking exact = RankProfile( name="exact", inputs=[("query(q3072)", "tensor(x[3072])")], functions=[Function(name="cos_sim", expression="closeness(field, embedding)")], first_phase=FirstPhaseRanking(expression="cos_sim"), match_features=["cos_sim"], ) my_schema.add_rank_profile(exact) shortened = RankProfile( name="shortened", inputs=[("query(q256)", "tensor(x[256])")], functions=[Function(name="cos_sim_256", expression="closeness(field, shortened)")], first_phase=FirstPhaseRanking(expression="cos_sim_256"), match_features=["cos_sim_256"], ) my_schema.add_rank_profile(shortened) rerank = RankProfile( name="rerank", inputs=[ ("query(q3072)", "tensor(x[3072])"), ("query(q256)", "tensor(x[256])"), ], functions=[ Function(name="cos_sim_256", expression="closeness(field, shortened)"), Function( name="cos_sim_3072", expression="cosine_similarity(query(q3072), attribute(embedding), x)", ), ], first_phase=FirstPhaseRanking(expression="cos_sim_256"), second_phase=SecondPhaseRanking(expression="cos_sim_3072"), match_features=["cos_sim_256", "cos_sim_3072"], ) my_schema.add_rank_profile(rerank) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking exact = RankProfile( name="exact", inputs=\[("query(q3072)", "tensor(x[3072])")\], functions=[Function(name="cos_sim", expression="closeness(field, embedding)")], first_phase=FirstPhaseRanking(expression="cos_sim"), match_features=["cos_sim"], ) my_schema.add_rank_profile(exact) shortened = RankProfile( name="shortened", inputs=\[("query(q256)", "tensor(x[256])")\], functions=[Function(name="cos_sim_256", expression="closeness(field, shortened)")], first_phase=FirstPhaseRanking(expression="cos_sim_256"), match_features=["cos_sim_256"], ) my_schema.add_rank_profile(shortened) rerank = RankProfile( name="rerank", inputs=\[ ("query(q3072)", "tensor(x[3072])"), ("query(q256)", "tensor(x[256])"), \], functions=[ Function(name="cos_sim_256", expression="closeness(field, shortened)"), Function( name="cos_sim_3072", expression="cosine_similarity(query(q3072), attribute(embedding), x)", ), ], first_phase=FirstPhaseRanking(expression="cos_sim_256"), second_phase=SecondPhaseRanking(expression="cos_sim_3072"), match_features=["cos_sim_256", "cos_sim_3072"], ) my_schema.add_rank_profile(rerank) For an example of a `hybrid` rank-profile which combines semantic search with traditional text retrieval such as BM25, see the previous blog post: [Turbocharge RAG with LangChain and Vespa Streaming Mode for Sharded Data](https://blog.vespa.ai/turbocharge-rag-with-langchain-and-vespa-streaming-mode/) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[36\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[37\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 3 of dev-aws-us-east-1c for vespa-team.matryoshka. This may take a few minutes the first time. INFO [15:51:53] Deploying platform version 8.296.15 and application dev build 3 for dev-aws-us-east-1c of default ... INFO [15:51:53] Using CA signed certificate version 0 INFO [15:51:53] Using 1 nodes in container cluster 'matryoshka_container' INFO [15:51:57] Session 282395 for tenant 'vespa-team' prepared and activated. INFO [15:52:00] ######## Details for all nodes ######## INFO [15:52:09] h88969c.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [15:52:09] --- platform vespa/cloud-tenant-rhel8:8.296.15 <-- : INFO [15:52:09] --- logserver-container on port 4080 has not started INFO [15:52:09] --- metricsproxy-container on port 19092 has not started INFO [15:52:09] h88972f.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [15:52:09] --- platform vespa/cloud-tenant-rhel8:8.296.15 <-- : INFO [15:52:09] --- container-clustercontroller on port 19050 has not started INFO [15:52:09] --- metricsproxy-container on port 19092 has not started INFO [15:52:09] h90002a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [15:52:09] --- platform vespa/cloud-tenant-rhel8:8.296.15 <-- : INFO [15:52:09] --- storagenode on port 19102 has not started INFO [15:52:09] --- searchnode on port 19107 has not started INFO [15:52:09] --- distributor on port 19111 has not started INFO [15:52:09] --- metricsproxy-container on port 19092 has not started INFO [15:52:09] h90512a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [15:52:09] --- platform vespa/cloud-tenant-rhel8:8.296.15 <-- : INFO [15:52:09] --- container on port 4080 has not started INFO [15:52:09] --- metricsproxy-container on port 19092 has not started INFO [15:53:11] Found endpoints: INFO [15:53:11] - dev.aws-us-east-1c INFO [15:53:11] |-- https://e5ba4967.b2349765.z.vespa-app.cloud/ (cluster 'matryoshka_container') INFO [15:53:12] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://e5ba4967.b2349765.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` ## Get OpenAI embeddings for documents in the dataset[¶](#get-openai-embeddings-for-documents-in-the-dataset) When producing the embeddings, we concatenate the title and text into a single string. We could also have created two separate embedding fields for text and title, combining the rank scores for these fields in a Vespa [rank expression](https://docs.vespa.ai/en/ranking-expressions-features.html). In \[44\]: Copied! ``` import concurrent.futures # only embed 100 docs while developing sample_docs = list(dataset.docs_iter())[:100] def embed_doc(doc): embedding = embed( (doc.title + " " + doc.text)[:8192] ) # we crop the ~25 documents which are longer than the context window shortened = embedding[0:256] return { "doc_id": doc.doc_id, "text": doc.text, "title": doc.title, "url": doc.url, "pubmed_id": doc.pubmed_id, "shortened": {"type": "tensor(x[256])", "values": shortened}, "embedding": {"type": "tensor(x[3072])", "values": embedding}, } with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: my_docs_to_feed = list(executor.map(embed_doc, sample_docs)) ``` import concurrent.futures # only embed 100 docs while developing sample_docs = list(dataset.docs_iter())[:100] def embed_doc(doc): embedding = embed( (doc.title + " " + doc.text)[:8192] ) # we crop the ~25 documents which are longer than the context window shortened = embedding[0:256] return { "doc_id": doc.doc_id, "text": doc.text, "title": doc.title, "url": doc.url, "pubmed_id": doc.pubmed_id, "shortened": {"type": "tensor(x[256])", "values": shortened}, "embedding": {"type": "tensor(x[3072])", "values": embedding}, } with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: my_docs_to_feed = list(executor.map(embed_doc, sample_docs)) ## Feeding the dataset and embeddings into Vespa[¶](#feeding-the-dataset-and-embeddings-into-vespa) Now that we have parsed the dataset and created an object with the fields that we want to add to Vespa, we must format the object into the format that PyVespa accepts. Notice the `fields`, `id` and `groupname` keys. The `groupname` is the key that is used to shard and co-locate the data and is only relevant when using Vespa with [streaming mode](https://docs.vespa.ai/en/streaming-search.html). In \[45\]: Copied! ``` from typing import Iterable def vespa_feed(user: str) -> Iterable[dict]: for doc in reversed(my_docs_to_feed): yield {"fields": doc, "id": doc["doc_id"], "groupname": user} ``` from typing import Iterable def vespa_feed(user: str) -> Iterable\[dict\]: for doc in reversed(my_docs_to_feed): yield {"fields": doc, "id": doc["doc_id"], "groupname": user} Now, we can feed to the Vespa instance (`app`), using the `feed_iterable` API, using the generator function above as input with a custom `callback` function. In \[46\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="my_schema", iter=vespa_feed(""), callback=callback, max_queue_size=2000, max_workers=32, max_connections=64, ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="my_schema", iter=vespa_feed(""), callback=callback, max_queue_size=2000, max_workers=32, max_connections=64, ) ### Embedding the queries[¶](#embedding-the-queries) We need to obtain embeddings for the queries from OpenAI. If only using the shortened embedding for the query, you should specify this in the OpenAI API call to reduce latency. In \[47\]: Copied! ``` queries = [] for q in dataset.queries_iter(): queries.append({"text": q.text, "embedding": embed(q.text), "id": q.query_id}) ``` queries = [] for q in dataset.queries_iter(): queries.append({"text": q.text, "embedding": embed(q.text), "id": q.query_id}) ### Querying data[¶](#querying-data) Now we can query our data. We'll do it in a few different ways, using the rank profiles we defined in the schema: - Exhaustive (exact) nearest neighbor search with the full embeddings (3072 dimensions) - Exhaustive (exact) nearest neighbor search with the shortened 256 dimensions - Approximate nearest neighbor search, using the 256 dimension ANN HNSW index - Approximate nearest neighbor search, using the 256 dimension ANN HNSW index in the first phase, then reranking top 100 hits with the full embeddings The query request uses the Vespa Query API and the `Vespa.query()` function supports passing any of the Vespa query API parameters. Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) In \[73\]: Copied! ``` import json def query_exact(q): return session.query( yql="select doc_id, title from my_schema where ({targetHits: 10, approximate:false}nearestNeighbor(embedding,q3072)) limit 10", ranking="exact", timeout=10, body={"presentation.timing": "true", "input.query(q3072)": q["embedding"]}, ) def query_256(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 10, approximate:false}nearestNeighbor(shortened,q256)) limit 10", ranking="shortened", timeout=10, body={"presentation.timing": "true", "input.query(q256)": q["embedding"][:256]}, ) def query_256_ann(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 100, approximate:true}nearestNeighbor(shortened,q256)) limit 10", ranking="shortened", timeout=10, body={"presentation.timing": "true", "input.query(q256)": q["embedding"][:256]}, ) def query_rerank(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 100, approximate:true}nearestNeighbor(shortened,q256)) limit 10", ranking="rerank", timeout=10, body={ "presentation.timing": "true", "input.query(q256)": q["embedding"][:256], "input.query(q3072)": q["embedding"], }, ) print("Sample query:", queries[0]["text"]) with app.syncio() as session: print(json.dumps(query_rerank(queries[0]).hits[0], indent=2)) ``` import json def query_exact(q): return session.query( yql="select doc_id, title from my_schema where ({targetHits: 10, approximate:false}nearestNeighbor(embedding,q3072)) limit 10", ranking="exact", timeout=10, body={"presentation.timing": "true", "input.query(q3072)": q["embedding"]}, ) def query_256(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 10, approximate:false}nearestNeighbor(shortened,q256)) limit 10", ranking="shortened", timeout=10, body={"presentation.timing": "true", "input.query(q256)": q["embedding"][:256]}, ) def query_256_ann(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 100, approximate:true}nearestNeighbor(shortened,q256)) limit 10", ranking="shortened", timeout=10, body={"presentation.timing": "true", "input.query(q256)": q["embedding"][:256]}, ) def query_rerank(q): return session.query( yql="select doc_id from my_schema where ({targetHits: 100, approximate:true}nearestNeighbor(shortened,q256)) limit 10", ranking="rerank", timeout=10, body={ "presentation.timing": "true", "input.query(q256)": q["embedding"][:256], "input.query(q3072)": q["embedding"], }, ) print("Sample query:", queries[0]["text"]) with app.syncio() as session: print(json.dumps(query_rerank(queries[0]).hits[0], indent=2)) ``` Sample query: what is the origin of COVID-19 ``` ``` { "id": "index:matryoshka_content/0/16c7e8749fb82d3b5e37bedb", "relevance": 0.6591723960884718, "source": "matryoshka_content", "fields": { "matchfeatures": { "cos_sim_256": 0.5481410972571522, "cos_sim_3072": 0.6591723960884718 }, "doc_id": "beguhous" } } ``` Here's the top result from the first query. Notice the `matchfeatures` that returns the match-features from the rank-profile. Now for each method of querying, we'll run all our queries and note the rank of each document in the response: In \[72\]: Copied! ``` global qt def run_queries(query_function): print( "\nrun", query_function.__name__, ) results = {} for q in queries: response = query_function(q) assert response.is_successful() print(".", end="") results[q["id"]] = {} for pos, hit in enumerate(response.hits, start=1): global qt qt += float(response.get_json()["timing"]["querytime"]) results[q["id"]][hit["fields"]["doc_id"]] = pos return results query_functions = (query_exact, query_256, query_256_ann, query_rerank) runs = {} with app.syncio() as session: for f in query_functions: qt = 0 runs[f.__name__] = run_queries(f) print(" avg query time {:.4f} s".format(qt / len(queries))) ``` global qt def run_queries(query_function): print( "\\nrun", query_function.__name__, ) results = {} for q in queries: response = query_function(q) assert response.is_successful() print(".", end="") results\[q["id"]\] = {} for pos, hit in enumerate(response.hits, start=1): global qt qt += float(response.get_json()["timing"]["querytime"]) results\[q["id"]\]\[hit["fields"]["doc_id"]\] = pos return results query_functions = (query_exact, query_256, query_256_ann, query_rerank) runs = {} with app.syncio() as session: for f in query_functions: qt = 0 runs\[f.__name__\] = run_queries(f) print(" avg query time {:.4f} s".format(qt / len(queries))) ``` run query_exact ``` ``` .................................................. avg query time 2.7918 s run query_256 .................................................. avg query time 0.3040 s run query_256_ann .................................................. avg query time 0.0252 s run query_rerank .................................................. avg query time 0.0310 s ``` The query time numbers here are NOT a proper benchmark but can illustrate some significant trends for this case: - Doing exact NN with 3072 dimensions is too slow and expensive for many use cases - Reducing dimensionality to 256 reduces latency by an order of magnitude - Using an ANN index improves query time by another order of magnitude - Re-ranking the top 100 results with the full embedding causes only a slight increase We could use [more cores per search](https://docs.vespa.ai/en/performance/sizing-search.html#reduce-latency-with-multi-threaded-per-search-execution) or sharding over multiple nodes to improve latency and handle larger content volumes. ## Evaluating the query results[¶](#evaluating-the-query-results) We need to get the query relevance judgements into the format supported by pytrec_eval: In \[62\]: Copied! ``` qrels = {} for q in dataset.queries_iter(): qrels[q.query_id] = {} for qrel in dataset.qrels_iter(): qrels[qrel.query_id][qrel.doc_id] = qrel.relevance ``` qrels = {} for q in dataset.queries_iter(): qrels[q.query_id] = {} for qrel in dataset.qrels_iter(): qrels[qrel.query_id][qrel.doc_id] = qrel.relevance With that done, we can check the scores for the first query: In \[70\]: Copied! ``` for docid in runs["query_256_ann"]["1"]: score = qrels["1"].get(docid) print(docid, score or "-") ``` for docid in runs["query_256_ann"]\["1"\]: score = qrels["1"].get(docid) print(docid, score or "-") ``` beguhous 2 k9lcpjyo 2 pl48ev5o 2 jwxt4ygt 2 dv9m19yk 1 ft4rbcxf 1 h8ahn8fw 2 6y1gwszn 2 3xusxrij - 2tyt8255 1 ``` A lot of '2', that is, 'highly relevant' results: Looks promising! Now we can use trec_eval to evaluate all the data for each run. The quality measure we use here is `nDCG@10` - [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG), computed for the first 10 results of each query. The evaluations are per-query so we compute and report the average per run. In \[71\]: Copied! ``` import pytrec_eval def evaluate(run): evaluator = pytrec_eval.RelevanceEvaluator(qrels, {"ndcg_cut.10"}) evaluation = evaluator.evaluate(run) sum = 0 for ev in evaluation: sum += evaluation[ev]["ndcg_cut_10"] return sum / len(evaluation) for run in runs: print(run, "\tndcg_cut_10: {:.4f}".format(evaluate(runs[run]))) ``` import pytrec_eval def evaluate(run): evaluator = pytrec_eval.RelevanceEvaluator(qrels, {"ndcg_cut.10"}) evaluation = evaluator.evaluate(run) sum = 0 for ev in evaluation: sum += evaluation[ev]["ndcg_cut_10"] return sum / len(evaluation) for run in runs: print(run, "\\tndcg_cut_10: {:.4f}".format(evaluate(runs[run]))) ``` query_exact ndcg_cut_10: 0.7870 query_256 ndcg_cut_10: 0.7574 query_256_ann ndcg_cut_10: 0.7552 query_rerank ndcg_cut_10: 0.7886 ``` ## Conclusions[¶](#conclusions) What do the numbers mean? They are good, highly relevant results. This is no great surprise, as the OpenAI embedding models are reported to score high on the [Massive Text Embedding Benchmark](https://github.com/embeddings-benchmark/mteb), of which our [BEIR](https://github.com/beir-cellar/beir)/TREC-COVID dataset is a part. More interesting to us, querying with the first 256 dimensions still gives quite good results, while requiring only **8.3%** of the memory. We also note that although the HNSW index is an approximation, result quality is impacted very little, while producing the results an order of magnitude faster. When adding a second phase to re-rank the top 100 hits using the full embeddings, the results are as good as the exact search, while retaining the lower latency, giving us the best of both worlds. ## Summary[¶](#summary) For those interested in learning more about Vespa, join the [Vespa community on Slack](https://vespatalk.slack.com/) to exchange ideas, seek assistance, or stay in the loop on the latest Vespa developments. We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Tuning ANN-parameters for your Vespa application[¶](#tuning-ann-parameters-for-your-vespa-application) You probably want to be somewhat familiar with Vespa and ANN search before going through this example.\ Recommended background reading: - [Vespa nearest neighbor search - a practical guide](https://docs.vespa.ai/en/nearest-neighbor-search-guide.html) - [ANN Search in Vespa](https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/) - [Query Time Constrained Approximate Nearest Neighbor Search](https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/) - [Additions to HNSW in Vespa: ACORN-1 and Adaptive Beam Search](https://blog.vespa.ai/additions-to-hnsw/) - [A Short Guide to Tweaking Vespa's ANN Parameters](https://blog.vespa.ai/tweaking-ann-parameters/) Approximate Nearest Neighbor (ANN) search is a powerful way to make vector search scalable and efficient. In Vespa, this is implemented by building HNSW graphs for embedding fields. There are many parameters that control how an ANN query is processed in Vespa ## Why different strategies are needed[¶](#why-different-strategies-are-needed) For a search that uses *only* vector similarity for retrieval, this works very well for all cases, as you can just query the HNSW index and get (enough) relevant results back very fast. However, most Vespa applications are more complex, and implement some sort of hybrid retrieval strategy, often combining vector similarity with filtering on metadata fields and/or lexical matching (weakAnd). In this case, it is not obvious whether applying the filter first and doing an exact search will be more efficient than doing HNSW (in different variations, as we will get back to). The *hit-ratio* of the filter for a given query will determine what is the most efficient strategy for the given query. Therefore, while (we) Vespa has chosen the default values of the parameters in a way that works well in most use cases, one usually can benefit from further tuning these parameters for each application / use case. This notebook will demonstrate how this tuning can be done with the recent addition of the `VespaNNParameterOptimizer` class. Hopefully, by stepping through this notebook, you will have learned how you can apply the same steps to tune *your* Vespa application (ANN configuration parameters) to get faster search responses while still maintaining acceptable recall. ## A short note on `recall`[¶](#a-short-note-on-recall) It is worth noting that our definition of `recall` in the context of tuning NN-parameters differs slightly from the `recall` definition used in e.g. `VespaMatchEvaluator` (see ). When optimizing NN-parameters, the recall will be the fraction of top K (targetHits) documents scored by exact distance that are also retrieved by the given strategy. This approach does not need a set of `relevant_docs`, in contrast to the `VespaMatchEvaluator`, which calculates the recall as the fraction of relevant documents that are retrieved. The good news is that a set of representative queries is all we need to tune these parameters to find the values that ensures a fast response time *and* maintains acceptable recall across your provided queries. ## The different strategies[¶](#the-different-strategies) With the addition of ACORN-1 and Adaptive Beam Search to Vespa's ANN implementation, Vespa switches between one of the following four strategies: 1. HNSW Search with Pre-Filtering. 1. HNSW Search with Pre-Filtering - Using the Filter First/ACORN-1 strategy. 1. HNSW Search with Post-Filtering: Regular HNSW with the filter applied after HNSW retrieval. 1. Exact Nearest-Neighbor Search with Pre-Filtering. See [Additions to HNSW in Vespa: ACORN-1 and Adaptive Beam Search](https://blog.vespa.ai/additions-to-hnsw/) for more details on these strategies. *When* to switch between these strategies is determined by a combination of a per-query internally calculated `hit-ratio` and the parameters below: - `filter-first-threshold`: Threshold value (in the range [0.0, 1.0]) deciding if the filter is checked before computing a distance (filter-first heuristic) while searching the HNSW graph for approximate neighbors with filtering. This improves the response time at low hit ratios but causes a dip in recall. The heuristic is used when the filter hit ratio of the query is less than this threshold. The default value is 0.3. - `approximate-threshold`: Threshold value (in the range [0.0, 1.0]) deciding if a query with an approximate nearestNeighbor operator combined with filters is evaluated by searching the HNSW graph for approximate neighbors with filtering, or performing an exact nearest neighbor search with pre-filtering. The fallback to exact search is chosen when the filter hit ratio of the query is less than this threshold. The default value is 0.02. - `post-filter-threshold`: Threshold value (in the range [0.0, 1.0]) deciding if a query with an approximate nearestNeighbor operator combined with filters is evaluated using post-filtering instead of the default filtering. Post-filtering is chosen when the filter hit ratio for the query is larger than this threshold. The default value is 1.0, which disables post-filtering. This is a bit simplified as some of the hit ratios referred to in the previous section may be an estimated hit ratio. The function below might make it easier to see which strategy is chosen for an ANN-query with a given hit ratio. In \[ \]: Copied! ``` def determine_ann_strategy( hit_ratio: float, filter_first_threshold: float = 0.3, approximate_threshold: float = 0.02, post_filter_threshold: float = 1.0, ) -> str: """ Determine the chosen NN-strategy based on hit ratio and thresholds. Args: hit_ratio: The hit ratio for the current query. filter_first_threshold: The threshold below which filter-first heuristic is preferred. approximate_threshold: The threshold below which approximate search is preferred. post_filter_threshold: The threshold above which post-filtering is preferred. Returns: A string indicating the recommended search strategy. """ # 1. Check if hit ratio is too low for approximate search if hit_ratio < approximate_threshold: return "Exact Search" # 2. Check post-filtering if hit_ratio >= post_filter_threshold: return "Post-Filtering" # 3. Check if filter-first heuristic should be used if hit_ratio < filter_first_threshold: return "Filter-First/ACORN-1" # 4. Default: standard HNSW with pre-filtering return "HNSW with Pre-Filtering" ``` def determine_ann_strategy( hit_ratio: float, filter_first_threshold: float = 0.3, approximate_threshold: float = 0.02, post_filter_threshold: float = 1.0, ) -> str: """ Determine the chosen NN-strategy based on hit ratio and thresholds. Args: hit_ratio: The hit ratio for the current query. filter_first_threshold: The threshold below which filter-first heuristic is preferred. approximate_threshold: The threshold below which approximate search is preferred. post_filter_threshold: The threshold above which post-filtering is preferred. Returns: A string indicating the recommended search strategy. """ # 1. Check if hit ratio is too low for approximate search if hit_ratio < approximate_threshold: return "Exact Search" # 2. Check post-filtering if hit_ratio >= post_filter_threshold: return "Post-Filtering" # 3. Check if filter-first heuristic should be used if hit_ratio < filter_first_threshold: return "Filter-First/ACORN-1" # 4. Default: standard HNSW with pre-filtering return "HNSW with Pre-Filtering" In addition to the parameters that control which strategy is used, we also have introduced a parameter that controls the behavior of the filter-first strategy (when it is applied): - `filter-first-exploration`: Value (in the range [0.0, 1.0]) specifying how aggressively the filter-first heuristic explores the graph when searching the HNSW graph for approximate neighbors with filtering. A higher value means that the graph is explored more aggressively and improves the recall at the cost of the response time. (Default value is 0.3.) These parameters can be configured per rank profile or provided as a query parameter, which is passed with the query, and may have a big effect on query performance of queries that combines ANN search with filters. In this notebook, we will show how to tune these parameters to ensure low response time, without losing much recall compared to exact search. As a teaser, this is the report we will produce: (Keep reading if you want to understand more) Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. **Pre-requisite**: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name. Now, let us get started with the practical part. ## The dataset[¶](#the-dataset) The dataset we will use for this notebook is a subset of the [GIST1M-dataset](http://corpus-texmex.irisa.fr/) commonly used for ANN benchmarks. We have also enriched each document with a `filter`-field of type `array`, that is added to each document, which allows us to construct queries with a predictable hit ratio. Here is an example document: ``` { "put": "id:test:test::499", "fields": { "id": 499, "filter": [1,10,50,90,95,99], "vec_m16": { "values": [0.01345, .., 0.30322] } // 960 float values } } ``` 99% of the documents include the value 1 in the filter field, 90% of the documents include the value 10 in the fitler field and so on. ## Configuring your application[¶](#configuring-your-application) In \[1\]: Copied! ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "anntuning" ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "anntuning" ## Downloading the dataset[¶](#downloading-the-dataset) In \[2\]: Copied! ``` import os import pathlib import requests import matplotlib.pyplot as plt from vespa.io import VespaResponse data_base_url = ( "https://data.vespa-cloud.com/tests/performance/nearest-neighbor/gist-data/" ) # We use a smaller dataset if the SCALE_DOWN flag is set to True SCALE_DOWN = False docs_url = ( data_base_url + "docs.1k.json" if SCALE_DOWN else data_base_url + "docs.300k.json" ) query_url = ( data_base_url + "query_vectors.10.txt" if SCALE_DOWN else data_base_url + "query_vectors.100.txt" ) NUMBER_OF_HITS = 10 if SCALE_DOWN else 100 ``` import os import pathlib import requests import matplotlib.pyplot as plt from vespa.io import VespaResponse data_base_url = ( "https://data.vespa-cloud.com/tests/performance/nearest-neighbor/gist-data/" ) # We use a smaller dataset if the SCALE_DOWN flag is set to True SCALE_DOWN = False docs_url = ( data_base_url + "docs.1k.json" if SCALE_DOWN else data_base_url + "docs.300k.json" ) query_url = ( data_base_url + "query_vectors.10.txt" if SCALE_DOWN else data_base_url + "query_vectors.100.txt" ) NUMBER_OF_HITS = 10 if SCALE_DOWN else 100 In \[3\]: Copied! ``` def download_file(url: str, dest_folder: str): local_filename = os.path.join(dest_folder, url.split("/")[-1]) if os.path.exists(local_filename): print(f"File {local_filename} already exists, skipping download.") return local_filename print(f"Downloading {url} to {local_filename}...") with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, "wb") as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) return local_filename data_path = "ann_test/" pathlib.Path(data_path).mkdir(parents=True, exist_ok=True) docs_path = download_file(docs_url, data_path) query_path = download_file(query_url, data_path) ``` def download_file(url: str, dest_folder: str): local_filename = os.path.join(dest_folder, url.split("/")[-1]) if os.path.exists(local_filename): print(f"File {local_filename} already exists, skipping download.") return local_filename print(f"Downloading {url} to {local_filename}...") with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, "wb") as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) return local_filename data_path = "ann_test/" pathlib.Path(data_path).mkdir(parents=True, exist_ok=True) docs_path = download_file(docs_url, data_path) query_path = download_file(query_url, data_path) ``` File ann_test/docs.300k.json already exists, skipping download. File ann_test/query_vectors.100.txt already exists, skipping download. ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) In \[4\]: Copied! ``` from vespa.package import ( ApplicationPackage, Schema, Document, Field, RankProfile, HNSW, DocumentSummary, Summary, ) from vespa.configuration.query_profiles import query_profile, query_profile_type, field # Define the document with fields doc = Document( fields=[ Field( name="id", type="int", indexing=["attribute", "summary"], ), Field( name="filter", type="array", # This is our filter field indexing=["attribute", "summary"], attribute=["fast-search"], ), Field( name="vec_m16", type="tensor(x[960])", # The vector field that we will do ANN search on indexing=["attribute", "index", "summary"], ann=HNSW( distance_metric="euclidean", # specific to this dataset max_links_per_node=16, # neighbors_to_explore_at_insert=500, # Specifies how many neighbors to explore when inserting a vector in the HNSW graph. The default value in Vespa is 200. This parameter is called efConstruction in the HNSW paper. ), ), ] ) # Define the rank profile with HNSW tuning parameters rank_profile = RankProfile( name="default", inputs=[ ("query(q_vec)", "tensor(x[960])"), ], first_phase="closeness(label,nns)", # We will tune some of these by overriding them as query parameters later rank_properties=[ ("approximate-threshold", 0.02), ("filter-first-threshold", 0.3), ("filter-first-exploration", 0.3), ("exploration-slack", 0.0), ], ) # Define a minimal document summary to avoid unnecessary data transfer minimal_summary = DocumentSummary(name="minimal", summary_fields=[Summary(name="id")]) # Create the schema schema = Schema( name=application, document=doc, rank_profiles=[rank_profile], document_summaries=[minimal_summary], ) # We also define a query profile type for the default query profile to enforce the type of the input tensor # See https://docs.vespa.ai/en/query-profiles.html#query-profile-types qp = query_profile( id="default", type="root", ) qpt = query_profile_type( field( name="ranking.features.query(q_vec)", type="tensor(x[960])", ), id="root", inherits="native", ) # Create the application package app_package = ApplicationPackage( name=application, schema=[schema], query_profile_config=[qp, qpt] ) ``` from vespa.package import ( ApplicationPackage, Schema, Document, Field, RankProfile, HNSW, DocumentSummary, Summary, ) from vespa.configuration.query_profiles import query_profile, query_profile_type, field # Define the document with fields doc = Document( fields=\[ Field( name="id", type="int", indexing=["attribute", "summary"], ), Field( name="filter", type="array", # This is our filter field indexing=["attribute", "summary"], attribute=["fast-search"], ), Field( name="vec_m16", type="tensor(x[960])", # The vector field that we will do ANN search on indexing=["attribute", "index", "summary"], ann=HNSW( distance_metric="euclidean", # specific to this dataset max_links_per_node=16, # neighbors_to_explore_at_insert=500, # Specifies how many neighbors to explore when inserting a vector in the HNSW graph. The default value in Vespa is 200. This parameter is called efConstruction in the HNSW paper. ), ), \] ) # Define the rank profile with HNSW tuning parameters rank_profile = RankProfile( name="default", inputs=\[ ("query(q_vec)", "tensor(x[960])"), \], first_phase="closeness(label,nns)", # We will tune some of these by overriding them as query parameters later rank_properties=[ ("approximate-threshold", 0.02), ("filter-first-threshold", 0.3), ("filter-first-exploration", 0.3), ("exploration-slack", 0.0), ], ) # Define a minimal document summary to avoid unnecessary data transfer minimal_summary = DocumentSummary(name="minimal", summary_fields=[Summary(name="id")]) # Create the schema schema = Schema( name=application, document=doc, rank_profiles=[rank_profile], document_summaries=[minimal_summary], ) # We also define a query profile type for the default query profile to enforce the type of the input tensor # See https://docs.vespa.ai/en/query-profiles.html#query-profile-types qp = query_profile( id="default", type="root", ) qpt = query_profile_type( field( name="ranking.features.query(q_vec)", type="tensor(x[960])", ), id="root", inherits="native", ) # Create the application package app_package = ApplicationPackage( name=application, schema=[schema], query_profile_config=[qp, qpt] ) It is often useful to dump the application package to files for inspection before deploying. In \[5\]: Copied! ``` app_package.to_files("ann_test") ``` app_package.to_files("ann_test") In \[6\]: Copied! ``` from vespa.deployment import VespaCloud from vespa.application import Vespa # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=app_package, ) ``` from vespa.deployment import VespaCloud from vespa.application import Vespa # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=app_package, ) ``` Setting application... Running: vespa config set application vespa-team.anntuning.default Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes a few minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.) In \[55\]: Copied! ``` app: Vespa = vespa_cloud.deploy() ``` app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 21 of dev-aws-us-east-1c for vespa-team.anntuning. This may take a few minutes the first time. INFO [12:54:05] Deploying platform version 8.608.33 and application dev build 19 for dev-aws-us-east-1c of default ... INFO [12:54:05] Using CA signed certificate version 1 INFO [12:54:05] Using 1 nodes in container cluster 'anntuning_container' INFO [12:54:09] Using 1 nodes in container cluster 'anntuning_container' INFO [12:54:13] Session 388963 for tenant 'vespa-team' prepared and activated. INFO [12:54:13] ######## Details for all nodes ######## INFO [12:54:13] h129348a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [12:54:13] --- platform vespa/cloud-tenant-rhel8:8.608.33 INFO [12:54:13] --- container on port 4080 has config generation 388960, wanted is 388963 INFO [12:54:13] --- metricsproxy-container on port 19092 has config generation 388963, wanted is 388963 INFO [12:54:13] h127903b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [12:54:13] --- platform vespa/cloud-tenant-rhel8:8.608.33 INFO [12:54:13] --- storagenode on port 19102 has config generation 388963, wanted is 388963 INFO [12:54:13] --- searchnode on port 19107 has config generation 388963, wanted is 388963 INFO [12:54:13] --- distributor on port 19111 has config generation 388960, wanted is 388963 INFO [12:54:13] --- metricsproxy-container on port 19092 has config generation 388963, wanted is 388963 INFO [12:54:13] h128504f.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [12:54:13] --- platform vespa/cloud-tenant-rhel8:8.608.33 INFO [12:54:13] --- logserver-container on port 4080 has config generation 388963, wanted is 388963 INFO [12:54:13] --- metricsproxy-container on port 19092 has config generation 388960, wanted is 388963 INFO [12:54:13] h119183a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [12:54:13] --- platform vespa/cloud-tenant-rhel8:8.608.33 INFO [12:54:13] --- container-clustercontroller on port 19050 has config generation 388963, wanted is 388963 INFO [12:54:13] --- metricsproxy-container on port 19092 has config generation 388960, wanted is 388963 INFO [12:54:44] Found endpoints: INFO [12:54:44] - dev.aws-us-east-1c INFO [12:54:44] |-- https://cbd3ac16.e1b24e47.z.vespa-app.cloud/ (cluster 'anntuning_container') INFO [12:54:44] Deployment of new application revision complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for anntuning_container URL: https://cbd3ac16.e1b24e47.z.vespa-app.cloud/ Application is up! ``` ## Feeding the documents[¶](#feeding-the-documents) In \[8\]: Copied! ``` # Load and feed documents import json with open(docs_path, "r") as f: docs = json.load(f) ``` # Load and feed documents import json with open(docs_path, "r") as f: docs = json.load(f) In \[9\]: Copied! ``` def docs_gen(docs): for doc in docs: yield { "id": str(doc["fields"]["id"]), "fields": doc["fields"], } ``` def docs_gen(docs): for doc in docs: yield { "id": str(doc["fields"]["id"]), "fields": doc["fields"], } In \[ \]: Copied! ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( "Id " + id + " failed : " + response.json["id"] + ", Status code: " + str(response.status_code) ) app.feed_iterable(docs_gen(docs), callback=callback) ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( "Id " - id - " failed : " - response.json["id"] - ", Status code: " - str(response.status_code) ) app.feed_iterable(docs_gen(docs), callback=callback) Let us run a test query: In \[60\]: Copied! ``` resp = app.query(yql="select * from sources * where true limit 1;") resp.status_code ``` resp = app.query(yql="select * from sources * where true limit 1;") resp.status_code Out\[60\]: ``` 200 ``` ## Constructing queries[¶](#constructing-queries) As we noted in the introduction, a set of representative queries is all that is needed for tuning ANN parameters. When we will compare recall, we will compare against the recall of the exact nearest neighbor search. We will use the [Querybuilder API](https://vespa-engine.github.io/pyvespa/query.md#using-the-querybuilder-dsl-api) to construct our queries. In \[13\]: Copied! ``` import vespa.querybuilder as qb def vector_to_query(vec_str: str, filter_value: int) -> dict: return { "yql": str( qb.select("*") .from_(application) .where( qb.nearestNeighbor( "vec_m16", "q_vec", annotations={ "targetHits": NUMBER_OF_HITS, "approximate": True, "label": "nns", }, ) & (qb.QueryField("filter") == filter_value), ) ), "hits": 10, "presentation.summary": "minimal", "timeout": "20s", "ranking.features.query(q_vec)": vec_str.strip(), } ``` import vespa.querybuilder as qb def vector_to_query(vec_str: str, filter_value: int) -> dict: return { "yql": str( qb.select("\*") .from\_(application) .where( qb.nearestNeighbor( "vec_m16", "q_vec", annotations={ "targetHits": NUMBER_OF_HITS, "approximate": True, "label": "nns", }, ) & (qb.QueryField("filter") == filter_value), ) ), "hits": 10, "presentation.summary": "minimal", "timeout": "20s", "ranking.features.query(q_vec)": vec_str.strip(), } In \[14\]: Copied! ``` with open(query_path, "r") as f: query_vectors = f.readlines() # Filter values filter_percentage = [1, 10, 50, 90, 95, 99] # We will construct queries for each combination of query vector and filter value queries = [] # We will also construct a single query per filter value for hit ratio evaluation # The vector does not affect the hit ratio, so it is overkill to run a query for each vector # just to determine the hit ratio hitratio_queries = [] # will only have one query per filter value, choosing the last vector arbitrarily for filter_value in filter_percentage: for vec in query_vectors: queries.append(vector_to_query(vec, filter_value)) hitratio_queries.append(queries[-1]) print(len(queries), len(hitratio_queries)) ``` with open(query_path, "r") as f: query_vectors = f.readlines() # Filter values filter_percentage = [1, 10, 50, 90, 95, 99] # We will construct queries for each combination of query vector and filter value queries = [] # We will also construct a single query per filter value for hit ratio evaluation # The vector does not affect the hit ratio, so it is overkill to run a query for each vector # just to determine the hit ratio hitratio_queries = [] # will only have one query per filter value, choosing the last vector arbitrarily for filter_value in filter_percentage: for vec in query_vectors: queries.append(vector_to_query(vec, filter_value)) hitratio_queries.append(queries[-1]) print(len(queries), len(hitratio_queries)) ``` 600 6 ``` Let us now run a test query against our Vespa app. In \[15\]: Copied! ``` resp = app.query(queries[0]) resp.json ``` resp = app.query(queries[0]) resp.json Out\[15\]: ``` {'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 100}, 'coverage': {'coverage': 100, 'documents': 300000, 'full': True, 'nodes': 1, 'results': 1, 'resultsFull': 1}, 'children': [{'id': 'index:anntuning_content/0/546939e42a88c72c83bda0f7', 'relevance': 0.5514919500458465, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 81360}}, {'id': 'index:anntuning_content/0/9e6b6ea9df591d1672385a96', 'relevance': 0.5493186760384602, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 284987}}, {'id': 'index:anntuning_content/0/33ae6c85ba2829b2be268cd6', 'relevance': 0.548701224770975, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 238816}}, {'id': 'index:anntuning_content/0/355d6a2adcb93b4614536016', 'relevance': 0.5464117465305897, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 222086}}, {'id': 'index:anntuning_content/0/989ef83d7e89814cbdc07857', 'relevance': 0.5457934983925564, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 118458}}, {'id': 'index:anntuning_content/0/f19866c7cc86f9a207a20b11', 'relevance': 0.5454133260745831, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 124745}}, {'id': 'index:anntuning_content/0/d61de96b8be49977bcf98022', 'relevance': 0.5452071467202191, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 147438}}, {'id': 'index:anntuning_content/0/e0b5114128914163c52411f2', 'relevance': 0.5437820844250533, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 65659}}, {'id': 'index:anntuning_content/0/414e0dfb4e7f7529c8a25ba3', 'relevance': 0.5431368276710448, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 115992}}, {'id': 'index:anntuning_content/0/64dd16fb44938771565d7cb6', 'relevance': 0.5430208957250813, 'source': 'anntuning_content', 'fields': {'sddocname': 'anntuning', 'id': 72752}}]}} ``` Great, we can see that we get some documents returned. The `relevance` score here, is defined by the `closeness(label, nns)`-expression we defined in our rank profile when creating our application package. ## Running the Optimizer[¶](#running-the-optimizer) Running the optimization is as simple as this: In \[16\]: Copied! ``` from vespa.evaluation import VespaNNParameterOptimizer optimizer = VespaNNParameterOptimizer( app=app, queries=queries, hits=NUMBER_OF_HITS, buckets_per_percent=2, print_progress=True, ) report = optimizer.run() ``` from vespa.evaluation import VespaNNParameterOptimizer optimizer = VespaNNParameterOptimizer( app=app, queries=queries, hits=NUMBER_OF_HITS, buckets_per_percent=2, print_progress=True, ) report = optimizer.run() ``` Distributing queries to buckets No queries found with filtered-out ratio in [0.25,0.5) Warning: Selection of queries might not cover enough hit ratios to get meaningful results. {'buckets_per_percent': 2, 'bucket_interval_width': 0.005, 'non_empty_buckets': [2, 20, 100, 180, 190, 198], 'filtered_out_ratios': [0.01, 0.1, 0.5, 0.9, 0.95, 0.99], 'hit_ratios': [0.99, 0.9, 0.5, 0.09999999999999998, 0.050000000000000044, 0.010000000000000009], 'query_distribution': [100, 100, 100, 100, 100, 100]} Determining suggestion for filterFirstExploration Benchmarking: 100.0% Computing recall: 100.0% Benchmarking: 100.0% Computing recall: 100.0% Testing 0.5 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.25 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.375 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.3125 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.28125 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.265625 Benchmarking: 100.0% Computing recall: 100.0% Testing 0.2734375 Benchmarking: 100.0% Computing recall: 100.0% {'suggestion': 0.27734375, 'benchmarks': {0.0: [5.586, 4.270000000000001, 4.047999999999999, 3.2960000000000003, 2.384999999999999, 1.5119999999999996], 1.0: [4.193000000000001, 3.9030000000000014, 3.9050000000000007, 4.005000000000001, 5.777999999999999, 9.513], 0.5: [4.010999999999999, 4.063, 3.6319999999999983, 3.8169999999999997, 4.911999999999999, 6.581999999999999], 0.25: [3.948000000000001, 3.9579999999999997, 3.653, 3.8739999999999997, 2.639000000000001, 2.095], 0.375: [3.998000000000001, 3.9939999999999993, 4.691, 3.543, 3.1250000000000004, 4.095999999999999], 0.3125: [3.8699999999999983, 6.257000000000001, 3.7429999999999994, 3.34, 2.6889999999999996, 2.9420000000000006], 0.28125: [5.361000000000001, 4.533000000000001, 3.4229999999999996, 3.113, 2.755999999999999, 2.407000000000002], 0.265625: [4.251000000000001, 3.9019999999999997, 3.737000000000001, 2.911000000000001, 2.6110000000000007, 2.2360000000000007], 0.2734375: [4.164999999999998, 3.9270000000000005, 3.8150000000000013, 3.035, 2.536000000000001, 2.3400000000000003]}, 'recall_measurements': {0.0: [0.8794000000000001, 0.8786999999999999, 0.8868000000000003, 0.9470999999999998, 0.9058000000000003, 0.6365000000000002], 1.0: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9665999999999997, 0.9859999999999995, 0.9957999999999996], 0.5: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9647, 0.9775999999999996, 0.9897999999999998], 0.25: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9530999999999998, 0.9316999999999993, 0.8238999999999996], 0.375: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9599000000000002, 0.9594999999999996, 0.9498999999999996], 0.3125: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9566000000000001, 0.945, 0.903], 0.28125: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9550999999999997, 0.9384999999999998, 0.8628999999999997], 0.265625: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9540999999999998, 0.9339999999999996, 0.8418999999999996], 0.2734375: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9543999999999998, 0.9358999999999998, 0.8525999999999998]}} Determining suggestion for filterFirstThreshold Benchmarking: 100.0% Computing recall: 100.0% Benchmarking: 100.0% Computing recall: 100.0% {'suggestion': 0.48, 'benchmarks': {'hnsw': [3.0120000000000005, 2.733, 3.5680000000000005, 9.578, 15.017999999999994, 42.60899999999999], 'filter_first': [4.036, 3.971000000000002, 3.846000000000002, 3.095999999999999, 2.51, 2.8889999999999985]}, 'recall_measurements': {'hnsw': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9740999999999996, 0.9848999999999992, 0.9942999999999994], 'filter_first': [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9547999999999998, 0.9371999999999998, 0.8579999999999997]}} Determining suggestion for approximateThreshold Benchmarking: 100.0% Benchmarking: 100.0% Computing recall: 100.0% {'suggestion': 0.015, 'benchmarks': {'exact': [72.68499999999996, 69.22399999999999, 39.19899999999999, 10.81, 7.9639999999999995, 1.8649999999999998], 'filter_first': [2.707, 2.841999999999999, 3.6020000000000003, 3.011, 2.6660000000000004, 2.3310000000000004]}, 'recall_measurements': {'exact': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'filter_first': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9547999999999998, 0.9371999999999998, 0.8579999999999997]}} Determining suggestion for postFilterThreshold Benchmarking: 100.0% Benchmarking: 100.0% Computing recall: 100.0% Computing recall: 100.0% {'suggestion': 0.485, 'benchmarks': {'post_filtering': [1.87, 1.994, 3.204999999999998, 9.596, 15.147000000000002, 15.11], 'filter_first': [3.131, 2.759000000000001, 3.4420000000000006, 3.1869999999999994, 2.567000000000001, 1.6219999999999997]}, 'recall_measurements': {'post_filtering': [0.8333000000000002, 0.8403000000000002, 0.8968, 0.9518999999999997, 0.9507999999999996, 0.1916], 'filter_first': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9547999999999998, 0.9371999999999998, 1.0]}} ``` The returned `report` should give us a lot of valuable info. In \[17\]: Copied! ``` report ``` report Out\[17\]: ``` {'buckets': {'buckets_per_percent': 2, 'bucket_interval_width': 0.005, 'non_empty_buckets': [2, 20, 100, 180, 190, 198], 'filtered_out_ratios': [0.01, 0.1, 0.5, 0.9, 0.95, 0.99], 'hit_ratios': [0.99, 0.9, 0.5, 0.09999999999999998, 0.050000000000000044, 0.010000000000000009], 'query_distribution': [100, 100, 100, 100, 100, 100]}, 'filterFirstExploration': {'suggestion': 0.27734375, 'benchmarks': {0.0: [5.586, 4.270000000000001, 4.047999999999999, 3.2960000000000003, 2.384999999999999, 1.5119999999999996], 1.0: [4.193000000000001, 3.9030000000000014, 3.9050000000000007, 4.005000000000001, 5.777999999999999, 9.513], 0.5: [4.010999999999999, 4.063, 3.6319999999999983, 3.8169999999999997, 4.911999999999999, 6.581999999999999], 0.25: [3.948000000000001, 3.9579999999999997, 3.653, 3.8739999999999997, 2.639000000000001, 2.095], 0.375: [3.998000000000001, 3.9939999999999993, 4.691, 3.543, 3.1250000000000004, 4.095999999999999], 0.3125: [3.8699999999999983, 6.257000000000001, 3.7429999999999994, 3.34, 2.6889999999999996, 2.9420000000000006], 0.28125: [5.361000000000001, 4.533000000000001, 3.4229999999999996, 3.113, 2.755999999999999, 2.407000000000002], 0.265625: [4.251000000000001, 3.9019999999999997, 3.737000000000001, 2.911000000000001, 2.6110000000000007, 2.2360000000000007], 0.2734375: [4.164999999999998, 3.9270000000000005, 3.8150000000000013, 3.035, 2.536000000000001, 2.3400000000000003]}, 'recall_measurements': {0.0: [0.8794000000000001, 0.8786999999999999, 0.8868000000000003, 0.9470999999999998, 0.9058000000000003, 0.6365000000000002], 1.0: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9665999999999997, 0.9859999999999995, 0.9957999999999996], 0.5: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9647, 0.9775999999999996, 0.9897999999999998], 0.25: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9530999999999998, 0.9316999999999993, 0.8238999999999996], 0.375: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9599000000000002, 0.9594999999999996, 0.9498999999999996], 0.3125: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9566000000000001, 0.945, 0.903], 0.28125: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9550999999999997, 0.9384999999999998, 0.8628999999999997], 0.265625: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9540999999999998, 0.9339999999999996, 0.8418999999999996], 0.2734375: [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9543999999999998, 0.9358999999999998, 0.8525999999999998]}}, 'filterFirstThreshold': {'suggestion': 0.48, 'benchmarks': {'hnsw': [3.0120000000000005, 2.733, 3.5680000000000005, 9.578, 15.017999999999994, 42.60899999999999], 'filter_first': [4.036, 3.971000000000002, 3.846000000000002, 3.095999999999999, 2.51, 2.8889999999999985]}, 'recall_measurements': {'hnsw': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9740999999999996, 0.9848999999999992, 0.9942999999999994], 'filter_first': [0.8795000000000002, 0.8788999999999999, 0.8872000000000003, 0.9547999999999998, 0.9371999999999998, 0.8579999999999997]}}, 'approximateThreshold': {'suggestion': 0.015, 'benchmarks': {'exact': [72.68499999999996, 69.22399999999999, 39.19899999999999, 10.81, 7.9639999999999995, 1.8649999999999998], 'filter_first': [2.707, 2.841999999999999, 3.6020000000000003, 3.011, 2.6660000000000004, 2.3310000000000004]}, 'recall_measurements': {'exact': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'filter_first': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9547999999999998, 0.9371999999999998, 0.8579999999999997]}}, 'postFilterThreshold': {'suggestion': 0.485, 'benchmarks': {'post_filtering': [1.87, 1.994, 3.204999999999998, 9.596, 15.147000000000002, 15.11], 'filter_first': [3.131, 2.759000000000001, 3.4420000000000006, 3.1869999999999994, 2.567000000000001, 1.6219999999999997]}, 'recall_measurements': {'post_filtering': [0.8333000000000002, 0.8403000000000002, 0.8968, 0.9518999999999997, 0.9507999999999996, 0.1916], 'filter_first': [0.8328000000000002, 0.8424000000000003, 0.9005999999999993, 0.9547999999999998, 0.9371999999999998, 1.0]}}} ``` ## Visualize Optimization Results[¶](#visualize-optimization-results) Let's visualize the optimization report to better understand the recommendations. Click to see the plotting code ``` def plot_optimization_report( report: Dict[str, Any], figsize: Tuple[int, int] = (18, 14) ) -> plt.Figure: """ Create a comprehensive visualization of the VespaNNParameterOptimizer report. This function creates a Tufte-inspired multi-panel plot showing parameters in the order they are calculated: 1. filterFirstExploration (response time + recall) 2. filterFirstThreshold (response time + recall) 3. approximateThreshold (response time + recall) 4. postFilterThreshold (response time + recall) Args: report: The report dictionary generated by VespaNNParameterOptimizer.run() figsize: Figure size as (width, height) tuple Returns: matplotlib Figure object """ # Extract data buckets = report["buckets"] hit_ratios = np.array(buckets["hit_ratios"]) * 100 # Convert to percentage query_distribution = buckets["query_distribution"] # Create figure with subplots organized by parameter fig = plt.figure(figsize=figsize) gs = fig.add_gridspec( 5, 3, hspace=0.5, wspace=0.3, top=0.94, bottom=0.05, left=0.06, right=0.97 ) # Apply Tufte-inspired styling plt.rcParams["font.family"] = "serif" plt.rcParams["axes.linewidth"] = 0.5 plt.rcParams["xtick.major.width"] = 0.5 plt.rcParams["ytick.major.width"] = 0.5 def setup_tufte_axes(ax): """Apply minimal Tufte-style formatting to axes""" ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_linewidth(0.5) ax.spines["bottom"].set_linewidth(0.5) ax.tick_params(labelsize=9) ax.grid(True, alpha=0.2, linewidth=0.5, linestyle="-", color="gray") # Row 0: Query Distribution + Summary Table # 1. Query Distribution (top left, spans 2 columns) ax_dist = fig.add_subplot(gs[0, 0:2]) # Use actual hit ratio values for x-positions _bars = ax_dist.bar( hit_ratios, query_distribution, width=0.5, color="#4A90E2", alpha=0.7, edgecolor="#2E5C8A", linewidth=0.5, ) ax_dist.set_xlabel("Hit Ratio (%)", fontsize=8) ax_dist.set_ylabel("Number of Queries", fontsize=10, fontweight="bold") ax_dist.set_title( "Query Distribution Across Hit Ratios", fontsize=11, fontweight="bold", pad=10 ) # Set x-axis limits to match other plots (0-100%) ax_dist.set_xlim(-2, 102) setup_tufte_axes(ax_dist) # Add value labels on bars for hr, height in zip(hit_ratios, query_distribution): if height > 0: # Only label bars with data ax_dist.text( hr, height, f"{int(height)}", ha="center", va="bottom", fontsize=8 ) # 2. Summary Table (top right) ax_summary = fig.add_subplot(gs[0, 2]) ax_summary.axis("off") summary_data = [ ["Parameter", "Suggested"], ["", ""], [ "filterFirstExploration", f"{report['filterFirstExploration']['suggestion']:.4f}", ], ["filterFirstThreshold", f"{report['filterFirstThreshold']['suggestion']:.4f}"], ["approximateThreshold", f"{report['approximateThreshold']['suggestion']:.4f}"], ["postFilterThreshold", f"{report['postFilterThreshold']['suggestion']:.4f}"], ["", ""], ["Total Queries", f"{sum(query_distribution)}"], ["Number of buckets", f"{len(buckets)}"], ["Buckets per %", f"{buckets['buckets_per_percent']}"], ] table = ax_summary.table( cellText=summary_data, cellLoc="left", loc="center", colWidths=[0.65, 0.35], bbox=[0, 0, 1, 1], ) table.auto_set_font_size(False) table.set_fontsize(8) # Style the table for i, row in enumerate(summary_data): for j in range(len(row)): cell = table[(i, j)] if i == 0: # Header cell.set_facecolor("#34495E") cell.set_text_props(weight="bold", color="white", size=8) elif i == 1 or i == 6: # Separator rows cell.set_facecolor("#ECF0F1") cell.set_height(0.02) elif i >= 7: # Info section cell.set_facecolor("#F8F9FA") else: cell.set_facecolor("white") cell.set_edgecolor("#BDC3C7") cell.set_linewidth(0.5) ax_summary.set_title("Summary", fontsize=11, fontweight="bold", pad=10, loc="left") # Row 1: filterFirstExploration (calculated first) # Response Time ax_ffe_rt = fig.add_subplot(gs[1, 0:2]) ffe_data = report["filterFirstExploration"] ffe_suggestion = ffe_data["suggestion"] exploration_values = sorted([k for k in ffe_data["benchmarks"].keys()]) colors = plt.cm.viridis(np.linspace(0.2, 0.9, len(exploration_values))) for i, exp_val in enumerate(exploration_values): benchmarks = ffe_data["benchmarks"][exp_val] label = f"{exp_val:.3f}" if abs(exp_val - ffe_suggestion) < 0.001: label += " ✓" ax_ffe_rt.plot( hit_ratios, benchmarks, "o-", color=colors[i], linewidth=2, markersize=6, label=label, zorder=10, ) else: ax_ffe_rt.plot( hit_ratios, benchmarks, "o-", color=colors[i], linewidth=1, markersize=4, alpha=0.6, label=label, ) ax_ffe_rt.set_xlabel("Hit Ratio (%)", fontsize=8) ax_ffe_rt.set_ylabel("Response Time (ms)", fontsize=10, fontweight="bold") ax_ffe_rt.set_title( f"1. filterFirstExploration: {ffe_suggestion:.4f} (Response Time)", fontsize=11, fontweight="bold", pad=10, ) ax_ffe_rt.legend( fontsize=7, frameon=False, loc="best", title="filterFirstExploration", title_fontsize=8, ) setup_tufte_axes(ax_ffe_rt) # Recall ax_ffe_recall = fig.add_subplot(gs[1, 2]) recall_measurements = ffe_data["recall_measurements"] for i, exp_val in enumerate(exploration_values): recalls = recall_measurements[exp_val] label = f"{exp_val:.3f}" if abs(exp_val - ffe_suggestion) < 0.001: label += " ✓" ax_ffe_recall.plot( hit_ratios, recalls, "o-", color=colors[i], linewidth=2, markersize=6, label=label, zorder=10, ) else: ax_ffe_recall.plot( hit_ratios, recalls, "o-", color=colors[i], linewidth=1, markersize=4, alpha=0.6, label=label, ) ax_ffe_recall.set_xlabel("Hit Ratio (%)", fontsize=8) ax_ffe_recall.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_ffe_recall.set_title("Recall", fontsize=11, fontweight="bold", pad=10) ax_ffe_recall.legend( fontsize=7, frameon=False, loc="best", title="filterFirstExploration", title_fontsize=8, ) setup_tufte_axes(ax_ffe_recall) # Row 2: filterFirstThreshold (calculated second) ax_fft = fig.add_subplot(gs[2, 0:2]) fft_data = report["filterFirstThreshold"] fft_suggestion = fft_data["suggestion"] benchmarks_hnsw = fft_data["benchmarks"]["hnsw"] benchmarks_ff = fft_data["benchmarks"]["filter_first"] ax_fft.plot( hit_ratios, benchmarks_hnsw, "o-", color="#E74C3C", linewidth=1.5, markersize=5, label="HNSW only", alpha=0.8, ) ax_fft.plot( hit_ratios, benchmarks_ff, "s-", color="#27AE60", linewidth=1.5, markersize=5, label="Filter-First", alpha=0.8, ) # Add suggestion line ax_fft.axvline( x=fft_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {fft_suggestion:.3f}", alpha=0.7, ) ax_fft.set_xlabel("Hit Ratio (%)", fontsize=8) ax_fft.set_ylabel("Response Time (ms)", fontsize=10, fontweight="bold") ax_fft.set_title( f"2. filterFirstThreshold: {fft_suggestion:.4f} (Response Time)", fontsize=11, fontweight="bold", pad=10, ) ax_fft.legend(fontsize=9, frameon=False, loc="best") setup_tufte_axes(ax_fft) # Recall for filterFirstThreshold ax_fft_recall = fig.add_subplot(gs[2, 2]) recall_hnsw = fft_data["recall_measurements"]["hnsw"] recall_ff = fft_data["recall_measurements"]["filter_first"] ax_fft_recall.plot( hit_ratios, recall_hnsw, "o-", color="#E74C3C", linewidth=1.5, markersize=5, label="HNSW only", alpha=0.8, ) ax_fft_recall.plot( hit_ratios, recall_ff, "s-", color="#27AE60", linewidth=1.5, markersize=5, label="Filter-First", alpha=0.8, ) # Add suggestion line ax_fft_recall.axvline( x=fft_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {fft_suggestion:.3f}", alpha=0.7, ) ax_fft_recall.set_xlabel("Hit Ratio (%)", fontsize=8) ax_fft_recall.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_fft_recall.set_title("Recall", fontsize=11, fontweight="bold", pad=10) ax_fft_recall.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_fft_recall) # Row 3: approximateThreshold (calculated third) ax_at = fig.add_subplot(gs[3, 0:2]) at_data = report["approximateThreshold"] at_suggestion = at_data["suggestion"] benchmarks_exact = at_data["benchmarks"]["exact"] benchmarks_ann = at_data["benchmarks"]["filter_first"] ax_at.plot( hit_ratios, benchmarks_exact, "o-", color="#9B59B6", linewidth=1.5, markersize=5, label="Exact search", alpha=0.8, ) ax_at.plot( hit_ratios, benchmarks_ann, "s-", color="#F39C12", linewidth=1.5, markersize=5, label="ANN (tuned)", alpha=0.8, ) # Add suggestion line ax_at.axvline( x=at_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {at_suggestion:.3f}", alpha=0.7, ) ax_at.set_xlabel("Hit Ratio (%)", fontsize=8) ax_at.set_ylabel("Response Time (ms)", fontsize=10, fontweight="bold") ax_at.set_title( f"3. approximateThreshold: {at_suggestion:.4f} (Response Time)", fontsize=11, fontweight="bold", pad=10, ) ax_at.legend(fontsize=9, frameon=False, loc="best") setup_tufte_axes(ax_at) # Recall for approximateThreshold ax_at_recall = fig.add_subplot(gs[3, 2]) recall_exact = at_data["recall_measurements"]["exact"] recall_ann = at_data["recall_measurements"]["filter_first"] ax_at_recall.plot( hit_ratios, recall_exact, "o-", color="#9B59B6", linewidth=1.5, markersize=5, label="Exact search", alpha=0.8, ) ax_at_recall.plot( hit_ratios, recall_ann, "s-", color="#F39C12", linewidth=1.5, markersize=5, label="ANN (tuned)", alpha=0.8, ) # Add suggestion line ax_at_recall.axvline( x=at_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {at_suggestion:.3f}", alpha=0.7, ) ax_at_recall.set_xlabel("Hit Ratio (%)", fontsize=8) ax_at_recall.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_at_recall.set_title("Recall", fontsize=11, fontweight="bold", pad=10) ax_at_recall.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_at_recall) # Row 4: postFilterThreshold (calculated fourth) - Response Time + Recall ax_pft_rt = fig.add_subplot(gs[4, 0:2]) pft_data = report["postFilterThreshold"] pft_suggestion = pft_data["suggestion"] benchmarks_post = pft_data["benchmarks"]["post_filtering"] benchmarks_pre = pft_data["benchmarks"]["filter_first"] # Response time comparison ax_pft_rt.plot( hit_ratios, benchmarks_post, "o-", color="#E67E22", linewidth=1.5, markersize=5, label="Post-filtering", alpha=0.8, ) ax_pft_rt.plot( hit_ratios, benchmarks_pre, "s-", color="#1ABC9C", linewidth=1.5, markersize=5, label="Pre-filtering", alpha=0.8, ) ax_pft_rt.axvline( x=pft_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {pft_suggestion:.3f}", alpha=0.7, ) ax_pft_rt.set_xlabel("Hit Ratio (%)", fontsize=8) ax_pft_rt.set_ylabel("Response Time (ms)", fontsize=10, fontweight="bold") ax_pft_rt.set_title( f"4. postFilterThreshold: {pft_suggestion:.4f} (Response Time)", fontsize=11, fontweight="bold", pad=10, ) ax_pft_rt.legend(fontsize=9, frameon=False, loc="best") setup_tufte_axes(ax_pft_rt) # Recall comparison ax_pft_recall = fig.add_subplot(gs[4, 2]) recall_post = pft_data["recall_measurements"]["post_filtering"] recall_pre = pft_data["recall_measurements"]["filter_first"] ax_pft_recall.plot( hit_ratios, recall_post, "o-", color="#E67E22", linewidth=1.5, markersize=5, label="Post-filtering", alpha=0.8, ) ax_pft_recall.plot( hit_ratios, recall_pre, "s-", color="#1ABC9C", linewidth=1.5, markersize=5, label="Pre-filtering", alpha=0.8, ) # Add suggestion line ax_pft_recall.axvline( x=pft_suggestion * 100, color="#555", linestyle=":", linewidth=1.5, label=f"Suggested: {pft_suggestion:.3f}", alpha=0.7, ) ax_pft_recall.set_xlabel("Hit Ratio (%)", fontsize=8) ax_pft_recall.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_pft_recall.set_title("Recall", fontsize=11, fontweight="bold", pad=10) ax_pft_recall.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_pft_recall) # Overall title fig.suptitle( "Vespa NN Parameter Optimization Report (Calculation Order)", fontsize=14, fontweight="bold", y=0.98, ) return fig ``` In \[51\]: Copied! ``` # Generate the visualization fig = plot_optimization_report(report, figsize=(18, 12)) plt.show() ``` # Generate the visualization fig = plot_optimization_report(report, figsize=(18, 12)) plt.show() ## Analyzing the report[¶](#analyzing-the-report) Now, let us take a closer look at what this actually tells us. If you prefer diving into the code yourself, feel free to expand the collapsed snippet below to see the steps of the `VespaNNParameterOptimizer.run()`-method. Click to see the code for the run-method ```` def run(self) -> Dict[str, Any]: """ Determines suggestions for all parameters supported by this class. This method: 1. Determines the hit-ratios of supplied ANN queries. 2. Sorts these queries into buckets based on the determined hit-ratio. 3. Determines a suggestion for filterFirstExploration. 4. Determines a suggestion for filterFirstThreshold. 5. Determines a suggestion for approximateThreshold. 6. Determines a suggestion for postFilterThreshold. 7. Reports the determined suggestions and all benchmarks and recall measurements performed. Returns: dict: A dictionary containing the suggested values, information about the query distribution, performed benchmarks, and recall measurements. Example: ```python { "buckets": { "buckets_per_percent": 2, "bucket_interval_width": 0.005, "non_empty_buckets": [ 2, 20, 100, 180, 190, 198 ], "filtered_out_ratios": [ 0.01, 0.1, 0.5, 0.9, 0.95, 0.99 ], "hit_ratios": [ 0.99, 0.9, 0.5, 0.09999999999999998, 0.050000000000000044, 0.010000000000000009 ], "query_distribution": [ 100, 100, 100, 100, 100, 100 ] }, "filterFirstExploration": { "suggestion": 0.39453125, "benchmarks": { "0.0": [ 4.265999999999999, 4.256000000000001, 3.9430000000000005, 3.246999999999998, 2.4610000000000003, 1.768 ], "1.0": [ 3.9259999999999984, 3.6010000000000004, 3.290999999999999, 3.78, 4.927000000000002, 8.415000000000001 ], "0.5": [ 3.6299999999999977, 3.417, 3.4490000000000007, 3.752, 4.257, 5.99 ], "0.25": [ 3.5830000000000006, 3.616, 3.3239999999999985, 3.3200000000000016, 2.654999999999999, 2.3789999999999996 ], "0.375": [ 3.465, 3.4289999999999994, 3.196999999999997, 3.228999999999999, 3.167, 3.700999999999999 ], "0.4375": [ 3.9880000000000013, 3.463000000000002, 3.4650000000000007, 3.5000000000000013, 3.7499999999999982, 4.724000000000001 ], "0.40625": [ 3.4990000000000006, 3.3680000000000003, 3.147000000000001, 3.33, 3.381, 4.083999999999998 ], "0.390625": [ 3.6060000000000008, 3.5269999999999992, 3.2820000000000005, 3.433999999999998, 3.2880000000000007, 3.8609999999999984 ], "0.3984375": [ 3.6870000000000016, 3.386000000000001, 3.336000000000001, 3.316999999999999, 3.5329999999999973, 4.719000000000002 ] }, "recall_measurements": { "0.0": [ 0.8758, 0.8768999999999997, 0.8915, 0.9489999999999994, 0.9045999999999998, 0.64 ], "1.0": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9675999999999998, 0.9852999999999996, 0.9957999999999998 ], "0.5": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9660999999999998, 0.9759999999999996, 0.9903 ], "0.25": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9553999999999995, 0.9323999999999996, 0.8123000000000004 ], "0.375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9615999999999997, 0.9599999999999999, 0.9626000000000002 ], "0.4375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9642999999999999, 0.9697999999999999, 0.9832 ], "0.40625": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9632, 0.9642999999999999, 0.9763999999999997 ], "0.390625": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9625999999999999, 0.9617999999999999, 0.9688999999999998 ], "0.3984375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.963, 0.9635000000000001, 0.9738999999999999 ] } }, "filterFirstThreshold": { "suggestion": 0.47, "benchmarks": { "hnsw": [ 2.779, 2.725000000000001, 3.151999999999999, 7.138999999999998, 11.362, 32.599999999999994 ], "filter_first": [ 3.543999999999999, 3.454, 3.443999999999999, 3.4129999999999994, 3.4090000000000003, 4.602999999999998 ] }, "recall_measurements": { "hnsw": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9740999999999996, 0.9852999999999993, 0.9937999999999992 ], "filter_first": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9627999999999999, 0.9630000000000001, 0.9718999999999994 ] } }, "approximateThreshold": { "suggestion": 0.03, "benchmarks": { "exact": [ 33.072, 31.99600000000001, 23.256, 9.155, 6.069000000000001, 2.0949999999999984 ], "filter_first": [ 2.9570000000000003, 2.91, 3.165000000000001, 3.396999999999998, 3.3310000000000004, 4.046 ] }, "recall_measurements": { "exact": [ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ], "filter_first": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9627999999999999, 0.9630000000000001, 0.9718999999999994 ] } }, "postFilterThreshold": { "suggestion": 0.49, "benchmarks": { "post_filtering": [ 2.0609999999999995, 2.448, 3.097999999999999, 7.200999999999999, 11.463000000000006, 11.622999999999996 ], "filter_first": [ 3.177999999999999, 2.717000000000001, 3.177, 3.5000000000000004, 3.455, 2.1159999999999997 ] }, "recall_measurements": { "post_filtering": [ 0.8288999999999995, 0.8355, 0.8967999999999998, 0.9519999999999997, 0.9512999999999994, 0.19180000000000003 ], "filter_first": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9627999999999999, 0.9630000000000001, 1.0 ] } } } ``` """ print("Distributing queries to buckets") # Distribute queries to buckets self.determine_hit_ratios_and_distribute_to_buckets(self.queries) # Check if the queries we have are deemed sufficient if not self.has_sufficient_queries(): print( " Warning: Selection of queries might not cover enough hit ratios to get meaningful results." ) if not self.buckets_sufficiently_filled(): print(" Warning: Only few queries for a specific hit ratio.") bucket_report = { "buckets_per_percent": self.buckets_per_percent, "bucket_interval_width": self.get_bucket_interval_width(), "non_empty_buckets": self.get_non_empty_buckets(), "filtered_out_ratios": self.get_filtered_out_ratios(), "hit_ratios": list(map(lambda x: 1 - x, self.get_filtered_out_ratios())), "query_distribution": self.get_query_distribution()[1], } if self.print_progress: print(bucket_report) # Determine filter-first parameters first # filterFirstExploration if self.print_progress: print("Determining suggestion for filterFirstExploration") filter_first_exploration_report = self.suggest_filter_first_exploration() filter_first_exploration = filter_first_exploration_report["suggestion"] if self.print_progress: print(filter_first_exploration_report) # filterFirstThreshold if self.print_progress: print("Determining suggestion for filterFirstThreshold") filter_first_threshold_report = self.suggest_filter_first_threshold( **{"ranking.matching.filterFirstExploration": filter_first_exploration} ) filter_first_threshold = filter_first_threshold_report["suggestion"] if self.print_progress: print(filter_first_threshold_report) # approximateThreshold if self.print_progress: print("Determining suggestion for approximateThreshold") approximate_threshold_report = self.suggest_approximate_threshold( **{ "ranking.matching.filterFirstThreshold": filter_first_threshold, "ranking.matching.filterFirstExploration": filter_first_exploration, } ) approximate_threshold = approximate_threshold_report["suggestion"] if self.print_progress: print(approximate_threshold_report) # postFilterThreshold if self.print_progress: print("Determining suggestion for postFilterThreshold") post_filter_threshold_report = self.suggest_post_filter_threshold( **{ "ranking.matching.approximateThreshold": approximate_threshold, "ranking.matching.filterFirstThreshold": filter_first_threshold, "ranking.matching.filterFirstExploration": filter_first_exploration, } ) if self.print_progress: print(post_filter_threshold_report) report = { "buckets": bucket_report, "filterFirstExploration": filter_first_exploration_report, "filterFirstThreshold": filter_first_threshold_report, "approximateThreshold": approximate_threshold_report, "postFilterThreshold": post_filter_threshold_report, } return report ```` Now, let's inspect the plots, one by one. 1. **Query Distribution Across Hit ratios:** This plot simply shows how our queries are distributed across hit ratios. This is important to know, as the hit ratios are used to determine which ANN strategy is used. If for example we had a timestamp field, and all of our queries for a given use case included a filter that matched almost the whole timestamp range (i.e. have *high hit ratios*), we could know that we won't benefit from tweaking the `approximateThreshold`-parameter, as it only affects queries with *very low hit ratios*. *Note: For a real-world document-query set, you would not expect the distribution to be as evenly distributed as this.* 1. **filterFirstExploration:** This plot shows the effect of tweaking filterFirstExploration. Decreasing it improves the response time at low hit ratios but also causes a drop in recall, and vice versa when the value is increased. Here, you probably want to choose a value that balances the two extremes. 1. **filterFirstThreshold:** Here, we can see the effects of the set it to 0.00, i.e., disable it, we get a horrible spike in response time for queries with low hit ratios. On the other hand, we do not want to set it too high as this would result in a slight increase in response time. 1. **approximateThreshold:** This plot shows when it makes sense to fall back to an exact search, i.e., for which hit ratios the exact search becomes cheaper than an approximative search. We want to set this to a value that is as large as possible while still offering acceptable response time. Note that the suggestion for this value may result in a lower recall than what is acceptable for your particular case. It should thus be interpreted merely as a suggestion. 1. **postFilterThreshold:** The last plot shows the effect of enabling post-filtering. This offers a small benefit in terms of shorter response time when the hit-ratio is high. When the hit ratio gets smaller, the response time does not only spike with post-filtering, but the recall plummets. **Warning** *Do not* set this too low, as this will cause degradation of quality (low recall) Now, we can plot strategy across different hit ratios again, with the suggested parameters: Click to see the plotting code ``` def plot_ann_strategy_selection( title: str = "ANN Strategy Selection Based on Hit Ratio", filter_first_threshold: float = 0.3, approximate_threshold: float = 0.02, post_filter_threshold: float = 1.0, figsize: tuple = (14, 6), ) -> plt.Figure: """ Visualize which ANN strategy is selected across different hit ratios. Creates a plot showing the four strategy regions with colored backgrounds and annotated threshold values at switching points. Args: filter_first_threshold: Threshold for filter-first heuristic (default 0.3) approximate_threshold: Threshold for exact search fallback (default 0.02) post_filter_threshold: Threshold for post-filtering (default 1.0) figsize: Figure size as (width, height) tuple Returns: matplotlib Figure object Example: >>> fig = plot_ann_strategy_selection( ... filter_first_threshold=0.25, ... approximate_threshold=0.03, ... post_filter_threshold=0.95 ... ) >>> plt.show() """ # Apply Tufte-inspired styling plt.rcParams["font.family"] = "serif" plt.rcParams["axes.linewidth"] = 0.5 # Create figure fig, ax = plt.subplots(figsize=figsize) # Generate hit ratios from 0 to 1 hit_ratios = np.linspace(0, 1, 1000) # Define strategy colors with transparency colors = { "Exact Search": "#9B59B6", "Filter-First/ACORN-1": "#27AE60", "HNSW with Pre-Filtering": "#3498DB", "Post-Filtering": "#E67E22", } # Calculate strategy for each hit ratio strategies = [] for hr in hit_ratios: strategy = determine_ann_strategy( hr, filter_first_threshold, approximate_threshold, post_filter_threshold ) strategies.append(strategy) # Plot colored regions for each strategy current_strategy = strategies[0] start_idx = 0 for i in range(1, len(strategies)): if strategies[i] != current_strategy or i == len(strategies) - 1: end_idx = i if i < len(strategies) - 1 else i ax.axvspan( hit_ratios[start_idx] * 100, hit_ratios[end_idx] * 100, alpha=0.3, color=colors[current_strategy], label=current_strategy, ) # Add text label in the middle of the region mid_point = (hit_ratios[start_idx] + hit_ratios[end_idx]) / 2 * 100 ax.text( mid_point, 0.5, current_strategy, ha="center", va="center", fontsize=10, fontweight="bold", color=colors[current_strategy], alpha=0.8, ) current_strategy = strategies[i] start_idx = i # Add vertical lines and annotations at threshold points thresholds = [ (approximate_threshold, "approximateThreshold", "left"), (filter_first_threshold, "filterFirstThreshold", "left"), (post_filter_threshold, "postFilterThreshold", "right"), ] for threshold, name, align in thresholds: if 0 < threshold <= 1.0: # Only show if threshold is in valid range ax.axvline( x=threshold * 100, color="#2C3E50", linestyle="--", linewidth=2, alpha=0.8, zorder=10, ) # Add annotation with parameter name and value y_pos = 0.85 if name == "filterFirstThreshold" else 0.95 ax.annotate( f"{name}\n{threshold:.3f}", xy=(threshold * 100, y_pos), xytext=(10 if align == "left" else -10, 0), textcoords="offset points", ha=align, va="top", fontsize=9, fontweight="bold", bbox=dict( boxstyle="round,pad=0.5", facecolor="white", edgecolor="#2C3E50", linewidth=1.5, alpha=0.9, ), arrowprops=dict( arrowstyle="->", connectionstyle="arc3,rad=0", lw=1.5, color="#2C3E50", ), ) # Set labels and title ax.set_xlabel("Hit Ratio (%)", fontsize=11, fontweight="bold") ax.set_ylabel("Strategy Selection", fontsize=11, fontweight="bold") ax.set_title( title, fontsize=13, fontweight="bold", pad=15, ) # Remove y-axis ticks (not meaningful for this visualization) ax.set_yticks([]) ax.set_xlim(0, 100) ax.set_ylim(0, 1) # Style the axes ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) ax.spines["bottom"].set_linewidth(0.5) ax.tick_params(labelsize=9) ax.grid(True, axis="x", alpha=0.2, linewidth=0.5, linestyle="-", color="gray") plt.tight_layout() return ``` In \[74\]: Copied! ``` # with optimized parameters from report fig = plot_ann_strategy_selection( filter_first_threshold=report["filterFirstThreshold"]["suggestion"], approximate_threshold=report["approximateThreshold"]["suggestion"], post_filter_threshold=report["postFilterThreshold"]["suggestion"], ) plt.show() ``` # with optimized parameters from report fig = plot_ann_strategy_selection( filter_first_threshold=report["filterFirstThreshold"]["suggestion"], approximate_threshold=report["approximateThreshold"]["suggestion"], post_filter_threshold=report["postFilterThreshold"]["suggestion"], ) plt.show() ## Comparing searchtime and recall[¶](#comparing-searchtime-and-recall) The next thing we want to do is to compare our suggested parameters against the default ones in terms of searchtime and recall. To do this, we will make use of the `VespaNNParameterOptimizer.benchmark()`-method. This method runs a set of queries, and calculates the recall (vs. exact NN search) and collects the Vespa-reported `searchtime` for the queries per bucket of queries based on their hit ratio. First, let us extract the suggested parameters, as returned in the report. In \[27\]: Copied! ``` def get_suggestions(report): suggested_parameters = {} prefix = "ranking.matching." for param, value in report.items(): if "suggestion" in value: suggested_parameters[prefix + param] = value["suggestion"] return suggested_parameters optimized_params = get_suggestions(report) optimized_params ``` def get_suggestions(report): suggested_parameters = {} prefix = "ranking.matching." for param, value in report.items(): if "suggestion" in value: suggested_parameters[prefix + param] = value["suggestion"] return suggested_parameters optimized_params = get_suggestions(report) optimized_params Out\[27\]: ``` {'ranking.matching.filterFirstExploration': 0.27734375, 'ranking.matching.filterFirstThreshold': 0.48, 'ranking.matching.approximateThreshold': 0.015, 'ranking.matching.postFilterThreshold': 0.485} ``` In \[87\]: Copied! ``` def run_benchmark(app: Vespa, params: dict = {}, queries=queries): # Create a new optimizer instance for step-by-step execution opt = VespaNNParameterOptimizer( app=app, queries=queries, hits=NUMBER_OF_HITS, buckets_per_percent=2, print_progress=True, ) # Distribute queries to buckets opt.determine_hit_ratios_and_distribute_to_buckets(queries) bench_results = opt.benchmark(**params) recall_results = opt.compute_average_recalls(**params) return bench_results, recall_results default_params = { "ranking.matching.filterFirstExploration": 0.3, "ranking.matching.filterFirstThreshold": 0.3, "ranking.matching.approximateThreshold": 0.02, "ranking.matching.postFilterThreshold": 1.00, } searchtime_before, recall_before = run_benchmark(app, default_params, queries) searchtime_after, recall_after = run_benchmark(app, optimized_params, queries) ``` def run_benchmark(app: Vespa, params: dict = {}, queries=queries): # Create a new optimizer instance for step-by-step execution opt = VespaNNParameterOptimizer( app=app, queries=queries, hits=NUMBER_OF_HITS, buckets_per_percent=2, print_progress=True, ) # Distribute queries to buckets opt.determine_hit_ratios_and_distribute_to_buckets(queries) bench_results = opt.benchmark(\*\*params) recall_results = opt.compute_average_recalls(\*\*params) return bench_results, recall_results default_params = { "ranking.matching.filterFirstExploration": 0.3, "ranking.matching.filterFirstThreshold": 0.3, "ranking.matching.approximateThreshold": 0.02, "ranking.matching.postFilterThreshold": 1.00, } searchtime_before, recall_before = run_benchmark(app, default_params, queries) searchtime_after, recall_after = run_benchmark(app, optimized_params, queries) ``` Benchmarking: 100.0% Computing recall: 100.0% Benchmarking: 100.0% Computing recall: 100.0% ``` Click to see the plotting code ``` def plot_benchmark_comparison( benchmark_data_1: dict, recall_data_1: dict, benchmark_data_2: dict, recall_data_2: dict, label_1: str = "Before", label_2: str = "After", title: str = "Benchmark Comparison", figsize: Tuple[int, int] = (14, 10), ) -> plt.Figure: """ Create a visualization comparing two sets of benchmark and recall results. Each dataset is shown in its own row of subplots, with consistent y-axis scales for direct comparison. Args: benchmark_data_1: First dictionary containing searchtime statistics recall_data_1: First dictionary containing recall statistics benchmark_data_2: Second dictionary containing searchtime statistics recall_data_2: Second dictionary containing recall statistics label_1: Label for the first dataset (default: "Before") label_2: Label for the second dataset (default: "After") title: Title for the plot figsize: Figure size as (width, height) tuple Returns: matplotlib Figure object """ # Apply Tufte-inspired styling plt.rcParams["font.family"] = "serif" plt.rcParams["axes.linewidth"] = 0.5 plt.rcParams["xtick.major.width"] = 0.5 plt.rcParams["ytick.major.width"] = 0.5 def setup_tufte_axes(ax): """Apply minimal Tufte-style formatting to axes""" ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_linewidth(0.5) ax.spines["bottom"].set_linewidth(0.5) ax.tick_params(labelsize=9) ax.grid(True, alpha=0.2, linewidth=0.5, linestyle="-", color="gray") # Extract data hit_ratios = np.array(benchmark_data_1["filtered_out_ratios"]) * 100 # Create figure with 2x2 subplots fig, axes = plt.subplots(2, 2, figsize=figsize) # Extract statistics stats_1 = benchmark_data_1["statistics"] stats_2 = benchmark_data_2["statistics"] recall_stats_1 = recall_data_1["statistics"] recall_stats_2 = recall_data_2["statistics"] # Calculate y-axis limits for consistent scaling # Search time limits (across both datasets) all_searchtime_values = [] for stats in [stats_1, stats_2]: all_searchtime_values.extend(stats["mean"]) all_searchtime_values.extend(stats["median"]) all_searchtime_values.extend(stats["p95"]) all_searchtime_values.extend(stats["p99"]) searchtime_min = min(all_searchtime_values) searchtime_max = max(all_searchtime_values) searchtime_margin = (searchtime_max - searchtime_min) * 0.1 searchtime_ylim = ( max(0, searchtime_min - searchtime_margin), searchtime_max + searchtime_margin, ) # Recall limits (should be 0-1.05) recall_ylim = (0, 1.05) # Define colors for each metric colors = { "mean": "#2C3E50", "median": "#3498DB", "p95": "#E74C3C", "p99": "#E67E22", } # Row 1: Dataset 1 (Before/Baseline) # Search Time ax_st_1 = axes[0, 0] ax_st_1.plot( hit_ratios, stats_1["mean"], "o-", color=colors["mean"], linewidth=2, markersize=6, label="Mean", zorder=10, ) ax_st_1.plot( hit_ratios, stats_1["median"], "s-", color=colors["median"], linewidth=1.5, markersize=5, label="Median", alpha=0.8, ) ax_st_1.plot( hit_ratios, stats_1["p95"], "^-", color=colors["p95"], linewidth=1.5, markersize=5, label="P95", alpha=0.8, ) ax_st_1.plot( hit_ratios, stats_1["p99"], "v-", color=colors["p99"], linewidth=1.5, markersize=5, label="P99", alpha=0.8, ) ax_st_1.axhline( y=benchmark_data_1["summary"]["overall_mean"], color=colors["mean"], linestyle=":", linewidth=1, alpha=0.4, ) ax_st_1.set_xlabel("Hit Ratio (%)", fontsize=10, fontweight="bold") ax_st_1.set_ylabel("Search Time (ms)", fontsize=10, fontweight="bold") ax_st_1.set_title( f"{label_1} - Search Time", fontsize=11, fontweight="bold", pad=10 ) ax_st_1.set_ylim(searchtime_ylim) ax_st_1.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_st_1) # Recall ax_rec_1 = axes[0, 1] ax_rec_1.plot( hit_ratios, recall_stats_1["mean"], "o-", color=colors["mean"], linewidth=2, markersize=6, label="Mean", zorder=10, ) ax_rec_1.plot( hit_ratios, recall_stats_1["median"], "s-", color=colors["median"], linewidth=1.5, markersize=5, label="Median", alpha=0.8, ) ax_rec_1.plot( hit_ratios, recall_stats_1["p95"], "^-", color=colors["p95"], linewidth=1.5, markersize=5, label="P95", alpha=0.8, ) ax_rec_1.plot( hit_ratios, recall_stats_1["p99"], "v-", color=colors["p99"], linewidth=1.5, markersize=5, label="P99", alpha=0.8, ) ax_rec_1.axhline( y=recall_data_1["summary"]["overall_mean"], color=colors["mean"], linestyle=":", linewidth=1, alpha=0.4, ) ax_rec_1.set_xlabel("Hit Ratio (%)", fontsize=10, fontweight="bold") ax_rec_1.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_rec_1.set_title(f"{label_1} - Recall", fontsize=11, fontweight="bold", pad=10) ax_rec_1.set_ylim(recall_ylim) ax_rec_1.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_rec_1) # Row 2: Dataset 2 (After/Optimized) # Search Time ax_st_2 = axes[1, 0] ax_st_2.plot( hit_ratios, stats_2["mean"], "o-", color=colors["mean"], linewidth=2, markersize=6, label="Mean", zorder=10, ) ax_st_2.plot( hit_ratios, stats_2["median"], "s-", color=colors["median"], linewidth=1.5, markersize=5, label="Median", alpha=0.8, ) ax_st_2.plot( hit_ratios, stats_2["p95"], "^-", color=colors["p95"], linewidth=1.5, markersize=5, label="P95", alpha=0.8, ) ax_st_2.plot( hit_ratios, stats_2["p99"], "v-", color=colors["p99"], linewidth=1.5, markersize=5, label="P99", alpha=0.8, ) ax_st_2.axhline( y=benchmark_data_2["summary"]["overall_mean"], color=colors["mean"], linestyle=":", linewidth=1, alpha=0.4, ) ax_st_2.set_xlabel("Hit Ratio (%)", fontsize=10, fontweight="bold") ax_st_2.set_ylabel("Search Time (ms)", fontsize=10, fontweight="bold") ax_st_2.set_title( f"{label_2} - Search Time", fontsize=11, fontweight="bold", pad=10 ) ax_st_2.set_ylim(searchtime_ylim) ax_st_2.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_st_2) # Recall ax_rec_2 = axes[1, 1] ax_rec_2.plot( hit_ratios, recall_stats_2["mean"], "o-", color=colors["mean"], linewidth=2, markersize=6, label="Mean", zorder=10, ) ax_rec_2.plot( hit_ratios, recall_stats_2["median"], "s-", color=colors["median"], linewidth=1.5, markersize=5, label="Median", alpha=0.8, ) ax_rec_2.plot( hit_ratios, recall_stats_2["p95"], "^-", color=colors["p95"], linewidth=1.5, markersize=5, label="P95", alpha=0.8, ) ax_rec_2.plot( hit_ratios, recall_stats_2["p99"], "v-", color=colors["p99"], linewidth=1.5, markersize=5, label="P99", alpha=0.8, ) ax_rec_2.axhline( y=recall_data_2["summary"]["overall_mean"], color=colors["mean"], linestyle=":", linewidth=1, alpha=0.4, ) ax_rec_2.set_xlabel("Hit Ratio (%)", fontsize=10, fontweight="bold") ax_rec_2.set_ylabel("Recall", fontsize=10, fontweight="bold") ax_rec_2.set_title(f"{label_2} - Recall", fontsize=11, fontweight="bold", pad=10) ax_rec_2.set_ylim(recall_ylim) ax_rec_2.legend(fontsize=8, frameon=False, loc="best") setup_tufte_axes(ax_rec_2) # Overall title fig.suptitle(title, fontsize=13, fontweight="bold", y=0.995) plt.tight_layout() return fig ``` In \[90\]: Copied! ``` # Compare before and after optimization fig = plot_benchmark_comparison( benchmark_data_1=searchtime_before.to_dict(), recall_data_1=recall_before.to_dict(), benchmark_data_2=searchtime_after.to_dict(), recall_data_2=recall_after.to_dict(), label_1="Default Parameters", label_2="Optimized Parameters", title="Performance Comparison: Before vs After Optimization", figsize=(14, 5), ) plt.show() ``` # Compare before and after optimization fig = plot_benchmark_comparison( benchmark_data_1=searchtime_before.to_dict(), recall_data_1=recall_before.to_dict(), benchmark_data_2=searchtime_after.to_dict(), recall_data_2=recall_after.to_dict(), label_1="Default Parameters", label_2="Optimized Parameters", title="Performance Comparison: Before vs After Optimization", figsize=(14, 5), ) plt.show() In \[91\]: Copied! ``` import pandas as pd from typing import Tuple def calculate_metric_diff( after_dict: dict, before_dict: dict ) -> Tuple[pd.DataFrame, pd.DataFrame]: """ Calculate differences between after and before metrics (searchtime or recall). Works for both searchtime (where lower is better) and recall (where higher is better). Args: after_dict: Dictionary containing metrics after optimization before_dict: Dictionary containing metrics before optimization Returns: Tuple of (df_stats, df_summary) DataFrames with absolute differences and percentage changes """ # Extract hit ratios (same for both) hit_ratios = after_dict["filtered_out_ratios"] metric_name = after_dict.get("metric_name", "metric") # Create DataFrame for statistics stats_data = [] for stat_name in ["mean", "median", "p95", "p99"]: after_values = np.array(after_dict["statistics"][stat_name]) before_values = np.array(before_dict["statistics"][stat_name]) abs_diff = after_values - before_values pct_change = ((after_values - before_values) / before_values) * 100 for i, hr in enumerate(hit_ratios): stats_data.append( { "hit_ratio": hr, "statistic": stat_name, "before": before_values[i], "after": after_values[i], "abs_diff": abs_diff[i], "pct_change": pct_change[i], } ) df_stats = pd.DataFrame(stats_data) # Create DataFrame for summary statistics summary_data = [] for metric in ["overall_mean", "overall_median"]: before_val = before_dict["summary"][metric] after_val = after_dict["summary"][metric] abs_diff = after_val - before_val pct_change = ((after_val - before_val) / before_val) * 100 summary_data.append( { "metric": metric, "before": before_val, "after": after_val, "abs_diff": abs_diff, "pct_change": pct_change, } ) df_summary = pd.DataFrame(summary_data) # Add metric name to both dataframes for clarity df_stats["metric_name"] = metric_name df_summary["metric_name"] = metric_name return df_stats, df_summary # Calculate differences for searchtime searchtime_stats, searchtime_summary = calculate_metric_diff( searchtime_after.to_dict(), searchtime_before.to_dict() ) # Calculate differences for recall recall_stats, recall_summary = calculate_metric_diff( recall_after.to_dict(), recall_before.to_dict() ) searchtime_stats ``` import pandas as pd from typing import Tuple def calculate_metric_diff( after_dict: dict, before_dict: dict ) -> Tuple\[pd.DataFrame, pd.DataFrame\]: """ Calculate differences between after and before metrics (searchtime or recall). Works for both searchtime (where lower is better) and recall (where higher is better). Args: after_dict: Dictionary containing metrics after optimization before_dict: Dictionary containing metrics before optimization Returns: Tuple of (df_stats, df_summary) DataFrames with absolute differences and percentage changes """ # Extract hit ratios (same for both) hit_ratios = after_dict["filtered_out_ratios"] metric_name = after_dict.get("metric_name", "metric") # Create DataFrame for statistics stats_data = [] for stat_name in \["mean", "median", "p95", "p99"\]: after_values = np.array(after_dict["statistics"][stat_name]) before_values = np.array(before_dict["statistics"][stat_name]) abs_diff = after_values - before_values pct_change = ((after_values - before_values) / before_values) * 100 for i, hr in enumerate(hit_ratios): stats_data.append( { "hit_ratio": hr, "statistic": stat_name, "before": before_values[i], "after": after_values[i], "abs_diff": abs_diff[i], "pct_change": pct_change[i], } ) df_stats = pd.DataFrame(stats_data) # Create DataFrame for summary statistics summary_data = [] for metric in \["overall_mean", "overall_median"\]: before_val = before_dict["summary"][metric] after_val = after_dict["summary"][metric] abs_diff = after_val - before_val pct_change = ((after_val - before_val) / before_val) * 100 summary_data.append( { "metric": metric, "before": before_val, "after": after_val, "abs_diff": abs_diff, "pct_change": pct_change, } ) df_summary = pd.DataFrame(summary_data) # Add metric name to both dataframes for clarity df_stats["metric_name"] = metric_name df_summary["metric_name"] = metric_name return df_stats, df_summary # Calculate differences for searchtime searchtime_stats, searchtime_summary = calculate_metric_diff( searchtime_after.to_dict(), searchtime_before.to_dict() ) # Calculate differences for recall recall_stats, recall_summary = calculate_metric_diff( recall_after.to_dict(), recall_before.to_dict() ) searchtime_stats Out\[91\]: | | hit_ratio | statistic | before | after | abs_diff | pct_change | metric_name | | --- | --------- | --------- | ------ | ----- | -------- | ---------- | ----------- | | 0 | 0.01 | mean | 3.984 | 2.063 | -1.921 | -48.217871 | searchtime | | 1 | 0.10 | mean | 3.228 | 2.173 | -1.055 | -32.682776 | searchtime | | 2 | 0.50 | mean | 4.367 | 3.863 | -0.504 | -11.541104 | searchtime | | 3 | 0.90 | mean | 3.791 | 3.365 | -0.426 | -11.237141 | searchtime | | 4 | 0.95 | mean | 3.105 | 2.930 | -0.175 | -5.636071 | searchtime | | 5 | 0.99 | mean | 1.851 | 1.782 | -0.069 | -3.727715 | searchtime | | 6 | 0.01 | median | 3.900 | 2.100 | -1.800 | -46.153846 | searchtime | | 7 | 0.10 | median | 3.200 | 2.200 | -1.000 | -31.250000 | searchtime | | 8 | 0.50 | median | 4.400 | 3.800 | -0.600 | -13.636364 | searchtime | | 9 | 0.90 | median | 3.700 | 3.300 | -0.400 | -10.810811 | searchtime | | 10 | 0.95 | median | 3.100 | 2.850 | -0.250 | -8.064516 | searchtime | | 11 | 0.99 | median | 1.800 | 1.700 | -0.100 | -5.555556 | searchtime | | 12 | 0.01 | p95 | 4.900 | 2.600 | -2.300 | -46.938776 | searchtime | | 13 | 0.10 | p95 | 4.205 | 2.905 | -1.300 | -30.915577 | searchtime | | 14 | 0.50 | p95 | 5.300 | 5.400 | 0.100 | 1.886792 | searchtime | | 15 | 0.90 | p95 | 4.600 | 4.100 | -0.500 | -10.869565 | searchtime | | 16 | 0.95 | p95 | 3.820 | 3.900 | 0.080 | 2.094241 | searchtime | | 17 | 0.99 | p95 | 2.200 | 2.205 | 0.005 | 0.227273 | searchtime | | 18 | 0.01 | p99 | 5.212 | 2.803 | -2.409 | -46.220261 | searchtime | | 19 | 0.10 | p99 | 4.601 | 3.102 | -1.499 | -32.579874 | searchtime | | 20 | 0.50 | p99 | 5.613 | 5.601 | -0.012 | -0.213789 | searchtime | | 21 | 0.90 | p99 | 5.013 | 4.800 | -0.213 | -4.248953 | searchtime | | 22 | 0.95 | p99 | 4.604 | 4.401 | -0.203 | -4.409209 | searchtime | | 23 | 0.99 | p99 | 2.504 | 2.501 | -0.003 | -0.119808 | searchtime | We can see that the searchtimes improve with the suggested parameters. Note that these improvements may be artificially high due to our random selection of filters. Now, let us take a look at the recall stats: In \[92\]: Copied! ``` recall_stats ``` recall_stats Out\[92\]: | | hit_ratio | statistic | before | after | abs_diff | pct_change | metric_name | | --- | --------- | --------- | ------ | ------ | -------- | ---------- | ----------- | | 0 | 0.01 | mean | 0.8328 | 0.8333 | 0.0005 | 0.060038 | recall | | 1 | 0.10 | mean | 0.8424 | 0.8403 | -0.0021 | -0.249288 | recall | | 2 | 0.50 | mean | 0.9006 | 0.8968 | -0.0038 | -0.421941 | recall | | 3 | 0.90 | mean | 0.9559 | 0.9548 | -0.0011 | -0.115075 | recall | | 4 | 0.95 | mean | 0.9417 | 0.9372 | -0.0045 | -0.477859 | recall | | 5 | 0.99 | mean | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 6 | 0.01 | median | 0.8600 | 0.8600 | 0.0000 | 0.000000 | recall | | 7 | 0.10 | median | 0.8600 | 0.8600 | 0.0000 | 0.000000 | recall | | 8 | 0.50 | median | 0.9300 | 0.9150 | -0.0150 | -1.612903 | recall | | 9 | 0.90 | median | 0.9700 | 0.9700 | 0.0000 | 0.000000 | recall | | 10 | 0.95 | median | 0.9600 | 0.9450 | -0.0150 | -1.562500 | recall | | 11 | 0.99 | median | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 12 | 0.01 | p95 | 0.9605 | 0.9605 | 0.0000 | 0.000000 | recall | | 13 | 0.10 | p95 | 0.9600 | 0.9600 | 0.0000 | 0.000000 | recall | | 14 | 0.50 | p95 | 0.9900 | 0.9900 | 0.0000 | 0.000000 | recall | | 15 | 0.90 | p95 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 16 | 0.95 | p95 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 17 | 0.99 | p95 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 18 | 0.01 | p99 | 0.9901 | 0.9901 | 0.0000 | 0.000000 | recall | | 19 | 0.10 | p99 | 0.9900 | 0.9900 | 0.0000 | 0.000000 | recall | | 20 | 0.50 | p99 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 21 | 0.90 | p99 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 22 | 0.95 | p99 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | | 23 | 0.99 | p99 | 1.0000 | 1.0000 | 0.0000 | 0.000000 | recall | We can see that the tradeoff for the reduced latencies is a small reduction in recall. ## Conclusion and next steps[¶](#conclusion-and-next-steps) We have provided example code and hopefully some intuition that you can use to optimize the ANN-search for your own applications. We want to emphasize that there is no "correct" choice of value for these parameters, as it will always be about balancing tradeoffs. Please try to understand the report to help you make the right choices for your requirements, rather than to blindly accept the suggestions. Tuning the parameters that determine the ANN-strategy is helpful to reduce the latency of some of your most "problematic" queries, as they are likely to suffer from a suboptimal strategy. It is wise to log and add these high latency queries to you query set in order to improve them. Make sure that they actually are high-latency queries by rerunning them, to verify they were not just issued at a time your application is under heavy load. Also, don't forget to include a representative sample of your other queries for the optimization as well, to make sure that you don't just switch strategy at the cost of other queries. If you are willing to trade even less recall for faster searchtimes, you could reduce the [targetHits](https://docs.vespa.ai/en/nearest-neighbor-search.html#querying-using-nearestneighbor-query-operator)-parameter passed to the [nearestNeighbor](https://docs.vespa.ai/en/reference/query-language-reference.html#nearestneighbor)-query operator. (Or increase it if you want more recall at the cost of higher latencies). Now that you have tuned your ANN-parameters for recall and performance, consider doing quality evaluation. Take a look at the [Evaluating a Vespa application](https://vespa-engine.github.io/pyvespa/evaluating-vespa-application-cloud.md)-notebook for some tips on getting started. ## FAQ[¶](#faq) **Q: When should I consider re-tuning the ANN-parameters?** If either your corpus or your query load changes so that the typical hit ratios for your queries change. **Q: Will this tuning guarantee low latency across all queries?** No, but if you observe queries with high-latency, consider adding these queries to your tuning queries, and rerun the optimization to ensure these queries are accounted for. **Q: What if I observe poor recall for some ANN queries?** Check where on the hit-ratio scale these queries lie. Maybe this can be remedied by increasing `filterFirstExploration`? Otherwise, the recall can generally be improved by increasing the number of `targetHits` or the `explorationSlack`. Note that this comes at the cost of latency. **Q: I have questions about my ANN parameter choices** Please generate the report as shown in this notebook for your application and queries, and reach out to us in our [community slack](https://slack.vespa.ai/). This will allow us to help you much more easily. ### Delete application[¶](#delete-application) The following will delete the application and data from the dev environment. In \[93\]: Copied! ``` if os.getenv("CI") == "true": vespa_cloud.delete() ``` if os.getenv("CI") == "true": vespa_cloud.delete() # Billion-scale vector search with Cohere binary embeddings in Vespa[¶](#billion-scale-vector-search-with-cohere-binary-embeddings-in-vespa) Cohere just released a new embedding API with support for binary and `int8` vectors. Read the announcement in the blog post: [Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets](https://cohere.com/blog/int8-binary-embeddings). > We are excited to announce that Cohere Embed is the first embedding model that natively supports int8 and binary embeddings. This is huge because: - Binarization reduces the storage (disk/memory) footprint from 1024 floats (4096 bytes) per vector to 128 bytes. - Faster distance calculations using [hamming](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) distance that Vespa natively supports for bits packed into `int8` tensor cells. More on [hamming distance in Vespa](https://docs.vespa.ai/en/reference/schema-reference.html#hamming). - Multiple vector representations allow for coarse retrieval in hamming space and subsequent phases using higher-resolution representations. - Drastically reduces the deployment due to tiered storage economics. Vespa supports `hamming` distance with and without [HNSW indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html). For those wanting to learn more about binary vectors, we recommend our 2021 blog series on [Billion-scale vector search with Vespa](https://blog.vespa.ai/billion-scale-knn/) and [Billion-scale vector search with Vespa - part two](https://blog.vespa.ai/billion-scale-knn-part-two/). This notebook demonstrates using the Cohere embeddings with a coarse-to-fine search and re-ranking pipeline that reduces costs, but offers the same retrieval (nDCG) accuracy. - The packed binary vector representation is stored in memory, with an optional HNSW index using hamming distance. - The `int8` vector representation is stored on disk using Vespa's [paged](https://docs.vespa.ai/en/attributes.html#paged-attributes) option. At query time: - Retrieve in hamming space (1000 candidates) as the coarse-level search using the compact binary representation. - Re-rank by using a dot product between the float version of the query vector (1024 dims) against an unpacked float version of the binary embedding (also 1024 dims) - A re-ranking phase using the 1024 dimensional int8 representations. This stage pages the vector data from the disk using Vespa's [paged](https://docs.vespa.ai/en/attributes.html#paged-attributes) option (unless it is already cached). Install the dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa cohere==4.57 vespacli ``` !pip3 install -U pyvespa cohere==4.57 vespacli ## Examining the Cohere embeddings[¶](#examining-the-cohere-embeddings) Let us check out the Cohere embedding API and how we can obtain vector embeddings with different precisions for the same text input (without additional cost). See also [Cohere embed API doc](https://docs.cohere.com/docs/embed-api). In \[3\]: Copied! ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() ### Some sample documents[¶](#some-sample-documents) Define a few sample documents that we want to embed In \[4\]: Copied! ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] In \[5\]: Copied! ``` # Compute the embeddings of our sample documents. # Set input_type to "search_document" and embedding_types to "binary" and "int8" embeddings = co.embed( texts=documents, model="embed-english-v3.0", input_type="search_document", embedding_types=["binary", "int8"], ) ``` # Compute the embeddings of our sample documents. # Set input_type to "search_document" and embedding_types to "binary" and "int8" embeddings = co.embed( texts=documents, model="embed-english-v3.0", input_type="search_document", embedding_types=["binary", "int8"], ) In \[6\]: Copied! ``` print(embeddings) ``` print(embeddings) ``` cohere.Embeddings { response_type: embeddings_by_type embeddings: cohere.EmbeddingsByType { float: None int8: [[-23, -22, -52, 18, -42, -48, 2, -8, 6, 44, 73, 9, 3, -44, -25, 15, 19, 3, 18, -19, 6, 17, 0, -62, -14, 46, -8, -14, 20, 22, 10, -40, 10, 48, -20, 40, -8, 8, 29, 0, -27, 11, -39, -28, -93, 33, -89, 4, 15, -41, -12, -2, 7, -23, -15, -21, 47, -9, 88, -107, -91, -50, 65, 27, 5, 5, 52, 27, -15, -4, 14, -7, 6, -1, -17, 13, 17, 74, 26, -9, -4, -1, 56, -15, -7, 6, -17, -25, -23, -38, -38, 78, -61, -27, -53, -20, -3, -8, -9, -18, 9, -24, 14, -13, -40, 90, 40, 24, -48, -7, -11, -116, 36, -56, -15, -1, -6, 31, 31, 8, 44, 80, 36, -35, -24, -13, -36, -64, 44, -11, -35, 46, -43, -68, -40, 12, 32, -8, -1, 58, -9, -4, 49, 3, 9, 44, 45, -33, -52, -25, -53, 27, -67, 22, 33, 29, -32, 36, 37, 83, -17, 19, 66, -17, 4, -57, -57, 20, 19, -20, -3, 18, 43, -16, -8, 29, -45, -39, -42, 121, 73, -49, -128, 127, -19, 41, -10, 55, 38, 13, -66, 1, -52, -35, 59, 6, -60, -35, 20, -11, -20, 58, -50, 27, -1, -27, 0, 33, 36, 39, -22, -6, 0, -43, -34, -4, -2, -27, -37, -19, -48, 30, -59, 33, -79, 27, -51, 38, -46, 7, 99, 0, 46, 21, -39, -13, -1, -87, 22, 65, 42, -47, -66, -109, 73, 77, 47, -79, -17, 28, 8, -2, -2, -36, -12, 35, -41, 25, -1, 13, -17, 57, 98, -31, -26, -23, -3, -8, -13, -33, 22, -13, 6, 63, -64, -12, 5, -11, 0, 27, 5, 50, 35, -7, 11, 64, 9, 30, 31, -14, 2, 53, 23, 54, 21, -19, -30, 90, -20, -16, -69, -5, -7, -79, 6, -2, 23, 8, 18, -11, -14, 8, 21, 16, -14, -58, -37, -8, -86, -34, -22, 7, -39, -14, 5, 27, -78, -2, -5, -39, 42, 1, 4, 22, 16, -7, -2, 48, -26, -68, -48, -37, -7, -26, -27, 44, 9, 4, -33, 47, -59, -19, 10, 44, 56, -123, -38, 111, -25, -10, 18, 29, 8, -41, -26, -51, 9, 20, 68, 46, -44, 45, -67, 0, 41, 35, 39, -28, 13, 21, 25, 71, -28, 32, -18, 59, 14, 7, 10, -40, 20, 72, 51, -26, -18, -25, -35, 39, -34, 23, 127, -24, -26, 77, -88, 104, 45, 37, 31, -36, 23, -34, -1, -50, 0, -35, -45, -8, 40, 1, -51, 71, -60, 4, -18, -26, 19, 20, 1, 30, 6, -20, -13, 3, 23, 88, -14, -12, -31, -36, 51, 15, -4, 13, 5, -42, 17, 29, 13, 23, -17, 8, 23, 25, -36, -60, 22, 57, 4, 2, 29, -36, -41, 34, 12, -34, 46, 10, -28, 31, 18, 11, 4, 3, 7, 19, -30, 25, -56, 7, 7, 0, 64, -35, -33, 19, -72, -35, -20, -79, -81, 2, -1, -54, -17, -6, -24, 97, 47, -46, 48, -12, -33, -20, 43, 7, -16, 45, -5, 27, -7, -8, 19, 43, -43, 15, -21, 35, -35, -18, -39, -21, 18, 4, 13, 12, 12, 57, 0, -11, 121, 15, 58, 29, -86, 11, -42, 17, 47, -18, -27, -29, -26, 55, -19, 20, -6, 34, 0, -9, 4, 7, 27, -17, -35, -4, -20, 11, 4, 36, 5, -7, 27, -40, 127, 23, -30, -111, 37, -15, -35, -22, 5, -17, -23, -36, -23, 45, -38, 16, 47, 5, -49, 52, -28, -20, -6, -51, -50, -53, 33, 4, 16, -63, -2, 13, -36, -37, -19, -9, -42, 46, -14, -22, 72, 93, 106, -27, -5, 13, -23, -47, 4, 25, -6, -30, 22, -45, -96, -34, 22, -44, 43, 40, -2, -9, -45, 15, -11, 23, 18, 0, -44, 11, 25, -30, -29, -6, -19, -20, 47, 35, 39, -24, -19, 25, 19, -11, -13, 2, -50, -4, -9, -22, 17, -2, -65, 37, 15, 30, 15, 107, -47, 28, 11, 18, -22, 53, -41, 58, 8, -14, -28, -8, -10, 11, -18, 20, -38, 4, 0, -18, -13, 26, -51, 20, -23, 23, 52, 5, -3, 25, 3, 27, 28, 60, 1, -13, -21, -14, 10, 7, 12, 21, 0, -5, -39, 7, 3, -2, 4, 42, -45, -12, 38, 0, -10, -7, -39, 6, -37, 24, 17, -37, 26, 13, -60, -22, 27, 36, 5, 54, -21, -19, 30, -79, 17, 19, -24, 17, 111, -54, 61, -56, 7, 86, 17, 60, 11, 26, -6, 59, 16, 21, 25, -17, 13, 15, 7, -13, -83, -2, -17, 39, 21, 60, 33, 40, -69, 36, 14, 19, -3, -2, -37, 14, -4, -40, -9, 3, 49, 16, 54, -6, 3, -11, -4, 4, -6, 25, -65, 47, -25, -29, -41, 31, 57, -35, 30, -7, -3, -27, -36, -23, -34, 39, -2, -25, 2, 58, 11, 16, -14, -55, -7, -7, -110, -14, -47, -85, 77, 71, -10, 6, 13, -72, -32, 69, 7, -27, 9, -41, -40, -28, 30, -12, 26, -58, 74, -1, -50, 37, -81, -41, 42, -49, -22, 25, 0, 86, -8, -4, -1, -17, 1, 58, 12, -34, -42, -24, -33, 23, 2, 23, 3, -44, -33, -19, 14, -70, 7, 25, -13, -90, -57, -29, -11, -46, -34, 6, 14, 79, 108, 26, 31, 3, -9, 27, 66, 2, 41, -17, -19, 62, 23, 48, -20, 6, -88, 74, -59, -53, 67, -77, -32, 1, -3, -43, 22, -45, -34, 20, 60, 58, -65, -48, 116, 76, 127, 24, -29, 59, 10, -20, -57, -19, -3, 35, 19, 3, 34, 6, 55, 27, 35, -4, -55, 32, 22, -4, -12, -34, -50, -16, 0, -22, 75, -48, -51, -26, -12, 1, -9, -17, -26, -4, -60, -128, -3, -19, -23, -17, -4, -5, -5, 37, -8, -21, -1, -16, 49, 6, -31, -21, -18, -13, 33, -11, -29, 16, -31, 41, -19, 0, 57, -4, -9, 16, 27, -27, 6, 104, -53, 39, -6, -8, 3, 0, 4, 39, -46, 33, 10, 26, -19, 53, 41, 31, 15, 12, 2, 44, -67, -18, -88, -29, 27, 3, 55, -8, -6, 38, 0, 13], [-18, -43, -29, 10, -43, -28, 0, -20, -10, 81, 107, 17, 35, -44, -27, 54, -4, 31, 17, -23, 19, -18, -41, -67, 31, -74, -39, 18, -4, 2, 13, 37, -7, 51, -7, 42, 9, 11, 43, 22, 12, 0, -32, -20, -39, 9, -21, -28, 27, -33, -11, 25, 11, -44, -24, -38, 109, -75, 73, -125, -89, -59, 103, 43, 20, -14, -24, 8, -3, 55, 22, -23, -4, 8, -25, -1, 28, 37, 28, 0, -27, 13, 40, -8, -43, -16, -39, -13, 9, 7, -11, 42, -32, 63, 2, -42, 2, 14, -34, -30, 17, -45, 21, -19, -41, 123, 32, 55, -63, -7, 11, -128, 28, 7, -29, -18, -17, 51, 7, 46, 25, 70, 61, -86, -7, -8, -27, -92, 88, 8, -20, 19, -42, -29, -19, 5, -10, 38, -8, 68, -45, -51, 46, 0, 5, 39, 35, -16, 2, -56, -16, 16, -26, 6, 21, -12, -28, 6, 53, 31, -35, -5, 20, 20, -1, 0, 46, 14, 33, 30, 29, -4, 56, 8, -21, -3, -3, -122, -24, 127, 71, 5, -128, 83, 30, -52, -14, -49, 29, 23, -21, 4, -45, -22, 16, 39, -64, 29, -2, -31, 18, 10, 27, 2, 4, 18, -13, 31, 91, 23, -37, -2, 2, -32, -69, 14, -7, 8, -38, 47, -45, 6, -52, -2, -24, 44, -50, 28, 18, 21, 98, -20, -25, 53, -2, 16, 68, 29, 14, -23, -4, -91, -40, 40, -30, 46, 17, 11, 37, 24, -18, -65, 13, -110, 39, 13, 15, 69, -78, 31, -39, 54, 43, -4, -21, 13, -36, -21, -62, 51, 56, -66, 8, 59, -80, 23, -13, 6, -2, 38, -17, 55, 11, -17, -19, -20, 23, -5, -13, -47, 31, -16, -21, 15, -26, -35, -39, 1, 28, -15, -52, 63, -3, 8, -9, 1, -20, 4, 0, -34, -19, 27, 17, -9, -11, -42, -10, 0, -66, -34, -7, -21, 17, -1, -11, 1, -7, 10, -5, 7, 127, 72, 37, 0, 49, -14, 28, -32, -11, 20, 31, 30, 0, -71, -50, 66, 9, 25, -28, 29, -43, -40, -27, -13, -1, -78, 23, 46, -33, -22, 1, -11, -22, -16, 36, -26, -24, 7, 5, 2, -29, 30, -87, -21, -5, 49, 0, -50, 23, -13, -11, 29, 24, 44, 3, 30, -44, -9, 13, 3, -10, -16, 16, -27, 54, -28, 6, 110, -99, -21, 127, 2, -1, 52, -86, 94, 23, 36, 22, -18, -14, 5, -59, 0, -26, -22, -103, 0, -18, 17, -50, 99, -72, 28, 48, 47, 9, -48, 51, 40, 45, -15, -34, -6, 14, 103, -1, 48, -21, 0, 41, -9, -6, 66, -11, 4, -33, -2, 52, 0, 16, -8, 58, 3, -33, -9, 50, 51, 20, 43, 64, 0, -53, 39, -41, -20, 98, 5, -49, 18, -39, 25, 5, 30, -9, 57, -31, 3, -41, 32, -2, 11, 33, -27, -47, 36, -76, -34, -4, -47, -51, -19, 31, -30, 14, -14, -30, 100, 42, -52, 47, -24, -77, -1, 45, 9, 20, 52, 4, 83, 44, 5, 45, 49, -15, -3, 81, 2, 22, -23, -39, -27, 20, 32, -14, 10, -21, 17, 13, 32, 77, -9, 45, 29, -51, -24, -4, 29, 22, -50, 7, 10, -25, -2, -20, 30, -35, 27, -12, -1, -9, -15, 27, -5, -29, -85, -52, -20, 16, 68, 48, -23, -6, -20, 92, 19, -63, -128, 30, -9, -51, -36, 54, 45, -24, -41, 10, 36, -32, -28, -2, 25, 3, 44, -61, 32, 33, -7, -31, -2, 20, 7, -31, -17, -2, 19, 63, 26, 61, -4, -18, 23, 3, -26, -11, 59, 45, -22, 14, 7, -11, -30, -21, 48, -18, -25, 2, -20, -25, -38, 15, -4, 5, 8, 18, -37, -42, -56, -41, -10, -67, -2, -54, -86, -4, 49, 1, -2, -21, -11, 59, 14, 10, -59, -62, -15, -15, -19, 3, 6, -19, -1, -46, 4, 51, -17, -32, 37, 1, 13, 19, 114, -11, 6, 21, -12, 1, 21, -22, 10, -31, 11, -51, -39, -4, 2, 78, 30, 28, -97, -8, -53, -12, 15, -19, 30, -15, -2, 43, 15, 53, 93, 0, 55, 23, 19, -23, -51, -3, 1, 2, -26, 14, -27, -15, 61, 26, -16, 9, 4, 12, 24, -16, -14, 43, -20, 35, 34, -18, -14, -33, -20, 4, -29, 20, -6, -37, -21, 13, 40, -36, 30, 12, -34, -3, -52, -5, 4, -58, -21, 57, -29, 11, -48, -15, 12, -4, 26, -22, -43, 4, 41, -58, 25, 10, 17, -33, 33, -60, 30, -50, 20, 34, 12, 20, 65, -57, -1, -16, 41, 26, 92, 20, 16, -128, -11, 2, -30, -41, -17, 35, 9, 67, 3, -21, -13, 17, 19, 8, 26, -37, 47, 10, 33, 34, 35, 14, -12, 55, -5, -65, -14, -84, 5, -30, 35, -5, -27, 22, 34, 16, 32, 0, 33, -12, 72, -68, 0, -50, -50, 20, 37, -43, 74, 0, -11, 15, -43, 23, -49, -29, -35, 61, 47, 85, -3, -9, -53, 41, -20, -14, 11, -59, 1, -1, -56, -12, 11, 19, 4, 38, -76, -23, 18, 20, -15, -25, -55, -69, 53, 54, -82, -60, 6, 17, -74, -40, -25, -40, -83, 0, 33, -73, -101, -50, -39, -26, 43, -67, 1, -51, -11, 24, -26, 23, -43, 7, 11, 73, 3, 59, 2, -7, 89, 54, 58, -37, 1, -84, 88, -65, -56, -11, -60, -58, 21, 25, -54, 16, -42, -58, -14, 24, 17, -41, 27, 48, 40, 79, 13, -38, 31, -8, 15, 4, 6, 3, -33, -59, 45, 55, -7, -12, 0, 10, -22, -17, 35, 4, -7, 4, -23, -34, -23, -3, 58, 50, -32, -72, -37, -56, 36, -46, 6, 3, 12, -39, 20, 13, 37, 15, 9, 0, -28, -21, -14, 4, 17, 57, 1, -24, -12, -1, -14, -11, -47, -13, 0, -36, -44, -4, -43, -48, -33, -4, 8, -19, 12, -23, 24, -10, 18, 19, 38, -5, -6, 54, -28, 41, -19, -28, -19, 17, -26, -41, 9, -15, 90, 33, 20, -6, 82, -60, -40, -84, -36, -3, -62, 6, -34, -10, -31, -31, 20], [5, 6, -45, -35, -3, -37, -11, -10, -18, 36, 26, 57, -17, -33, 25, -7, -3, 66, 44, 8, 1, 40, -34, -57, 2, -34, -7, 12, 9, 2, 20, -4, 19, 37, 2, 58, -16, 28, 50, -19, 20, -14, -59, -48, -98, 61, -61, 48, 7, -50, -4, -35, 11, -32, -38, -37, 60, -75, 47, -48, -56, -41, 69, 12, 3, -1, 85, 32, -35, -21, 23, 17, 12, -33, -6, -30, -2, -14, 39, 12, 34, 64, -5, -65, 24, -60, -14, -20, -58, -27, -48, -33, -87, 6, 11, -23, -33, -87, 3, 22, 52, -50, 73, -2, -33, 99, 4, 86, -9, 8, 18, -104, 40, -12, -64, -19, -3, -5, 11, -18, -4, 13, 77, -36, 32, 7, -56, -34, 65, -40, -24, 76, -62, -88, -58, 32, 4, 22, 23, 8, -2, -43, 39, 39, -6, -24, 8, 14, 13, 22, -42, 21, -36, 4, 26, 30, -25, 32, 0, 66, 12, 7, 16, 8, -16, -21, 9, 7, 29, -26, -4, 12, 55, 21, 16, 69, -2, -53, -50, 127, 98, -3, -128, 116, -8, -27, 11, 19, -4, 10, 16, 8, -38, -22, 66, 11, -37, -19, 30, -62, 29, 47, -34, -12, -57, -16, -14, 35, 47, 38, -16, -7, -6, -4, -18, -24, -23, -48, -25, -17, -7, 22, -27, 64, -23, 1, -29, 5, -32, 14, 18, 16, 23, 37, -1, 21, 46, -50, 19, 21, 49, -71, -17, -17, 34, 16, 33, -31, -4, 69, -57, 39, 3, -43, -22, 69, -50, 33, -32, 8, -7, 112, 84, 17, -23, -4, 1, 0, -9, -14, 26, -22, 17, 102, -47, -15, 26, -22, -32, 40, -22, 29, -7, -26, -21, -55, 11, -4, 16, -9, 39, 14, 8, 36, -13, -32, -88, 38, 22, -19, -69, 43, -15, -15, 8, 4, -29, 21, 23, -13, -55, 9, 23, -32, 21, -37, -23, -4, -55, -3, 28, -28, 19, -48, -1, -20, -59, 2, -30, 42, -9, 47, 24, 100, -12, 9, -9, 12, -41, -10, -49, -11, 16, -64, 21, 49, 33, 27, -46, 68, -75, -44, 3, 41, 62, -81, -31, 72, 13, -30, -28, 27, -22, 16, -12, -24, 8, 25, 16, 11, -64, 34, -13, -11, 8, 29, 16, -29, 16, 20, 38, 44, 22, 13, 12, 29, -23, -26, 25, -25, -8, 27, 41, -23, 10, -7, -45, 0, -63, 16, 127, -21, -8, 52, -59, 74, 55, 40, 18, 2, -12, -9, -42, -8, -11, -9, -71, 1, -2, 27, -50, 80, -62, 21, -4, 16, -25, 10, -8, -9, 0, -32, -8, -3, -11, 57, -5, 37, 0, -41, 52, 29, -20, 18, -18, -22, 46, 29, 36, 8, 21, -25, 42, 9, -30, 49, 22, 13, -3, 33, 35, 25, -75, -13, -33, -77, 95, -2, 1, -16, -49, 92, -27, 7, 13, 77, -13, -13, -42, 17, -57, 19, -30, -12, -45, 28, -45, -13, -8, 0, -16, 2, 47, -28, -9, 30, -38, 127, 39, -30, 15, -18, -16, 10, 14, -9, 41, 27, 18, 63, 14, 3, 45, 5, -24, 41, -36, 46, -32, -28, 4, -10, 18, 35, 0, -15, -15, -24, -29, 32, 43, 16, 23, -14, 7, -13, -54, 11, 69, 40, -2, -9, -26, 82, 0, -24, -27, 38, -94, 54, -31, -22, 20, -27, -13, -128, -39, -22, 47, 78, 36, 6, -4, -45, 33, 17, -37, -103, 55, -41, -42, -46, -17, -29, 8, 11, 25, 60, 10, -28, 54, -65, -62, -10, -40, 26, 30, 13, -24, -25, -17, -4, -15, -54, 19, 48, -47, -38, -3, 6, 1, 18, -6, -13, -1, 63, 111, -18, -10, -11, -6, -62, -19, 53, -25, 9, 75, -50, -42, -43, 2, 26, 5, 0, -25, -62, -21, -27, -25, -1, 1, 19, -47, -37, -16, 13, -23, -40, -3, 19, 23, 38, 43, -102, 71, 5, -13, 52, -18, -29, -68, 2, -48, 28, 54, 9, -12, -14, -37, 3, 50, 63, -109, 63, 8, 21, -28, 58, -30, -2, 22, -14, -37, 28, 9, -48, 14, 1, -94, 10, -8, 6, 45, -5, -39, 26, -43, 5, 50, 55, -5, 6, 9, 11, 4, 29, -4, -6, -16, -56, 6, 16, 0, 14, 8, 39, -35, 10, 38, 28, 43, -11, -128, -36, 22, 2, -32, 28, 30, -13, 1, -40, 5, 13, 24, -52, -16, 16, 15, 15, -41, 91, -32, -43, 28, -21, -32, -2, -57, 25, 70, -32, 37, -25, -19, 67, 2, 53, 10, -25, -9, 79, -20, -20, 30, -35, -31, -36, -26, 20, -71, 25, 18, -18, 8, -31, 12, 118, -27, 43, 24, 15, 5, 49, -66, 21, -33, -40, 14, 52, 35, 72, 36, -38, 24, 0, -20, -19, 5, -32, -60, 51, 29, 21, -1, 40, 82, 24, 88, 8, -45, -29, -83, -37, -39, 42, -10, -34, 36, -7, 19, -18, -10, 0, -39, 35, -98, -27, -35, -15, 28, 44, -7, 13, -31, -24, 25, -29, 13, 62, 8, 14, 64, 61, 107, -42, 51, -86, 70, 0, -18, 60, -76, -17, 11, -70, -11, 32, -15, 60, 14, -43, -14, 0, 7, 20, 7, -24, -1, -28, -42, 66, 29, -2, 1, -10, -60, 2, -1, -48, 8, 78, -25, -62, -57, 29, -41, 46, -36, 41, -78, -19, -10, -33, 6, -19, -18, -12, 44, 10, 22, -52, 1, 10, 37, 50, -24, -15, -50, 13, -22, -29, 44, -74, -50, 20, 44, -18, 50, -42, -53, 25, 35, 46, -30, -39, 76, 12, 127, 21, -9, 10, 59, -43, -11, 22, 45, -20, 8, 17, 10, -27, 14, -11, 43, -22, -108, 72, -1, -1, -37, -29, 6, 50, -15, -12, 76, -51, -91, -48, -45, -9, -14, -31, 28, 16, -47, -41, 8, 10, 20, -17, -19, -35, 13, -8, -5, 4, 80, 46, 20, 35, -44, 39, -22, -54, -11, -9, -38, -28, 9, 6, -4, 3, -24, -63, 43, 56, 2, 9, -12, 127, -65, 22, -30, -9, -41, -43, -23, 50, -43, 61, 24, 81, -35, 36, 53, 30, -23, 43, -38, 43, -40, 13, -18, 0, 0, 3, -52, -45, 8, 46, -16, 38], [-47, -19, -104, 29, -32, -72, 0, 5, -53, 69, 56, 17, 0, -38, -10, 10, -44, 69, 20, -17, -2, -45, -19, -128, 34, -4, -64, 21, -23, -9, 13, 21, 28, 55, -52, 39, -1, 24, 0, 30, 2, 5, 28, 3, -30, 19, -33, -47, 27, -35, -29, -28, 5, -41, -40, -60, 43, -21, 49, -92, -60, -22, 59, 65, 35, -10, 24, 35, -76, 31, 35, -58, -4, -13, -47, 4, 12, -7, 18, -14, -36, 47, -8, -35, -28, -15, -41, -18, -72, -38, -39, 36, -128, 20, 44, -26, -8, 14, 1, 17, -20, -23, 1, -38, -30, 85, 61, 81, -16, 4, 2, -39, 40, -77, -22, 26, 24, 48, 56, 41, 25, 99, -37, -16, 41, 50, 16, -61, -25, -18, -34, 48, 60, 20, -16, 0, 28, -17, 28, 12, 49, -46, 13, 47, -7, 10, 19, 15, 19, -26, 8, -24, -22, 12, -5, 7, -28, -4, 32, 21, 38, 16, 16, -1, -15, -32, -32, 12, 9, 47, 9, -5, -60, -39, -35, -14, -9, -10, -65, 127, 93, -48, -118, 63, 58, -71, 21, -58, -32, 37, -21, 33, 1, 0, -21, -27, -17, 12, -17, 0, -50, 64, -39, 52, 11, 24, -11, 33, 0, 0, -4, -37, 23, -52, -11, 31, -8, 1, -30, -1, -8, 6, -33, 43, 34, 27, -36, 7, 39, 19, 8, 30, 11, -12, -33, 33, 103, -15, 0, 8, 28, -66, -45, -18, 69, 64, 27, -22, 27, 60, -35, -49, 12, -86, -29, 50, 51, 71, -25, 27, -50, 72, 11, -1, -59, 63, 13, 29, -29, 8, 32, 41, -13, 60, -39, 58, -27, 0, -5, 48, 16, 65, 36, -11, 18, 59, 18, -7, -19, -41, 25, 6, 9, -9, 0, -30, 18, 73, 2, -31, -74, -32, -43, -44, 38, -18, 25, -4, 31, 31, -66, 35, 21, -39, 20, -36, -41, -6, -71, -11, -76, -57, 16, -51, -20, 28, -60, 34, 7, -1, 102, 58, 46, 3, 32, -37, -4, 32, -26, -4, -27, -56, -9, -47, -59, 62, 1, 18, -26, 18, -28, -30, 26, 46, 13, -18, -19, 53, -32, -27, -48, 13, -39, -1, 15, -15, -11, 29, 33, -5, 25, -21, -4, -16, 27, -46, 4, -40, 51, 21, 13, 0, 14, -16, 56, -28, -68, -22, 60, 52, 20, -29, 52, -27, -25, -28, -33, 34, -30, 39, 15, -16, -30, 19, -44, 4, 5, -29, 14, 34, -38, -62, -54, -72, 8, 2, -98, -33, 71, 36, -85, 71, -68, 51, -7, 3, 2, -9, 59, 62, 19, 5, -17, 4, 19, 9, -21, -24, 10, -20, 68, 3, -18, 59, -11, -2, -4, 20, 17, -7, 35, -12, 30, 20, -48, 72, 73, 30, 15, 91, 35, 19, -13, 0, -43, 12, 43, 8, -32, -2, -17, 35, -1, 51, 2, 100, -37, 44, -112, 33, -64, 36, 20, -50, -63, 31, -75, -69, 11, 1, -68, -1, 34, -19, 3, 39, -12, 38, 7, -54, -8, 0, 26, 21, 6, -8, -9, 2, -16, 18, -37, -19, 76, -18, 16, -31, 112, -42, 53, -5, -9, -13, 55, 19, -8, -39, 61, -11, -13, 3, 56, 33, 30, 4, -25, -46, -14, -32, 13, 9, 11, 4, -46, 38, -9, 39, -47, 27, -1, 59, -49, -80, 49, -32, -17, -44, 14, -40, 3, 17, 32, 17, -23, -39, 32, 11, -38, -87, 118, -81, -52, -59, -1, -24, 39, 11, 10, 24, 16, 18, 8, 29, 0, 27, 7, -21, 7, -38, 8, -31, -14, -15, 27, -83, -18, -21, -28, -15, -6, 0, 22, 16, 0, -15, -34, 21, 62, 35, 2, 39, -13, -64, 30, -1, 31, 9, -5, 24, -84, -6, -14, -38, 15, -76, -39, -27, -21, -73, -64, 9, -84, 10, -46, -92, -29, 79, -8, -45, -3, 38, 32, 20, -13, -35, -4, 0, -24, -12, -17, 4, 2, 4, -11, -11, 12, 14, -41, 49, 1, 6, -37, 49, 5, -6, -31, 48, 82, 54, 8, -1, 8, 69, -13, -42, 12, 24, 16, 15, 8, -61, -4, -6, 54, -34, -20, 14, 11, 8, 21, 0, 8, 43, -14, 29, 54, 30, -38, -46, 15, 22, 8, -22, 23, -32, -7, 22, 15, -6, -3, -12, 5, 47, -16, -8, 35, -45, -31, -40, -71, 22, -41, 22, 7, -37, 6, 24, -62, 14, 19, 40, -34, 53, -26, -44, 27, -23, 8, 42, 34, -4, 39, -33, 64, -13, 27, 59, 8, 87, 61, 43, 16, 43, -3, 4, 33, -3, 10, -39, -20, -8, -71, 11, 5, 44, -8, 40, -44, 6, -2, 0, -44, 65, -10, -4, -34, -32, 13, -32, -19, -16, 10, 0, 15, 12, -1, 8, 51, -5, -20, -7, 29, 47, -39, -1, -3, 37, 14, -8, 86, -15, -15, -94, -7, -25, 6, 27, -29, 12, 10, 44, 4, -36, 9, -17, 108, 35, -72, -9, -56, -55, 54, 53, -28, 32, 37, -6, 5, 5, -1, 26, 39, 14, -29, -24, 54, 13, 0, -128, 95, 0, -38, 40, -90, -7, -46, -23, 36, 11, 30, 26, 25, -49, -101, 44, 20, 59, 12, -40, -77, -35, 27, -10, 32, 18, 2, -107, -50, -60, -19, -128, 4, 76, 9, -71, -35, 7, 5, -50, -6, 9, -101, -68, -1, -11, 2, -95, 19, -10, 34, -22, 48, 11, 8, 78, 34, 8, -58, -14, -31, 29, -35, 3, -39, -42, 4, -12, -11, -41, 1, -33, 24, 30, 70, 48, -37, -9, 127, 61, 127, -5, -19, 30, 4, -9, -29, 10, -57, -7, 5, -16, -17, -26, 9, 47, 34, -3, 2, 17, 11, 32, 1, 20, -31, -22, 24, 8, 51, 3, -25, 44, 41, 32, -75, 14, -25, -32, -45, -28, 41, 64, 37, 12, 18, -34, -42, -59, -10, 40, 46, 4, -1, 0, -31, 20, -15, -56, 18, -19, 16, 51, -36, 44, -61, 15, 9, 1, -22, -14, -12, 4, -21, 13, 9, 31, 12, 10, 62, -45, 35, 23, 3, 0, -19, 0, 10, -10, -44, 35, -10, -89, -48, 41, 12, -32, -7, -27, -13, -23, -67, -34, -6, -16, -33, 16]] uint8: None binary: [[-110, 121, 110, -50, 87, -59, 8, 35, 114, 30, -92, -112, -118, -16, 7, 96, 17, 19, 97, -9, -23, 25, -103, -35, -78, -45, 72, -123, -41, 67, 14, -31, -42, -126, 75, 111, 62, -64, 57, 64, -52, -66, -64, -12, 100, 99, 87, 61, -5, 5, 23, 34, -75, -66, -16, 91, 92, 121, 55, 117, 100, -112, -24, 84, 84, -65, 61, -31, -45, 7, 44, 8, -35, -125, 16, -50, -52, 11, -105, -32, 102, -62, -3, 86, -107, 21, 95, 15, 27, -79, -20, 114, 90, 125, 110, -97, -15, -98, 21, -102, -124, 112, -115, 26, -86, -55, 67, 7, 11, -127, 125, 103, -46, -55, 79, -31, 126, -32, 33, -128, -124, -80, 21, 27, -49, -9, 112, 103], [-110, -7, -24, 23, -33, 68, 24, 35, 22, -50, -32, 86, 74, -14, 71, 96, 81, -45, 105, -25, -73, 108, -99, 13, -76, 125, 73, -44, -34, -34, -105, 75, 86, -58, 85, -30, -92, -27, -39, 0, -91, -2, 30, -12, -116, 9, 81, 39, 76, 44, 87, 20, -107, 110, -75, 20, 44, 125, -75, 85, -28, -118, -24, 127, 78, -75, 108, -20, -48, 3, 12, 12, 71, -29, -98, -26, 68, 11, 0, -104, 96, 70, -3, 53, -98, -108, 127, -102, -17, -84, -88, 88, -54, -45, -11, -4, -4, 15, -67, 122, -108, 117, -115, 40, 98, -47, 102, -103, 3, -123, -85, 119, -48, -24, 95, -34, -26, -24, -31, -9, 99, 64, -128, -43, 74, -91, 80, -95], [64, -14, -4, 30, 118, 5, 8, 35, 51, 3, 72, -122, -70, -10, 2, -20, 17, 115, -67, -11, 115, 31, -103, -73, -78, 65, 64, -123, -41, 91, 14, -39, -41, -78, 73, -62, 60, -28, 89, 32, 33, -35, -62, 116, 102, -45, 83, 63, 73, 37, 23, 64, -43, -46, -106, 83, 109, 92, -87, -15, -60, -39, -23, 63, 84, 56, -6, -15, 20, 3, 76, 3, 104, -16, -79, 70, -123, 15, -125, -111, 109, -105, -99, 82, -19, -27, 95, -113, 94, -74, 57, 82, -102, -7, -95, -21, -3, -66, 73, 95, -124, 37, -115, -81, 107, -55, -25, 6, 19, -107, -120, 111, -110, -23, 79, -26, 106, -61, -96, -77, 9, 116, -115, -67, -63, -9, -43, 77], [-109, -7, -32, 19, 87, 116, 8, 35, 54, -102, -64, -106, -14, -10, 31, 78, -99, 59, -6, -45, 97, 96, -103, 37, 69, -35, 9, -59, 95, 25, 14, 73, 86, -9, -43, 110, -70, 96, 45, 32, -91, 62, -64, -12, 100, -55, 34, 62, 14, 5, 22, 67, -75, -17, -14, 81, 45, 125, -15, -11, -28, 75, -25, 20, 42, -78, -4, -67, -44, 11, 76, 3, 127, 40, 0, 103, 75, -62, -123, -111, 68, -13, -10, -5, -66, -89, 119, -70, -29, -95, -19, 82, 106, 127, -24, -11, -48, 15, -29, -102, -115, 107, -115, 55, -69, -61, 103, 11, 3, 25, -118, 63, -108, 11, 78, -28, 14, 124, 119, -61, 97, 84, 53, 69, 123, 89, -104, -127]] ubinary: None } meta: {'api_version': {'version': '1'}, 'billed_units': {'input_tokens': 106}} } ``` As we can see from the above, we got multiple vector representations for the same input strings. In \[54\]: Copied! ``` print( "int8 dimensionality: {}, binary dimensionality {}".format( len(embeddings.embeddings.int8[0]), len(embeddings.embeddings.binary[0]) ) ) ``` print( "int8 dimensionality: {}, binary dimensionality {}".format( len(embeddings.embeddings.int8[0]), len(embeddings.embeddings.binary[0]) ) ) ``` int8 dimensionality: 1024, binary dimensionality 128 ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. Notice the `binary_vector` field that defines an indexed (dense) Vespa tensor with the dimension name `x[128]`. Indexing specifies `index` which means that Vespa will build HNSW graph for searching this vector field. Also, notice the configuration of the [distance-metric](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric). We also want to store the `int8_vector` on disk; we use `paged` to signalize this. In \[8\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), Field( name="int8_vector", type="tensor(x[1024])", indexing=["attribute"], attribute=["paged"], ), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=\[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), Field( name="int8_vector", type="tensor(x[1024])", indexing=["attribute"], attribute=["paged"], ), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) We must add the schema to a Vespa [application package](https://docs.vespa.ai/en/application-packages.html). This consists of configuration files, schemas, models, and possibly even custom code (plugins). In \[9\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "coherebillion" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "coherebillion" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. `unpack_bits`, unpacks the binary representation into a 1024-dimensional float vector [doc](https://docs.vespa.ai/en/reference/ranking-expressions.html#unpack-bits). We define three tensor inputs that we intend to send with the query request. In \[10\]: Copied! ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), ("query(q_int8)", "tensor(x[1024])"), ], functions=[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2*unpack_bits(attribute(binary_vector)) -1", ) ], first_phase=FirstPhaseRanking( expression="sum(query(q_full)*unpack_binary_representation )" # phase 1 ranking using the float query and the unpacked float version of the binary_vector ), second_phase=SecondPhaseRanking( expression="cosine_similarity(query(q_int8),attribute(int8_vector),x)", # phase 2 using the int8 vector representations rerank_count=30, # number of hits to rerank, upper bound on number of random IO operations ), match_features=[ "distance(field, binary_vector)", "closeness(field, binary_vector)", "firstPhase", ], ) my_schema.add_rank_profile(rerank) ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=\[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), ("query(q_int8)", "tensor(x[1024])"), \], functions=\[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2\*unpack_bits(attribute(binary_vector)) -1", ) \], first_phase=FirstPhaseRanking( expression="sum(query(q_full)\*unpack_binary_representation )" # phase 1 ranking using the float query and the unpacked float version of the binary_vector ), second_phase=SecondPhaseRanking( expression="cosine_similarity(query(q_int8),attribute(int8_vector),x)", # phase 2 using the int8 vector representations rerank_count=30, # number of hits to rerank, upper bound on number of random IO operations ), match_features=[ "distance(field, binary_vector)", "closeness(field, binary_vector)", "firstPhase", ], ) my_schema.add_rank_profile(rerank) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[39\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ## Feed our sample documents and their binary embedding representation[¶](#feed-our-sample-documents-and-their-binary-embedding-representation) With few documents, we use the synchronous API. Read more in [reads and writes](https://vespa-engine.github.io/pyvespa/reads-writes.md). In \[41\]: Copied! ``` for i, doc in enumerate(documents): response = app.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": embeddings.embeddings.binary[i], "int8_vector": embeddings.embeddings.int8[i], }, ) assert response.is_successful() ``` for i, doc in enumerate(documents): response = app.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": embeddings.embeddings.binary[i], "int8_vector": embeddings.embeddings.int8[i], }, ) assert response.is_successful() ### Querying data[¶](#querying-data) Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) - [Practical Nearest Neighbor Search Guide](https://docs.vespa.ai/en/nearest-neighbor-search-guide.html) We now need to invoke the embed API again to embed the query text; we ask for all three representations: In \[30\]: Copied! ``` query = "Who discovered x-ray?" # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for 3 versions (float, binary, and int8) of the embeddings. query_emb = co.embed( [query], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "binary", "int8"], ) ``` query = "Who discovered x-ray?" # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for 3 versions (float, binary, and int8) of the embeddings. query_emb = co.embed( [query], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "binary", "int8"], ) In \[ \]: Copied! ``` print(query_emb) ``` print(query_emb) Now, we use Vespa's [nearestNeighbor](https://docs.vespa.ai/en/reference/query-language-reference.html#nearestneighbor) query operator to expose up to 1000 hits to ranking using the configured distance-metric (hamming distance). This is the retrieve logic, or phase-0 search as it only uses the hamming distance. See [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) for more on phased ranking pipelines. The hits that are near in hamming space, are exposed to the flexibility of the Vespa ranking framework: - the first-phase uses the unpacked version of the binary vector and computes the dot product against the float query version - The second phase and final phase re-ranks the 30 best from the the previous phase, here using cosine similarity between the int8 embedding representations In \[47\]: Copied! ``` response = app.query( yql="select * from doc where {targetHits:1000}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": query_emb.embeddings.binary[0], "input.query(q_full)": query_emb.embeddings.float[0], "input.query(q_int8)": query_emb.embeddings.int8[0], }, ) assert response.is_successful() ``` response = app.query( yql="select * from doc where {targetHits:1000}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": query_emb.embeddings.binary[0], "input.query(q_full)": query_emb.embeddings.float[0], "input.query(q_int8)": query_emb.embeddings.int8[0], }, ) assert response.is_successful() In \[48\]: Copied! ``` response.hits ``` response.hits Out\[48\]: ``` [{'id': 'id:doc:doc::3', 'relevance': 0.45650564242263414, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.0030303030303030303, 'distance(field,binary_vector)': 329.0, 'firstPhase': 4.905200004577637}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::3', 'doc_id': '3', 'text': 'Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity'}}, {'id': 'id:doc:doc::1', 'relevance': 0.337421116422118, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.002544529262086514, 'distance(field,binary_vector)': 391.99999999999994, 'firstPhase': 3.7868080139160156}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::1', 'doc_id': '1', 'text': 'Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.'}}, {'id': 'id:doc:doc::2', 'relevance': 0.280400768492745, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.0026595744680851063, 'distance(field,binary_vector)': 375.0, 'firstPhase': 3.854860305786133}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::2', 'doc_id': '2', 'text': 'Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.'}}, {'id': 'id:doc:doc::0', 'relevance': 0.2570603626828106, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.0024390243902439024, 'distance(field,binary_vector)': 409.0, 'firstPhase': 2.845644474029541}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::0', 'doc_id': '0', 'text': 'Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.'}}] ``` The `relevance` is the cosine similarity between the int8 vector representations calculated in the second-phase. Note also that we return the `hamming` distance and the firstPhase score which is the query, unpacked binary dot product. ## Conclusions[¶](#conclusions) These new Cohere binary embeddings are a huge step forward for cost-efficient vector search at scale and integrate perfectly with Vespa features for building out vector search at scale. Storing the `int8` vector representation on disk using the paged attribute option enables phased retrieval and ranking close to the data. First, one can use the compact in-memory binary representation for the coarse-level search to efficiently find a limited number of candidates. Then, the candidates from the coarse search can be re-scored and re-ranked using a more advanced scoring function using a finer resolution. ### Clean up[¶](#clean-up) We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Chat with your pdfs with ColBERT, langchain, and Vespa[¶](#chat-with-your-pdfs-with-colbert-langchain-and-vespa) This notebook illustrates using [Vespa streaming mode](https://docs.vespa.ai/en/streaming-search.html) to build cost-efficient RAG applications over naturally sharded data. It also demonstrates how you can now use ColBERT ranking natively in Vespa, which can now handle the ColBERT embedding process for you with no custom code! You can read more about Vespa vector streaming search in these blog posts: - [Announcing vector streaming search: AI assistants at scale without breaking the bank](https://blog.vespa.ai/announcing-vector-streaming-search/) - [Yahoo Mail turns to Vespa to do RAG at scale](https://blog.vespa.ai/yahoo-mail-turns-to-vespa-to-do-rag-at-scale/) - [Hands-On RAG guide for personal data with Vespa and LLamaIndex](https://blog.vespa.ai/scaling-personal-ai-assistants-with-streaming-mode/) - [Turbocharge RAG with LangChain and Vespa Streaming Mode for Sharded Data](https://blog.vespa.ai/turbocharge-rag-with-langchain-and-vespa-streaming-mode/) ### TLDR; Vespa streaming mode for partitioned data[¶](#tldr-vespa-streaming-mode-for-partitioned-data) Vespa's streaming search solution enables you to integrate a user ID (or any sharding key) into the Vespa document ID. This setup allows Vespa to efficiently group each user's data on a small set of nodes and the same disk chunk. Streaming mode enables low latency searches on a user's data without keeping data in memory. The key benefits of streaming mode: - Eliminating compromises in precision introduced by approximate algorithms - Achieve significantly higher write throughput, thanks to the absence of index builds required for supporting approximate search. - Optimize efficiency by storing documents, including tensors and data, on disk, benefiting from the cost-effective economics of storage tiers. - Storage cost is the primary cost driver of Vespa streaming mode; no data is in memory. Avoiding memory usage lowers deployment costs significantly. ### Connecting LangChain Retriever with Vespa for Context Retrieval from PDF Documents[¶](#connecting-langchain-retriever-with-vespa-for-context-retrieval-from-pdf-documents) In this notebook, we seamlessly integrate a custom [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction) [retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/) with a Vespa app, leveraging Vespa's streaming mode to extract meaningful context from PDF documents. The workflow - Define and deploy a Vespa [application package](https://docs.vespa.ai/en/application-packages.html) using PyVespa. - Utilize [LangChain PDF Loaders](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf) to download and parse PDF files. - Leverage [LangChain Document Transformers](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) to convert each PDF page into multiple model context-sized parts. - Feed the transformer representation to the running Vespa instance - Employ Vespa's built-in [ColBERT embedder functionality](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/) (using an open-source embedding model) for embedding the contexts, resulting in a multi-vector representation per context - Develop a custom [Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/) to enable seamless retrieval for any unstructured text query. Let's get started! First, install dependencies: In \[ \]: Copied! ``` !uv pip install -q pyvespa langchain langchain-community langchain-openai langchain-text-splitters pypdf==5.0.1 openai vespacli ``` !uv pip install -q pyvespa langchain langchain-community langchain-openai langchain-text-splitters pypdf==5.0.1 openai vespacli ## Sample data[¶](#sample-data) We love [ColBERT](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-3/), so we'll use a few COlBERT related papers as examples of PDFs in this notebook. In \[2\]: Copied! ``` def sample_pdfs(): return [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, { "title": "On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval", "url": "https://arxiv.org/pdf/2108.11480.pdf", "authors": "Craig Macdonald, Nicola Tonellotto", }, { "title": "A Study on Token Pruning for ColBERT", "url": "https://arxiv.org/pdf/2112.06540.pdf", "authors": "Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant", }, { "title": "Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval", "url": "https://arxiv.org/pdf/2106.11251.pdf", "authors": "Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis", }, ] ``` def sample_pdfs(): return [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, { "title": "On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval", "url": "https://arxiv.org/pdf/2108.11480.pdf", "authors": "Craig Macdonald, Nicola Tonellotto", }, { "title": "A Study on Token Pruning for ColBERT", "url": "https://arxiv.org/pdf/2112.06540.pdf", "authors": "Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant", }, { "title": "Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval", "url": "https://arxiv.org/pdf/2106.11251.pdf", "authors": "Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis", }, ] ## Defining the Vespa application[¶](#defining-the-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[3\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet pdf_schema = Schema( name="pdf", mode="streaming", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field(name="title", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="authors", type="array", indexing=["summary", "index"]), Field( name="metadata", type="map", indexing=["summary", "index"], ), Field(name="page", type="int", indexing=["summary", "attribute"]), Field(name="contexts", type="array", indexing=["summary", "index"]), Field( name="embedding", type="tensor(context{}, x[384])", indexing=[ "input contexts", 'for_each { (input title || "") . " " . ( _ || "") }', "embed e5", "attribute", ], attribute=["distance-metric: angular"], is_document_field=False, ), Field( name="colbert", type="tensor(context{}, token{}, v[16])", indexing=["input contexts", "embed colbert context", "attribute"], is_document_field=False, ), ], ), fieldsets=[FieldSet(name="default", fields=["title", "contexts"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet pdf_schema = Schema( name="pdf", mode="streaming", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field(name="title", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="authors", type="array", indexing=["summary", "index"]), Field( name="metadata", type="map\", indexing=["summary", "index"], ), Field(name="page", type="int", indexing=["summary", "attribute"]), Field(name="contexts", type="array", indexing=["summary", "index"]), Field( name="embedding", type="tensor(context{}, x[384])", indexing=[ "input contexts", 'for_each { (input title || "") . " " . ( _ || "") }', "embed e5", "attribute", ], attribute=["distance-metric: angular"], is_document_field=False, ), Field( name="colbert", type="tensor(context{}, token{}, v[16])", indexing=["input contexts", "embed colbert context", "attribute"], is_document_field=False, ), \], ), fieldsets=\[FieldSet(name="default", fields=["title", "contexts"])\], ) The above defines our `pdf` schema using mode `streaming`. Most fields are straightforward, but take a note of: - `metadata` using `map` - here we can store and match over page level metadata extracted by the PDF parser. - `contexts` using `array`, these are the context-sized text parts that we use langchain document transformers for. - The `embedding` field of type `tensor(context{},x[384])` allows us to store and search the 384-dimensional embeddings per context in the same document - The `colbert` field of type `tensor(context{}, token{}, v[16])` stores the ColBERT embeddings, retaining a (quantized) per-token representation of the text. The observant reader might have noticed the `e5` and `colbert` arguments to the `embed` expression in the above `embedding` field. The `e5` argument references a component of the type [hugging-face-embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder), and `colbert` references the new [cobert-embedder](https://docs.vespa.ai/en/embedding.html#colbert-embedder). We configure the application package and its name with the `pdf` schema and the `e5` and `colbert` embedder components. In \[4\]: Copied! ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "pdfs" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[pdf_schema], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( name="transformer-model", args={ "url": "https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json" }, ), Parameter( name="prepend", args={}, children=[ Parameter(name="query", args={}, children="query: "), Parameter(name="document", args={}, children="passage: "), ], ), ], ), Component( id="colbert", type="colbert-embedder", parameters=[ Parameter( name="transformer-model", args={ "url": "https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json" }, ), ], ), ], ) ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "pdfs" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[pdf_schema], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=\[ Parameter( name="transformer-model", args={ "url": "https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json" }, ), Parameter( name="prepend", args={}, children=[ Parameter(name="query", args={}, children="query: "), Parameter(name="document", args={}, children="passage: "), ], ), \], ), Component( id="colbert", type="colbert-embedder", parameters=[ Parameter( name="transformer-model", args={ "url": "https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json" }, ), ], ), \], ) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. Vespa supports [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) and has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html), including many text-matching features such as: - [BM25](https://docs.vespa.ai/en/reference/bm25.html). - [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html) and many more. Users can also define custom functions using [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html). The following defines a `colbert` Vespa ranking profile which uses the `e5` embedding in the first phase, and the `max_sim` function in the second phase. The `max_sim` function performs the *late interaction* for the ColBERT ranking, and is by default applied to the top 100 documents from the first phase. In \[5\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colbert = RankProfile( name="colbert", inputs=[ ("query(q)", "tensor(x[384])"), ("query(qt)", "tensor(querytoken{}, v[128])"), ], functions=[ Function(name="cos_sim", expression="closeness(field, embedding)"), Function( name="max_sim_per_context", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ), Function( name="max_sim", expression="reduce(max_sim_per_context, max, context)" ), ], first_phase=FirstPhaseRanking(expression="cos_sim"), second_phase=SecondPhaseRanking(expression="max_sim"), match_features=["cos_sim", "max_sim", "max_sim_per_context"], ) pdf_schema.add_rank_profile(colbert) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colbert = RankProfile( name="colbert", inputs=\[ ("query(q)", "tensor(x[384])"), ("query(qt)", "tensor(querytoken{}, v[128])"), \], functions=[ Function(name="cos_sim", expression="closeness(field, embedding)"), Function( name="max_sim_per_context", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ), Function( name="max_sim", expression="reduce(max_sim_per_context, max, context)" ), ], first_phase=FirstPhaseRanking(expression="cos_sim"), second_phase=SecondPhaseRanking(expression="max_sim"), match_features=["cos_sim", "max_sim", "max_sim_per_context"], ) pdf_schema.add_rank_profile(colbert) Using [match-features](https://docs.vespa.ai/en/reference/schema-reference.html#match-features), Vespa returns selected features along with the highest scoring documents. Here, we include `max_sim_per_context` which we can later use to select the top N scoring contexts for each page. For an example of a `hybrid` rank-profile which combines semantic search with traditional text retrieval such as BM25, see the previous blog post: [Turbocharge RAG with LangChain and Vespa Streaming Mode for Sharded Data](https://blog.vespa.ai/turbocharge-rag-with-langchain-and-vespa-streaming-mode/) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[ \]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[12\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 1 of dev-aws-us-east-1c for vespa-team.pdfs. This may take a few minutes the first time. INFO [19:04:30] Deploying platform version 8.314.57 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [19:04:30] Using CA signed certificate version 1 INFO [19:04:30] Using 1 nodes in container cluster 'pdfs_container' INFO [19:04:35] Session 285265 for tenant 'vespa-team' prepared and activated. INFO [19:04:39] ######## Details for all nodes ######## INFO [19:04:44] h88969d.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [19:04:44] --- platform vespa/cloud-tenant-rhel8:8.314.57 <-- : INFO [19:04:44] --- container-clustercontroller on port 19050 has not started INFO [19:04:44] --- metricsproxy-container on port 19092 has not started INFO [19:04:44] h88978a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [19:04:44] --- platform vespa/cloud-tenant-rhel8:8.314.57 <-- : INFO [19:04:44] --- logserver-container on port 4080 has not started INFO [19:04:44] --- metricsproxy-container on port 19092 has not started INFO [19:04:44] h90615b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [19:04:44] --- platform vespa/cloud-tenant-rhel8:8.314.57 <-- : INFO [19:04:44] --- storagenode on port 19102 has not started INFO [19:04:44] --- searchnode on port 19107 has not started INFO [19:04:44] --- distributor on port 19111 has not started INFO [19:04:44] --- metricsproxy-container on port 19092 has not started INFO [19:04:44] h91135a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [19:04:44] --- platform vespa/cloud-tenant-rhel8:8.314.57 <-- : INFO [19:04:44] --- container on port 4080 has not started INFO [19:04:44] --- metricsproxy-container on port 19092 has not started INFO [19:05:52] Waiting for convergence of 10 services across 4 nodes INFO [19:05:52] 1/1 nodes upgrading platform INFO [19:05:52] 1 application services still deploying DEBUG [19:05:52] h91135a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP DEBUG [19:05:52] --- platform vespa/cloud-tenant-rhel8:8.314.57 <-- : DEBUG [19:05:52] --- container on port 4080 has not started DEBUG [19:05:52] --- metricsproxy-container on port 19092 has config generation 285265, wanted is 285265 INFO [19:06:21] Found endpoints: INFO [19:06:21] - dev.aws-us-east-1c INFO [19:06:21] |-- https://bac3e5ad.c81e7b13.z.vespa-app.cloud/ (cluster 'pdfs_container') INFO [19:06:22] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://bac3e5ad.c81e7b13.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` ### Processing PDFs with LangChain[¶](#processing-pdfs-with-langchain) [LangChain](https://python.langchain.com/) has a rich set of [document loaders](https://python.langchain.com/docs/how_to/#document-loaders) that can be used to load and process various file formats. In this notebook, we use the [PyPDFLoader](https://python.langchain.com/docs/how_to/document_loader_pdf/). We also want to split the extracted text into *contexts* using a [text splitter](https://python.langchain.com/docs/how_to/#text-splitters). Most text embedding models have limited input lengths (typically less than 512 language model tokens, so splitting the text into multiple contexts that each fits into the context limit of the embedding model is a common strategy. For embedding text data, models based on the Transformer architecture have become the de facto standard. A challenge with Transformer-based models is their input length limitation due to the quadratic self-attention computational complexity. For example, a popular open-source text embedding model like [e5](https://huggingface.co/intfloat/e5-small) has an absolute maximum input length of 512 wordpiece tokens. In addition to the technical limitation, trying to fit more tokens than used during fine-tuning of the model will impact the quality of the vector representation. One can view this text embedding encoding as a lossy compression technique, where variable-length texts are compressed into a fixed dimensional vector representation. Although this compressed representation is very useful, it can be imprecise especially as the size of the text increases. By adding the ColBERT embedding, we also retain token-level information which retains more of the original meaning of the text and allows the richer *late interaction* between the query and the document text. In \[ \]: Copied! ``` from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # chars, not llm tokens chunk_overlap=0, length_function=len, is_separator_regex=False, ) ``` from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # chars, not llm tokens chunk_overlap=0, length_function=len, is_separator_regex=False, ) The following iterates over the `sample_pdfs` and performs the following: - Load the URL and extract the text into pages. A page is the retrievable unit we will use in Vespa - For each page, use the text splitter to split the text into contexts. The contexts are represented as an `array` in the Vespa schema - Create the page level Vespa `fields`, note that we duplicate some content like the title and URL into the page level representation. In \[14\]: Copied! ``` import hashlib import unicodedata def remove_control_characters(s): return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C") my_docs_to_feed = [] for pdf in sample_pdfs(): url = pdf["url"] loader = PyPDFLoader(url) pages = loader.load_and_split() for index, page in enumerate(pages): source = page.metadata["source"] chunks = text_splitter.transform_documents([page]) text_chunks = [chunk.page_content for chunk in chunks] text_chunks = [remove_control_characters(chunk) for chunk in text_chunks] page_number = index + 1 vespa_id = f"{url}#{page_number}" hash_value = hashlib.sha1(vespa_id.encode()).hexdigest() fields = { "title": pdf["title"], "url": url, "page": page_number, "id": hash_value, "authors": [a.strip() for a in pdf["authors"].split(",")], "contexts": text_chunks, "metadata": page.metadata, } my_docs_to_feed.append(fields) ``` import hashlib import unicodedata def remove_control_characters(s): return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C") my_docs_to_feed = [] for pdf in sample_pdfs(): url = pdf["url"] loader = PyPDFLoader(url) pages = loader.load_and_split() for index, page in enumerate(pages): source = page.metadata["source"] chunks = text_splitter.transform_documents([page]) text_chunks = [chunk.page_content for chunk in chunks] text_chunks = [remove_control_characters(chunk) for chunk in text_chunks] page_number = index + 1 vespa_id = f"{url}#{page_number}" hash_value = hashlib.sha1(vespa_id.encode()).hexdigest() fields = { "title": pdf["title"], "url": url, "page": page_number, "id": hash_value, "authors": \[a.strip() for a in pdf["authors"].split(",")\], "contexts": text_chunks, "metadata": page.metadata, } my_docs_to_feed.append(fields) Now that we have parsed the input PDFs and created a list of pages that we want to add to Vespa, we must format the list into the format that PyVespa accepts. Notice the `fields`, `id` and `groupname` keys. The `groupname` is the key that is used to shard and co-locate the data and is only relevant when using Vespa with streaming mode. In \[15\]: Copied! ``` from typing import Iterable def vespa_feed(user: str) -> Iterable[dict]: for doc in my_docs_to_feed: yield {"fields": doc, "id": doc["id"], "groupname": user} ``` from typing import Iterable def vespa_feed(user: str) -> Iterable\[dict\]: for doc in my_docs_to_feed: yield {"fields": doc, "id": doc["id"], "groupname": user} In \[16\]: Copied! ``` my_docs_to_feed[0] ``` my_docs_to_feed[0] Out\[16\]: ``` {'title': 'ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction', 'url': 'https://arxiv.org/pdf/2112.01488.pdf', 'page': 1, 'id': 'a731a839198de04fa3d1a3cee6890d0d170ab025', 'authors': ['Keshav Santhanam', 'Omar Khattab', 'Jon Saad-Falcon', 'Christopher Potts', 'Matei Zaharia'], 'contexts': ['ColBERTv2:Effective and Efficient Retrieval via Lightweight Late InteractionKeshav Santhanam∗Stanford UniversityOmar Khattab∗Stanford UniversityJon Saad-FalconGeorgia Institute of TechnologyChristopher PottsStanford UniversityMatei ZahariaStanford UniversityAbstractNeural information retrieval (IR) has greatlyadvanced search and other knowledge-intensive language tasks. While many neuralIR methods encode queries and documentsinto single-vector representations, lateinteraction models produce multi-vector repre-sentations at the granularity of each token anddecompose relevance modeling into scalabletoken-level computations. This decompositionhas been shown to make late interaction moreeffective, but it inflates the space footprint ofthese models by an order of magnitude. In thiswork, we introduce ColBERTv2, a retrieverthat couples an aggressive residual compres-sion mechanism with a denoised supervisionstrategy to simultaneously improve the quality', 'and space footprint of late interaction. Weevaluate ColBERTv2 across a wide rangeof benchmarks, establishing state-of-the-artquality within and outside the training domainwhile reducing the space footprint of lateinteraction models by 6–10 ×.1 IntroductionNeural information retrieval (IR) has quickly domi-nated the search landscape over the past 2–3 years,dramatically advancing not only passage and doc-ument search (Nogueira and Cho, 2019) but alsomany knowledge-intensive NLP tasks like open-domain question answering (Guu et al., 2020),multi-hop claim verification (Khattab et al., 2021a),and open-ended generation (Paranjape et al., 2022).Many neural IR methods follow a single-vectorsimilarity paradigm: a pretrained language modelis used to encode each query and each documentinto a single high-dimensional vector, and rele-vance is modeled as a simple dot product betweenboth vectors. An alternative is late interaction , in-troduced in ColBERT (Khattab and Zaharia, 2020),', 'where queries and documents are encoded at a finer-granularity into multi-vector representations, and∗Equal contribution.relevance is estimated using rich yet scalable in-teractions between these two sets of vectors. Col-BERT produces an embedding for every token inthe query (and document) and models relevanceas the sum of maximum similarities between eachquery vector and all vectors in the document.By decomposing relevance modeling into token-level computations, late interaction aims to reducethe burden on the encoder: whereas single-vectormodels must capture complex query–document re-lationships within one dot product, late interactionencodes meaning at the level of tokens and del-egates query–document matching to the interac-tion mechanism. This added expressivity comesat a cost: existing late interaction systems imposean order-of-magnitude larger space footprint thansingle-vector models, as they must store billionsof small vectors for Web-scale collections. Con-', 'sidering this challenge, it might seem more fruit-ful to focus instead on addressing the fragility ofsingle-vector models (Menon et al., 2022) by in-troducing new supervision paradigms for negativemining (Xiong et al., 2020), pretraining (Gao andCallan, 2021), and distillation (Qu et al., 2021).Indeed, recent single-vector models with highly-tuned supervision strategies (Ren et al., 2021b; For-mal et al., 2021a) sometimes perform on-par oreven better than “vanilla” late interaction models,and it is not necessarily clear whether late inter-action architectures—with their fixed token-levelinductive biases—admit similarly large gains fromimproved supervision.In this work, we show that late interaction re-trievers naturally produce lightweight token rep-resentations that are amenable to efficient storageoff-the-shelf and that they can benefit drasticallyfrom denoised supervision. We couple those inColBERTv2 ,1a new late-interaction retriever that'], 'metadata': {'source': 'https://arxiv.org/pdf/2112.01488.pdf', 'page': 0}} ``` Now, we can feed to the Vespa instance (`app`), using the `feed_iterable` API, using the generator function above as input with a custom `callback` function. Vespa also performs embedding inference during this step using the built-in Vespa [embedding](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In \[17\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="pdf", iter=vespa_feed("jo-bergum"), namespace="personal", callback=callback ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="pdf", iter=vespa_feed("jo-bergum"), namespace="personal", callback=callback ) Notice the `schema` and `namespace` arguments. PyVespa transforms the input operations to Vespa [document v1](https://docs.vespa.ai/en/document-v1-api-guide.html) requests. ### Querying data[¶](#querying-data) Now, we can also query our data. With [streaming mode](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming), we must pass the `groupname` parameter, or the request will fail with an error. The query request uses the Vespa Query API and the `Vespa.query()` function supports passing any of the Vespa query API parameters. Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) Sample query request for `why is colbert effective?` for the user `jo-bergum`: In \[18\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id,title,page,contexts from pdf where ({targetHits:10}nearestNeighbor(embedding,q))", groupname="jo-bergum", ranking="colbert", query="why is colbert effective?", body={ "presentation.format.tensors": "short-value", "input.query(q)": 'embed(e5, "why is colbert effective?")', "input.query(qt)": 'embed(colbert, "why is colbert effective?")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id,title,page,contexts from pdf where ({targetHits:10}nearestNeighbor(embedding,q))", groupname="jo-bergum", ranking="colbert", query="why is colbert effective?", body={ "presentation.format.tensors": "short-value", "input.query(q)": 'embed(e5, "why is colbert effective?")', "input.query(qt)": 'embed(colbert, "why is colbert effective?")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "id:personal:pdf:g=jo-bergum:55ea3f735cb6748a2eddb9f76d3f0e7fff0c31a8", "relevance": 103.17699432373047, "source": "pdfs_content.pdf", "fields": { "matchfeatures": { "cos_sim": 0.6534222205340683, "max_sim": 103.17699432373047, "max_sim_per_context": { "0": 74.16375732421875, "1": 103.17699432373047 } }, "id": "55ea3f735cb6748a2eddb9f76d3f0e7fff0c31a8", "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "page": 18, "contexts": [ "at least once. While ColBERT encodes each document with BERTexactly once, existing BERT-based rankers would repeat similarcomputations on possibly hundreds of documents for each query.Se/t_ting Dimension( m) Bytes/Dim Space(GiBs) MRR@10Re-rank Cosine 128 4 286 34.9End-to-end L2 128 2 154 36.0Re-rank L2 128 2 143 34.8Re-rank Cosine 48 4 54 34.4Re-rank Cosine 24 2 27 33.9Table 4: Space Footprint vs MRR@10 (Dev) on MS MARCO.Table 4 reports the space footprint of ColBERT under variousse/t_tings as we reduce the embeddings dimension and/or the bytesper dimension. Interestingly, the most space-e\ufb03cient se/t_ting, thatis, re-ranking with cosine similarity with 24-dimensional vectorsstored as 2-byte /f_loats, is only 1% worse in MRR@10 than the mostspace-consuming one, while the former requires only 27 GiBs torepresent the MS MARCO collection.5 CONCLUSIONSIn this paper, we introduced ColBERT, a novel ranking model thatemploys contextualized late interaction over deep LMs (in particular,", "BERT) for e\ufb03cient retrieval. By independently encoding queriesand documents into /f_ine-grained representations that interact viacheap and pruning-friendly computations, ColBERT can leveragethe expressiveness of deep LMs while greatly speeding up queryprocessing. In addition, doing so allows using ColBERT for end-to-end neural retrieval directly from a large document collection. Ourresults show that ColBERT is more than 170 \u00d7faster and requires14,000\u00d7fewer FLOPs/query than existing BERT-based models, allwhile only minimally impacting quality and while outperformingevery non-BERT baseline.Acknowledgments. OK was supported by the Eltoukhy FamilyGraduate Fellowship at the Stanford School of Engineering. /T_hisresearch was supported in part by a\ufb03liate members and othersupporters of the Stanford DAWN project\u2014Ant Financial, Facebook,Google, Infosys, NEC, and VMware\u2014as well as Cisco, SAP, and the" ] } } ``` Notice the `matchfeatures` that returns the configured match-features from the rank-profile, including all the context similarities. ## LangChain Retriever[¶](#langchain-retriever) We use the [LangChain Retriever](https://python.langchain.com/docs/how_to/#retrievers) interface so that we can connect our Vespa app with the flexibility and power of the [LangChain](https://python.langchain.com/docs/get_started/introduction) LLM framework. > A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well. The retriever interface fits perfectly with Vespa, as Vespa can support a wide range of features and ways to retrieve and rank content. The following implements a custom retriever `VespaStreamingColBERTRetriever` that takes the following arguments: - `app:Vespa` The Vespa application we retrieve from. This could be a Vespa Cloud instance or a local instance, for example running on a laptop. - `user:str` The user that that we want to retrieve for, this argument maps to the [Vespa streaming mode groupname parameter](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming.groupname) - `pages:int` The target number of PDF pages we want to retrieve for a given query - `chunks_per_page` The is the target number of relevant text chunks that are associated with the page - `chunk_similarity_threshold` - The chunk similarity threshold, only chunks with a similarity above this threshold The core idea is to *retrieve* pages using max context similarity as the initial scoring function, then re-rank the top-K pages using the ColBERT embeddings. This re-ranking is handled by the second phase of the Vespa ranking expression defined above, and is transparent to the retriever code below. In \[25\]: Copied! ``` from langchain_core.documents import Document from langchain_core.retrievers import BaseRetriever from typing import List class VespaStreamingColBERTRetriever(BaseRetriever): app: Vespa user: str pages: int = 5 chunks_per_page: int = 3 chunk_similarity_threshold: float = 0.8 def _get_relevant_documents(self, query: str) -> List[Document]: response: VespaQueryResponse = self.app.query( yql="select id, url, title, page, authors, contexts from pdf where userQuery() or ({targetHits:20}nearestNeighbor(embedding,q))", groupname=self.user, ranking="colbert", query=query, hits=self.pages, body={ "presentation.format.tensors": "short-value", "input.query(q)": f'embed(e5, "query: {query} ")', "input.query(qt)": f'embed(colbert, "{query}")', }, timeout="2s", ) if not response.is_successful(): raise ValueError( f"Query failed with status code {response.status_code}, url={response.url} response={response.json}" ) return self._parse_response(response) def _parse_response(self, response: VespaQueryResponse) -> List[Document]: documents: List[Document] = [] for hit in response.hits: fields = hit["fields"] chunks_with_scores = self._get_chunk_similarities(fields) ## Best k chunks from each page best_chunks_on_page = " ### ".join( [ chunk for chunk, score in chunks_with_scores[0 : self.chunks_per_page] if score > self.chunk_similarity_threshold ] ) documents.append( Document( id=fields["id"], page_content=best_chunks_on_page, title=fields["title"], metadata={ "title": fields["title"], "url": fields["url"], "page": fields["page"], "authors": fields["authors"], "features": fields["matchfeatures"], }, ) ) return documents def _get_chunk_similarities(self, hit_fields: dict) -> List[tuple]: match_features = hit_fields["matchfeatures"] similarities = match_features["max_sim_per_context"] chunk_scores = [] for i in range(0, len(similarities)): chunk_scores.append(similarities.get(str(i), 0)) chunks = hit_fields["contexts"] chunks_with_scores = list(zip(chunks, chunk_scores)) return sorted(chunks_with_scores, key=lambda x: x[1], reverse=True) ``` from langchain_core.documents import Document from langchain_core.retrievers import BaseRetriever from typing import List class VespaStreamingColBERTRetriever(BaseRetriever): app: Vespa user: str pages: int = 5 chunks_per_page: int = 3 chunk_similarity_threshold: float = 0.8 def \_get_relevant_documents(self, query: str) -> List\[Document\]: response: VespaQueryResponse = self.app.query( yql="select id, url, title, page, authors, contexts from pdf where userQuery() or ({targetHits:20}nearestNeighbor(embedding,q))", groupname=self.user, ranking="colbert", query=query, hits=self.pages, body={ "presentation.format.tensors": "short-value", "input.query(q)": f'embed(e5, "query: {query} ")', "input.query(qt)": f'embed(colbert, "{query}")', }, timeout="2s", ) if not response.is_successful(): raise ValueError( f"Query failed with status code {response.status_code}, url={response.url} response={response.json}" ) return self.\_parse_response(response) def \_parse_response(self, response: VespaQueryResponse) -> List\[Document\]: documents: List[Document] = [] for hit in response.hits: fields = hit["fields"] chunks_with_scores = self.\_get_chunk_similarities(fields) ## Best k chunks from each page best_chunks_on_page = " ### ".join( \[ chunk for chunk, score in chunks_with_scores[0 : self.chunks_per_page] if score > self.chunk_similarity_threshold \] ) documents.append( Document( id=fields["id"], page_content=best_chunks_on_page, title=fields["title"], metadata={ "title": fields["title"], "url": fields["url"], "page": fields["page"], "authors": fields["authors"], "features": fields["matchfeatures"], }, ) ) return documents def \_get_chunk_similarities(self, hit_fields: dict) -> List\[tuple\]: match_features = hit_fields["matchfeatures"] similarities = match_features["max_sim_per_context"] chunk_scores = [] for i in range(0, len(similarities)): chunk_scores.append(similarities.get(str(i), 0)) chunks = hit_fields["contexts"] chunks_with_scores = list(zip(chunks, chunk_scores)) return sorted(chunks_with_scores, key=lambda x: x[1], reverse=True) That's it! We can give our newborn retriever a spin for the user `jo-bergum` by In \[26\]: Copied! ``` vespa_hybrid_retriever = VespaStreamingColBERTRetriever( app=app, user="jo-bergum", pages=1, chunks_per_page=3 ) ``` vespa_hybrid_retriever = VespaStreamingColBERTRetriever( app=app, user="jo-bergum", pages=1, chunks_per_page=3 ) In \[27\]: Copied! ``` vespa_hybrid_retriever.invoke("what is the maxsim operator in colbert?") ``` vespa_hybrid_retriever.invoke("what is the maxsim operator in colbert?") Out\[27\]: ``` [Document(page_content='ture that precisely does so. As illustrated, every query embeddinginteracts with all document embeddings via a MaxSim operator,which computes maximum similarity (e.g., cosine similarity), andthe scalar outputs of these operators are summed across queryterms. /T_his paradigm allows ColBERT to exploit deep LM-basedrepresentations while shi/f_ting the cost of encoding documents of-/f_line and amortizing the cost of encoding the query once acrossall ranked documents. Additionally, it enables ColBERT to lever-age vector-similarity search indexes (e.g., [ 1,15]) to retrieve thetop-kresults directly from a large document collection, substan-tially improving recall over models that only re-rank the output ofterm-based retrieval.As Figure 1 illustrates, ColBERT can serve queries in tens orfew hundreds of milliseconds. For instance, when used for re-ranking as in “ColBERT (re-rank)”, it delivers over 170 ×speedup(and requires 14,000 ×fewer FLOPs) relative to existing BERT-based ### models, while being more effective than every non-BERT baseline(§4.2 & 4.3). ColBERT’s indexing—the only time it needs to feeddocuments through BERT—is also practical: it can index the MSMARCO collection of 9M passages in about 3 hours using a singleserver with four GPUs ( §4.5), retaining its effectiveness with a spacefootprint of as li/t_tle as few tens of GiBs. Our extensive ablationstudy ( §4.4) shows that late interaction, its implementation viaMaxSim operations, and crucial design choices within our BERT-based encoders are all essential to ColBERT’s effectiveness.Our main contributions are as follows.(1)We propose late interaction (§3.1) as a paradigm for efficientand effective neural ranking.(2)We present ColBERT ( §3.2 & 3.3), a highly-effective modelthat employs novel BERT-based query and document en-coders within the late interaction paradigm.', metadata={'title': 'ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT', 'url': 'https://arxiv.org/pdf/2004.12832.pdf', 'page': 4, 'authors': ['Omar Khattab', 'Matei Zaharia'], 'features': {'cos_sim': 0.6664045997289173, 'max_sim': 124.19231414794922, 'max_sim_per_context': {'0': 124.19231414794922, '1': 92.21265411376953}}})] ``` ## RAG[¶](#rag) Finally, we can connect our custom retriever with the complete flexibility and power of the [LangChain] LLM framework. The following uses [LangChain Expression Language, or LCEL](https://python.langchain.com/docs/how_to/#langchain-expression-language-lcel), a declarative way to compose chains. We have several steps composed into a chain: - The prompt template and LLM model, in this case using OpenAI - The retriever that provides the retrieved context for the question - The formatting of the retrieved context In \[28\]: Copied! ``` vespa_hybrid_retriever = VespaStreamingColBERTRetriever( app=app, user="jo-bergum", chunks_per_page=3 ) ``` vespa_hybrid_retriever = VespaStreamingColBERTRetriever( app=app, user="jo-bergum", chunks_per_page=3 ) In \[ \]: Copied! ``` from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough prompt_template = """ Answer the question based only on the following context. Cite the page number and the url of the document you are citing. {context} Question: {question} """ prompt = ChatPromptTemplate.from_template(prompt_template) model = ChatOpenAI(model="gpt-4-0125-preview") def format_prompt_context(docs) -> str: context = [] for d in docs: context.append(f"{d.metadata['title']} by {d.metadata['authors']}\n") context.append(f"url: {d.metadata['url']}\n") context.append(f"page: {d.metadata['page']}\n") context.append(f"{d.page_content}\n\n") return "".join(context) chain = ( { "context": vespa_hybrid_retriever | format_prompt_context, "question": RunnablePassthrough(), } | prompt | model | StrOutputParser() ) ``` from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough prompt_template = """ Answer the question based only on the following context. Cite the page number and the url of the document you are citing. {context} Question: {question} """ prompt = ChatPromptTemplate.from_template(prompt_template) model = ChatOpenAI(model="gpt-4-0125-preview") def format_prompt_context(docs) -> str: context = [] for d in docs: context.append(f"{d.metadata['title']} by {d.metadata['authors']}\\n") context.append(f"url: {d.metadata['url']}\\n") context.append(f"page: {d.metadata['page']}\\n") context.append(f"{d.page_content}\\n\\n") return "".join(context) chain = ( { "context": vespa_hybrid_retriever | format_prompt_context, "question": RunnablePassthrough(), } | prompt | model | StrOutputParser() ) ### Interact with the chain[¶](#interact-with-the-chain) Now, we can start asking questions using the `chain` define above. In \[31\]: Copied! ``` chain.invoke("what is colbert?") ``` chain.invoke("what is colbert?") Out\[31\]: ``` 'ColBERT, introduced by Omar Khattab and Matei Zaharia, is a novel ranking model that employs contextualized late interaction over deep language models (LMs), specifically focusing on BERT (Bidirectional Encoder Representations from Transformers) for efficient and effective passage search. It achieves this by independently encoding queries and documents into fine-grained representations that interact via cheap and pruning-friendly computations. This approach allows ColBERT to leverage the expressiveness of deep LMs while significantly speeding up query processing compared to existing BERT-based models. ColBERT also enables end-to-end neural retrieval directly from a large document collection, offering more than 170 times faster performance and requiring 14,000 times fewer FLOPs (floating-point operations) per query than previous BERT-based models, with minimal impact on quality. It outperforms every non-BERT baseline in effectiveness (https://arxiv.org/pdf/2004.12832.pdf, page 18).\n\nColBERT differentiates itself with a mechanism that delays the query-document interaction, which allows for pre-computation of document representations for cheap neural re-ranking and supports practical end-to-end neural retrieval through pruning via vector-similarity search. This method preserves the effectiveness of state-of-the-art models that condition most of their computations on the joint query-document pair, making ColBERT a scalable solution for passage search challenges (https://arxiv.org/pdf/2004.12832.pdf, page 6).' ``` In \[32\]: Copied! ``` chain.invoke("what is the colbert maxsim operator") ``` chain.invoke("what is the colbert maxsim operator") Out\[32\]: ``` 'The ColBERT MaxSim operator is a mechanism for computing the maximum similarity between query embeddings and document embeddings. It operates by calculating the maximum similarity (e.g., cosine similarity) for each query embedding with all document embeddings, and then summing the scalar outputs of these operations across query terms. This paradigm enables the efficient and effective retrieval of documents by allowing for the interaction between deep language model-based representations of queries and documents to occur in a late stage of the processing pipeline, thereby shifting the cost of encoding documents offline and amortizing the cost of encoding the query across all ranked documents. Additionally, the MaxSim operator facilitates the use of vector-similarity search indexes to directly retrieve the top-k results from a large document collection, substantially improving recall over models that only re-rank the output of term-based retrieval. This operator is a key component of ColBERT\'s approach to efficient and effective passage search.\n\nSource: "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" by Omar Khattab and Matei Zaharia, page 4, https://arxiv.org/pdf/2004.12832.pdf' ``` In \[33\]: Copied! ``` chain.invoke( "What is the difference between colbert and single vector representational models?" ) ``` chain.invoke( "What is the difference between colbert and single vector representational models?" ) Out\[33\]: ``` 'The main difference between ColBERT and single-vector representational models lies in their approach to handling document and query representations for information retrieval tasks. ColBERT utilizes a multi-vector representation for both queries and documents, whereas single-vector models encode each query and each document into a single, dense vector.\n\n1. **Multi-Vector vs. Single-Vector Representations**: ColBERT leverages a late interaction mechanism that allows for fine-grained matching between the multiple embeddings of query terms and document tokens. This approach enables capturing the nuanced semantics of the text by considering the contextualized representation of each term separately. On the other hand, single-vector models compress the entire content of a document or a query into a single dense vector, which might lead to a loss of detail and context specificity.\n\n2. **Efficiency and Effectiveness**: While single-vector models might be simpler and potentially faster in some scenarios due to their straightforward matching mechanism (e.g., cosine similarity between query and document vectors), this simplicity could come at the cost of effectiveness. ColBERT, with its detailed interaction between term-level vectors, can offer more accurate retrieval results because it preserves and utilizes the rich semantic relationships within and across the text of queries and documents. However, ColBERT\'s detailed approach initially required more storage and computational resources compared to single-vector models. Nonetheless, advancements like ColBERTv2 have significantly improved the efficiency, achieving competitive storage requirements and reducing the computational cost while maintaining or even enhancing retrieval effectiveness.\n\n3. **Compression and Storage**: Initial versions of multi-vector models like ColBERT required significantly more storage space compared to single-vector models due to storing multiple vectors per document. However, with the introduction of techniques like residual compression in ColBERTv2, the storage requirements have been drastically reduced to levels competitive with single-vector models. Single-vector models, while naturally more storage-efficient, can also be compressed, but aggressive compression might exacerbate the loss in quality.\n\n4. **Search Quality and Compression**: Despite the potential for aggressive compression in single-vector models, such approaches often lead to a more pronounced loss in quality compared to late interaction methods like ColBERTv2. ColBERTv2, even when employing compression techniques to reduce its storage footprint, can achieve higher quality across systems, showcasing the robustness of its retrieval capabilities even when optimizing for space efficiency.\n\nIn summary, the difference between ColBERT and single-vector representational models is primarily in their approach to encoding and matching queries and documents, with ColBERT focusing on detailed, term-level interactions for improved accuracy, and single-vector models emphasizing simplicity and compactness, which might come at the cost of retrieval effectiveness.\n\nCitations:\n- Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction," p. 14, 15, 17, https://arxiv.org/pdf/2112.01488.pdf' ``` In \[34\]: Copied! ``` chain.invoke("Why does ColBERT work better for longer documents?") ``` chain.invoke("Why does ColBERT work better for longer documents?") Out\[34\]: ``` "ColBERT is designed to efficiently handle the interaction between query and document representations through a mechanism called late interaction, which is particularly beneficial when dealing with longer documents. This is because ColBERT independently encodes queries and documents into fine-grained representations using BERT, and then employs a cheap yet powerful interaction step that models their fine-grained similarity. This approach allows for the pre-computation of document representations offline, significantly speeding up query processing by avoiding the need to feed each query-document pair through a massive neural network at query time.\n\nFor longer documents, the benefits of this approach are twofold:\n\n1. **Efficiency in Handling Long Documents**: Since ColBERT encodes document representations offline, it can efficiently manage longer documents without a proportional increase in computational cost at query time. This is unlike traditional BERT-based models that might require more computational resources to process longer documents due to their size and complexity.\n\n2. **Effectiveness in Capturing Fine-Grained Semantics**: The fine-grained representations and the late interaction mechanism enable ColBERT to effectively capture the nuances and detailed semantics of longer documents. This is crucial for maintaining high retrieval quality, as longer documents often contain more information and require a more nuanced understanding to match relevant queries accurately.\n\nThus, ColBERT's architecture, which leverages the strengths of BERT for deep language understanding while introducing efficiencies through late interaction, makes it particularly adept at handling longer documents. It achieves this by pre-computing and efficiently utilizing detailed semantic representations of documents, enabling both high-quality retrieval and significant speed-ups in query processing times compared to traditional BERT-based models.\n\nReference: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT by ['Omar Khattab', 'Matei Zaharia'] (https://arxiv.org/pdf/2004.12832.pdf), page 4." ``` ## Summary[¶](#summary) Vespa’s streaming mode is a game-changer, enabling the creation of highly cost-effective RAG applications for naturally partitioned data. Now it is also possible to use ColBERT for re-ranking, without having to integrate any custom embedder or re-ranking code. In this notebook, we delved into the hands-on application of [LangChain](https://python.langchain.com/docs/get_started/introduction), leveraging document loaders and transformers. Finally, we showcased a custom LangChain retriever that connected all the functionality of LangChain with Vespa. For those interested in learning more about Vespa, join the [Vespa community on Slack](https://vespatalk.slack.com/) to exchange ideas, seek assistance, or stay in the loop on the latest Vespa developments. We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Using Cohere Binary Embeddings in Vespa[¶](#using-cohere-binary-embeddings-in-vespa) Cohere just released a new embedding API supporting binary and `int8` vectors. Read the announcement in the blog post: [Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets](https://cohere.com/blog/int8-binary-embeddings). > We are excited to announce that Cohere Embed is the first embedding model that natively supports int8 and binary embeddings. This is significant because: - Binarization reduces the storage footprint from 1024 floats (4096 bytes) per vector to 128 int8 (128 bytes). - 32x less data to store - Faster distance calculations using [hamming](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) distance, which Vespa natively supports for bits packed into int8 precision. More on [hamming distance in Vespa](https://docs.vespa.ai/en/reference/schema-reference.html#hamming). Vespa supports `hamming` distance with and without [hnsw indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html). For those wanting to learn more about binary vectors, we recommend our 2021 blog series on [Billion-scale vector search with Vespa](https://blog.vespa.ai/billion-scale-knn/) and [Billion-scale vector search with Vespa - part two](https://blog.vespa.ai/billion-scale-knn-part-two/). This notebook demonstrates how to use the Cohere binary vectors with Vespa, including a re-ranking phase that uses the float query vector version for improved accuracy. From the Cohere blog announcement: > To improve the search quality, the float query embedding can be compared with the binary document embeddings using dot-product. So we first retrieve 10\*top_k results with the binary query embedding, and then rescore the binary document embeddings with the float query embedding. This pushes the search quality from 90% to 95%. Install the dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa cohere==4.57 vespacli ``` !pip3 install -U pyvespa cohere==4.57 vespacli ## Examining the Cohere embeddings[¶](#examining-the-cohere-embeddings) Let us check out the Cohere embedding API and how we can obtain binarized embeddings. See also the [Cohere embed API doc](https://docs.cohere.com/docs/embed-api). In \[2\]: Copied! ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() ### Some sample documents[¶](#some-sample-documents) Define a few sample documents that we want to embed In \[3\]: Copied! ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] Notice that we ask for `embedding_types=["binary]` In \[4\]: Copied! ``` # Compute the binary embeddings ofdocuments. # Set input_type to "search_document" and embedding_types to "binary" cohere_response = co.embed( documents, model="embed-english-v3.0", input_type="search_document", embedding_types=["binary"], ) ``` # Compute the binary embeddings ofdocuments. # Set input_type to "search_document" and embedding_types to "binary" cohere_response = co.embed( documents, model="embed-english-v3.0", input_type="search_document", embedding_types=["binary"], ) In \[5\]: Copied! ``` print(cohere_response.embeddings.binary) ``` print(cohere_response.embeddings.binary) ``` [[-110, 121, 110, -50, 87, -59, 8, 35, 114, 30, -92, -112, -118, -16, 7, 96, 17, 51, 97, -9, -23, 25, -103, -35, -78, -47, 64, -123, -41, 67, 14, -31, -42, -126, 75, 111, 62, -64, 57, 64, -52, -66, -64, -12, 100, 99, 87, 61, -5, 5, 23, 34, -75, -66, -16, 91, 92, 121, 55, 117, 100, -112, -24, 84, 84, -65, 61, -31, -45, 7, 44, 8, -35, -125, 16, -50, -52, 11, -105, -32, 102, -62, -3, 86, -107, 21, 95, 15, 27, -79, -20, 114, 90, 125, 110, -97, -15, -98, 21, -102, -124, 112, -115, 26, -86, -55, 67, 7, 11, -127, 125, 103, -46, -55, 79, -31, 126, -32, 33, -128, -124, -80, 21, 27, -49, -9, 112, 101], [-110, -7, -24, 23, -33, 68, 24, 35, 22, -50, -32, 86, 74, -14, 71, 96, 81, -45, 105, -25, -73, 108, -99, 13, -76, 125, 73, -44, -34, -34, -105, 75, 86, -58, 85, -30, -92, -27, -39, 0, -75, -2, 30, -12, -116, 9, 81, 39, 76, 44, 87, 20, -43, 110, -75, 20, 108, 125, -75, 85, -28, -118, -24, 127, 78, -75, 108, -20, -48, 3, 12, 12, 71, -29, -98, -26, 68, 11, 0, -104, 96, 70, -3, 53, -98, -108, 127, -102, -17, -84, -88, 88, -54, -45, -11, -4, -4, 15, -67, 122, -108, 117, -51, 40, 98, -47, 102, -103, 3, -123, -85, 119, -48, -24, 95, -34, -26, -24, -31, -9, 99, 64, -128, -43, 74, -91, 80, -95], [64, -14, -4, 30, 118, 5, 8, 35, 51, 3, 72, -122, -70, -10, 2, -20, 17, 115, -67, -9, 115, 31, -103, -73, -78, 65, 64, -123, -41, 91, 14, -39, -41, -78, 73, -62, 60, -28, 89, 32, 33, -35, -62, 116, 102, -45, 83, 63, 73, 37, 23, 64, -43, -46, -106, 83, 109, 92, -87, -15, -60, -39, -23, 63, 84, 56, -6, -15, 20, 3, 76, 3, 104, -16, -79, 70, -123, 15, -125, -111, 109, -105, -99, 82, -19, -27, 95, -113, 94, -74, 57, 82, -102, -7, -95, -21, -3, -66, 73, 95, -124, 37, -115, -81, 107, -55, -25, 6, 19, -107, -120, 111, -110, -23, 79, -26, 106, -61, -96, -77, 9, 116, -115, -67, -63, -9, -43, 77], [-109, -7, -32, 19, 87, 116, 8, 35, 54, -102, -64, -106, -14, -10, 31, 78, -99, 59, -6, -45, 97, 96, -103, 37, 69, -35, -119, -59, 95, 27, 14, 73, 86, -9, -43, 110, -70, 96, 45, 32, -91, 62, -64, -12, 100, -55, 34, 62, 14, 5, 22, 67, -75, -17, -14, 81, 45, 125, -15, -11, -28, 75, -25, 20, 42, -78, -4, -67, -44, 11, 76, 3, 127, 40, 0, 103, 75, -62, -123, -111, 64, -13, -10, -5, -66, -89, 119, -70, -29, -95, -19, 82, 106, 127, -24, -11, -48, 15, -29, -102, -115, 107, -115, 55, -69, -61, 103, 11, 3, 25, -118, 63, -108, 11, 78, -28, 14, 124, 119, -61, 97, 84, 53, 69, 123, 89, -104, -127]] ``` As we can see from the above, we got an array of binary embeddings, using signed `int8` precision in the numeric range [-128 to 127]. Each embedding vector has 128 dimensions: In \[6\]: Copied! ``` len(cohere_response.embeddings.binary[0]) ``` len(cohere_response.embeddings.binary[0]) Out\[6\]: ``` 128 ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. Notice the `binary_vector` field that defines an indexed (dense) Vespa tensor with the dimension name `x[128]`. Indexing specifies `index` which means that Vespa will use HNSW indexing for this field. Also notice the configuration of [distance-metric](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) where we specify `hamming`. In \[20\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=[ Field( name="doc_id", type="string", indexing=["summary", "index"], match=["word"], rank="filter", ), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=\[ Field( name="doc_id", type="string", indexing=["summary", "index"], match=["word"], rank="filter", ), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) We must add the schema to a Vespa [application package](https://docs.vespa.ai/en/application-packages.html). This consists of configuration files, schemas, models, and possibly even custom code (plugins). In \[21\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "cohere" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "cohere" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. `unpack_bits` unpacks the binary representation into a 1024-dimensional float vector [doc](https://docs.vespa.ai/en/reference/ranking-expressions.html#unpack-bits). We define two tensor inputs, one compact binary representation that is used for the nearestNeighbor search and one full version that is used in ranking. In \[22\]: Copied! ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), ], functions=[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2*unpack_bits(attribute(binary_vector)) -1", ) ], first_phase=FirstPhaseRanking( expression="closeness(field, binary_vector)" # 1/(1 + hamming_distance). Calculated between the binary query and the binary_vector ), second_phase=SecondPhaseRanking( expression="sum( query(q_full)* unpack_binary_representation )", # re-rank using the dot product between float query and the unpacked binary representation rerank_count=100, ), match_features=[ "distance(field, binary_vector)", "closeness(field, binary_vector)", ], ) my_schema.add_rank_profile(rerank) ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=\[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), \], functions=\[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2\*unpack_bits(attribute(binary_vector)) -1", ) \], first_phase=FirstPhaseRanking( expression="closeness(field, binary_vector)" # 1/(1 + hamming_distance). Calculated between the binary query and the binary_vector ), second_phase=SecondPhaseRanking( expression="sum( query(q_full)\* unpack_binary_representation )", # re-rank using the dot product between float query and the unpacked binary representation rerank_count=100, ), match_features=[ "distance(field, binary_vector)", "closeness(field, binary_vector)", ], ) my_schema.add_rank_profile(rerank) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[26\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ## Feed our sample documents and their binary embedding representation[¶](#feed-our-sample-documents-and-their-binary-embedding-representation) With few documents, we use the synchronous API. Read more in [reads and writes](https://vespa-engine.github.io/pyvespa/reads-writes.md). In \[28\]: Copied! ``` from vespa.io import VespaResponse with app.syncio(connections=12) as sync: for i, doc in enumerate(documents): response: VespaResponse = sync.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": cohere_response.embeddings.binary[i], }, ) assert response.is_successful() ``` from vespa.io import VespaResponse with app.syncio(connections=12) as sync: for i, doc in enumerate(documents): response: VespaResponse = sync.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": cohere_response.embeddings.binary[i], }, ) assert response.is_successful() For some cases where we have lots of vector data, we can use the [hex format for binary indexed tensors](https://docs.vespa.ai/en/reference/document-json-format.html#tensor-hex-dump). In \[30\]: Copied! ``` from binascii import hexlify import numpy as np def to_hex_str(binary_vector): return str(hexlify(np.array(binary_vector, dtype=np.int8)), "utf-8") ``` from binascii import hexlify import numpy as np def to_hex_str(binary_vector): return str(hexlify(np.array(binary_vector, dtype=np.int8)), "utf-8") Feed using hex format In \[32\]: Copied! ``` with app.syncio() as sync: for i, doc in enumerate(documents): response: VespaResponse = sync.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": { "values": to_hex_str(cohere_response.embeddings.binary[i]) }, }, ) assert response.is_successful() ``` with app.syncio() as sync: for i, doc in enumerate(documents): response: VespaResponse = sync.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": { "values": to_hex_str(cohere_response.embeddings.binary[i]) }, }, ) assert response.is_successful() ### Querying data[¶](#querying-data) Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) - [Practical Nearest Neighbor Search Guide](https://docs.vespa.ai/en/nearest-neighbor-search-guide.html) In \[33\]: Copied! ``` query = "Who discovered x-ray?" # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for both float and binary query embeddings cohere_query_response = co.embed( [query], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "binary"], ) ``` query = "Who discovered x-ray?" # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for both float and binary query embeddings cohere_query_response = co.embed( [query], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "binary"], ) Now, we use nearestNeighbor search to retrieve 100 hits using hamming distance, these hits are then exposed to vespa ranking framework, where we re-rank using the dot product between the float tensor and the unpacked binary vector (the unpack returns a 1024 float version). In \[35\]: Copied! ``` response = app.query( yql="select * from doc where {targetHits:100}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": to_hex_str(cohere_query_response.embeddings.binary[0]), "input.query(q_full)": cohere_query_response.embeddings.float[0], }, ) assert response.is_successful() ``` response = app.query( yql="select * from doc where {targetHits:100}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": to_hex_str(cohere_query_response.embeddings.binary[0]), "input.query(q_full)": cohere_query_response.embeddings.float[0], }, ) assert response.is_successful() In \[36\]: Copied! ``` response.hits ``` response.hits Out\[36\]: ``` [{'id': 'id:doc:doc::3', 'relevance': 8.697503089904785, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.0029940119760479044, 'distance(field,binary_vector)': 333.0}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::3', 'doc_id': '3', 'text': 'Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity'}}, {'id': 'id:doc:doc::1', 'relevance': 6.413589954376221, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.002551020408163265, 'distance(field,binary_vector)': 391.00000000000006}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::1', 'doc_id': '1', 'text': 'Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.'}}, {'id': 'id:doc:doc::2', 'relevance': 6.379772663116455, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.002652519893899204, 'distance(field,binary_vector)': 376.0}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::2', 'doc_id': '2', 'text': 'Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.'}}, {'id': 'id:doc:doc::0', 'relevance': 4.5963287353515625, 'source': 'cohere_content', 'fields': {'matchfeatures': {'closeness(field,binary_vector)': 0.0024271844660194173, 'distance(field,binary_vector)': 411.00000000000006}, 'sddocname': 'doc', 'documentid': 'id:doc:doc::0', 'doc_id': '0', 'text': 'Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.'}}] ``` Notice the returned hits. The `relevance` is the score assigned by the second-phase dot product between the full query version and the unpacked binary vector. Also, we see the match features and the hamming distances. Notice that the re-ranking step has re-ordered doc 1 and doc 2. ## Conclusions[¶](#conclusions) These new Cohere binary embeddings are a huge step forward for cost-efficient vector search at scale and integrates perfectly with the rich feature set in Vespa. ### Clean up[¶](#clean-up) We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Standalone ColBERT with Vespa for end-to-end retrieval and ranking[¶](#standalone-colbert-with-vespa-for-end-to-end-retrieval-and-ranking) This notebook illustrates using [ColBERT](https://github.com/stanford-futuredata/ColBERT) package to produce token vectors, instead of using the native Vespa [colbert embedder](https://docs.vespa.ai/en/embedding.html#colbert-embedder). This guide illustrates how to feed and query using a single passage representation - Compress token vectors using binarization compatible with Vespa unpackbits used in ranking. This implements the binarization of token-level vectors using `numpy`. - Use Vespa hex feed format for binary vectors [doc](https://docs.vespa.ai/en/reference/document-json-format.html#tensor). - Query examples. As a bonus, this also demonstrates how to use ColBERT end-to-end with Vespa for both retrieval and ranking. The retrieval step searches the binary token-level representations using hamming distance. This uses 32 nearestNeighbor operators in the same query, each finding 100 nearest hits in hamming space. Then the results are re-ranked using the full-blown MaxSim calculation. See [Announcing the Vespa ColBERT embedder](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/) for details on ColBERT and the binary quantization used to compress ColBERT's token-level vectors. In \[ \]: Copied! ``` !pip3 install -U pyvespa colbert-ai numpy torch transformers<=4.49.0 ``` !pip3 install -U pyvespa colbert-ai numpy torch transformers\<=4.49.0 Load a checkpoint with colbert and obtain document and query embeddings In \[ \]: Copied! ``` from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig ckpt = Checkpoint( "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments") ) ``` from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig ckpt = Checkpoint( "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments") ) In \[139\]: Copied! ``` passage = [ "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist." ] ``` passage = [ "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist." ] In \[ \]: Copied! ``` vectors = ckpt.docFromText(passage)[0] ``` vectors = ckpt.docFromText(passage)[0] In \[129\]: Copied! ``` vectors.shape ``` vectors.shape Out\[129\]: ``` torch.Size([27, 128]) ``` In this case, we got 27 token-level embeddings, each using 128 float dimensions. This includes CLS token and special tokens used to differentiate the query from the document encoding. In \[130\]: Copied! ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] query_vectors.shape ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] query_vectors.shape Out\[130\]: ``` torch.Size([32, 128]) ``` Routines for binarization and output in Vespa tensor format that can be used in queries and in JSON feed. In \[118\]: Copied! ``` import numpy as np import torch from binascii import hexlify from typing import Dict, List def binarize_token_vectors_hex(vectors: torch.Tensor) -> Dict[str, str]: binarized_token_vectors = np.packbits(np.where(vectors > 0, 1, 0), axis=1).astype( np.int8 ) vespa_token_feed = dict() for index in range(0, len(binarized_token_vectors)): vespa_token_feed[index] = str( hexlify(binarized_token_vectors[index].tobytes()), "utf-8" ) return vespa_token_feed def float_query_token_vectors(vectors: torch.Tensor) -> Dict[str, List[float]]: vespa_token_feed = dict() for index in range(0, len(vectors)): vespa_token_feed[index] = vectors[index].tolist() return vespa_token_feed ``` import numpy as np import torch from binascii import hexlify from typing import Dict, List def binarize_token_vectors_hex(vectors: torch.Tensor) -> Dict\[str, str\]: binarized_token_vectors = np.packbits(np.where(vectors > 0, 1, 0), axis=1).astype( np.int8 ) vespa_token_feed = dict() for index in range(0, len(binarized_token_vectors)): vespa_token_feed[index] = str( hexlify(binarized_token_vectors[index].tobytes()), "utf-8" ) return vespa_token_feed def float_query_token_vectors(vectors: torch.Tensor) -> Dict\[str, List[float]\]: vespa_token_feed = dict() for index in range(0, len(vectors)): vespa_token_feed[index] = vectors[index].tolist() return vespa_token_feed In \[ \]: Copied! ``` import json print(json.dumps(binarize_token_vectors_hex(vectors))) print(json.dumps(float_query_token_vectors(query_vectors))) ``` import json print(json.dumps(binarize_token_vectors_hex(vectors))) print(json.dumps(float_query_token_vectors(query_vectors))) ## Defining the Vespa application[¶](#defining-the-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. We use HNSW with hamming distance for retrieval In \[151\]: Copied! ``` from vespa.package import Schema, Document, Field colbert_schema = Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field(name="passage", type="string", indexing=["index", "summary"]), Field( name="colbert", type="tensor(token{}, v[16])", indexing=["attribute", "summary", "index"], attribute=["distance-metric:hamming"], ), ] ), ) ``` from vespa.package import Schema, Document, Field colbert_schema = Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field(name="passage", type="string", indexing=["index", "summary"]), Field( name="colbert", type="tensor(token{}, v[16])", indexing=["attribute", "summary", "index"], attribute=["distance-metric:hamming"], ), \] ), ) In \[152\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "colbert" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "colbert" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) We need to define all the query input tensors. We are going to input up to 32 query tensors in binary form these are used for retrieval In \[92\]: Copied! ``` query_binary_input_tensors = [] for index in range(0, 32): query_binary_input_tensors.append( ("query(binary_vector_{})".format(index), "tensor(v[16])") ) ``` query_binary_input_tensors = [] for index in range(0, 32): query_binary_input_tensors.append( ("query(binary_vector\_{})".format(index), "tensor(v[16])") ) Note that we just use max sim in the first phase ranking over all the hits that are retrieved by the query In \[153\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking colbert = RankProfile( name="default", inputs=[ ("query(qt)", "tensor(querytoken{}, v[128])"), *query_binary_input_tensors, ], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colbert_schema.add_rank_profile(colbert) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking colbert = RankProfile( name="default", inputs=\[ ("query(qt)", "tensor(querytoken{}, v[128])"), \*query_binary_input_tensors, \], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colbert_schema.add_rank_profile(colbert) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). It is also possible to deploy the app using docker; see the [Hybrid Search - Quickstart](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa.md) guide for an example of deploying it to a local docker container. Install the Vespa CLI. In \[ \]: Copied! ``` !pip3 install vespacli ``` !pip3 install vespacli To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. ### Configure Vespa Cloud date-plane security[¶](#configure-vespa-cloud-date-plane-security) Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See [Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide) for details. We save the paths to the credentials for later data-plane access without using pyvespa APIs. In \[ \]: Copied! ``` import os os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name vespa_cli_command = ( f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}' ) !vespa config set target cloud !{vespa_cli_command} !vespa auth cert -N ``` import os os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name vespa_cli_command = ( f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}' ) !vespa config set target cloud !{vespa_cli_command} !vespa auth cert -N Validate that we have the expected data-plane credential files: In \[52\]: Copied! ``` from os.path import exists from pathlib import Path cert_path = ( Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem" ) key_path = ( Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem" ) if not exists(cert_path) or not exists(key_path): print( "ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error" ) ``` from os.path import exists from pathlib import Path cert_path = ( Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem" ) key_path = ( Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem" ) if not exists(cert_path) or not exists(key_path): print( "ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error" ) Note that the subsequent Vespa Cloud deploy call below will add `data-plane-public-cert.pem` to the application before deploying it to Vespa Cloud, so that you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate. ### Configure Vespa Cloud control-plane security[¶](#configure-vespa-cloud-control-plane-security) Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it. The generated tenant api key must be added in the Vespa Console before attempting to deploy the application. ``` To use this key in Vespa Cloud click 'Add custom key' at https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys and paste the entire public key including the BEGIN and END lines. ``` In \[ \]: Copied! ``` !vespa auth api-key from pathlib import Path api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem" ``` !vespa auth api-key from pathlib import Path api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem" ### Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud! `PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf). > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[154\]: Copied! ``` from vespa.deployment import VespaCloud def read_secret(): """Read the API key from the environment variable. This is only used for CI/CD purposes.""" t = os.getenv("VESPA_TEAM_API_KEY") if t: return t.replace(r"\n", "\n") else: return t vespa_cloud = VespaCloud( tenant=os.environ["TENANT_NAME"], application=vespa_app_name, key_content=read_secret() if read_secret() else None, key_location=api_key_path, application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud def read_secret(): """Read the API key from the environment variable. This is only used for CI/CD purposes.""" t = os.getenv("VESPA_TEAM_API_KEY") if t: return t.replace(r"\\n", "\\n") else: return t vespa_cloud = VespaCloud( tenant=os.environ["TENANT_NAME"], application=vespa_app_name, key_content=read_secret() if read_secret() else None, key_location=api_key_path, application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() In \[156\]: Copied! ``` from vespa.io import VespaResponse vespa_feed_format = { "id": "1", "passage": passage[0], "colbert": binarize_token_vectors_hex(vectors), } with app.syncio() as sync: response: VespaResponse = sync.feed_data_point( data_id=1, fields=vespa_feed_format, schema="doc" ) ``` from vespa.io import VespaResponse vespa_feed_format = { "id": "1", "passage": passage[0], "colbert": binarize_token_vectors_hex(vectors), } with app.syncio() as sync: response: VespaResponse = sync.feed_data_point( data_id=1, fields=vespa_feed_format, schema="doc" ) ## Querying[¶](#querying) Now we create all the query token vectors in binary form and use 32 nearestNeighbor query operators that are combined with OR. These hits are then exposed to ranking where the final MaxSim is performed using the unpacked binary representations. In \[ \]: Copied! ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] binary_query_input_tensors = binarize_token_vectors_hex(query_vectors) ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] binary_query_input_tensors = binarize_token_vectors_hex(query_vectors) In \[158\]: Copied! ``` binary_query_vectors = dict() nn_operators = list() for index in range(0, 32): name = "input.query(binary_vector_{})".format(index) nn_argument = "binary_vector_{}".format(index) value = binary_query_input_tensors[index] binary_query_vectors[name] = value nn_operators.append("({targetHits:100}nearestNeighbor(colbert, %s))" % nn_argument) ``` binary_query_vectors = dict() nn_operators = list() for index in range(0, 32): name = "input.query(binary_vector\_{})".format(index) nn_argument = "binary_vector\_{}".format(index) value = binary_query_input_tensors[index] binary_query_vectors[name] = value nn_operators.append("({targetHits:100}nearestNeighbor(colbert, %s))" % nn_argument) In \[159\]: Copied! ``` nn_operators = " OR ".join(nn_operators) ``` nn_operators = " OR ".join(nn_operators) Out\[159\]: ``` '({targetHits:100}nearestNeighbor(colbert, binary_vector_0)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_1)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_2)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_3)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_4)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_5)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_6)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_7)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_8)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_9)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_10)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_11)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_12)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_13)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_14)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_15)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_16)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_17)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_18)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_19)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_20)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_21)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_22)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_23)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_24)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_25)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_26)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_27)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_28)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_29)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_30)) OR ({targetHits:100}nearestNeighbor(colbert, binary_vector_31))' ``` In \[161\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where {}".format(nn_operators), ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), **binary_query_vectors, }, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where {}".format(nn_operators), ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), \*\*binary_query_vectors, }, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "id:doc:doc::1", "relevance": 100.57648777961731, "source": "colbert_content", "fields": { "sddocname": "doc", "documentid": "id:doc:doc::1", "id": "1", "passage": "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "colbert": { "0": [ 3, 120, 69, 0, 37, -60, -58, -95, -120, 32, -127, 67, -36, 68, -106, -12 ], "1": [ -106, 40, -119, -128, 96, -60, -58, 33, 48, 96, -127, 67, -100, 96, -106, -12 ], "2": [ -28, -84, 73, -18, 113, -60, -51, 40, -96, 121, 4, 24, -99, 68, -47, -60 ], "3": [ -13, 40, 75, -124, 65, 64, -32, -53, 12, 64, 125, 4, 24, -64, -69, 101 ], "4": [ 33, -54, 113, 24, 77, -36, -44, 3, -32, -72, 40, 41, -38, 102, 53, -35 ], "5": [ 3, -22, 73, -95, 73, -51, 85, -128, -121, 25, 17, 68, 90, 64, -113, -28 ], "6": [ -109, -72, -114, 0, 97, -58, -57, -95, 40, -96, -112, 67, -97, -85, -42, -12 ], "7": [ -112, 56, -114, 0, 97, -58, -57, -83, 40, -96, -127, 67, -97, 43, -42, -12 ], "8": [ 22, -71, 65, 96, 0, -60, 108, 37, 16, 106, -55, 115, -117, -56, -28, -12 ], "9": [ -106, -72, 94, 30, 32, -60, -60, -19, 24, -56, -47, -63, -40, -53, -103, -11 ], "10": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "11": [ -126, 121, 3, -103, 32, 70, 103, -23, 88, -55, -61, 71, -101, -106, -8, -68 ], "12": [ 18, 24, -106, 30, 36, -42, -60, 104, 57, -120, -128, -61, -67, -53, -100, -11 ], "13": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "14": [ 22, 49, -38, 17, 36, -42, -25, 65, 25, -56, -45, -59, -102, -2, -65, 125 ], "15": [ -105, 25, -50, 16, 0, -42, -28, 45, 48, -56, -112, -55, -3, -87, -112, -11 ], "16": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "17": [ 55, 43, -62, 33, -91, 68, 99, 32, 72, 10, -41, 70, -117, -78, -73, -11 ], "18": [ 3, 53, -117, 20, 36, -42, 79, 33, 9, -120, -41, 69, -36, -69, -111, 117 ], "19": [ 23, 16, -42, 20, 44, -42, -26, 33, 57, -120, -112, -63, -3, -24, -108, -11 ], "20": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "21": [ -110, 53, -106, 28, 32, -42, -58, 77, 61, -56, -42, -15, -68, -5, -110, -11 ], "22": [ -109, 56, -114, 0, 96, -42, -58, -83, 40, -96, -128, -61, -99, -21, -44, -12 ], "23": [ 18, 57, -50, 30, 36, 86, -60, 69, 9, -120, -48, -63, -75, -22, -98, -11 ], "24": [ 30, -71, -106, 26, 32, -42, -50, 104, 56, 64, -48, -61, -4, -8, -104, -12 ], "25": [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "26": [ 7, 56, 70, 0, 36, -58, -42, 33, -104, 34, -127, 67, -99, 96, -105, -12 ] } } } ``` Another example where we brute-force "true" search without a retrieval step using nearestNeighbor or other filters. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where true", ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), }, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where true", ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), }, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Standalone ColBERT + Vespa for long-context ranking[¶](#standalone-colbert-vespa-for-long-context-ranking) This is a guide on how to use the [ColBERT](https://github.com/stanford-futuredata/ColBERT) package to produce token-level vectors. This as an alternative for using the native Vespa [colbert embedder](https://docs.vespa.ai/en/embedding.html#colbert-embedder). This guide illustrates how to feed multiple passages per Vespa document (long-context) - Compress token vectors using binarization compatible with Vespa `unpack_bits` - Use Vespa hex feed format for binary vectors with mixed vespa tensors - How to query Vespa with the ColBERT query tensor representation Read more about [Vespa Long-Context ColBERT](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/). In \[ \]: Copied! ``` !pip3 install -U pyvespa colbert-ai numpy torch vespacli transformers<=4.49.0 ``` !pip3 install -U pyvespa colbert-ai numpy torch vespacli transformers\<=4.49.0 Load a checkpoint with ColBERT and obtain document and query embeddings In \[ \]: Copied! ``` from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig ckpt = Checkpoint( "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments") ) ``` from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig ckpt = Checkpoint( "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments") ) A few sample documents: In \[50\]: Copied! ``` document_passages = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Born in Maida Vale, London, Turing was raised in southern England. He graduated from King's College, Cambridge, with a degree in mathematics.", "After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer.", "Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science innovations.", ] ``` document_passages = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Born in Maida Vale, London, Turing was raised in southern England. He graduated from King's College, Cambridge, with a degree in mathematics.", "After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer.", "Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science innovations.", ] In \[ \]: Copied! ``` document_token_vectors = ckpt.docFromText(document_passages) ``` document_token_vectors = ckpt.docFromText(document_passages) See the shape of the ColBERT document embeddings: In \[52\]: Copied! ``` document_token_vectors.shape ``` document_token_vectors.shape Out\[52\]: ``` torch.Size([4, 35, 128]) ``` In \[53\]: Copied! ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] query_vectors.shape ``` query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0] query_vectors.shape Out\[53\]: ``` torch.Size([32, 128]) ``` The query is always padded to 32 so in the above we have 32 query token vectors. Routines for binarization and output in Vespa tensor format that can be used in queries and JSON feed. In \[67\]: Copied! ``` import numpy as np import torch from binascii import hexlify from typing import List, Dict def binarize_token_vectors_hex(vectors: torch.Tensor) -> Dict[str, str]: # Notice axix=2 to pack the bits in the last dimension, which is the token level vectors binarized_token_vectors = np.packbits(np.where(vectors > 0, 1, 0), axis=2).astype( np.int8 ) vespa_tensor = list() for chunk_index in range(0, len(binarized_token_vectors)): token_vectors = binarized_token_vectors[chunk_index] for token_index in range(0, len(token_vectors)): values = str(hexlify(token_vectors[token_index].tobytes()), "utf-8") if ( values == "00000000000000000000000000000000" ): # skip empty vectors due to padding with batch of passages continue vespa_tensor_cell = { "address": {"context": chunk_index, "token": token_index}, "values": values, } vespa_tensor.append(vespa_tensor_cell) return vespa_tensor def float_query_token_vectors(vectors: torch.Tensor) -> Dict[str, List[float]]: vespa_token_feed = dict() for index in range(0, len(vectors)): vespa_token_feed[index] = vectors[index].tolist() return vespa_token_feed ``` import numpy as np import torch from binascii import hexlify from typing import List, Dict def binarize_token_vectors_hex(vectors: torch.Tensor) -> Dict\[str, str\]: # Notice axix=2 to pack the bits in the last dimension, which is the token level vectors binarized_token_vectors = np.packbits(np.where(vectors > 0, 1, 0), axis=2).astype( np.int8 ) vespa_tensor = list() for chunk_index in range(0, len(binarized_token_vectors)): token_vectors = binarized_token_vectors[chunk_index] for token_index in range(0, len(token_vectors)): values = str(hexlify(token_vectors[token_index].tobytes()), "utf-8") if ( values == "00000000000000000000000000000000" ): # skip empty vectors due to padding with batch of passages continue vespa_tensor_cell = { "address": {"context": chunk_index, "token": token_index}, "values": values, } vespa_tensor.append(vespa_tensor_cell) return vespa_tensor def float_query_token_vectors(vectors: torch.Tensor) -> Dict\[str, List[float]\]: vespa_token_feed = dict() for index in range(0, len(vectors)): vespa_token_feed[index] = vectors[index].tolist() return vespa_token_feed In \[ \]: Copied! ``` import json print(json.dumps(binarize_token_vectors_hex(document_token_vectors))) print(json.dumps(float_query_token_vectors(query_vectors))) ``` import json print(json.dumps(binarize_token_vectors_hex(document_token_vectors))) print(json.dumps(float_query_token_vectors(query_vectors))) ## Defining the Vespa application[¶](#defining-the-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[60\]: Copied! ``` from vespa.package import Schema, Document, Field colbert_schema = Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field( name="passages", type="array", indexing=["summary", "index"], index="enable-bm25", ), Field( name="colbert", type="tensor(context{}, token{}, v[16])", indexing=["attribute", "summary"], ), ] ), ) ``` from vespa.package import Schema, Document, Field colbert_schema = Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field( name="passages", type="array", indexing=["summary", "index"], index="enable-bm25", ), Field( name="colbert", type="tensor(context{}, token{}, v[16])", indexing=["attribute", "summary"], ), \] ), ) In \[61\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "colbertlong" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "colbertlong" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) Note that we use max sim in the first phase ranking over all the hits that are retrieved by the query logic. Also note that asymmetric MaxSim where we use `unpack_bits` to obtain a 128-d float vector representation from the binary vector representation. In \[62\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking colbert_profile = RankProfile( name="default", inputs=[("query(qt)", "tensor(querytoken{}, v[128])")], functions=[ Function( name="max_sim_per_context", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ), Function( name="max_sim", expression="reduce(max_sim_per_context, max, context)" ), ], first_phase=FirstPhaseRanking(expression="max_sim"), match_features=["max_sim_per_context"], ) colbert_schema.add_rank_profile(colbert_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking colbert_profile = RankProfile( name="default", inputs=\[("query(qt)", "tensor(querytoken{}, v[128])")\], functions=[ Function( name="max_sim_per_context", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, token ), querytoken ) """, ), Function( name="max_sim", expression="reduce(max_sim_per_context, max, context)" ), ], first_phase=FirstPhaseRanking(expression="max_sim"), match_features=["max_sim_per_context"], ) colbert_schema.add_rank_profile(colbert_profile) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[63\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() Use Vespa tensor `blocks` format for mixed tensors (two mapped dimensions with one dense) [doc](https://docs.vespa.ai/en/reference/document-json-format.html#tensor). In \[65\]: Copied! ``` from vespa.io import VespaResponse vespa_feed_format = { "id": "1", "passages": document_passages, "colbert": {"blocks": binarize_token_vectors_hex(document_token_vectors)}, } # synchrounous feed (this is blocking and slow, but few docs..) with app.syncio() as sync: response: VespaResponse = sync.feed_data_point( data_id=1, fields=vespa_feed_format, schema="doc" ) ``` from vespa.io import VespaResponse vespa_feed_format = { "id": "1", "passages": document_passages, "colbert": {"blocks": binarize_token_vectors_hex(document_token_vectors)}, } # synchrounous feed (this is blocking and slow, but few docs..) with app.syncio() as sync: response: VespaResponse = sync.feed_data_point( data_id=1, fields=vespa_feed_format, schema="doc" ) ### Querying Vespa with ColBERT tensors[¶](#querying-vespa-with-colbert-tensors) This example uses brute-force "true" search without a retrieval step using nearestNeighbor or keywords. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where true", ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), }, ) assert response.is_successful() ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select * from doc where true", ranking="default", body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(query_vectors), }, ) assert response.is_successful() You should see output similar to this: ``` { "id": "id:doc:doc::1", "relevance": 100.0651626586914, "source": "colbertlong_content", "fields": { "matchfeatures": { "max_sim_per_context": { "0": 100.0651626586914, "1": 62.7861328125, "2": 67.44772338867188, "3": 60.133323669433594 } }, "sddocname": "doc", "documentid": "id:doc:doc::1", "id": "1", "passages": [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Born in Maida Vale, London, Turing was raised in southern England. He graduated from King's College, Cambridge, with a degree in mathematics.", "After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer.", "Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science innovations." ], "colbert": [ { "address": { "context": "0", "token": "0" }, "values": [ 1, 120, 69, 0, 33, -60, -58, -95, -120, 32, -127, 67, -51, 68, -106, -12 ] }, { "address": { "context": "0", "token": "1" }, "values": [ -122, 60, 9, -128, 97, -60, -58, -95, -80, 112, -127, 67, -99, 68, -106, -28 ] }, "..." ], } } ``` As can be seen from the matchfeatures, the first context (index 0) scored the highest and this is the score that is used to score the entire document. In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # ColPali Ranking Experiments on DocVQA[¶](#colpali-ranking-experiments-on-docvqa) This notebook demonstrates how to reproduce the ColPali results on [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled) with Vespa. The dataset consists of PDF documents with questions and answers. We demonstrate how we can binarize the patch embeddings and replace the float MaxSim scoring with a `hamming` based MaxSim without much loss in ranking accuracy but with a significant speedup (close to 4x) and reducing the memory (and storage) requirements by 32x. In this notebook, we represent one PDF page as one vespa document. See other notebooks for more information about using ColPali with Vespa: - [Scaling ColPALI (VLM) Retrieval](https://vespa-engine.github.io/pyvespa/examples/simplified-retrieval-with-colpali-vlm_Vespa-cloud.ipynb) - [Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models](https://vespa-engine.github.io/pyvespa/examples/colpali-document-retrieval-vision-language-models-cloud.ipynb) Install dependencies: In \[ \]: Copied! ``` !pip3 install transformers==4.51.3 accelerate pyvespa vespacli requests numpy scipy ir_measures pillow datasets ``` !pip3 install transformers==4.51.3 accelerate pyvespa vespacli requests numpy scipy ir_measures pillow datasets In \[ \]: Copied! ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from PIL import Image from transformers import ColPaliForRetrieval, ColPaliProcessor ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from PIL import Image from transformers import ColPaliForRetrieval, ColPaliProcessor ### Load the model[¶](#load-the-model) Load the model, also choose the correct device and model weights. Choose the right device to run the model on. In \[ \]: Copied! ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 Load the base model and the adapter. In \[ \]: Copied! ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ### The ViDoRe Benchmark[¶](#the-vidore-benchmark) We load the DocVQA test set, a subset of the ViDoRe dataset It has 500 pages and a question per page. The task is retrieve the page across the 500 indexed pages. In \[5\]: Copied! ``` from datasets import load_dataset ds = load_dataset("vidore/docvqa_test_subsampled", split="test") ``` from datasets import load_dataset ds = load_dataset("vidore/docvqa_test_subsampled", split="test") Now we use the ColPali model to generate embeddings for the images in the dataset. We use a dataloader to process each image and store the embeddings in a list. Batch size 4 requires a GPU with 16GB of memory and fits into a T4 GPU. If you have a smaller GPU, you can reduce the batch size to 2. In \[ \]: Copied! ``` dataloader = DataLoader( ds["image"], batch_size=4, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) embeddings = [] for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc).embeddings embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) ``` dataloader = DataLoader( ds["image"], batch_size=4, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) embeddings = [] for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(\*\*batch_doc).embeddings embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) ``` 100%|██████████| 125/125 [29:29<00:00, 14.16s/it] ``` Generate embeddings for the queries in the dataset. In \[ \]: Copied! ``` dummy_image = Image.new("RGB", (448, 448), (255, 255, 255)) dataloader = DataLoader( ds["query"], batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) query_embeddings = [] for batch_query in tqdm(dataloader): with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query).embeddings query_embeddings.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` dummy_image = Image.new("RGB", (448, 448), (255, 255, 255)) dataloader = DataLoader( ds["query"], batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) query_embeddings = [] for batch_query in tqdm(dataloader): with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query).embeddings query_embeddings.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` 100%|██████████| 500/500 [01:45<00:00, 4.72it/s] ``` Now we have all the embeddings. We'll define two helper functions to perform binarization (BQ) and also packing float values to shorter hex representation in JSON. Both save bandwidth and improve feed performance. In \[8\]: Copied! ``` import struct import numpy as np def binarize_tensor(tensor: torch.Tensor) -> str: """ Binarize a floating-point 1-d tensor by thresholding at zero and packing the bits into bytes. Returns the hex str representation of the bytes. """ if not tensor.is_floating_point(): raise ValueError("Input tensor must be of floating-point type.") return ( np.packbits(np.where(tensor > 0, 1, 0), axis=0).astype(np.int8).tobytes().hex() ) ``` import struct import numpy as np def binarize_tensor(tensor: torch.Tensor) -> str: """ Binarize a floating-point 1-d tensor by thresholding at zero and packing the bits into bytes. Returns the hex str representation of the bytes. """ if not tensor.is_floating_point(): raise ValueError("Input tensor must be of floating-point type.") return ( np.packbits(np.where(tensor > 0, 1, 0), axis=0).astype(np.int8).tobytes().hex() ) In \[9\]: Copied! ``` def tensor_to_hex_bfloat16(tensor: torch.Tensor) -> str: if not tensor.is_floating_point(): raise ValueError("Input tensor must be of float32 type.") def float_to_bfloat16_hex(f: float) -> str: packed_float = struct.pack("=f", f) bfloat16_bits = struct.unpack("=H", packed_float[2:])[0] return format(bfloat16_bits, "04X") hex_list = [float_to_bfloat16_hex(float(val)) for val in tensor.flatten()] return "".join(hex_list) ``` def tensor_to_hex_bfloat16(tensor: torch.Tensor) -> str: if not tensor.is_floating_point(): raise ValueError("Input tensor must be of float32 type.") def float_to_bfloat16_hex(f: float) -> str: packed_float = struct.pack("=f", f) bfloat16_bits = struct.unpack("=H", packed_float[2:])[0] return format(bfloat16_bits, "04X") hex_list = [float_to_bfloat16_hex(float(val)) for val in tensor.flatten()] return "".join(hex_list) ### Patch Vector pooling[¶](#patch-vector-pooling) This reduces the number of patch embeddings by a factor of 3, meaning that we go from 1030 patch vectors to 343 patch vectors. This reduces both the memory and the number of dotproducts we need to calculate. This function is not in use in this notebook, but it is included for reference. In \[ \]: Copied! ``` from scipy.cluster.hierarchy import fcluster, linkage from typing import Dict, List def pool_embeddings(embeddings: torch.Tensor, pool_factor=3) -> torch.Tensor: """ pool embeddings using hierarchical clustering to reduce the number of patch embeddings. Adapted from https://github.com/illuin-tech/vidore-benchmark/blob/e3b4f456d50271c69bce3d2c23131f5245d0c270/src/vidore_benchmark/compression/token_pooling.py#L32 Inspired by https://www.answer.ai/posts/colbert-pooling.html """ pooled_embeddings = [] token_length = embeddings.size(0) if token_length == 1: raise ValueError("The input tensor must have more than one token.") embeddings.to(device) similarities = torch.mm(embeddings, embeddings.t()) if similarities.dtype == torch.bfloat16: similarities = similarities.to(torch.float16) similarities = 1 - similarities.cpu().numpy() Z = linkage(similarities, metric="euclidean", method="ward") # noqa: N806 max_clusters = max(token_length // pool_factor, 1) cluster_labels = fcluster(Z, t=max_clusters, criterion="maxclust") cluster_id_to_indices: Dict[int, torch.Tensor] = {} with torch.no_grad(): for cluster_id in range(1, max_clusters + 1): cluster_indices = torch.where(torch.tensor(cluster_labels == cluster_id))[0] cluster_id_to_indices[cluster_id] = cluster_indices if cluster_indices.numel() > 0: pooled_embedding = embeddings[cluster_indices].mean(dim=0) pooled_embedding = torch.nn.functional.normalize( pooled_embedding, p=2, dim=-1 ) pooled_embeddings.append(pooled_embedding) pooled_embeddings = torch.stack(pooled_embeddings, dim=0) return pooled_embeddings ``` from scipy.cluster.hierarchy import fcluster, linkage from typing import Dict, List def pool_embeddings(embeddings: torch.Tensor, pool_factor=3) -> torch.Tensor: """ pool embeddings using hierarchical clustering to reduce the number of patch embeddings. Adapted from https://github.com/illuin-tech/vidore-benchmark/blob/e3b4f456d50271c69bce3d2c23131f5245d0c270/src/vidore_benchmark/compression/token_pooling.py#L32 Inspired by https://www.answer.ai/posts/colbert-pooling.html """ pooled_embeddings = [] token_length = embeddings.size(0) if token_length == 1: raise ValueError("The input tensor must have more than one token.") embeddings.to(device) similarities = torch.mm(embeddings, embeddings.t()) if similarities.dtype == torch.bfloat16: similarities = similarities.to(torch.float16) similarities = 1 - similarities.cpu().numpy() Z = linkage(similarities, metric="euclidean", method="ward") # noqa: N806 max_clusters = max(token_length // pool_factor, 1) cluster_labels = fcluster(Z, t=max_clusters, criterion="maxclust") cluster_id_to_indices: Dict[int, torch.Tensor] = {} with torch.no_grad(): for cluster_id in range(1, max_clusters + 1): cluster_indices = torch.where(torch.tensor(cluster_labels == cluster_id))[0] cluster_id_to_indices[cluster_id] = cluster_indices if cluster_indices.numel() > 0: pooled_embedding = embeddings[cluster_indices].mean(dim=0) pooled_embedding = torch.nn.functional.normalize( pooled_embedding, p=2, dim=-1 ) pooled_embeddings.append(pooled_embedding) pooled_embeddings = torch.stack(pooled_embeddings, dim=0) return pooled_embeddings Create the Vespa feed format. We use hex formats for mixed tensors [doc](https://docs.vespa.ai/en/reference/document-json-format.html#tensor). In \[12\]: Copied! ``` vespa_docs = [] for row, embedding in zip(ds, embeddings): embedding_full = dict() embedding_binary = dict() # You can experiment with pooling if you want to reduce the number of embeddings # pooled_embedding = pool_embeddings(embedding, pool_factor=2) # reduce the number of embeddings by a factor of 2 for j, emb in enumerate(embedding): embedding_full[j] = tensor_to_hex_bfloat16(emb) embedding_binary[j] = binarize_tensor(emb) vespa_doc = { "id": row["docId"], "embedding": embedding_full, "binary_embedding": embedding_binary, } vespa_docs.append(vespa_doc) ``` vespa_docs = [] for row, embedding in zip(ds, embeddings): embedding_full = dict() embedding_binary = dict() # You can experiment with pooling if you want to reduce the number of embeddings # pooled_embedding = pool_embeddings(embedding, pool_factor=2) # reduce the number of embeddings by a factor of 2 for j, emb in enumerate(embedding): embedding_full[j] = tensor_to_hex_bfloat16(emb) embedding_binary[j] = binarize_tensor(emb) vespa_doc = { "id": row["docId"], "embedding": embedding_full, "binary_embedding": embedding_binary, } vespa_docs.append(vespa_doc) ### Configure Vespa[¶](#configure-vespa) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. This is a simple schema, which is all we need to evaluate the effectiveness of the model. In \[14\]: Copied! ``` from vespa.package import Schema, Document, Field colpali_schema = Schema( name="pdf_page", document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="embedding", type="tensor(patch{}, v[128])", indexing=["attribute"], ), Field( name="binary_embedding", type="tensor(patch{}, v[16])", indexing=["attribute"], ), ] ), ) ``` from vespa.package import Schema, Document, Field colpali_schema = Schema( name="pdf_page", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="embedding", type="tensor(patch{}, v[128])", indexing=["attribute"], ), Field( name="binary_embedding", type="tensor(patch{}, v[16])", indexing=["attribute"], ), \] ), ) In \[15\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "visionragtest" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "visionragtest" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) Now we define how we want to rank the pages. We have 4 ranking models that we want to evaluate. These are all MaxSim variants but with various precision trade-offs. 1. **float-float** A regular MaxSim implementation that uses the float representation of both query and page embeddings. 1. **float-binary** Use the binarized representation of the page embeddings and where we unpack it into float representation. The query representation is still float. 1. **binary-binary** Use the binarized representation of the doc embeddings and the query embeddings and replaces the dot product with inverted hamming distance. 1. **phased** This uses the binary-binary in a first-phase, and then re-ranks using the float-binary representation. Only top 20 pages are re-ranked (This can be overriden in the query request as well). In \[17\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="float-float", # We define both the float and binary query inputs here; the rest of the profiles inherit these inputs inputs=[ ("query(qtb)", "tensor(querytoken{}, v[16])"), ("query(qt)", "tensor(querytoken{}, v[128])"), ], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * cell_cast(attribute(embedding), float), v ), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali_binary_profile = RankProfile( name="float-binary", inherits="float-float", functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(binary_embedding)), v ), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali_hamming_profile = RankProfile( name="binary-binary", inherits="float-float", functions=[ Function( name="max_sim", expression=""" sum( reduce( 1/(1+ sum( hamming(query(qtb), attribute(binary_embedding)),v )), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali__phased_hamming_profile = RankProfile( name="phased", inherits="float-float", functions=[ Function( name="max_sim_hamming", expression=""" sum( reduce( 1/(1+ sum( hamming(query(qtb), attribute(binary_embedding)),v )), max, patch ), querytoken ) """, ), Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(binary_embedding)), v ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_hamming"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=20), ) colpali_schema.add_rank_profile(colpali_profile) colpali_schema.add_rank_profile(colpali_binary_profile) colpali_schema.add_rank_profile(colpali_hamming_profile) colpali_schema.add_rank_profile(colpali__phased_hamming_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="float-float", # We define both the float and binary query inputs here; the rest of the profiles inherit these inputs inputs=\[ ("query(qtb)", "tensor(querytoken{}, v[16])"), ("query(qt)", "tensor(querytoken{}, v[128])"), \], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * cell_cast(attribute(embedding), float), v ), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali_binary_profile = RankProfile( name="float-binary", inherits="float-float", functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(binary_embedding)), v ), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali_hamming_profile = RankProfile( name="binary-binary", inherits="float-float", functions=[ Function( name="max_sim", expression=""" sum( reduce( 1/(1+ sum( hamming(query(qtb), attribute(binary_embedding)),v )), max, patch ), querytoken ) """, ) ], first_phase=FirstPhaseRanking(expression="max_sim"), ) colpali\_\_phased_hamming_profile = RankProfile( name="phased", inherits="float-float", functions=[ Function( name="max_sim_hamming", expression=""" sum( reduce( 1/(1+ sum( hamming(query(qtb), attribute(binary_embedding)),v )), max, patch ), querytoken ) """, ), Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(binary_embedding)), v ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_hamming"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=20), ) colpali_schema.add_rank_profile(colpali_profile) colpali_schema.add_rank_profile(colpali_binary_profile) colpali_schema.add_rank_profile(colpali_hamming_profile) colpali_schema.add_rank_profile(colpali\_\_phased_hamming_profile) ### Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). `PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf). > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. In \[ \]: Copied! ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() This example uses the asynchronous feed method and feeds one document at a time. In \[23\]: Copied! ``` from vespa.io import VespaResponse async with app.asyncio(connections=1, timeout=180) as session: for doc in tqdm(vespa_docs): response: VespaResponse = await session.feed_data_point( data_id=doc["id"], fields=doc, schema="pdf_page" ) if not response.is_successful(): print(response.json()) ``` from vespa.io import VespaResponse async with app.asyncio(connections=1, timeout=180) as session: for doc in tqdm(vespa_docs): response: VespaResponse = await session.feed_data_point( data_id=doc["id"], fields=doc, schema="pdf_page" ) if not response.is_successful(): print(response.json()) ``` 100%|██████████| 500/500 [01:13<00:00, 6.77it/s] ``` ### Run queries and evaluate effectiveness[¶](#run-queries-and-evaluate-effectiveness) We use ir_measures to evaluate the effectiveness of the retrieval model. In \[24\]: Copied! ``` from ir_measures import calc_aggregate, nDCG, ScoredDoc, Qrel ``` from ir_measures import calc_aggregate, nDCG, ScoredDoc, Qrel A simple routine for querying Vespa. Note that we send both vector representations in the query independently of the ranking method used, this for simplicity. Not all the ranking models we evaluate need both representations. In \[32\]: Copied! ``` from vespa.io import VespaQueryResponse from vespa.application import VespaAsync async def get_vespa_response( embedding: torch.Tensor, qid: str, session: VespaAsync, depth=20, profile="float-float", ) -> List[ScoredDoc]: # The query tensor api does not support hex formats yet float_embedding = {index: vector.tolist() for index, vector in enumerate(embedding)} binary_embedding = { index: np.packbits(np.where(vector > 0, 1, 0), axis=0).astype(np.int8).tolist() for index, vector in enumerate(embedding) } response: VespaQueryResponse = await session.query( yql="select id from pdf_page where true", # brute force search, rank all pages ranking=profile, hits=5, timeout=10, body={ "input.query(qt)": float_embedding, "input.query(qtb)": binary_embedding, "ranking.rerankCount": depth, }, ) assert response.is_successful() scored_docs = [] for hit in response.hits: doc_id = hit["fields"]["id"] score = hit["relevance"] scored_docs.append(ScoredDoc(qid, doc_id, score)) return scored_docs ``` from vespa.io import VespaQueryResponse from vespa.application import VespaAsync async def get_vespa_response( embedding: torch.Tensor, qid: str, session: VespaAsync, depth=20, profile="float-float", ) -> List\[ScoredDoc\]: # The query tensor api does not support hex formats yet float_embedding = {index: vector.tolist() for index, vector in enumerate(embedding)} binary_embedding = { index: np.packbits(np.where(vector > 0, 1, 0), axis=0).astype(np.int8).tolist() for index, vector in enumerate(embedding) } response: VespaQueryResponse = await session.query( yql="select id from pdf_page where true", # brute force search, rank all pages ranking=profile, hits=5, timeout=10, body={ "input.query(qt)": float_embedding, "input.query(qtb)": binary_embedding, "ranking.rerankCount": depth, }, ) assert response.is_successful() scored_docs = [] for hit in response.hits: doc_id = hit["fields"]["id"] score = hit["relevance"] scored_docs.append(ScoredDoc(qid, doc_id, score)) return scored_docs Run a test query first.. In \[28\]: Copied! ``` async with app.asyncio() as session: for profile in ["float-float", "float-binary", "binary-binary", "phased"]: print( await get_vespa_response( query_embeddings[0], profile, session, profile=profile ) ) ``` async with app.asyncio() as session: for profile in \["float-float", "float-binary", "binary-binary", "phased"\]: print( await get_vespa_response( query_embeddings[0], profile, session, profile=profile ) ) ``` [ScoredDoc(query_id='float-float', doc_id='4720', score=16.292504370212555), ScoredDoc(query_id='float-float', doc_id='4858', score=13.315170526504517), ScoredDoc(query_id='float-float', doc_id='14686', score=12.212152108550072), ScoredDoc(query_id='float-float', doc_id='4846', score=12.002869427204132), ScoredDoc(query_id='float-float', doc_id='864', score=11.308563649654388)] [ScoredDoc(query_id='float-binary', doc_id='4720', score=82.99432492256165), ScoredDoc(query_id='float-binary', doc_id='4858', score=71.45464742183685), ScoredDoc(query_id='float-binary', doc_id='14686', score=68.46699643135071), ScoredDoc(query_id='float-binary', doc_id='4846', score=64.85357594490051), ScoredDoc(query_id='float-binary', doc_id='2161', score=63.85516130924225)] [ScoredDoc(query_id='binary-binary', doc_id='4720', score=0.771387243643403), ScoredDoc(query_id='binary-binary', doc_id='4858', score=0.7132036704570055), ScoredDoc(query_id='binary-binary', doc_id='14686', score=0.6979007869958878), ScoredDoc(query_id='binary-binary', doc_id='6087', score=0.6534321829676628), ScoredDoc(query_id='binary-binary', doc_id='2161', score=0.6525899451225996)] [ScoredDoc(query_id='phased', doc_id='4720', score=82.99432492256165), ScoredDoc(query_id='phased', doc_id='4858', score=71.45464742183685), ScoredDoc(query_id='phased', doc_id='14686', score=68.46699643135071), ScoredDoc(query_id='phased', doc_id='4846', score=64.85357594490051), ScoredDoc(query_id='phased', doc_id='2161', score=63.85516130924225)] ``` Now, run through all of the test queries for each of the ranking models. In \[29\]: Copied! ``` qrels = [] profiles = ["float-float", "float-binary", "binary-binary", "phased"] results = {profile: [] for profile in profiles} async with app.asyncio(connections=3) as session: for row, embedding in zip(tqdm(ds), query_embeddings): qrels.append(Qrel(row["questionId"], str(row["docId"]), 1)) for profile in profiles: scored_docs = await get_vespa_response( embedding, row["questionId"], session, profile=profile ) results[profile].extend(scored_docs) ``` qrels = [] profiles = ["float-float", "float-binary", "binary-binary", "phased"] results = {profile: [] for profile in profiles} async with app.asyncio(connections=3) as session: for row, embedding in zip(tqdm(ds), query_embeddings): qrels.append(Qrel(row["questionId"], str(row["docId"]), 1)) for profile in profiles: scored_docs = await get_vespa_response( embedding, row["questionId"], session, profile=profile ) results[profile].extend(scored_docs) ``` 500it [11:32, 1.39s/it] ``` Calculate the effectiveness of the 4 different models In \[30\]: Copied! ``` for profile in profiles: score = calc_aggregate([nDCG @ 5], qrels, results[profile])[nDCG @ 5] print(f"nDCG@5 for {profile}: {100*score:.2f}") ``` for profile in profiles: score = calc_aggregate([nDCG @ 5], qrels, results[profile])[nDCG @ 5] print(f"nDCG@5 for {profile}: {100\*score:.2f}") ``` nDCG@5 for float-float: 52.37 nDCG@5 for float-binary: 51.64 nDCG@5 for binary-binary: 49.48 nDCG@5 for phased: 51.70 ``` This is encouraging as the binary-binary representation is 4x faster than the float-float representation and saves 32x space. We can also largely retain the effectiveness of the float-binary representation by using the phased approach, where we re-rank the top 20 pages from the hamming (binary-binary) version using the float-binary representation. Now we can explore the ranking depth and see how the phased approach performs with different ranking depths. In \[35\]: Copied! ``` results = { profile: [] for profile in [ "phased-rerank-count=5", "phased-rerank-count=10", "phased-rerank-count=20", "phased-rerank-count=40", ] } async with app.asyncio(connections=3) as session: for row, embedding in zip(tqdm(ds), query_embeddings): qrels.append(Qrel(row["questionId"], str(row["docId"]), 1)) for count in [5, 10, 20, 40]: scored_docs = await get_vespa_response( embedding, row["questionId"], session, profile="phased", depth=count ) results["phased-rerank-count=" + str(count)].extend(scored_docs) ``` results = { profile: [] for profile in [ "phased-rerank-count=5", "phased-rerank-count=10", "phased-rerank-count=20", "phased-rerank-count=40", ] } async with app.asyncio(connections=3) as session: for row, embedding in zip(tqdm(ds), query_embeddings): qrels.append(Qrel(row["questionId"], str(row["docId"]), 1)) for count in \[5, 10, 20, 40\]: scored_docs = await get_vespa_response( embedding, row["questionId"], session, profile="phased", depth=count ) results["phased-rerank-count=" + str(count)].extend(scored_docs) ``` 500it [08:18, 1.00it/s] ``` In \[36\]: Copied! ``` for profile in results.keys(): score = calc_aggregate([nDCG @ 5], qrels, results[profile])[nDCG @ 5] print(f"nDCG@5 for {profile}: {100*score:.2f}") ``` for profile in results.keys(): score = calc_aggregate([nDCG @ 5], qrels, results[profile])[nDCG @ 5] print(f"nDCG@5 for {profile}: {100\*score:.2f}") ``` nDCG@5 for phased-rerank-count=5: 50.77 nDCG@5 for phased-rerank-count=10: 51.58 nDCG@5 for phased-rerank-count=20: 51.70 nDCG@5 for phased-rerank-count=40: 51.64 ``` ### Conclusion[¶](#conclusion) The binary representation of the patch embeddings reduces the storage by 32x, and using hamming distance instead of dotproduct saves us about 4x in computation compared to the float-float model or the float-binary model (which only saves storage). Using a re-ranking step with only depth 10, we can improve the effectiveness of the binary-binary model to almost match the float-float MaxSim model. The additional re-ranking step only requires that we pass also the float query embedding version without any additional storage overhead. # Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models[¶](#vespa-colpali-efficient-document-retrieval-with-vision-language-models) For a simpler example of using ColPali, where we use one Vespa document = One PDF page, see [simplified-retrieval-with-colpali](https://vespa-engine.github.io/pyvespa/examples/simplified-retrieval-with-colpali-vlm_Vespa-cloud.md). This notebook demonstrates how to represent [ColPali](https://huggingface.co/vidore/colpali) in Vespa. ColPali is a powerful visual language model that can generate embeddings for images and text. In this notebook, we will use ColPali to generate embeddings for images of PDF *pages* and store them in Vespa. We will also store the base64 encoded image of the PDF page and some meta data like title and url. We will then demonstrate how to retrieve the pdf pages using the embeddings generated by ColPali. [ColPali: Efficient Document Retrieval with Vision Language Models Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo](https://arxiv.org/abs/2407.01449v2) ColPali is a combination of [ColBERT](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/) and [PaliGemma](https://huggingface.co/blog/paligemma): > ColPali is enabled by the latest advances in Vision Language Models, notably the PaliGemma model from the Google Zürich team, and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab. Quote from [ColPali: Efficient Document Retrieval with Vision Language Models 👀](https://huggingface.co/blog/manu/colpali) The ColPali model achieves remarkable retrieval performance on the ViDoRe (Visual Document Retrieval) Benchmark. Beating complex pipelines with a single model. The TLDR of this notebook: - Generate an image per PDF page using [pdf2image](https://pypi.org/project/pdf2image/) and also extract the text using [pypdf](https://pypdf.readthedocs.io/en/stable/user/extract-text.html). - For each page image, use ColPali to obtain the visual multi-vector embeddings Then we store colbert embeddings in Vespa and use the [long-context variant](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/) where we represent the colbert embeddings per document with the tensor `tensor(page{}, patch{}, v[128])`. This enables us to use the PDF as the document (retrievable unit), storing the page embeddings in the same document. The upside of this is that we do not need to duplicate document level meta data like title, url, etc. But, the downside is that we cannot retrieve using the ColPali embeddings directly, but need to use the extracted text for retrieval. The ColPali embeddings are only used for reranking the results. For a simpler example where we use one vespa document = One PDF page, see [simplified-retrieval-with-colpali](https://vespa-engine.github.io/pyvespa/examples/simplified-retrieval-with-colpali-vlm_Vespa-cloud.md). Consider following the [ColQWen2](https://vespa-engine.github.io/pyvespa/examples/pdf-retrieval-with-ColQwen2-vlm_Vespa-cloud.md) notebook instead as it use a better model with improved performance (Both accuracy and speed). We also store the base64 encoded image, and page meta data like title and url so that we can display it in the result page, but also use it for RAG with powerful LLMs with vision capabilities. At query time, we retrieve using [BM25](https://docs.vespa.ai/en/reference/bm25.html) over all the text from all pages, then use the ColPali embeddings to rerank the results using the max page score. Let us get started. Install dependencies: Note that the python pdf2image package requires poppler-utils, see other installation options [here](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler). Install dependencies: In \[ \]: Copied! ``` !sudo apt-get update && sudo apt-get install poppler-utils -y ``` !sudo apt-get update && sudo apt-get install poppler-utils -y Install python packages In \[ \]: Copied! ``` !pip3 install transformers==4.51.3 accelerate vidore_benchmark==4.0.0 pdf2image google-generativeai pypdf==5.0.1 pyvespa vespacli requests ``` !pip3 install transformers==4.51.3 accelerate vidore_benchmark==4.0.0 pdf2image google-generativeai pypdf==5.0.1 pyvespa vespacli requests In \[ \]: Copied! ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from PIL import Image from io import BytesIO from transformers import ColPaliForRetrieval, ColPaliProcessor from vidore_benchmark.utils.image_utils import scale_image, get_base64_image ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from PIL import Image from io import BytesIO from transformers import ColPaliForRetrieval, ColPaliProcessor from vidore_benchmark.utils.image_utils import scale_image, get_base64_image ## Load the model[¶](#load-the-model) This requires that the HF_TOKEN environment variable is set as the underlaying PaliGemma model is hosted on Hugging Face and has a [restricive licence](https://ai.google.dev/gemma/terms) that requires authentication. Choose the right device to run the model. In \[ \]: Copied! ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 In \[ \]: Copied! ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ## Working with pdfs[¶](#working-with-pdfs) We need to convert a PDF to an array of images. One image per page. We will use pdf2image for this. Secondary, we also extract the text content of the pdf using pypdf. NOTE: This step requires that you have `poppler` installed on your system. Read more in [pdf2image](https://pdf2image.readthedocs.io/en/latest/installation.html) docs. In \[ \]: Copied! ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) with open("temp.pdf", "wb") as f: f.write(pdf_file.read()) reader = PdfReader("temp.pdf") page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path("temp.pdf") assert len(images) == len(page_texts) return (images, page_texts) ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) with open("temp.pdf", "wb") as f: f.write(pdf_file.read()) reader = PdfReader("temp.pdf") page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path("temp.pdf") assert len(images) == len(page_texts) return (images, page_texts) We define a few sample PDFs to work with. In \[ \]: Copied! ``` sample_pdfs = [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, ] ``` sample_pdfs = [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, ] Now we can convert the PDFs to images and also extract the text content. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts Let us look at the extracted image of the first PDF page. This is the input to ColPali. In \[ \]: Copied! ``` from IPython.display import display display(scale_image(sample_pdfs[0]["images"][0], 720)) ``` from IPython.display import display display(scale_image(sample_pdfs[0]["images"][0], 720)) Now we use the ColPali model to generate embeddings for the images. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc).embeddings page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(\*\*batch_doc).embeddings page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings Now we are done with the document side embeddings, we now convert the custom dict to [Vespa JSON feed](https://docs.vespa.ai/en/reference/document-json-format.html) format. We use binarization of the vector embeddings to reduce their size. Read more about binarization of multi-vector representations in the [colbert blog post](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/). This maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x. In \[ \]: Copied! ``` import numpy as np from typing import Dict, List from binascii import hexlify def binarize_token_vectors_hex(vectors: List[torch.Tensor]) -> Dict[str, str]: vespa_tensor = list() for page_id in range(0, len(vectors)): page_vector = vectors[page_id] binarized_token_vectors = np.packbits( np.where(page_vector > 0, 1, 0), axis=1 ).astype(np.int8) for patch_index in range(0, len(page_vector)): values = str( hexlify(binarized_token_vectors[patch_index].tobytes()), "utf-8" ) if ( values == "00000000000000000000000000000000" ): # skip empty vectors due to padding of batch continue vespa_tensor_cell = { "address": {"page": page_id, "patch": patch_index}, "values": values, } vespa_tensor.append(vespa_tensor_cell) return vespa_tensor ``` import numpy as np from typing import Dict, List from binascii import hexlify def binarize_token_vectors_hex(vectors: List[torch.Tensor]) -> Dict\[str, str\]: vespa_tensor = list() for page_id in range(0, len(vectors)): page_vector = vectors[page_id] binarized_token_vectors = np.packbits( np.where(page_vector > 0, 1, 0), axis=1 ).astype(np.int8) for patch_index in range(0, len(page_vector)): values = str( hexlify(binarized_token_vectors[patch_index].tobytes()), "utf-8" ) if ( values == "00000000000000000000000000000000" ): # skip empty vectors due to padding of batch continue vespa_tensor_cell = { "address": {"page": page_id, "patch": patch_index}, "values": values, } vespa_tensor.append(vespa_tensor_cell) return vespa_tensor Iterate over the sample and create the Vespa JSON feed format, including the base64 encoded page images. In \[ \]: Copied! ``` vespa_feed = [] for idx, pdf in enumerate(sample_pdfs): images_base_64 = [] for image in pdf["images"]: images_base_64.append(get_base64_image(image, add_url_prefix=False)) pdf["images_base_64"] = images_base_64 doc = { "fields": { "url": pdf["url"], "title": pdf["title"], "images": pdf["images_base_64"], "texts": pdf["texts"], # Array of text per page "colbert": { # Colbert embeddings per page "blocks": binarize_token_vectors_hex(pdf["embeddings"]) }, } } vespa_feed.append(doc) ``` vespa_feed = [] for idx, pdf in enumerate(sample_pdfs): images_base_64 = [] for image in pdf\["images"\]: images_base_64.append(get_base64_image(image, add_url_prefix=False)) pdf["images_base_64"] = images_base_64 doc = { "fields": { "url": pdf["url"], "title": pdf["title"], "images": pdf["images_base_64"], "texts": pdf["texts"], # Array of text per page "colbert": { # Colbert embeddings per page "blocks": binarize_token_vectors_hex(pdf["embeddings"]) }, } } vespa_feed.append(doc) In \[ \]: Copied! ``` vespa_feed[0]["fields"]["colbert"]["blocks"][0:5] ``` vespa_feed[0]["fields"]["colbert"]["blocks"][0:5] Above is the feed format for mixed tensors with more than one mapped dimension, see [details](https://docs.vespa.ai/en/ranking/tensor-user-guide.html). We have the `page` and `patch` dimensions and for each combination with have a binary representation of the 128 dimensional embeddings, packed into 16 bytes. For each page image, we have 1030 patches, each with a 128 dimensional embedding. ## Configure Vespa[¶](#configure-vespa) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[ \]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet colbert_schema = Schema( name="doc", document=Document( fields=[ Field(name="url", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="texts", type="array", indexing=["index"], index="enable-bm25", ), Field( name="images", type="array", indexing=["summary"], ), Field( name="colbert", type="tensor(page{}, patch{}, v[16])", indexing=["attribute"], ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "texts"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet colbert_schema = Schema( name="doc", document=Document( fields=\[ Field(name="url", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="texts", type="array", indexing=["index"], index="enable-bm25", ), Field( name="images", type="array", indexing=["summary"], ), Field( name="colbert", type="tensor(page{}, patch{}, v[16])", indexing=["attribute"], ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "texts"])\], ) Notice the `colbert` field is a tensor field with the type `tensor(page{}, patch{}, v[128])`. This is the field that will store the embeddings generated by ColPali. This is an example of a mixed tensor where we combine two mapped (sparse) dimensions with one dense. Read more in [Tensor guide](https://docs.vespa.ai/en/tensor-user-guide.html). We also enable [BM25](https://docs.vespa.ai/en/reference/bm25.html) for the `title` and `texts` fields. Create the Vespa [application package](https://docs.vespa.ai/en/application-packages): In \[ \]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colbert_schema] ) Now we define how we want to rank the pages. We use BM25 for the text and late interaction with Max Sim for the image embeddings. This means that we retrieve using the text representations to find relevant PDF documents, then we use the ColPALI embeddings to rerank the pages within the document using the max of the page scores. We also return all the page level scores using `match-features`, so that we can render multiple scoring pages in the search result. As LLMs gets longer context windows, we can input more than a single page per PDF. In \[ \]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colbert_profile = RankProfile( name="default", inputs=[("query(qt)", "tensor(querytoken{}, v[128])")], functions=[ Function( name="max_sim_per_page", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, patch ), querytoken ) """, ), Function(name="max_sim", expression="reduce(max_sim_per_page, max, page)"), Function(name="bm25_score", expression="bm25(title) + bm25(texts)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), match_features=["max_sim_per_page", "bm25_score"], ) colbert_schema.add_rank_profile(colbert_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colbert_profile = RankProfile( name="default", inputs=\[("query(qt)", "tensor(querytoken{}, v[128])")\], functions=[ Function( name="max_sim_per_page", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(colbert)) , v ), max, patch ), querytoken ) """, ), Function(name="max_sim", expression="reduce(max_sim_per_page, max, page)"), Function(name="bm25_score", expression="bm25(title) + bm25(texts)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), match_features=["max_sim_per_page", "bm25_score"], ) colbert_schema.add_rank_profile(colbert_profile) Validate that certificates are ok and deploy the application to Vespa Cloud. ### Deploy to Vespa Cloud[¶](#deploy-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). `PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf). > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[ \]: Copied! ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() This example uses the synchronous feed method and feeds one document at a time. For larger datasets, consider using the asynchronous feed method. In \[ \]: Copied! ``` from vespa.io import VespaResponse with app.syncio() as sync: for operation in vespa_feed: fields = operation["fields"] response: VespaResponse = sync.feed_data_point( data_id=fields["url"], fields=fields, schema="doc" ) if not response.is_successful(): print(response.json()) ``` from vespa.io import VespaResponse with app.syncio() as sync: for operation in vespa_feed: fields = operation["fields"] response: VespaResponse = sync.feed_data_point( data_id=fields["url"], fields=fields, schema="doc" ) if not response.is_successful(): print(response.json()) ## Querying Vespa[¶](#querying-vespa) Ok, so now we have indexed the PDF pages in Vespa. Let us now obtain ColPali embeddings for a text query and use it to match against the indexed PDF pages. Our demo query: *Composition of the Lotte Benchmark* In \[ \]: Copied! ``` queries = ["Composition of the LoTTE benchmark"] ``` queries = ["Composition of the LoTTE benchmark"] Obtain the query embeddings using the ColPali model In \[ \]: Copied! ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) A simple routine to format the ColPali multi-vector emebeddings to a format that can be used in Vespa. See [querying with tensors](https://docs.vespa.ai/en/tensor-user-guide.html#querying-with-tensors) for more details. In \[ \]: Copied! ``` def float_query_token_vectors(vectors: torch.Tensor) -> Dict[str, List[float]]: vespa_token_dict = dict() for index in range(0, len(vectors)): vespa_token_dict[index] = vectors[index].tolist() return vespa_token_dict dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` def float_query_token_vectors(vectors: torch.Tensor) -> Dict\[str, List[float]\]: vespa_token_dict = dict() for index in range(0, len(vectors)): vespa_token_dict[index] = vectors[index].tolist() return vespa_token_dict dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) We create a simple routine to display the results. Notice that each hit is a PDF document. Within a PDF document we have multiple pages and we have the MaxSim score for each page. The PDF documents are ranked by the maximum page score. But, we have access to all the page level scores and below we display the top 2-pages for each PDF document. We convert the base64 encoded image to a PIL image for rendering. We could also render the extracted text, but we skip that for now. In \[ \]: Copied! ``` from IPython.display import display, HTML import base64 def display_query_results(query, response): """ Displays the query result, including the two best matching pages per matched pdf. """ html_content = f"

Query text: {query}

" for i, hit in enumerate(response.hits[:2]): # Adjust to show more hits if needed title = hit["fields"]["title"] url = hit["fields"]["url"] match_scores = hit["fields"]["matchfeatures"]["max_sim_per_page"] images = hit["fields"]["images"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}

' # Find the two best matching pages sorted_pages = sorted(match_scores.items(), key=lambda x: x[1], reverse=True) best_pages = sorted_pages[:2] for page, score in best_pages: page = int(page) image_data = base64.b64decode(images[page]) image = Image.open(BytesIO(image_data)) scaled_image = scale_image(image, 648) buffered = BytesIO() scaled_image.save(buffered, format="PNG") img_str = base64.b64encode(buffered.getvalue()).decode() html_content += f"

Best Matching Page {page+1} for PDF document: with MaxSim score {score:.2f}

" html_content += ( f'' ) display(HTML(html_content)) ``` from IPython.display import display, HTML import base64 def display_query_results(query, response): """ Displays the query result, including the two best matching pages per matched pdf. """ html_content = f"

Query text: {query}

" for i, hit in enumerate(response.hits[:2]): # Adjust to show more hits if needed title = hit["fields"]["title"] url = hit["fields"]["url"] match_scores = hit["fields"]["matchfeatures"]["max_sim_per_page"] images = hit["fields"]["images"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}

' # Find the two best matching pages sorted_pages = sorted(match_scores.items(), key=lambda x: x[1], reverse=True) best_pages = sorted_pages[:2] for page, score in best_pages: page = int(page) image_data = base64.b64decode(images[page]) image = Image.open(BytesIO(image_data)) scaled_image = scale_image(image, 648) buffered = BytesIO() scaled_image.save(buffered, format="PNG") img_str = base64.b64encode(buffered.getvalue()).decode() html_content += f"

Best Matching Page {page+1} for PDF document: with MaxSim score {score:.2f}

" html_content += ( f'' ) display(HTML(html_content)) Query Vespa with a text query and display the results. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse for idx, query in enumerate(queries): response: VespaQueryResponse = app.query( yql="select title,url,images from doc where userInput(@userQuery)", ranking="default", userQuery=query, timeout=2, hits=3, body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(qs[idx]), }, ) assert response.is_successful() display_query_results(query, response) ``` from vespa.io import VespaQueryResponse for idx, query in enumerate(queries): response: VespaQueryResponse = app.query( yql="select title,url,images from doc where userInput(@userQuery)", ranking="default", userQuery=query, timeout=2, hits=3, body={ "presentation.format.tensors": "short-value", "input.query(qt)": float_query_token_vectors(qs[idx]), }, ) assert response.is_successful() display_query_results(query, response) ## RAG with LLMs with vision capabilities.[¶](#rag-with-llms-with-vision-capabilities) Now we can use the top k documents to answer the question using a LLM with vision capabilities. This then becomes an end-to-end pipeline using vision capable language models, where we use ColPali visual embeddings for retrieval and Gemini Flash to read the retrieved PDF pages and answer the question with that context. We will use the [Gemini Flash](https://deepmind.google/technologies/gemini/flash/) model for reading and answering. In the following, we input the best matching PDF *page* image and the question. In \[ \]: Copied! ``` import google.generativeai as genai genai.configure(api_key=os.environ["GOOGLE_API_KEY"]) ``` import google.generativeai as genai genai.configure(api_key=os.environ["GOOGLE_API_KEY"]) Just extract the best page image from the first hit to demonstrate how to use the image with Gemini Flash to answer the question. In \[ \]: Copied! ``` best_hit = response.hits[0] pdf_url = best_hit["fields"]["url"] pdf_title = best_hit["fields"]["title"] match_scores = best_hit["fields"]["matchfeatures"]["max_sim_per_page"] images = best_hit["fields"]["images"] sorted_pages = sorted(match_scores.items(), key=lambda x: x[1], reverse=True) best_page, score = sorted_pages[0] best_page = int(best_page) image_data = base64.b64decode(images[best_page]) image = Image.open(BytesIO(image_data)) scaled_image = scale_image(image, 720) display(scaled_image) ``` best_hit = response.hits[0] pdf_url = best_hit["fields"]["url"] pdf_title = best_hit["fields"]["title"] match_scores = best_hit["fields"]["matchfeatures"]["max_sim_per_page"] images = best_hit["fields"]["images"] sorted_pages = sorted(match_scores.items(), key=lambda x: x[1], reverse=True) best_page, score = sorted_pages[0] best_page = int(best_page) image_data = base64.b64decode(images[best_page]) image = Image.open(BytesIO(image_data)) scaled_image = scale_image(image, 720) display(scaled_image) Initialize the Gemini Flash model and answer the question. In \[ \]: Copied! ``` model = genai.GenerativeModel(model_name="gemini-flash-lite-latest") response = model.generate_content([queries[0], image]) ``` model = genai.GenerativeModel(model_name="gemini-flash-lite-latest") response = model.generate_content(\[queries[0], image\]) Some formatting of the response from Gemini Flash. In \[ \]: Copied! ``` from IPython.display import Markdown, display markdown_text = response.candidates[0].content.parts[0].text display(Markdown(markdown_text)) ``` from IPython.display import Markdown, display markdown_text = response.candidates[0].content.parts[0].text display(Markdown(markdown_text)) ## Summary[¶](#summary) In this notebook, we have demonstrated how to represent ColPali in Vespa. We have used ColPali to generate embeddings for images of pdf pages and stored them in Vespa. We have also stored the base64 encoded image of the pdf page and some meta data like title and url. We have then demonstrated how to retrieve the pdf pages using the embeddings generated by ColPali. We have also demonstrated how to use the top k documents to answer a question using a LLM with vision capabilities. ## Cleanup[¶](#cleanup) When this notebook is running in CI, we want to delete the application. In \[ \]: Copied! ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() # Using Mixedbread.ai cross-encoder for reranking in Vespa.ai[¶](#using-mixedbreadai-cross-encoder-for-reranking-in-vespaai) First, let us recap what cross-encoders are and where they might fit in a Vespa application. In contrast to bi-encoders, it is important to know that cross-encoders do NOT produce an embedding. Instead, a cross-encoder acts on *pairs* of input sequences and produces a single scalar score between 0 and 1, indicating the similarity or relevance between the two sentences. > The cross-encoder model is a transformer-based model with a classification head on top of the Transformer CLS token (classification token). > > The model has been fine-tuned using the MS Marco passage training set and is a binary classifier which classifies if a query,document pair is relevant or not. The quote is from [this](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-4/) blog post from 2021 that explains cross-encoders more in-depth. Note that the reference to the MS Marco dataset is for the model used in the blog post, and not the model we will use in this notebook. ## Properties of cross-encoders and where they fit in Vespa[¶](#properties-of-cross-encoders-and-where-they-fit-in-vespa) Cross-encoders are great at comparing a query and a document, but the time complexity increases linearly with the number of documents a query is compared to. This is why cross-encoders are often part of solutions at the top of leaderboards for ranking performance, such as MS MARCO Passage Ranking leaderboard. However, this leaderboard does not evaluate a solution's latency, and for production systems, doing cross-encoder inference for all documents in a corpus become prohibitively expensive. With Vespa's phased ranking capabilities, doing cross-encoder inference for a subset of documents at a later stage in the ranking pipeline can be a good trade-off between ranking performance and latency. For the remainder of this notebook, we will look at using a cross-encoder in *global-phase reranking*, introduced in [this](https://blog.vespa.ai/improving-llm-context-ranking-with-cross-encoders/) blog post. In this notebook, we will show how to use the Mixedbread.ai cross-encoder for global-phase reranking in Vespa. The inference can also be run on GPU in [Vespa Cloud](https://cloud.vespa.ai/), to accelerate inference even further. ## Exploring the Mixedbread.ai cross-encoder[¶](#exploring-the-mixedbreadai-cross-encoder) [mixedbread.ai](https://huggingface.co/mixedbread-ai) has done an amazing job of releasing both (binary) embedding-models and rerankers on huggingface 🤗 the last weeks. > Check out our previous notebook on using binary embeddings from mixedbread.ai in Vespa Cloud [here](https://vespa-engine.github.io/pyvespa/examples/mixedbread-binary-embeddings-with-sentence-transformers-cloud.md) For this demo, we will use [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1), but you can experiment with the larger models, depending on how you want to balance speed, accuracy, and cost (if you want to use GPU). This model is really powerful despite its small size, and provides a good trade-off between speed and accuracy. Table of accuracy on a [BEIR](http://beir.ai) (11 datasets): | Model | Accuracy | | -------------------------- | -------- | | Lexical Search | 66.4 | | bge-reranker-base | 66.9 | | bge-reranker-large | 70.6 | | cohere-embed-v3 | 70.9 | | **mxbai-rerank-xsmall-v1** | **70.0** | | mxbai-rerank-base-v1 | 72.3 | | mxbai-rerank-large-v1 | 74.9 | (Table from mixedbread.ai's introductory [blog post](https://www.mixedbread.ai/blog/mxbai-rerank-v1).) As we can see, the `mxbai-rerank-xsmall-v1` model is almost on par with much larger models while being much faster and cheaper to run. ## Downloading the model[¶](#downloading-the-model) We will use the quantized version of `mxbai-rerank-xsmall-v1` for this demo, as it is faster and cheaper to run, but feel free to change to the model of your choice. In \[1\]: Copied! ``` import requests from pathlib import Path url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model_quantized.onnx" local_model_path = "model/model_quantized.onnx" r = requests.get(url) # Create path if it doesn't exist Path(local_model_path).parent.mkdir(parents=True, exist_ok=True) with open(local_model_path, "wb") as f: f.write(r.content) print(f"Downloaded model to {local_model_path}") ``` import requests from pathlib import Path url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model_quantized.onnx" local_model_path = "model/model_quantized.onnx" r = requests.get(url) # Create path if it doesn't exist Path(local_model_path).parent.mkdir(parents=True, exist_ok=True) with open(local_model_path, "wb") as f: f.write(r.content) print(f"Downloaded model to {local_model_path}") ``` Downloaded model to model/model_quantized.onnx ``` ## Inspecting the model[¶](#inspecting-the-model) It is useful to inspect the expected inputs and outputs, along with their shapes, before integrating the model into Vespa. This can either be done by, for instance, by using the `sentence_transformers` and `onnxruntime` libraries. One-off tasks like this are well suited for a Colab notebook. One example of how to do this in Colab can be found here: ## What does a crossencoder do?[¶](#what-does-a-crossencoder-do) Below, we have tried to visualize what is going on in a cross-encoder, which helps us understand how we can use it in Vespa. We can see that the input pairs (query, document) are prefixed with a special `[CLS]` token, and separated by a `[SEP]` token. In Vespa, we want to tokenize the document body at indexing time, and the query at query time, and then combine them in the same way as the cross-encoder does, during ranking. Let us see how we can achieve this in Vespa. ## Defining our Vespa application[¶](#defining-our-vespa-application) In \[2\]: Copied! ``` from vespa.package import ( Component, Document, Field, FieldSet, Function, GlobalPhaseRanking, OnnxModel, Parameter, RankProfile, Schema, ) schema = Schema( name="doc", mode="index", document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", ), # Let´s add a synthetic field (see https://docs.vespa.ai/en/schemas.html#field) # to define how the tokens are derived from the text field Field( name="body_tokens", type="tensor(d0[512])", # The tokenizer will be defined in the next cell indexing=["input text", "embed tokenizer", "attribute", "summary"], is_document_field=False, # Indicates a synthetic field ), ], ), fieldsets=[FieldSet(name="default", fields=["text"])], models=[ OnnxModel( model_name="crossencoder", model_file_path=f"{local_model_path}", inputs={ "input_ids": "input_ids", "attention_mask": "attention_mask", }, outputs={"logits": "logits"}, ) ], rank_profiles=[ RankProfile(name="bm25", first_phase="bm25(text)"), RankProfile( name="reranking", inherits="default", # We truncate the query to 64 tokens, meaning we have 512-64=448 tokens left for the document. inputs=[("query(q)", "tensor(d0[64])")], # See https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/blob/main/tokenizer_config.json functions=[ Function( name="input_ids", # See https://docs.vespa.ai/en/cross-encoders.html#roberta-based-model and https://docs.vespa.ai/en/reference/rank-features.html expression="customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens))", ), Function( name="attention_mask", expression="tokenAttentionMask(512, query(q), attribute(body_tokens))", ), ], first_phase="bm25(text)", global_phase=GlobalPhaseRanking( rerank_count=10, # We use the sigmoid function to force the output to be between 0 and 1, converting logits to probabilities. expression="sigmoid(onnx(crossencoder).logits{d0:0,d1:0})", ), summary_features=[ "query(q)", "input_ids", "attention_mask", "onnx(crossencoder).logits", ], ), ], ) ``` from vespa.package import ( Component, Document, Field, FieldSet, Function, GlobalPhaseRanking, OnnxModel, Parameter, RankProfile, Schema, ) schema = Schema( name="doc", mode="index", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field( name="text", type="string", indexing=["index", "summary"], index="enable-bm25", ), # Let´s add a synthetic field (see https://docs.vespa.ai/en/schemas.html#field) # to define how the tokens are derived from the text field Field( name="body_tokens", type="tensor(d0[512])", # The tokenizer will be defined in the next cell indexing=["input text", "embed tokenizer", "attribute", "summary"], is_document_field=False, # Indicates a synthetic field ), \], ), fieldsets=\[FieldSet(name="default", fields=["text"])\], models=[ OnnxModel( model_name="crossencoder", model_file_path=f"{local_model_path}", inputs={ "input_ids": "input_ids", "attention_mask": "attention_mask", }, outputs={"logits": "logits"}, ) ], rank_profiles=\[ RankProfile(name="bm25", first_phase="bm25(text)"), RankProfile( name="reranking", inherits="default", # We truncate the query to 64 tokens, meaning we have 512-64=448 tokens left for the document. inputs=\[("query(q)", "tensor(d0[64])")\], # See https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/blob/main/tokenizer_config.json functions=\[ Function( name="input_ids", # See https://docs.vespa.ai/en/cross-encoders.html#roberta-based-model and https://docs.vespa.ai/en/reference/rank-features.html expression="customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens))", ), Function( name="attention_mask", expression="tokenAttentionMask(512, query(q), attribute(body_tokens))", ), \], first_phase="bm25(text)", global_phase=GlobalPhaseRanking( rerank_count=10, # We use the sigmoid function to force the output to be between 0 and 1, converting logits to probabilities. expression="sigmoid(onnx(crossencoder).logits{d0:0,d1:0})", ), summary_features=[ "query(q)", "input_ids", "attention_mask", "onnx(crossencoder).logits", ], ), \], ) In \[3\]: Copied! ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name="reranking", schema=[schema], components=[ Component( # See https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder id="tokenizer", type="hugging-face-tokenizer", parameters=[ Parameter( "model", { "url": "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json" }, ), ], ) ], ) ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name="reranking", schema=[schema], components=\[ Component( # See https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder id="tokenizer", type="hugging-face-tokenizer", parameters=[ Parameter( "model", { "url": "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json" }, ), ], ) \], ) It is useful to inspect the schema-file (see ) before deploying the application. In \[4\]: Copied! ``` print(schema.schema_to_text) ``` print(schema.schema_to_text) ``` schema doc { document doc { field id type string { indexing: summary | attribute } field text type string { indexing: index | summary index: enable-bm25 } } field body_tokens type tensor(d0[512]) { indexing: input text | embed tokenizer | attribute | summary } fieldset default { fields: text } onnx-model crossencoder { file: files/crossencoder.onnx input input_ids: input_ids input attention_mask: attention_mask output logits: logits } rank-profile bm25 { first-phase { expression { bm25(text) } } } rank-profile reranking inherits default { inputs { query(q) tensor(d0[64]) } function input_ids() { expression { customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens)) } } function attention_mask() { expression { tokenAttentionMask(512, query(q), attribute(body_tokens)) } } first-phase { expression { bm25(text) } } global-phase { rerank-count: 10 expression { sigmoid(onnx(crossencoder).logits{d0:0,d1:0}) } } summary-features { query(q) input_ids attention_mask onnx(crossencoder).logits } } } ``` It looks fine. Now, let's just save the application package first, so that we also have more insight into the other files that are part of the application package. In \[5\]: Copied! ``` # Optionally, we can also write the application package to disk before deploying it. app_package.to_files("crossencoder-demo") ``` # Optionally, we can also write the application package to disk before deploying it. app_package.to_files("crossencoder-demo") In \[6\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker(port=8080) app = vespa_docker.deploy(application_package=app_package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker(port=8080) app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/60 seconds... Using plain http against endpoint http://localhost:8089/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8089/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8089/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8089/ApplicationStatus Application is up! Finished deployment. ``` In \[7\]: Copied! ``` from docker.models.containers import Container def download_and_analyze_model(url: str, container: Container) -> None: """ Downloads an ONNX model from a specified URL and analyzes it within a Docker container. Parameters: url (str): The URL from where the ONNX model should be downloaded. container (Container): The Docker container in which the command will be executed. Raises: Exception: Raises an exception if the command execution fails or if there are issues in streaming the output. Note: This function assumes that 'curl' and 'vespa-analyze-onnx-model' are available in the container environment. """ # Define the path inside the container where the model will be stored. model_path = "/opt/vespa/var/model.onnx" # Construct the command to download and analyze the model inside the container. command = f"bash -c 'curl -Lo {model_path} {url} && vespa-analyze-onnx-model {model_path}'" # Command to delete the model after analysis. delete_command = f"rm {model_path}" # Execute the command in the container and handle potential errors. try: exit_code, output = container.exec_run(command, stream=True) # Print the output from the command. for line in output: print(line.decode(), end="") # Remove the model after analysis. container.exec_run(delete_command) except Exception as e: print(f"An error occurred: {e}") raise url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model.onnx" # Example usage: # download_and_analyze_model(url, vespa_docker.container) ``` from docker.models.containers import Container def download_and_analyze_model(url: str, container: Container) -> None: """ Downloads an ONNX model from a specified URL and analyzes it within a Docker container. Parameters: url (str): The URL from where the ONNX model should be downloaded. container (Container): The Docker container in which the command will be executed. Raises: Exception: Raises an exception if the command execution fails or if there are issues in streaming the output. Note: This function assumes that 'curl' and 'vespa-analyze-onnx-model' are available in the container environment. """ # Define the path inside the container where the model will be stored. model_path = "/opt/vespa/var/model.onnx" # Construct the command to download and analyze the model inside the container. command = f"bash -c 'curl -Lo {model_path} {url} && vespa-analyze-onnx-model {model_path}'" # Command to delete the model after analysis. delete_command = f"rm {model_path}" # Execute the command in the container and handle potential errors. try: exit_code, output = container.exec_run(command, stream=True) # Print the output from the command. for line in output: print(line.decode(), end="") # Remove the model after analysis. container.exec_run(delete_command) except Exception as e: print(f"An error occurred: {e}") raise url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model.onnx" # Example usage: # download_and_analyze_model(url, vespa_docker.container) ``` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1126 100 1126 0 0 5715 0 --:--:-- --:--:-- --:--:-- 5686 100 271M 100 271M 0 0 15.8M 0 0:00:17 0:00:17 --:--:-- 16.3M unspecified option[0](optimize model), fallback: true vm_size: 166648 kB, vm_rss: 46700 kB, malloc_peak: 0 kb, malloc_curr: 1100 (before loading model) vm_size: 517176 kB, vm_rss: 405592 kB, malloc_peak: 0 kb, malloc_curr: 351628 (after loading model) model meta-data: input[0]: 'input_ids' long[batch_size][sequence_length] input[1]: 'attention_mask' long[batch_size][sequence_length] output[0]: 'logits' float[batch_size][1] unspecified option[1](symbolic size 'batch_size'), fallback: 1 unspecified option[2](symbolic size 'sequence_length'), fallback: 1 1717140328.769314 localhost 1305/26134 - .eval.onnx_wrapper warning input 'input_ids' with element type 'long' is bound to vespa value with cell type 'double'; adding explicit conversion step (this conversion might be lossy) 1717140328.769336 localhost 1305/26134 - .eval.onnx_wrapper warning input 'attention_mask' with element type 'long' is bound to vespa value with cell type 'double'; adding explicit conversion step (this conversion might be lossy) test setup: input[0]: tensor(d0[1],d1[1]) -> long[1][1] input[1]: tensor(d0[1],d1[1]) -> long[1][1] output[0]: float[1][1] -> tensor(d0[1],d1[1]) unspecified option[3](max concurrent evaluations), fallback: 1 vm_size: 517176 kB, vm_rss: 405592 kB, malloc_peak: 0 kb, malloc_curr: 351628 (no evaluations yet) vm_size: 517176 kB, vm_rss: 405856 kB, malloc_peak: 0 kb, malloc_curr: 351628 (concurrent evaluations: 1) estimated model evaluation time: 3.77819 ms ``` By doing this with the different size models and their quantized versions, we get this table. | Model | Model File | Inference Time (ms) | Size | N docs in 200ms | | ------------------------------------ | -------------------- | ------------------- | ------ | --------------- | | mixedbread-ai/mxbai-rerank-xsmall-v1 | model_quantized.onnx | 2.4 | 87MB | 82 | | mixedbread-ai/mxbai-rerank-xsmall-v1 | model.onnx | 3.8 | 284MB | 52 | | mixedbread-ai/mxbai-rerank-base-v1 | model_quantized.onnx | 5.4 | 244MB | 37 | | mixedbread-ai/mxbai-rerank-base-v1 | model.onnx | 10.3 | 739MB | 19 | | mixedbread-ai/mxbai-rerank-large-v1 | model_quantized.onnx | 16.0 | 643MB | 12 | | mixedbread-ai/mxbai-rerank-large-v1 | model.onnx | 35.6 | 1.74GB | 5 | With a time budget of 200ms for reranking, we can add a column indicating the number of documents we are able to rerank within the budget time. In \[8\]: Copied! ``` # Feed a few sample documents to the application sample_docs = [ {"id": i, "fields": {"text": text}} for i, text in enumerate( [ "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.", "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.", "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.", "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.", "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.", "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.", ] ) ] ``` # Feed a few sample documents to the application sample_docs = \[ {"id": i, "fields": {"text": text}} for i, text in enumerate( [ "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.", "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.", "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.", "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.", "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.", "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.", ] ) \] In \[9\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) app.feed_iterable(sample_docs, schema="doc", callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) app.feed_iterable(sample_docs, schema="doc", callback=callback) In \[10\]: Copied! ``` from pprint import pprint with app.syncio(connections=1) as sync_app: query = sync_app.query( body={ "yql": "select * from sources * where userQuery();", "query": "who wrote to kill a mockingbird?", "input.query(q)": "embed(tokenizer, @query)", "ranking.profile": "reranking", "ranking.listFeatures": "true", "presentation.timing": "true", } ) for hit in query.hits: pprint(hit["fields"]["text"]) pprint(hit["relevance"]) ``` from pprint import pprint with app.syncio(connections=1) as sync_app: query = sync_app.query( body={ "yql": "select * from sources * where userQuery();", "query": "who wrote to kill a mockingbird?", "input.query(q)": "embed(tokenizer, @query)", "ranking.profile": "reranking", "ranking.listFeatures": "true", "presentation.timing": "true", } ) for hit in query.hits: pprint(hit["fields"]["text"]) pprint(hit["relevance"]) ``` ("'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was " 'immediately successful, winning the Pulitzer Prize, and has become a classic ' 'of modern American literature.') 0.9634037778787636 ("Harper Lee, an American novelist widely known for her novel 'To Kill a " "Mockingbird', was born in 1926 in Monroeville, Alabama. She received the " 'Pulitzer Prize for Fiction in 1961.') 0.8672221280618897 ("'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, " 'was published in 1925. The story is set in the Jazz Age and follows the life ' 'of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.') 0.09325768904619067 ("The novel 'Moby-Dick' was written by Herman Melville and first published in " '1851. It is considered a masterpiece of American literature and deals with ' 'complex themes of obsession, revenge, and the conflict between good and ' 'evil.') 0.010269765303083865 ``` It will of course be necessary to evaluate the performance of the cross-encoder in your specific use-case, but this notebook should give you a good starting point. ## Next steps[¶](#next-steps) Try to use [hybrid search](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa-cloud.md) for the first phase, and rerank with a cross-encoder. ## Cleanup[¶](#cleanup) In \[11\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() # Evaluating retrieval with Snowflake arctic embed[¶](#evaluating-retrieval-with-snowflake-arctic-embed) This notebook will demonstrate how different rank profiles in Vespa can be set up and evaluated. For the rank profiles that use semantic search, we will use the small version of [Snowflake's arctic embed model series](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) for generating embeddings. Feel free to experiment with different sizes based on your need and compute/latency constraints. > The snowflake-arctic-embedding models achieve state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants. We will set up and compare the following rank profiles: - **unranked**: No ranking at all, for a dummy baseline. - **bm25**: The classic BM25 ranker. - **semantic**: Using `closeness(query_embedding, document_embedding)` only. - **hybrid**: Combining BM25 and semantic search - `closeness(query_embedding, document_embedding) + log10( bm25(doc) )`. - **hybrid_filter**: Same as the previous, but with a filter to exclude hits based on some heuristics. In \[1\]: Copied! ``` from vespa.package import ( HNSW, ApplicationPackage, Component, Field, Parameter, Function, ) app_name = "snowflake" app_package = ApplicationPackage( name=app_name, components=[ Component( id="snow", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://huggingface.co/Snowflake/snowflake-arctic-embed-s/resolve/main/onnx/model_int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://huggingface.co/Snowflake/snowflake-arctic-embed-s/raw/main/tokenizer.json" }, ), Parameter( "normalize", {}, "true", ), Parameter( "pooling-strategy", {}, "cls", ), ], ) ], ) ``` from vespa.package import ( HNSW, ApplicationPackage, Component, Field, Parameter, Function, ) app_name = "snowflake" app_package = ApplicationPackage( name=app_name, components=\[ Component( id="snow", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://huggingface.co/Snowflake/snowflake-arctic-embed-s/resolve/main/onnx/model_int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://huggingface.co/Snowflake/snowflake-arctic-embed-s/raw/main/tokenizer.json" }, ), Parameter( "normalize", {}, "true", ), Parameter( "pooling-strategy", {}, "cls", ), ], ) \], ) In \[2\]: Copied! ``` app_package.schema.add_fields( Field(name="id", type="int", indexing=["attribute", "summary"]), Field( name="doc", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="doc_embeddings", type="tensor(x[384])", indexing=["input doc", "embed", "index", "attribute"], ann=HNSW(distance_metric="prenormalized-angular"), is_document_field=False, ), ) ``` app_package.schema.add_fields( Field(name="id", type="int", indexing=["attribute", "summary"]), Field( name="doc", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="doc_embeddings", type="tensor(x[384])", indexing=["input doc", "embed", "index", "attribute"], ann=HNSW(distance_metric="prenormalized-angular"), is_document_field=False, ), ) In \[3\]: Copied! ``` from vespa.package import ( DocumentSummary, FieldSet, FirstPhaseRanking, RankProfile, SecondPhaseRanking, Summary, ) app_package.schema.add_rank_profile( RankProfile( name="semantic", inputs=[("query(q)", "tensor(x[384])")], inherits="default", first_phase="closeness(field, doc_embeddings)", match_features=["closeness(field, doc_embeddings)"], ) ) app_package.schema.add_rank_profile(RankProfile(name="bm25", first_phase="bm25(doc)")) ``` from vespa.package import ( DocumentSummary, FieldSet, FirstPhaseRanking, RankProfile, SecondPhaseRanking, Summary, ) app_package.schema.add_rank_profile( RankProfile( name="semantic", inputs=\[("query(q)", "tensor(x[384])")\], inherits="default", first_phase="closeness(field, doc_embeddings)", match_features=["closeness(field, doc_embeddings)"], ) ) app_package.schema.add_rank_profile(RankProfile(name="bm25", first_phase="bm25(doc)")) In \[4\]: Copied! ``` app_package.schema.add_rank_profile( RankProfile( name="hybrid", inherits="semantic", # Guard against no keword match -> bm25 = 0 -> log10(0) = undefined functions=[ Function( name="log_guard", expression="if (bm25(doc) > 0, log10(bm25(doc)), 0)" ) ], first_phase=FirstPhaseRanking(expression="closeness(field, doc_embeddings)"), # Notice that we use log10 here, as the bm25 values with the natural logarithm tends to dominate the closeness values for these documents. second_phase=SecondPhaseRanking(expression="firstPhase + log_guard"), match_features=[ "firstPhase", "bm25(doc)", ], ) ) ``` app_package.schema.add_rank_profile( RankProfile( name="hybrid", inherits="semantic", # Guard against no keword match -> bm25 = 0 -> log10(0) = undefined functions=[ Function( name="log_guard", expression="if (bm25(doc) > 0, log10(bm25(doc)), 0)" ) ], first_phase=FirstPhaseRanking(expression="closeness(field, doc_embeddings)"), # Notice that we use log10 here, as the bm25 values with the natural logarithm tends to dominate the closeness values for these documents. second_phase=SecondPhaseRanking(expression="firstPhase + log_guard"), match_features=[ "firstPhase", "bm25(doc)", ], ) ) In \[5\]: Copied! ``` app_package.schema.add_field_set(FieldSet(name="default", fields=["doc"])) ``` app_package.schema.add_field_set(FieldSet(name="default", fields=["doc"])) In \[6\]: Copied! ``` app_package.schema.add_document_summary( DocumentSummary( name="minimal", summary_fields=[Summary("id", "int"), Summary("doc", "string")], ) ) ``` app_package.schema.add_document_summary( DocumentSummary( name="minimal", summary_fields=[Summary("id", "int"), Summary("doc", "string")], ) ) Create some sample documents that will help us see where the different ranking strategies have their strengths and weaknesses. > These sample documents were created with a little help from ChatGPT. Looking through the documents, we can see that a ranking of the documents in the order they are presented seem quite reasonable. In \[7\]: Copied! ``` # Query that the user is searching for query = "How does Vespa handle real-time indexing and search?" documents = [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "Instant data availability and maintaining query performance while simultaneously indexing are key features of the Vespa search engine.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "While not as robust as Vespa, our vector database strives to meet your search needs, despite certain, shall we say, 'flexible' features.", "Search engines like ours utilize complex algorithms to handle immediate data querying and indexing.", "Modern search platforms emphasize quick data retrieval from continuously updated indexes.", "Discover the history and cultural impact of the classic Italian Vespa scooter brand.", "Tips for maintaining your Vespa to ensure optimal performance and longevity of your scooter.", "Review of different scooter brands including Vespa, highlighting how they handle features like speed, cost, and aesthetics, and how consumers search for the best options.", "Vespa scooter safety regulations and best practices for urban commuting.", ] ``` # Query that the user is searching for query = "How does Vespa handle real-time indexing and search?" documents = [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "Instant data availability and maintaining query performance while simultaneously indexing are key features of the Vespa search engine.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "While not as robust as Vespa, our vector database strives to meet your search needs, despite certain, shall we say, 'flexible' features.", "Search engines like ours utilize complex algorithms to handle immediate data querying and indexing.", "Modern search platforms emphasize quick data retrieval from continuously updated indexes.", "Discover the history and cultural impact of the classic Italian Vespa scooter brand.", "Tips for maintaining your Vespa to ensure optimal performance and longevity of your scooter.", "Review of different scooter brands including Vespa, highlighting how they handle features like speed, cost, and aesthetics, and how consumers search for the best options.", "Vespa scooter safety regulations and best practices for urban commuting.", ] ## Dumping the application package to files[¶](#dumping-the-application-package-to-files) This is a good practice to inspect and understand the structure of the application package and schema files, generated by pyvespa. In \[8\]: Copied! ``` app_package.to_files("snowflake") ``` app_package.to_files("snowflake") In \[9\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(app_package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(app_package) ``` Waiting for configuration server, 0/60 seconds... Waiting for configuration server, 5/60 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 25/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` In \[10\]: Copied! ``` feed_docs = [ { "id": str(i), "fields": { "doc": doc, }, } for i, doc in enumerate(documents) ] ``` feed_docs = [ { "id": str(i), "fields": { "doc": doc, }, } for i, doc in enumerate(documents) ] In \[11\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) In \[12\]: Copied! ``` app.feed_iterable(feed_docs, schema=app_package.schema.name, callback=callback) ``` app.feed_iterable(feed_docs, schema=app_package.schema.name, callback=callback) ## Choosing a metric[¶](#choosing-a-metric) Check out [this](https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29) wikipedia-article for an overview of evaluation measures in information retrieval. In our case, we have a query and ranked documents as the ground truth. When evaluating a ranking against our ground truth ranking, the Normalized Discounted Cumulative Gain (NDCG) metric is a good choice. The NDCG is a measure of ranking quality. It is calculated as the sum of the discounted gain of the relevant documents(DCG), divided by the ideal DCG. The ideal DCG is the DCG of the perfect ranking, where the documents are ordered by relevance. > If you are already familiar with NDCG, feel free to skip this part. There is also an implementation in [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html) that you can use. The formula for NDCG is: $$ NDCG = \\frac{DCG}{IDCG} $$ where: $$ DCG = \\sum\_{i=1}^{n} \\frac{2^{rel_i} - 1}{\\log_2(i + 1)} $$ Let us create a function to calculate the NDCG for a given ranking. In \[13\]: Copied! ``` import math from typing import List def ndcg_at_k(rank_order: List[int], ideal_order: List[int], k: int) -> float: """ Calculate the normalized Discounted Cumulative Gain (nDCG) at position k. Parameters: rank_order (List[int]): The list of document indices as ranked by the search system. ideal_order (List[int]): The list of document indices in the ideal order. k (int): The position up to which to calculate nDCG. Returns: float: The nDCG value at position k. """ dcg = 0.0 idcg = 0.0 # Calculate DCG based on the ranked order up to k for i in range(min(k, len(rank_order))): rank_index = rank_order[i] # Find the rank index in the ideal order to assign relevance if rank_index in ideal_order: relevance = len(ideal_order) - ideal_order.index(rank_index) else: relevance = 0 dcg += relevance / math.log2(i + 2) # Calculate IDCG based on the ideal order up to k for i in range(min(k, len(ideal_order))): relevance = len(ideal_order) - i idcg += relevance / math.log2(i + 2) # Handle the case where IDCG is zero to avoid division by zero if idcg == 0: return 0.0 return dcg / idcg # Example usage rank_order = [5, 6, 1] # Example ranked order indices ideal_result_order = [0, 1, 2, 4, 5, 3, 6, 7, 8, 9] # Example ideal order indices # Calculate nDCG@3 result = ndcg_at_k(rank_order, ideal_result_order, 3) print(f"nDCG@3: {result:.4f}") assert ndcg_at_k([0, 1, 2], ideal_result_order, 3) == 1.0 ``` import math from typing import List def ndcg_at_k(rank_order: List[int], ideal_order: List[int], k: int) -> float: """ Calculate the normalized Discounted Cumulative Gain (nDCG) at position k. Parameters: rank_order (List[int]): The list of document indices as ranked by the search system. ideal_order (List[int]): The list of document indices in the ideal order. k (int): The position up to which to calculate nDCG. Returns: float: The nDCG value at position k. """ dcg = 0.0 idcg = 0.0 # Calculate DCG based on the ranked order up to k for i in range(min(k, len(rank_order))): rank_index = rank_order[i] # Find the rank index in the ideal order to assign relevance if rank_index in ideal_order: relevance = len(ideal_order) - ideal_order.index(rank_index) else: relevance = 0 dcg += relevance / math.log2(i + 2) # Calculate IDCG based on the ideal order up to k for i in range(min(k, len(ideal_order))): relevance = len(ideal_order) - i idcg += relevance / math.log2(i + 2) # Handle the case where IDCG is zero to avoid division by zero if idcg == 0: return 0.0 return dcg / idcg # Example usage rank_order = [5, 6, 1] # Example ranked order indices ideal_result_order = [0, 1, 2, 4, 5, 3, 6, 7, 8, 9] # Example ideal order indices # Calculate nDCG@3 result = ndcg_at_k(rank_order, ideal_result_order, 3) print(f"nDCG@3: {result:.4f}") assert ndcg_at_k([0, 1, 2], ideal_result_order, 3) == 1.0 ``` nDCG@3: 0.6618 ``` In \[14\]: Copied! ``` # Define the different rank profiles to evaluate rank_profiles = { "unranked": { "yql": f"select * from {app_name} where true", "ranking.profile": "unranked", }, "bm25": { "yql": f"select * from {app_name} where userQuery()", "ranking.profile": "bm25", }, "semantic": { "yql": f"select * from {app_name} where {{targetHits:5}}nearestNeighbor(doc_embeddings,q)", "ranking.profile": "semantic", "input.query(q)": f"embed({query})", }, "hybrid": { "yql": f"select * from {app_name} where userQuery() or ({{targetHits:5}}nearestNeighbor(doc_embeddings,q))", "ranking.profile": "hybrid", "input.query(q)": f"embed({query})", }, "hybrid_filtered": { # Here, we will add an heuristic, filtering out documents that contain the word "scooter" "yql": f'select * from {app_name} where !(doc contains "scooter") and userQuery() or ({{targetHits:5}}nearestNeighbor(doc_embeddings,q))', "ranking.profile": "hybrid", "input.query(q)": f"embed({query})", }, } # Define some common params that will be used for all queries common_params = { "query": query, "hits": 3, } ``` # Define the different rank profiles to evaluate rank_profiles = { "unranked": { "yql": f"select * from {app_name} where true", "ranking.profile": "unranked", }, "bm25": { "yql": f"select * from {app_name} where userQuery()", "ranking.profile": "bm25", }, "semantic": { "yql": f"select * from {app_name} where {{targetHits:5}}nearestNeighbor(doc_embeddings,q)", "ranking.profile": "semantic", "input.query(q)": f"embed({query})", }, "hybrid": { "yql": f"select * from {app_name} where userQuery() or ({{targetHits:5}}nearestNeighbor(doc_embeddings,q))", "ranking.profile": "hybrid", "input.query(q)": f"embed({query})", }, "hybrid_filtered": { # Here, we will add an heuristic, filtering out documents that contain the word "scooter" "yql": f'select * from {app_name} where !(doc contains "scooter") and userQuery() or ({{targetHits:5}}nearestNeighbor(doc_embeddings,q))', "ranking.profile": "hybrid", "input.query(q)": f"embed({query})", }, } # Define some common params that will be used for all queries common_params = { "query": query, "hits": 3, } In \[15\]: Copied! ``` from typing import List, Tuple from vespa.application import Vespa from vespa.io import VespaQueryResponse def evaluate_rank_profile( app: Vespa, rank_profile: str, params: dict, k: int ) -> Tuple[float, List[str]]: """ Run a query against a Vespa application using a specific rank profile and parameters. Evaluate the nDCG@3 of the search results based on the ideal order. Parameters: app (Vespa): The Vespa application to query. rank_profile (str): The name of the rank profile to use. params (dict): The common parameters to use in addition to the rank profile specific parameters. k (int): The position up to which to calculate nDCG. Returns: List[str]: The search results """ body_params = { **rank_profiles[rank_profile], **params, } response: VespaQueryResponse = app.query(body_params) rankings = [int(hit["id"][-1]) for hit in response.hits] docs = [hit["fields"]["doc"] for hit in response.hits] ndcg = ndcg_at_k(rankings, ideal_order=ideal_result_order, k=3) return ndcg, docs ``` from typing import List, Tuple from vespa.application import Vespa from vespa.io import VespaQueryResponse def evaluate_rank_profile( app: Vespa, rank_profile: str, params: dict, k: int ) -> Tuple\[float, List[str]\]: """ Run a query against a Vespa application using a specific rank profile and parameters. Evaluate the nDCG@3 of the search results based on the ideal order. Parameters: app (Vespa): The Vespa application to query. rank_profile (str): The name of the rank profile to use. params (dict): The common parameters to use in addition to the rank profile specific parameters. k (int): The position up to which to calculate nDCG. Returns: List\[str\]: The search results """ body_params = { \*\*rank_profiles[rank_profile], \*\*params, } response: VespaQueryResponse = app.query(body_params) rankings = \[int(hit["id"][-1]) for hit in response.hits\] docs = \[hit["fields"]["doc"] for hit in response.hits\] ndcg = ndcg_at_k(rankings, ideal_order=ideal_result_order, k=3) return ndcg, docs In \[16\]: Copied! ``` import json rank_results = {} for rank_profile, params in rank_profiles.items(): ndcg, docs = evaluate_rank_profile( app, rank_profile=rank_profile, params=common_params, k=3 ) rank_results[rank_profile] = ndcg print(f"Rank profile: {rank_profile}, nDCG@3: {ndcg:.2f}") print(json.dumps(docs, indent=2)) ``` import json rank_results = {} for rank_profile, params in rank_profiles.items(): ndcg, docs = evaluate_rank_profile( app, rank_profile=rank_profile, params=common_params, k=3 ) rank_results[rank_profile] = ndcg print(f"Rank profile: {rank_profile}, nDCG@3: {ndcg:.2f}") print(json.dumps(docs, indent=2)) ``` Rank profile: unranked, nDCG@3: 0.68 [ "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "Tips for maintaining your Vespa to ensure optimal performance and longevity of your scooter.", "Search engines like ours utilize complex algorithms to handle immediate data querying and indexing." ] Rank profile: bm25, nDCG@3: 0.78 [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "Review of different scooter brands including Vespa, highlighting how they handle features like speed, cost, and aesthetics, and how consumers search for the best options.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness." ] Rank profile: semantic, nDCG@3: 0.94 [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "Search engines like ours utilize complex algorithms to handle immediate data querying and indexing." ] Rank profile: hybrid, nDCG@3: 0.82 [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "Review of different scooter brands including Vespa, highlighting how they handle features like speed, cost, and aesthetics, and how consumers search for the best options." ] Rank profile: hybrid_filtered, nDCG@3: 0.94 [ "Vespa excels in real-time data indexing and its ability to search large datasets quickly.", "With our search solution, real-time updates are seamlessly integrated into the search index, enhancing responsiveness.", "Search engines like ours utilize complex algorithms to handle immediate data querying and indexing." ] ``` Uncomment the cell below to install dependencies needed to generate the plot. In \[17\]: Copied! ``` #!pip3 install pandas plotly ``` #!pip3 install pandas plotly In \[20\]: Copied! ``` import pandas as pd import plotly.express as px def plot_rank_profiles(rank_profiles): # Convert dictionary to DataFrame for easier manipulation data = pd.DataFrame(list(rank_profiles.items()), columns=["Rank Profile", "nDCG@3"]) colors = { "unranked": "#e74c3c", # Red "bm25": "#2ecc71", # Green "semantic": "#9b59b6", # Purple "hybrid": "#3498db", # Blue "hybrid_filtered": "#2980b9", # Darker Blue } # Map the colors to the dataframe data["Color"] = data["Rank Profile"].map(colors) # Create a bar chart using Plotly fig = px.bar( data, x="Rank Profile", y="nDCG@3", title="Rank Profile Performance - snowflake-arctic-embed-s", labels={"nDCG@3": "nDCG@3 Score"}, text="nDCG@3", template="simple_white", color="Color", color_discrete_map="identity", ) # Set bar width and update traces for individual colors fig.update_traces( marker_line_color="black", marker_line_width=1.5, width=0.4 ) # Less wide bars # Enhance chart design adhering to Tufte's principles fig.update_traces(texttemplate="%{text:.2f}", textposition="outside") fig.update_layout( plot_bgcolor="white", xaxis=dict( title="Rank Profile", showline=True, linewidth=2, linecolor="black", mirror=True, ), yaxis=dict( title="nDCG@3 Score", range=[0, 1.1], showline=True, linewidth=2, linecolor="black", mirror=True, ), title_font=dict(size=24), font=dict(family="Arial, sans-serif", size=18, color="black"), margin=dict(l=40, r=40, t=40, b=40), width=800, # Set the width of the plot ) # Show the plot fig.show() plot_rank_profiles(rank_profiles=rank_results) ``` import pandas as pd import plotly.express as px def plot_rank_profiles(rank_profiles): # Convert dictionary to DataFrame for easier manipulation data = pd.DataFrame(list(rank_profiles.items()), columns=["Rank Profile", "nDCG@3"]) colors = { "unranked": "#e74c3c", # Red "bm25": "#2ecc71", # Green "semantic": "#9b59b6", # Purple "hybrid": "#3498db", # Blue "hybrid_filtered": "#2980b9", # Darker Blue } # Map the colors to the dataframe data["Color"] = data["Rank Profile"].map(colors) # Create a bar chart using Plotly fig = px.bar( data, x="Rank Profile", y="nDCG@3", title="Rank Profile Performance - snowflake-arctic-embed-s", labels={"nDCG@3": "nDCG@3 Score"}, text="nDCG@3", template="simple_white", color="Color", color_discrete_map="identity", ) # Set bar width and update traces for individual colors fig.update_traces( marker_line_color="black", marker_line_width=1.5, width=0.4 ) # Less wide bars # Enhance chart design adhering to Tufte's principles fig.update_traces(texttemplate="%{text:.2f}", textposition="outside") fig.update_layout( plot_bgcolor="white", xaxis=dict( title="Rank Profile", showline=True, linewidth=2, linecolor="black", mirror=True, ), yaxis=dict( title="nDCG@3 Score", range=[0, 1.1], showline=True, linewidth=2, linecolor="black", mirror=True, ), title_font=dict(size=24), font=dict(family="Arial, sans-serif", size=18, color="black"), margin=dict(l=40, r=40, t=40, b=40), width=800, # Set the width of the plot ) # Show the plot fig.show() plot_rank_profiles(rank_profiles=rank_results) For this particular synthetic small dataset, we can see that using the `snowflake-arctic-embed`-model improved the results significantly compared to keyword search only. Still, our experience with real-world data is that hybrid search is often the way to go. We also provided a little taste of how one can evaluate different ranking profiles if you have a ground truth dataset available, (or can create a synthetic one). ## Next steps[¶](#next-steps) Check out global reranking strategies, and try to introduce a global_phase reranking strategy. ## Cleanup[¶](#cleanup) In \[19\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() # Feeding performance[¶](#feeding-performance) This explorative notebook intends to shine some light on the different modes of feeding documents to Vespa. We will look at these 4 different methods: 1. Using `VespaSync`. 1. Using `VespaAsync`. 1. Using `feed_iterable()` 1. Using [Vespa CLI](https://docs.vespa.ai/en/vespa-cli) Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. Install [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). The `vespacli` python package is just a thin wrapper, allowing for installation through pypi. > Do NOT install if you already have the Vespa CLI installed. In \[ \]: Copied! ``` #!pip3 install vespacli ``` #!pip3 install vespacli [Install pyvespa](https://pyvespa.readthedocs.io/), other dependencies, and start the Docker Daemon. In \[1\]: Copied! ``` #!pip3 install pyvespa datasets plotly>=5.20 #!docker info ``` #!pip3 install pyvespa datasets plotly>=5.20 #!docker info ## Create an application package[¶](#create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) has all the Vespa configuration files. For this demo, we will just use a dummy application package without any indexing, just to ease the load for the server, as it is the clients we want to compare in this experiment. In \[2\]: Copied! ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, FieldSet, ) package = ApplicationPackage( name="pyvespafeed", schema=[ Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field(name="text", type="string", indexing=["summary"]), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ], ) ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, FieldSet, ) package = ApplicationPackage( name="pyvespafeed", schema=\[ Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field(name="text", type="string", indexing=["summary"]), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) \], ) Note that the `ApplicationPackage` name cannot have `-` or `_`. ## Deploy the Vespa application[¶](#deploy-the-vespa-application) Deploy `package` on the local machine using Docker, without leaving the notebook, by creating an instance of [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#vespa.deployment.VespaDocker). `VespaDocker` connects to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop). In \[3\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=package) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=package) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 25/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` `app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa) instance. ## Preparing the data[¶](#preparing-the-data) In this example we use [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the ["Cohere/wikipedia-2023-11-embed-multilingual-v3"](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3) dataset and index in our newly deployed Vespa instance. The dataset contains wikipedia-pages, and their corresponding embeddings. > For this exploration we will just use the `id` and `text`-fields The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without downloading all the contents locally. The `map` functionality allows us to convert the dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`: `{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}` In \[4\]: Copied! ``` from datasets import load_dataset dataset = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split="train", streaming=True, ) ``` from datasets import load_dataset dataset = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split="train", streaming=True, ) ## Utility function to stream different number of documents[¶](#utility-function-to-stream-different-number-of-documents) In \[5\]: Copied! ``` def get_dataset(n_docs: int = 1000): return ( dataset.map(lambda x: {"id": x["_id"] + "-iter", "fields": {"text": x["text"]}}) .select_columns(["id", "fields"]) .take(n_docs) ) ``` def get_dataset(n_docs: int = 1000): return ( dataset.map(lambda x: {"id": x["\_id"] + "-iter", "fields": {"text": x["text"]}}) .select_columns(["id", "fields"]) .take(n_docs) ) ### A dataclass to store the parameters and results of the different feeding methods[¶](#a-dataclass-to-store-the-parameters-and-results-of-the-different-feeding-methods) In \[6\]: Copied! ``` from dataclasses import dataclass from typing import Callable, Optional, Iterable, Dict @dataclass class FeedParams: name: str num_docs: int max_connections: int function_name: str max_workers: Optional[int] = None max_queue_size: Optional[int] = None num_concurrent_requests: Optional[int] = None @dataclass class FeedResult(FeedParams): feed_time: Optional[float] = None ``` from dataclasses import dataclass from typing import Callable, Optional, Iterable, Dict @dataclass class FeedParams: name: str num_docs: int max_connections: int function_name: str max_workers: Optional[int] = None max_queue_size: Optional[int] = None num_concurrent_requests: Optional[int] = None @dataclass class FeedResult(FeedParams): feed_time: Optional[float] = None ### A common callback function to notify if something goes wrong[¶](#a-common-callback-function-to-notify-if-something-goes-wrong) In \[7\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ### Defining our feeding functions[¶](#defining-our-feeding-functions) In \[8\]: Copied! ``` import time import asyncio def feed_sync(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start_time = time.time() with app.syncio(connections=params.max_connections): for doc in data: app.feed_data_point( data_id=doc["id"], fields=doc["fields"], schema="doc", namespace="pyvespa-feed", callback=callback, ) end_time = time.time() return FeedResult( **params.__dict__, feed_time=end_time - start_time, ) async def feed_async(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start_time = time.time() tasks = [] # We use a semaphore to limit the number of concurrent requests, this is useful to avoid # running into memory issues when feeding a large number of documents semaphore = asyncio.Semaphore(params.num_concurrent_requests) async with app.asyncio( connections=params.max_connections, timeout=params.num_docs // 10 ) as async_app: for doc in data: async with semaphore: task = asyncio.create_task( async_app.feed_data_point( data_id=doc["id"], fields=doc["fields"], schema="doc", namespace="pyvespa-feed", timeout=10, ) ) tasks.append(task) await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED) end_time = time.time() return FeedResult( **params.__dict__, feed_time=end_time - start_time, ) def feed_iterable(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start = time.time() app.feed_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( **params.__dict__, feed_time=sync_feed_time, ) ``` import time import asyncio def feed_sync(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start_time = time.time() with app.syncio(connections=params.max_connections): for doc in data: app.feed_data_point( data_id=doc["id"], fields=doc["fields"], schema="doc", namespace="pyvespa-feed", callback=callback, ) end_time = time.time() return FeedResult( \*\*params.__dict__, feed_time=end_time - start_time, ) async def feed_async(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start_time = time.time() tasks = [] # We use a semaphore to limit the number of concurrent requests, this is useful to avoid # running into memory issues when feeding a large number of documents semaphore = asyncio.Semaphore(params.num_concurrent_requests) async with app.asyncio( connections=params.max_connections, timeout=params.num_docs // 10 ) as async_app: for doc in data: async with semaphore: task = asyncio.create_task( async_app.feed_data_point( data_id=doc["id"], fields=doc["fields"], schema="doc", namespace="pyvespa-feed", timeout=10, ) ) tasks.append(task) await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED) end_time = time.time() return FeedResult( \*\*params.__dict__, feed_time=end_time - start_time, ) def feed_iterable(params: FeedParams, data: Iterable[Dict]) -> FeedResult: start = time.time() app.feed_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( \*\*params.__dict__, feed_time=sync_feed_time, ) ## Defining our hyperparameters[¶](#defining-our-hyperparameters) In \[9\]: Copied! ``` from itertools import product # We will only run for 1000 documents here as notebook is run as part of CI. # But you will see some plots when run with 100k documents as well. num_docs = [1000] params_by_function = { "feed_sync": { "num_docs": num_docs, "max_connections": [16, 64], }, "feed_async": { "num_docs": num_docs, "max_connections": [16, 64], "num_concurrent_requests": [1000, 10_000], }, "feed_iterable": { "num_docs": num_docs, "max_connections": [64, 128], "max_workers": [16, 64], "max_queue_size": [1000, 10000], }, } feed_params = [] # Create one FeedParams instance of each permutation for func, parameters in params_by_function.items(): print(f"Function: {func}") keys, values = zip(*parameters.items()) for combination in product(*values): settings = dict(zip(keys, combination)) print(settings) feed_params.append( FeedParams( name=f"{settings['num_docs']}_{settings['max_connections']}_{settings.get('max_workers', 0)}_{func}", function_name=func, **settings, ) ) print("\n") # Just to add space between different functions ``` from itertools import product # We will only run for 1000 documents here as notebook is run as part of CI. # But you will see some plots when run with 100k documents as well. num_docs = [1000] params_by_function = { "feed_sync": { "num_docs": num_docs, "max_connections": [16, 64], }, "feed_async": { "num_docs": num_docs, "max_connections": [16, 64], "num_concurrent_requests": [1000, 10_000], }, "feed_iterable": { "num_docs": num_docs, "max_connections": [64, 128], "max_workers": [16, 64], "max_queue_size": [1000, 10000], }, } feed_params = [] # Create one FeedParams instance of each permutation for func, parameters in params_by_function.items(): print(f"Function: {func}") keys, values = zip(\*parameters.items()) for combination in product(\*values): settings = dict(zip(keys, combination)) print(settings) feed_params.append( FeedParams( name=f"{settings['num_docs']}_{settings['max_connections']}_{settings.get('max_workers', 0)}\_{func}", function_name=func, \*\*settings, ) ) print("\\n") # Just to add space between different functions ``` Function: feed_sync {'num_docs': 1000, 'max_connections': 16} {'num_docs': 1000, 'max_connections': 64} Function: feed_async {'num_docs': 1000, 'max_connections': 16, 'num_concurrent_requests': 1000} {'num_docs': 1000, 'max_connections': 16, 'num_concurrent_requests': 10000} {'num_docs': 1000, 'max_connections': 64, 'num_concurrent_requests': 1000} {'num_docs': 1000, 'max_connections': 64, 'num_concurrent_requests': 10000} Function: feed_iterable {'num_docs': 1000, 'max_connections': 64, 'max_workers': 16, 'max_queue_size': 1000} {'num_docs': 1000, 'max_connections': 64, 'max_workers': 16, 'max_queue_size': 10000} {'num_docs': 1000, 'max_connections': 64, 'max_workers': 64, 'max_queue_size': 1000} {'num_docs': 1000, 'max_connections': 64, 'max_workers': 64, 'max_queue_size': 10000} {'num_docs': 1000, 'max_connections': 128, 'max_workers': 16, 'max_queue_size': 1000} {'num_docs': 1000, 'max_connections': 128, 'max_workers': 16, 'max_queue_size': 10000} {'num_docs': 1000, 'max_connections': 128, 'max_workers': 64, 'max_queue_size': 1000} {'num_docs': 1000, 'max_connections': 128, 'max_workers': 64, 'max_queue_size': 10000} ``` In \[10\]: Copied! ``` print(f"Total number of feed_params: {len(feed_params)}") ``` print(f"Total number of feed_params: {len(feed_params)}") ``` Total number of feed_params: 14 ``` Now, we will need a way to retrieve the callable function from the function name. In \[11\]: Copied! ``` # Get reference to function from string name def get_func_from_str(func_name: str) -> Callable: return globals()[func_name] ``` # Get reference to function from string name def get_func_from_str(func_name: str) -> Callable: return globals()[func_name] ### Function to clean up after each feed[¶](#function-to-clean-up-after-each-feed) For a fair comparison, we will delete the data before feeding it again. In \[12\]: Copied! ``` from typing import Iterable, Dict def delete_data(data: Iterable[Dict]): app.feed_iterable( iter=data, schema="doc", namespace="pyvespa-feed", operation_type="delete", callback=callback, ) ``` from typing import Iterable, Dict def delete_data(data: Iterable[Dict]): app.feed_iterable( iter=data, schema="doc", namespace="pyvespa-feed", operation_type="delete", callback=callback, ) ## Main experiment loop[¶](#main-experiment-loop) The line below is used to make the code run in Jupyter, as it is already running an event loop In \[13\]: Copied! ``` import nest_asyncio nest_asyncio.apply() ``` import nest_asyncio nest_asyncio.apply() In \[14\]: Copied! ``` results = [] for params in feed_params: print("-" * 50) print("Starting feed with params:") print(params) data = get_dataset(params.num_docs) if "async" not in params.function_name: feed_result = get_func_from_str(params.function_name)(params=params, data=data) else: feed_result = asyncio.run( get_func_from_str(params.function_name)(params=params, data=data) ) print(feed_result.feed_time) results.append(feed_result) print("Deleting data") delete_data(data) ``` results = [] for params in feed_params: print("-" * 50) print("Starting feed with params:") print(params) data = get_dataset(params.num_docs) if "async" not in params.function_name: feed_result = get_func_from_str(params.function_name)(params=params, data=data) else: feed_result = asyncio.run( get_func_from_str(params.function_name)(params=params, data=data) ) print(feed_result.feed_time) results.append(feed_result) print("Deleting data") delete_data(data) ``` -------------------------------------------------- Starting feed with params: FeedParams(name='1000_16_0_feed_sync', num_docs=1000, max_connections=16, function_name='feed_sync', max_workers=None, max_queue_size=None, num_concurrent_requests=None) 15.175757884979248 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_0_feed_sync', num_docs=1000, max_connections=64, function_name='feed_sync', max_workers=None, max_queue_size=None, num_concurrent_requests=None) 12.517201900482178 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_16_0_feed_async', num_docs=1000, max_connections=16, function_name='feed_async', max_workers=None, max_queue_size=None, num_concurrent_requests=1000) 4.953256130218506 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_16_0_feed_async', num_docs=1000, max_connections=16, function_name='feed_async', max_workers=None, max_queue_size=None, num_concurrent_requests=10000) 4.914812088012695 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_0_feed_async', num_docs=1000, max_connections=64, function_name='feed_async', max_workers=None, max_queue_size=None, num_concurrent_requests=1000) 4.711783170700073 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_0_feed_async', num_docs=1000, max_connections=64, function_name='feed_async', max_workers=None, max_queue_size=None, num_concurrent_requests=10000) 4.942464113235474 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_16_feed_iterable', num_docs=1000, max_connections=64, function_name='feed_iterable', max_workers=16, max_queue_size=1000, num_concurrent_requests=None) 5.707854270935059 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_16_feed_iterable', num_docs=1000, max_connections=64, function_name='feed_iterable', max_workers=16, max_queue_size=10000, num_concurrent_requests=None) 5.798462867736816 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_64_feed_iterable', num_docs=1000, max_connections=64, function_name='feed_iterable', max_workers=64, max_queue_size=1000, num_concurrent_requests=None) 5.706255674362183 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_64_feed_iterable', num_docs=1000, max_connections=64, function_name='feed_iterable', max_workers=64, max_queue_size=10000, num_concurrent_requests=None) 5.976051330566406 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_128_16_feed_iterable', num_docs=1000, max_connections=128, function_name='feed_iterable', max_workers=16, max_queue_size=1000, num_concurrent_requests=None) 5.959493160247803 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_128_16_feed_iterable', num_docs=1000, max_connections=128, function_name='feed_iterable', max_workers=16, max_queue_size=10000, num_concurrent_requests=None) 5.757789134979248 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_128_64_feed_iterable', num_docs=1000, max_connections=128, function_name='feed_iterable', max_workers=64, max_queue_size=1000, num_concurrent_requests=None) 5.612061023712158 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_128_64_feed_iterable', num_docs=1000, max_connections=128, function_name='feed_iterable', max_workers=64, max_queue_size=10000, num_concurrent_requests=None) 5.622947692871094 Deleting data ``` In \[15\]: Copied! ``` # Create a pandas DataFrame with the results import pandas as pd df = pd.DataFrame([result.__dict__ for result in results]) df["requests_per_second"] = df["num_docs"] / df["feed_time"] df ``` # Create a pandas DataFrame with the results import pandas as pd df = pd.DataFrame(\[result.__dict__ for result in results\]) df["requests_per_second"] = df["num_docs"] / df["feed_time"] df Out\[15\]: | | name | num_docs | max_connections | function_name | max_workers | max_queue_size | num_concurrent_requests | feed_time | requests_per_second | | --- | ------------------------- | -------- | --------------- | ------------- | ----------- | -------------- | ----------------------- | --------- | ------------------- | | 0 | 1000_16_0_feed_sync | 1000 | 16 | feed_sync | NaN | NaN | NaN | 15.175758 | 65.894567 | | 1 | 1000_64_0_feed_sync | 1000 | 64 | feed_sync | NaN | NaN | NaN | 12.517202 | 79.890059 | | 2 | 1000_16_0_feed_async | 1000 | 16 | feed_async | NaN | NaN | 1000.0 | 4.953256 | 201.887400 | | 3 | 1000_16_0_feed_async | 1000 | 16 | feed_async | NaN | NaN | 10000.0 | 4.914812 | 203.466579 | | 4 | 1000_64_0_feed_async | 1000 | 64 | feed_async | NaN | NaN | 1000.0 | 4.711783 | 212.233875 | | 5 | 1000_64_0_feed_async | 1000 | 64 | feed_async | NaN | NaN | 10000.0 | 4.942464 | 202.328227 | | 6 | 1000_64_16_feed_iterable | 1000 | 64 | feed_iterable | 16.0 | 1000.0 | NaN | 5.707854 | 175.197185 | | 7 | 1000_64_16_feed_iterable | 1000 | 64 | feed_iterable | 16.0 | 10000.0 | NaN | 5.798463 | 172.459499 | | 8 | 1000_64_64_feed_iterable | 1000 | 64 | feed_iterable | 64.0 | 1000.0 | NaN | 5.706256 | 175.246266 | | 9 | 1000_64_64_feed_iterable | 1000 | 64 | feed_iterable | 64.0 | 10000.0 | NaN | 5.976051 | 167.334573 | | 10 | 1000_128_16_feed_iterable | 1000 | 128 | feed_iterable | 16.0 | 1000.0 | NaN | 5.959493 | 167.799505 | | 11 | 1000_128_16_feed_iterable | 1000 | 128 | feed_iterable | 16.0 | 10000.0 | NaN | 5.757789 | 173.677774 | | 12 | 1000_128_64_feed_iterable | 1000 | 128 | feed_iterable | 64.0 | 1000.0 | NaN | 5.612061 | 178.187656 | | 13 | 1000_128_64_feed_iterable | 1000 | 128 | feed_iterable | 64.0 | 10000.0 | NaN | 5.622948 | 177.842664 | In \[16\]: Copied! ``` import plotly.express as px def plot_performance(df: pd.DataFrame): # Create a scatter plot with logarithmic scale for both axes using Plotly Express fig = px.scatter( df, x="num_docs", y="requests_per_second", color="function_name", # Defines color based on different functions log_x=True, # Set x-axis to logarithmic scale log_y=False, # If you also want the y-axis in logarithmic scale, set this to True title="Performance: Requests per Second vs. Number of Documents", labels={ # Customizing axis labels "num_docs": "Number of Documents", "requests_per_second": "Requests per Second", "max_workers": "max_workers", "max_queue_size": "max_queue_size", "max_connections": "max_connections", "num_concurrent_requests": "num_concurrent_requests", }, template="plotly_white", # This sets the style to a white background, adhering to Tufte's minimalist principles hover_data=[ "max_workers", "max_queue_size", "max_connections", "num_concurrent_requests", ], # Additional information to show on hover ) # Update layout for better readability, similar to 'talk' context in Seaborn fig.update_layout( font=dict( size=16, # Adjusting font size for better visibility, similar to 'talk' context ), legend_title_text="Function Details", # Custom legend title legend=dict( title_font_size=16, x=800, # Adjusting legend position similar to bbox_to_anchor in Matplotlib xanchor="auto", y=1, yanchor="auto", ), width=800, # Adjusting width of the plot ) fig.update_xaxes( tickvals=[1000, 10000, 100000], # Set specific tick values ticktext=["1k", "10k", "100k"], # Set corresponding tick labels ) fig.update_traces( marker=dict(size=12, opacity=0.7) ) # Adjust marker size and opacity # Show plot fig.show() # Save plot as HTML file fig.write_html("performance.html") plot_performance(df) ``` import plotly.express as px def plot_performance(df: pd.DataFrame): # Create a scatter plot with logarithmic scale for both axes using Plotly Express fig = px.scatter( df, x="num_docs", y="requests_per_second", color="function_name", # Defines color based on different functions log_x=True, # Set x-axis to logarithmic scale log_y=False, # If you also want the y-axis in logarithmic scale, set this to True title="Performance: Requests per Second vs. Number of Documents", labels={ # Customizing axis labels "num_docs": "Number of Documents", "requests_per_second": "Requests per Second", "max_workers": "max_workers", "max_queue_size": "max_queue_size", "max_connections": "max_connections", "num_concurrent_requests": "num_concurrent_requests", }, template="plotly_white", # This sets the style to a white background, adhering to Tufte's minimalist principles hover_data=[ "max_workers", "max_queue_size", "max_connections", "num_concurrent_requests", ], # Additional information to show on hover ) # Update layout for better readability, similar to 'talk' context in Seaborn fig.update_layout( font=dict( size=16, # Adjusting font size for better visibility, similar to 'talk' context ), legend_title_text="Function Details", # Custom legend title legend=dict( title_font_size=16, x=800, # Adjusting legend position similar to bbox_to_anchor in Matplotlib xanchor="auto", y=1, yanchor="auto", ), width=800, # Adjusting width of the plot ) fig.update_xaxes( tickvals=[1000, 10000, 100000], # Set specific tick values ticktext=["1k", "10k", "100k"], # Set corresponding tick labels ) fig.update_traces( marker=dict(size=12, opacity=0.7) ) # Adjust marker size and opacity # Show plot fig.show() # Save plot as HTML file fig.write_html("performance.html") plot_performance(df) Here is the corresponding plot when run with 1k, 10k, and 100k documents: Interesting. Let's try to summarize the insights we got from this experiment: - The `feed_sync` method is the slowest, and does not benefit much from increasing `max_connections`. As there is no concurrency, and each request is blocking, this will be bound by the network latency, which in many cases will be a lot higher than when running against a local VespaDocker instance. For example, if you have 100ms latency against your Vespa instance, you can only feed 10 documents per second using the `VespaSync` method. - The `feed_async` method is the fastest, and benefits slightly from increasing `max_connections` regardless of the number of documents. This method is non-blocking, and will likely be even more beneficial when running against a remote Vespa instance, such as [Vespa Cloud](https://cloud.vespa.ai/), when network latency is higher. - The `feed_iterable` performance is somewhere in between the other two methods, and benefits a lot from increasing `num_workers` when the number of documents increases. We have not looked at multiprocessing, but there is definitively room to utilize more cores to improve further upon these results. But there is one alternative that it is interesting to compare against, the Vespa CLI. ## Feeding with Vespa CLI[¶](#feeding-with-vespa-cli) [Vespa CLI](https://docs.vespa.ai/en/vespa-cli) is a command-line interface for interacting with Vespa. Among many useful features are a `vespa feed` command that is the recommended way of feeding large datasets into Vespa. This is optimized for high feeding performance, and it will be interesting to get a feel for how performant feeding to a local Vespa instance is using the CLI. Note that comparing feeding with the CLI is not entirely fair, as the CLI relies on prepared data files, while the pyvespa methods are streaming the data before feeding it. ## Prepare the data for Vespa CLI[¶](#prepare-the-data-for-vespa-cli) Vespa CLI can feed data from either many .json files or a single .jsonl file with many documents. The json format needs to be in the following format: ``` { "put": "id:namespace:document-type::document-id", "fields": { "field1": "value1", "field2": "value2" } } ``` Where, `put` is the document operation in this case. Other allowed operations are `get`, `update` and `remove`. For reference, see ### Getting the datasets as .jsonl files[¶](#getting-the-datasets-as-jsonl-files) Now, let\`s save the dataset to 3 different jsonl files of 1k, 10k, and 100k documents. In \[17\]: Copied! ``` for n in num_docs: print(f"Getting dataset with {n} docs...") # First, let's load the dataset in non-streaming mode this time, as we want to save it to a jsonl file dataset_cli = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split=f"train[:{n}]", # Notice the slicing here, see https://huggingface.co/docs/datasets/loading#slice-splits streaming=False, ) # Map to the format expected by the CLI. # Note that this differs a little bit from the format expected by the Python API. dataset_cli = dataset_cli.map( lambda x: { "put": f"id:pyvespa-feed:doc::{x['_id']}-json", "fields": {"text": x["text"]}, } ).select_columns(["put", "fields"]) # Save to a jsonl file assert len(dataset_cli) == n dataset_cli.to_json(f"vespa_feed-{n}.json", orient="records", lines=True) ``` for n in num_docs: print(f"Getting dataset with {n} docs...") # First, let's load the dataset in non-streaming mode this time, as we want to save it to a jsonl file dataset_cli = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split=f"train[:{n}]", # Notice the slicing here, see https://huggingface.co/docs/datasets/loading#slice-splits streaming=False, ) # Map to the format expected by the CLI. # Note that this differs a little bit from the format expected by the Python API. dataset_cli = dataset_cli.map( lambda x: { "put": f"id:pyvespa-feed:doc::{x['\_id']}-json", "fields": {"text": x["text"]}, } ).select_columns(["put", "fields"]) # Save to a jsonl file assert len(dataset_cli) == n dataset_cli.to_json(f"vespa_feed-{n}.json", orient="records", lines=True) ``` Getting dataset with 1000 docs... ``` ``` Creating json from Arrow format: 0%| | 0/1 [00:00 Do NOT install if you already have the Vespa CLI installed. [Install pyvespa](https://pyvespa.readthedocs.io/), and other dependencies. In \[1\]: Copied! ``` !pip3 install vespacli pyvespa datasets plotly>=5.20 ``` !pip3 install vespacli pyvespa datasets plotly>=5.20 ``` zsh:1: 5.20 not found ``` ## Create an application package[¶](#create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) has all the Vespa configuration files. For this demo, we will use a simple application package In \[2\]: Copied! ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, FieldSet, HNSW, ) # Define the application name (can NOT contain `_` or `-`) application = "feedperformancecloud" package = ApplicationPackage( name=application, schema=[ Schema( name="doc", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field(name="text", type="string", indexing=["index", "summary"]), Field( name="embedding", type="tensor(x[1024])", # Note that we are NOT embedding with a vespa model here, but that is also possible. indexing=["summary", "attribute", "index"], ann=HNSW(distance_metric="angular"), ), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ], ) ``` from vespa.package import ( ApplicationPackage, Field, Schema, Document, FieldSet, HNSW, ) # Define the application name (can NOT contain `_` or `-`) application = "feedperformancecloud" package = ApplicationPackage( name=application, schema=\[ Schema( name="doc", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field(name="text", type="string", indexing=["index", "summary"]), Field( name="embedding", type="tensor(x[1024])", # Note that we are NOT embedding with a vespa model here, but that is also possible. indexing=["summary", "attribute", "index"], ann=HNSW(distance_metric="angular"), ), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) \], ) Note that the `ApplicationPackage` name cannot have `-` or `_`. ## Deploy the Vespa application[¶](#deploy-the-vespa-application) Deploy `package` on the local machine using Docker, without leaving the notebook, by creating an instance of [VespaDocker](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#vespa.deployment.VespaDocker). `VespaDocker` connects to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop). Follow the instructions from the output above and add the control-plane key in the console at `https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys` (replace TENANT_NAME with your tenant name). In \[3\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=application, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=package, ) ``` Setting application... Running: vespa config set application vespa-team.feedperformancecloud Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` `app` now holds a reference to a [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#VespaCloud) instance. In \[4\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 9 of dev-aws-us-east-1c for vespa-team.feedperformancecloud. This may take a few minutes the first time. INFO [07:22:29] Deploying platform version 8.392.14 and application dev build 7 for dev-aws-us-east-1c of default ... INFO [07:22:30] Using CA signed certificate version 1 INFO [07:22:30] Using 1 nodes in container cluster 'feedperformancecloud_container' WARNING [07:22:33] Auto-overriding validation which would be disallowed in production: certificate-removal: Data plane certificate(s) from cluster 'feedperformancecloud_container' is removed (removed certificates: [CN=cloud.vespa.example]) This can cause client connection issues.. To allow this add certificate-removal to validation-overrides.xml, see https://docs.vespa.ai/en/reference/validation-overrides.html INFO [07:22:34] Session 304192 for tenant 'vespa-team' prepared and activated. INFO [07:22:35] ######## Details for all nodes ######## INFO [07:22:35] h95731a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:22:35] --- platform vespa/cloud-tenant-rhel8:8.392.14 INFO [07:22:35] --- container on port 4080 has not started INFO [07:22:35] --- metricsproxy-container on port 19092 has config generation 304192, wanted is 304192 INFO [07:22:35] h95729b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:22:35] --- platform vespa/cloud-tenant-rhel8:8.392.14 INFO [07:22:35] --- storagenode on port 19102 has config generation 304192, wanted is 304192 INFO [07:22:35] --- searchnode on port 19107 has config generation 304192, wanted is 304192 INFO [07:22:35] --- distributor on port 19111 has config generation 304192, wanted is 304192 INFO [07:22:35] --- metricsproxy-container on port 19092 has config generation 304192, wanted is 304192 INFO [07:22:35] h93272g.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:22:35] --- platform vespa/cloud-tenant-rhel8:8.392.14 INFO [07:22:35] --- logserver-container on port 4080 has config generation 304192, wanted is 304192 INFO [07:22:35] --- metricsproxy-container on port 19092 has config generation 304192, wanted is 304192 INFO [07:22:35] h93272h.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [07:22:35] --- platform vespa/cloud-tenant-rhel8:8.392.14 INFO [07:22:35] --- container-clustercontroller on port 19050 has config generation 304192, wanted is 304192 INFO [07:22:35] --- metricsproxy-container on port 19092 has config generation 304192, wanted is 304192 INFO [07:22:42] Found endpoints: INFO [07:22:42] - dev.aws-us-east-1c INFO [07:22:42] |-- https://b48e8812.bc737822.z.vespa-app.cloud/ (cluster 'feedperformancecloud_container') INFO [07:22:44] Deployment of new application complete! Found mtls endpoint for feedperformancecloud_container URL: https://b48e8812.bc737822.z.vespa-app.cloud/ Connecting to https://b48e8812.bc737822.z.vespa-app.cloud/ Using mtls_key_cert Authentication against endpoint https://b48e8812.bc737822.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` Note that if you already have a Vespa Cloud instance running, the recommended way to initialize a `Vespa` instance is directly, by passing the `endpoint` and `tenant` parameters to the `Vespa` constructor, along with either: 1. Key/cert for dataplane authentication (generated as part of deployment, copied into the application package, in `/security/clients.pem`, and `~/.vespa/mytenant.myapplication/data-plane-public-cert.pem` and `~/.vespa/mytenant.myapplication/data-plane-private-key.pem`). ``` from vespa.application import Vespa app: Vespa = Vespa( url="https://my-endpoint.z.vespa-app.cloud", tenant="my-tenant", key_file="path/to/private-key.pem", cert_file="path/to/certificate.pem", ) ``` 2. Using a token (must be generated in [Vespa Cloud Console](https://console.vespa-cloud.com/) and defined in the application package, see . ``` from vespa.application import Vespa import os app: Vespa = Vespa( url="https://my-endpoint.z.vespa-app.cloud", tenant="my-tenant", vespa_cloud_secret_token=os.getenv("VESPA_CLOUD_SECRET_TOKEN"), ) ``` In \[5\]: Copied! ``` app.get_application_status() ``` app.get_application_status() ``` Using mtls_key_cert Authentication against endpoint https://b48e8812.bc737822.z.vespa-app.cloud//ApplicationStatus ``` Out\[5\]: ``` ``` ## Preparing the data[¶](#preparing-the-data) In this example we use [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the ["Cohere/wikipedia-2023-11-embed-multilingual-v3"](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3) dataset and index in our newly deployed Vespa instance. The dataset contains Wikipedia-pages, and their corresponding embeddings. > For this exploration, we will use the `id`, `text` and `embedding`-fields The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without downloading all the contents locally. The `map` functionality allows us to convert the dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`: `{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}` In \[ \]: Copied! ``` from datasets import load_dataset ``` from datasets import load_dataset ## Utility function to create a dataset with different number of documents[¶](#utility-function-to-create-a-dataset-with-different-number-of-documents) In \[7\]: Copied! ``` def get_dataset(n_docs: int = 1000): dataset = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split=f"train[:{n_docs}]", ) dataset = dataset.map( lambda x: { "id": x["_id"] + "-iter", "fields": {"text": x["text"], "embedding": x["emb"]}, } ).select_columns(["id", "fields"]) return dataset ``` def get_dataset(n_docs: int = 1000): dataset = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3", "simple", split=f"train[:{n_docs}]", ) dataset = dataset.map( lambda x: { "id": x["\_id"] + "-iter", "fields": {"text": x["text"], "embedding": x["emb"]}, } ).select_columns(["id", "fields"]) return dataset ### A dataclass to store the parameters and results of the different feeding methods[¶](#a-dataclass-to-store-the-parameters-and-results-of-the-different-feeding-methods) In \[8\]: Copied! ``` from dataclasses import dataclass from typing import Callable, Optional, Iterable, Dict @dataclass class FeedParams: name: str num_docs: int max_connections: int function_name: str max_workers: Optional[int] = None max_queue_size: Optional[int] = None @dataclass class FeedResult(FeedParams): feed_time: Optional[float] = None ``` from dataclasses import dataclass from typing import Callable, Optional, Iterable, Dict @dataclass class FeedParams: name: str num_docs: int max_connections: int function_name: str max_workers: Optional[int] = None max_queue_size: Optional[int] = None @dataclass class FeedResult(FeedParams): feed_time: Optional[float] = None ### A common callback function to notify if something goes wrong[¶](#a-common-callback-function-to-notify-if-something-goes-wrong) In \[9\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ### Defining our feeding functions[¶](#defining-our-feeding-functions) In \[10\]: Copied! ``` import time import asyncio from vespa.application import Vespa ``` import time import asyncio from vespa.application import Vespa In \[11\]: Copied! ``` def feed_iterable(app: Vespa, params: FeedParams, data: Iterable[Dict]) -> FeedResult: start = time.time() app.feed_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( **params.__dict__, feed_time=sync_feed_time, ) def feed_async_iterable( app: Vespa, params: FeedParams, data: Iterable[Dict] ) -> FeedResult: start = time.time() app.feed_async_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( **params.__dict__, feed_time=sync_feed_time, ) ``` def feed_iterable(app: Vespa, params: FeedParams, data: Iterable[Dict]) -> FeedResult: start = time.time() app.feed_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( \*\*params.__dict__, feed_time=sync_feed_time, ) def feed_async_iterable( app: Vespa, params: FeedParams, data: Iterable[Dict] ) -> FeedResult: start = time.time() app.feed_async_iterable( data, schema="doc", namespace="pyvespa-feed", operation_type="feed", max_queue_size=params.max_queue_size, max_workers=params.max_workers, max_connections=params.max_connections, callback=callback, ) end = time.time() sync_feed_time = end - start return FeedResult( \*\*params.__dict__, feed_time=sync_feed_time, ) ## Defining our hyperparameters[¶](#defining-our-hyperparameters) In \[12\]: Copied! ``` from itertools import product # We will only run for up to 10 000 documents here as notebook is run as part of CI. num_docs = [ 1000, 5_000, 10_000, ] params_by_function = { "feed_async_iterable": { "num_docs": num_docs, "max_connections": [1], "max_workers": [64], "max_queue_size": [2500], }, "feed_iterable": { "num_docs": num_docs, "max_connections": [64], "max_workers": [64], "max_queue_size": [2500], }, } feed_params = [] # Create one FeedParams instance of each permutation for func, parameters in params_by_function.items(): print(f"Function: {func}") keys, values = zip(*parameters.items()) for combination in product(*values): settings = dict(zip(keys, combination)) print(settings) feed_params.append( FeedParams( name=f"{settings['num_docs']}_{settings['max_connections']}_{settings.get('max_workers', 0)}_{func}", function_name=func, **settings, ) ) print("\n") # Just to add space between different functions ``` from itertools import product # We will only run for up to 10 000 documents here as notebook is run as part of CI. num_docs = [ 1000, 5_000, 10_000, ] params_by_function = { "feed_async_iterable": { "num_docs": num_docs, "max_connections": [1], "max_workers": [64], "max_queue_size": [2500], }, "feed_iterable": { "num_docs": num_docs, "max_connections": [64], "max_workers": [64], "max_queue_size": [2500], }, } feed_params = [] # Create one FeedParams instance of each permutation for func, parameters in params_by_function.items(): print(f"Function: {func}") keys, values = zip(\*parameters.items()) for combination in product(\*values): settings = dict(zip(keys, combination)) print(settings) feed_params.append( FeedParams( name=f"{settings['num_docs']}_{settings['max_connections']}_{settings.get('max_workers', 0)}\_{func}", function_name=func, \*\*settings, ) ) print("\\n") # Just to add space between different functions ``` Function: feed_async_iterable {'num_docs': 1000, 'max_connections': 1, 'max_workers': 64, 'max_queue_size': 2500} {'num_docs': 5000, 'max_connections': 1, 'max_workers': 64, 'max_queue_size': 2500} {'num_docs': 10000, 'max_connections': 1, 'max_workers': 64, 'max_queue_size': 2500} Function: feed_iterable {'num_docs': 1000, 'max_connections': 64, 'max_workers': 64, 'max_queue_size': 2500} {'num_docs': 5000, 'max_connections': 64, 'max_workers': 64, 'max_queue_size': 2500} {'num_docs': 10000, 'max_connections': 64, 'max_workers': 64, 'max_queue_size': 2500} ``` In \[13\]: Copied! ``` print(f"Total number of feed_params: {len(feed_params)}") ``` print(f"Total number of feed_params: {len(feed_params)}") ``` Total number of feed_params: 6 ``` Now, we will need a way to retrieve the callable function from the function name. In \[14\]: Copied! ``` # Get reference to function from string name def get_func_from_str(func_name: str) -> Callable: return globals()[func_name] ``` # Get reference to function from string name def get_func_from_str(func_name: str) -> Callable: return globals()[func_name] ### Function to clean up after each feed[¶](#function-to-clean-up-after-each-feed) For a fair comparison, we will delete the data before feeding it again. In \[15\]: Copied! ``` from typing import Iterable, Dict from vespa.application import Vespa def delete_data(app: Vespa, data: Iterable[Dict]): app.feed_iterable( iter=data, schema="doc", namespace="pyvespa-feed", operation_type="delete", callback=callback, max_workers=16, max_connections=16, ) ``` from typing import Iterable, Dict from vespa.application import Vespa def delete_data(app: Vespa, data: Iterable[Dict]): app.feed_iterable( iter=data, schema="doc", namespace="pyvespa-feed", operation_type="delete", callback=callback, max_workers=16, max_connections=16, ) ## Main experiment loop[¶](#main-experiment-loop) The line below is used to make the code run in Jupyter, as it is already running an event loop In \[16\]: Copied! ``` import nest_asyncio nest_asyncio.apply() ``` import nest_asyncio nest_asyncio.apply() In \[17\]: Copied! ``` results = [] for params in feed_params: print("-" * 50) print("Starting feed with params:") print(params) data = get_dataset(params.num_docs) if "xxx" not in params.function_name: if "feed_sync" in params.function_name: print("Skipping feed_sync") continue feed_result = get_func_from_str(params.function_name)( app=app, params=params, data=data ) else: feed_result = asyncio.run( get_func_from_str(params.function_name)(app=app, params=params, data=data) ) print(feed_result.feed_time) results.append(feed_result) print("Deleting data") time.sleep(3) delete_data(app, data) ``` results = [] for params in feed_params: print("-" * 50) print("Starting feed with params:") print(params) data = get_dataset(params.num_docs) if "xxx" not in params.function_name: if "feed_sync" in params.function_name: print("Skipping feed_sync") continue feed_result = get_func_from_str(params.function_name)( app=app, params=params, data=data ) else: feed_result = asyncio.run( get_func_from_str(params.function_name)(app=app, params=params, data=data) ) print(feed_result.feed_time) results.append(feed_result) print("Deleting data") time.sleep(3) delete_data(app, data) ``` -------------------------------------------------- Starting feed with params: FeedParams(name='1000_1_64_feed_async_iterable', num_docs=1000, max_connections=1, function_name='feed_async_iterable', max_workers=64, max_queue_size=2500) ``` ``` Using mtls_key_cert Authentication against endpoint https://b48e8812.bc737822.z.vespa-app.cloud//ApplicationStatus 7.062151908874512 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='5000_1_64_feed_async_iterable', num_docs=5000, max_connections=1, function_name='feed_async_iterable', max_workers=64, max_queue_size=2500) 20.979923963546753 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='10000_1_64_feed_async_iterable', num_docs=10000, max_connections=1, function_name='feed_async_iterable', max_workers=64, max_queue_size=2500) 41.321199893951416 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='1000_64_64_feed_iterable', num_docs=1000, max_connections=64, function_name='feed_iterable', max_workers=64, max_queue_size=2500) 16.278107166290283 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='5000_64_64_feed_iterable', num_docs=5000, max_connections=64, function_name='feed_iterable', max_workers=64, max_queue_size=2500) 78.27990508079529 Deleting data -------------------------------------------------- Starting feed with params: FeedParams(name='10000_64_64_feed_iterable', num_docs=10000, max_connections=64, function_name='feed_iterable', max_workers=64, max_queue_size=2500) 156.38266611099243 Deleting data ``` In \[18\]: Copied! ``` # Create a pandas DataFrame with the results import pandas as pd df = pd.DataFrame([result.__dict__ for result in results]) df["requests_per_second"] = df["num_docs"] / df["feed_time"] df ``` # Create a pandas DataFrame with the results import pandas as pd df = pd.DataFrame(\[result.__dict__ for result in results\]) df["requests_per_second"] = df["num_docs"] / df["feed_time"] df Out\[18\]: | | name | num_docs | max_connections | function_name | max_workers | max_queue_size | feed_time | requests_per_second | | --- | ------------------------------ | -------- | --------------- | ------------------- | ----------- | -------------- | ---------- | ------------------- | | 0 | 1000_1_64_feed_async_iterable | 1000 | 1 | feed_async_iterable | 64 | 2500 | 7.062152 | 141.599899 | | 1 | 5000_1_64_feed_async_iterable | 5000 | 1 | feed_async_iterable | 64 | 2500 | 20.979924 | 238.323075 | | 2 | 10000_1_64_feed_async_iterable | 10000 | 1 | feed_async_iterable | 64 | 2500 | 41.321200 | 242.006525 | | 3 | 1000_64_64_feed_iterable | 1000 | 64 | feed_iterable | 64 | 2500 | 16.278107 | 61.432204 | | 4 | 5000_64_64_feed_iterable | 5000 | 64 | feed_iterable | 64 | 2500 | 78.279905 | 63.873353 | | 5 | 10000_64_64_feed_iterable | 10000 | 64 | feed_iterable | 64 | 2500 | 156.382666 | 63.945706 | ## Plotting the results[¶](#plotting-the-results) Let's plot the results to see how the different methods compare. In \[19\]: Copied! ``` import plotly.express as px def plot_performance(df: pd.DataFrame): # Create a scatter plot with logarithmic scale for both axes using Plotly Express fig = px.scatter( df, x="num_docs", y="requests_per_second", color="function_name", # Defines color based on different functions log_x=True, # Set x-axis to logarithmic scale log_y=False, # If you also want the y-axis in logarithmic scale, set this to True title="Performance: Requests per Second vs. Number of Documents", labels={ # Customizing axis labels "num_docs": "Number of Documents", "requests_per_second": "Requests per Second", "max_workers": "max_workers", "max_queue_size": "max_queue_size", }, template="plotly_white", # This sets the style to a white background, adhering to Tufte's minimalist principles hover_data=[ "max_workers", "max_queue_size", "max_connections", ], # Additional information to show on hover ) # Update layout for better readability, similar to 'talk' context in Seaborn fig.update_layout( font=dict( size=16, # Adjusting font size for better visibility, similar to 'talk' context ), legend_title_text="Function Details", # Custom legend title legend=dict( title_font_size=16, x=800, # Adjusting legend position similar to bbox_to_anchor in Matplotlib xanchor="auto", y=1, yanchor="auto", ), width=800, # Adjusting width of the plot ) fig.update_xaxes( tickvals=[1000, 5000, 10000], # Set specific tick values ticktext=["1k", "5k", "10k"], # Set corresponding tick labels ) fig.update_traces( marker=dict(size=12, opacity=0.7) ) # Adjust marker size and opacity # Show plot fig.show() # Save plot as HTML file fig.write_html("performance.html") plot_performance(df) ``` import plotly.express as px def plot_performance(df: pd.DataFrame): # Create a scatter plot with logarithmic scale for both axes using Plotly Express fig = px.scatter( df, x="num_docs", y="requests_per_second", color="function_name", # Defines color based on different functions log_x=True, # Set x-axis to logarithmic scale log_y=False, # If you also want the y-axis in logarithmic scale, set this to True title="Performance: Requests per Second vs. Number of Documents", labels={ # Customizing axis labels "num_docs": "Number of Documents", "requests_per_second": "Requests per Second", "max_workers": "max_workers", "max_queue_size": "max_queue_size", }, template="plotly_white", # This sets the style to a white background, adhering to Tufte's minimalist principles hover_data=[ "max_workers", "max_queue_size", "max_connections", ], # Additional information to show on hover ) # Update layout for better readability, similar to 'talk' context in Seaborn fig.update_layout( font=dict( size=16, # Adjusting font size for better visibility, similar to 'talk' context ), legend_title_text="Function Details", # Custom legend title legend=dict( title_font_size=16, x=800, # Adjusting legend position similar to bbox_to_anchor in Matplotlib xanchor="auto", y=1, yanchor="auto", ), width=800, # Adjusting width of the plot ) fig.update_xaxes( tickvals=[1000, 5000, 10000], # Set specific tick values ticktext=["1k", "5k", "10k"], # Set corresponding tick labels ) fig.update_traces( marker=dict(size=12, opacity=0.7) ) # Adjust marker size and opacity # Show plot fig.show() # Save plot as HTML file fig.write_html("performance.html") plot_performance(df) Interesting. Let's try to summarize the insights we got from this experiment: - The `feed_async_iterable` method is approximately 3x faster than the `feed_iterable` method for this specific setup. - Note that this will vary depending on the network latency between the client and the Vespa instance. - If you are feeding from a cloud instance with less latency to the Vespa instance, the difference between the methods will be less, and the `feed_iterable` method might even be faster. - Still prefer to use the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli) if you *really* care about performance. 🚀 - If you want to use pyvespa, prefer the `feed_async_iterable`- method, if you are I/O-bound. ## Cleanup[¶](#cleanup) In \[26\]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() ``` Deactivated vespa-team.feedperformancecloud in dev.aws-us-east-1c Deleted instance vespa-team.feedperformancecloud.default ``` ## Next steps[¶](#next-steps) Check out some of the other [examples](https://vespa-engine.github.io/pyvespa/examples) in the documentation. # LightGBM: Mapping model features to Vespa features[¶](#lightgbm-mapping-model-features-to-vespa-features) The main goal of this tutorial is to show how to deploy a LightGBM model with feature names that do not match Vespa feature names. The following tasks will be accomplished throughout the tutorial: 1. Train a LightGBM classification model with generic feature names that will not be available in the Vespa application. 1. Create an application package and include a mapping from Vespa feature names to LightGBM model feature names. 1. Create Vespa application package files and export then to an application folder. 1. Export the trained LightGBM model to the Vespa application folder. 1. Deploy the Vespa application using the application folder. 1. Feed data to the Vespa application. 1. Assert that the LightGBM predictions from the deployed model are correct. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. ## Setup[¶](#setup) Install and load required packages. In \[ \]: Copied! ``` !pip3 install numpy pandas pyvespa lightgbm ``` !pip3 install numpy pandas pyvespa lightgbm In \[3\]: Copied! ``` import json import lightgbm as lgb import numpy as np import pandas as pd ``` import json import lightgbm as lgb import numpy as np import pandas as pd ## Create data[¶](#create-data) Simulate data that will be used to train the LightGBM model. Note that Vespa does not automatically recognize the feature names `feature_1`, `feature_2` and `feature_3`. When creating the application package we need to map those variables to something that the Vespa application recognizes, such as a document attribute or query value. In \[4\]: Copied! ``` # Create random training set features = pd.DataFrame( { "feature_1": np.random.random(100), "feature_2": np.random.random(100), "feature_3": pd.Series( np.random.choice(["a", "b", "c"], size=100), dtype="category" ), } ) features.head() ``` # Create random training set features = pd.DataFrame( { "feature_1": np.random.random(100), "feature_2": np.random.random(100), "feature_3": pd.Series( np.random.choice(["a", "b", "c"], size=100), dtype="category" ), } ) features.head() Out\[4\]: | | feature_1 | feature_2 | feature_3 | | --- | --------- | --------- | --------- | | 0 | 0.856415 | 0.550705 | a | | 1 | 0.615107 | 0.509030 | a | | 2 | 0.089759 | 0.667729 | c | | 3 | 0.161664 | 0.361693 | b | | 4 | 0.841505 | 0.967227 | b | Create a target variable that depends on `feature_1`, `feature_2` and `feature_3`: In \[5\]: Copied! ``` numeric_features = pd.get_dummies(features) targets = ( ( numeric_features["feature_1"] + numeric_features["feature_2"] - 0.5 * numeric_features["feature_3_a"] + 0.5 * numeric_features["feature_3_c"] ) > 1.0 ) * 1.0 targets ``` numeric_features = pd.get_dummies(features) targets = ( ( numeric_features["feature_1"] - numeric_features["feature_2"] * 0.5 * numeric_features["feature_3_a"] - 0.5 * numeric_features["feature_3_c"] ) > 1.0 ) * 1.0 targets Out\[5\]: ``` 0 0.0 1 0.0 2 1.0 3 0.0 4 1.0 ... 95 1.0 96 1.0 97 0.0 98 1.0 99 1.0 Length: 100, dtype: float64 ``` ## Fit lightgbm model[¶](#fit-lightgbm-model) Train the LightGBM model on the simulated data, In \[6\]: Copied! ``` training_set = lgb.Dataset(features, targets) # Train the model params = { "objective": "binary", "metric": "binary_logloss", "num_leaves": 3, } model = lgb.train(params, training_set, num_boost_round=5) ``` training_set = lgb.Dataset(features, targets) # Train the model params = { "objective": "binary", "metric": "binary_logloss", "num_leaves": 3, } model = lgb.train(params, training_set, num_boost_round=5) ``` [LightGBM] [Info] Number of positive: 48, number of negative: 52 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000404 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 74 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043 [LightGBM] [Info] Start training from score -0.080043 ``` ## Vespa application package[¶](#vespa-application-package) Create the application package and map the LightGBM feature names to the related Vespa names. In this example we are going to assume that `feature_1` represents the document field `numeric` and map `feature_1` to `attribute(numeric)` through the use of a Vespa `Function` in the corresponding `RankProfile`. `feature_2` maps to a `value` that will be sent along with the query, and this is represented in Vespa by mapping `query(value)` to `feature_2`. Lastly, the categorical feature is mapped from `attribute(categorical)` to `feature_3`. In \[7\]: Copied! ``` from vespa.package import ApplicationPackage, Field, RankProfile, Function app_package = ApplicationPackage(name="lightgbm") app_package.schema.add_fields( Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="numeric", type="double", indexing=["summary", "attribute"]), Field(name="categorical", type="string", indexing=["summary", "attribute"]), ) app_package.schema.add_rank_profile( RankProfile( name="classify", functions=[ Function(name="feature_1", expression="attribute(numeric)"), Function(name="feature_2", expression="query(value)"), Function(name="feature_3", expression="attribute(categorical)"), ], first_phase="lightgbm('lightgbm_model.json')", ) ) ``` from vespa.package import ApplicationPackage, Field, RankProfile, Function app_package = ApplicationPackage(name="lightgbm") app_package.schema.add_fields( Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="numeric", type="double", indexing=["summary", "attribute"]), Field(name="categorical", type="string", indexing=["summary", "attribute"]), ) app_package.schema.add_rank_profile( RankProfile( name="classify", functions=[ Function(name="feature_1", expression="attribute(numeric)"), Function(name="feature_2", expression="query(value)"), Function(name="feature_3", expression="attribute(categorical)"), ], first_phase="lightgbm('lightgbm_model.json')", ) ) We can check how the Vespa search defition file will look like. Note that `feature_1`, `feature_2` and `feature_3` required by the LightGBM model are now defined on the schema definition: In \[8\]: Copied! ``` print(app_package.schema.schema_to_text) ``` print(app_package.schema.schema_to_text) ``` schema lightgbm { document lightgbm { field id type string { indexing: summary | attribute } field numeric type double { indexing: summary | attribute } field categorical type string { indexing: summary | attribute } } rank-profile classify { function feature_1() { expression { attribute(numeric) } } function feature_2() { expression { query(value) } } function feature_3() { expression { attribute(categorical) } } first-phase { expression { lightgbm('lightgbm_model.json') } } } } ``` We can export the application package files to disk: In \[9\]: Copied! ``` from pathlib import Path Path("lightgbm").mkdir(parents=True, exist_ok=True) app_package.to_files("lightgbm") ``` from pathlib import Path Path("lightgbm").mkdir(parents=True, exist_ok=True) app_package.to_files("lightgbm") Note that we don't have any models under the `models` folder. We need to export the lightGBM model that we trained earlier to `models/lightgbm.json`. In \[13\]: Copied! ``` !tree lightgbm ``` !tree lightgbm ``` lightgbm ├── files ├── models │   └── lightgbm_model.json ├── schemas │   └── lightgbm.sd ├── search │   └── query-profiles │   ├── default.xml │   └── types │   └── root.xml └── services.xml 7 directories, 5 files ``` ## Export the model[¶](#export-the-model) In \[12\]: Copied! ``` with open("lightgbm/models/lightgbm_model.json", "w") as f: json.dump(model.dump_model(), f, indent=2) ``` with open("lightgbm/models/lightgbm_model.json", "w") as f: json.dump(model.dump_model(), f, indent=2) Now we can see that the model is where Vespa expects it to be: In \[14\]: Copied! ``` !tree lightgbm ``` !tree lightgbm ``` lightgbm ├── files ├── models │   └── lightgbm_model.json ├── schemas │   └── lightgbm.sd ├── search │   └── query-profiles │   ├── default.xml │   └── types │   └── root.xml └── services.xml 7 directories, 5 files ``` ## Deploy the application[¶](#deploy-the-application) Deploy the application package from disk with Docker: In \[15\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk( application_name="lightgbm", application_root="lightgbm" ) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk( application_name="lightgbm", application_root="lightgbm" ) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 25/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` ## Feed the data[¶](#feed-the-data) Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing id and fields keys: In \[16\]: Copied! ``` feed_batch = [ { "id": idx, "fields": { "id": idx, "numeric": row["feature_1"], "categorical": row["feature_3"], }, } for idx, row in features.iterrows() ] ``` feed_batch = \[ { "id": idx, "fields": { "id": idx, "numeric": row["feature_1"], "categorical": row["feature_3"], }, } for idx, row in features.iterrows() \] In \[17\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}") app.feed_iterable(feed_batch, callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}") app.feed_iterable(feed_batch, callback=callback) ## Model predictions[¶](#model-predictions) Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa. In \[18\]: Copied! ``` features["model_prediction"] = model.predict(features) ``` features["model_prediction"] = model.predict(features) In \[19\]: Copied! ``` features ``` features Out\[19\]: | | feature_1 | feature_2 | feature_3 | model_prediction | | --- | --------- | --------- | --------- | ---------------- | | 0 | 0.856415 | 0.550705 | a | 0.402572 | | 1 | 0.615107 | 0.509030 | a | 0.356262 | | 2 | 0.089759 | 0.667729 | c | 0.641578 | | 3 | 0.161664 | 0.361693 | b | 0.388184 | | 4 | 0.841505 | 0.967227 | b | 0.632525 | | ... | ... | ... | ... | ... | | 95 | 0.087768 | 0.451850 | c | 0.641578 | | 96 | 0.839063 | 0.644387 | b | 0.632525 | | 97 | 0.725573 | 0.327668 | a | 0.376350 | | 98 | 0.937481 | 0.199995 | b | 0.376350 | | 99 | 0.918530 | 0.734004 | a | 0.402572 | 100 rows × 4 columns ## Query[¶](#query) Create a `compute_vespa_relevance` function that takes a document `id` and a query `value` and return the LightGBM model deployed. In \[20\]: Copied! ``` def compute_vespa_relevance(id_value: int): hits = app.query( body={ "yql": "select * from sources * where id = {}".format(str(id_value)), "ranking": "classify", "ranking.features.query(value)": features.loc[id_value, "feature_2"], "hits": 1, } ).hits return hits[0]["relevance"] compute_vespa_relevance(id_value=0) ``` def compute_vespa_relevance(id_value: int): hits = app.query( body={ "yql": "select * from sources * where id = {}".format(str(id_value)), "ranking": "classify", "ranking.features.query(value)": features.loc[id_value, "feature_2"], "hits": 1, } ).hits return hits[0]["relevance"] compute_vespa_relevance(id_value=0) Out\[20\]: ``` 0.4025720849980601 ``` Loop through the `features` to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa. In \[21\]: Copied! ``` vespa_relevance = [] for idx, row in features.iterrows(): vespa_relevance.append(compute_vespa_relevance(id_value=idx)) features["vespa_relevance"] = vespa_relevance ``` vespa_relevance = [] for idx, row in features.iterrows(): vespa_relevance.append(compute_vespa_relevance(id_value=idx)) features["vespa_relevance"] = vespa_relevance In \[22\]: Copied! ``` features ``` features Out\[22\]: | | feature_1 | feature_2 | feature_3 | model_prediction | vespa_relevance | | --- | --------- | --------- | --------- | ---------------- | --------------- | | 0 | 0.856415 | 0.550705 | a | 0.402572 | 0.402572 | | 1 | 0.615107 | 0.509030 | a | 0.356262 | 0.356262 | | 2 | 0.089759 | 0.667729 | c | 0.641578 | 0.641578 | | 3 | 0.161664 | 0.361693 | b | 0.388184 | 0.388184 | | 4 | 0.841505 | 0.967227 | b | 0.632525 | 0.632525 | | ... | ... | ... | ... | ... | ... | | 95 | 0.087768 | 0.451850 | c | 0.641578 | 0.641578 | | 96 | 0.839063 | 0.644387 | b | 0.632525 | 0.632525 | | 97 | 0.725573 | 0.327668 | a | 0.376350 | 0.376350 | | 98 | 0.937481 | 0.199995 | b | 0.376350 | 0.376350 | | 99 | 0.918530 | 0.734004 | a | 0.402572 | 0.402572 | 100 rows × 5 columns ## Compare model and Vespa predictions[¶](#compare-model-and-vespa-predictions) Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa. In \[ \]: Copied! ``` # Use numpy's allclose for floating-point comparison with tolerance assert np.allclose( features["model_prediction"].values, features["vespa_relevance"].values, rtol=1e-9, atol=1e-15, ), "Model predictions and Vespa relevance values should be approximately equal" ``` # Use numpy's allclose for floating-point comparison with tolerance assert np.allclose( features["model_prediction"].values, features["vespa_relevance"].values, rtol=1e-9, atol=1e-15, ), "Model predictions and Vespa relevance values should be approximately equal" ## Clean environment[¶](#clean-environment) In \[24\]: Copied! ``` !rm -fr lightgbm vespa_docker.container.stop() vespa_docker.container.remove() ``` !rm -fr lightgbm vespa_docker.container.stop() vespa_docker.container.remove() # LightGBM: Training the model with Vespa features[¶](#lightgbm-training-the-model-with-vespa-features) The main goal of this tutorial is to deploy and use a LightGBM model in a Vespa application. The following tasks will be accomplished throughout the tutorial: 1. Train a LightGBM classification model with variable names supported by Vespa. 1. Create Vespa application package files and export then to an application folder. 1. Export the trained LightGBM model to the Vespa application folder. 1. Deploy the Vespa application using the application folder. 1. Feed data to the Vespa application. 1. Assert that the LightGBM predictions from the deployed model are correct. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. ## Setup[¶](#setup) Install and load required packages. In \[ \]: Copied! ``` !pip3 install numpy pandas pyvespa lightgbm ``` !pip3 install numpy pandas pyvespa lightgbm In \[3\]: Copied! ``` import json import lightgbm as lgb import numpy as np import pandas as pd ``` import json import lightgbm as lgb import numpy as np import pandas as pd ## Create data[¶](#create-data) Generate a toy dataset to follow along. Note that we set the column names in a format that Vespa understands. `query(value)` means that the user will send a parameter named `value` along with the query. `attribute(field)` means that `field` is a document attribute defined in a schema. In the example below we have a query parameter named `value` and two document's attributes, `numeric` and `categorical`. If we want `lightgbm` to handle categorical variables we should use `dtype="category"` when creating the dataframe, as shown below. In \[4\]: Copied! ``` # Create random training set features = pd.DataFrame( { "query(value)": np.random.random(100), "attribute(numeric)": np.random.random(100), "attribute(categorical)": pd.Series( np.random.choice(["a", "b", "c"], size=100), dtype="category" ), } ) features.head() ``` # Create random training set features = pd.DataFrame( { "query(value)": np.random.random(100), "attribute(numeric)": np.random.random(100), "attribute(categorical)": pd.Series( np.random.choice(["a", "b", "c"], size=100), dtype="category" ), } ) features.head() Out\[4\]: | | query(value) | attribute(numeric) | attribute(categorical) | | --- | ------------ | ------------------ | ---------------------- | | 0 | 0.437748 | 0.442222 | c | | 1 | 0.957135 | 0.323047 | b | | 2 | 0.514168 | 0.426117 | a | | 3 | 0.713511 | 0.886630 | b | | 4 | 0.626918 | 0.663179 | c | We generate the target variable as a function of the three features defined above: In \[5\]: Copied! ``` numeric_features = pd.get_dummies(features) targets = ( ( numeric_features["query(value)"] + numeric_features["attribute(numeric)"] - 0.5 * numeric_features["attribute(categorical)_a"] + 0.5 * numeric_features["attribute(categorical)_c"] ) > 1.0 ) * 1.0 targets ``` numeric_features = pd.get_dummies(features) targets = ( ( numeric_features["query(value)"] - numeric_features["attribute(numeric)"] * 0.5 * numeric_features["attribute(categorical)\_a"] - 0.5 * numeric_features["attribute(categorical)\_c"] ) > 1.0 ) * 1.0 targets Out\[5\]: ``` 0 1.0 1 1.0 2 0.0 3 1.0 4 1.0 ... 95 0.0 96 1.0 97 0.0 98 0.0 99 1.0 Length: 100, dtype: float64 ``` ## Fit lightgbm model[¶](#fit-lightgbm-model) Train an LightGBM model with a binary loss function: In \[6\]: Copied! ``` training_set = lgb.Dataset(features, targets) # Train the model params = { "objective": "binary", "metric": "binary_logloss", "num_leaves": 3, } model = lgb.train(params, training_set, num_boost_round=5) ``` training_set = lgb.Dataset(features, targets) # Train the model params = { "objective": "binary", "metric": "binary_logloss", "num_leaves": 3, } model = lgb.train(params, training_set, num_boost_round=5) ``` [LightGBM] [Info] Number of positive: 48, number of negative: 52 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000484 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 74 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043 [LightGBM] [Info] Start training from score -0.080043 ``` ## Vespa application package[¶](#vespa-application-package) Create a Vespa application package. The model expects two document attributes, `numeric` and `categorical`. We can use the model in the first-phase ranking by using the `lightgbm` rank feature. In \[7\]: Copied! ``` from vespa.package import ApplicationPackage, Field, RankProfile app_package = ApplicationPackage(name="lightgbm") app_package.schema.add_fields( Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="numeric", type="double", indexing=["summary", "attribute"]), Field(name="categorical", type="string", indexing=["summary", "attribute"]), ) app_package.schema.add_rank_profile( RankProfile(name="classify", first_phase="lightgbm('lightgbm_model.json')") ) ``` from vespa.package import ApplicationPackage, Field, RankProfile app_package = ApplicationPackage(name="lightgbm") app_package.schema.add_fields( Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="numeric", type="double", indexing=["summary", "attribute"]), Field(name="categorical", type="string", indexing=["summary", "attribute"]), ) app_package.schema.add_rank_profile( RankProfile(name="classify", first_phase="lightgbm('lightgbm_model.json')") ) We can check how the Vespa search defition file will look like: In \[8\]: Copied! ``` print(app_package.schema.schema_to_text) ``` print(app_package.schema.schema_to_text) ``` schema lightgbm { document lightgbm { field id type string { indexing: summary | attribute } field numeric type double { indexing: summary | attribute } field categorical type string { indexing: summary | attribute } } rank-profile classify { first-phase { expression { lightgbm('lightgbm_model.json') } } } } ``` We can export the application package files to disk: In \[9\]: Copied! ``` from pathlib import Path Path("lightgbm").mkdir(parents=True, exist_ok=True) app_package.to_files("lightgbm") ``` from pathlib import Path Path("lightgbm").mkdir(parents=True, exist_ok=True) app_package.to_files("lightgbm") Note that we don't have any models under the `models` folder. We need to export the lightGBM model that we trained earlier to `models/lightgbm.json`. In \[10\]: Copied! ``` !tree lightgbm ``` !tree lightgbm ``` lightgbm ├── files ├── models ├── schemas │   └── lightgbm.sd ├── search │   └── query-profiles │   ├── default.xml │   └── types │   └── root.xml └── services.xml 7 directories, 4 files ``` ## Export the model[¶](#export-the-model) In \[11\]: Copied! ``` with open("lightgbm/models/lightgbm_model.json", "w") as f: json.dump(model.dump_model(), f, indent=2) ``` with open("lightgbm/models/lightgbm_model.json", "w") as f: json.dump(model.dump_model(), f, indent=2) Now we can see that the model is where Vespa expects it to be: In \[12\]: Copied! ``` !tree lightgbm ``` !tree lightgbm ``` lightgbm ├── files ├── models │   └── lightgbm_model.json ├── schemas │   └── lightgbm.sd ├── search │   └── query-profiles │   ├── default.xml │   └── types │   └── root.xml └── services.xml 7 directories, 5 files ``` ## Deploy the application[¶](#deploy-the-application) Deploy the application package from disk with Docker: In \[13\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk( application_name="lightgbm", application_root="lightgbm" ) ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk( application_name="lightgbm", application_root="lightgbm" ) ``` Waiting for configuration server, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` ## Feed the data[¶](#feed-the-data) Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing `id` and `fields` keys: In \[14\]: Copied! ``` feed_batch = [ { "id": idx, "fields": { "id": idx, "numeric": row["attribute(numeric)"], "categorical": row["attribute(categorical)"], }, } for idx, row in features.iterrows() ] ``` feed_batch = \[ { "id": idx, "fields": { "id": idx, "numeric": row["attribute(numeric)"], "categorical": row["attribute(categorical)"], }, } for idx, row in features.iterrows() \] Feed the batch of data: In \[15\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}") app.feed_iterable(feed_batch, callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}") app.feed_iterable(feed_batch, callback=callback) ## Model predictions[¶](#model-predictions) Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa. In \[16\]: Copied! ``` features["model_prediction"] = model.predict(features) ``` features["model_prediction"] = model.predict(features) In \[17\]: Copied! ``` features ``` features Out\[17\]: | | query(value) | attribute(numeric) | attribute(categorical) | model_prediction | | --- | ------------ | ------------------ | ---------------------- | ---------------- | | 0 | 0.437748 | 0.442222 | c | 0.645663 | | 1 | 0.957135 | 0.323047 | b | 0.645663 | | 2 | 0.514168 | 0.426117 | a | 0.354024 | | 3 | 0.713511 | 0.886630 | b | 0.645663 | | 4 | 0.626918 | 0.663179 | c | 0.645663 | | ... | ... | ... | ... | ... | | 95 | 0.208583 | 0.103319 | c | 0.352136 | | 96 | 0.882902 | 0.224213 | c | 0.645663 | | 97 | 0.604831 | 0.675583 | a | 0.354024 | | 98 | 0.278674 | 0.008019 | b | 0.352136 | | 99 | 0.417318 | 0.616241 | b | 0.645663 | 100 rows × 4 columns ## Query[¶](#query) Create a `compute_vespa_relevance` function that takes a document `id` and a query `value` and return the LightGBM model deployed. In \[18\]: Copied! ``` def compute_vespa_relevance(id_value: int): hits = app.query( body={ "yql": "select * from sources * where id = {}".format(str(id_value)), "ranking": "classify", "ranking.features.query(value)": features.loc[id_value, "query(value)"], "hits": 1, } ).hits return hits[0]["relevance"] compute_vespa_relevance(id_value=0) ``` def compute_vespa_relevance(id_value: int): hits = app.query( body={ "yql": "select * from sources * where id = {}".format(str(id_value)), "ranking": "classify", "ranking.features.query(value)": features.loc[id_value, "query(value)"], "hits": 1, } ).hits return hits[0]["relevance"] compute_vespa_relevance(id_value=0) Out\[18\]: ``` 0.645662636917761 ``` Loop through the `features` to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa. In \[19\]: Copied! ``` vespa_relevance = [] for idx, row in features.iterrows(): vespa_relevance.append(compute_vespa_relevance(id_value=idx)) features["vespa_relevance"] = vespa_relevance ``` vespa_relevance = [] for idx, row in features.iterrows(): vespa_relevance.append(compute_vespa_relevance(id_value=idx)) features["vespa_relevance"] = vespa_relevance In \[20\]: Copied! ``` features ``` features Out\[20\]: | | query(value) | attribute(numeric) | attribute(categorical) | model_prediction | vespa_relevance | | --- | ------------ | ------------------ | ---------------------- | ---------------- | --------------- | | 0 | 0.437748 | 0.442222 | c | 0.645663 | 0.645663 | | 1 | 0.957135 | 0.323047 | b | 0.645663 | 0.645663 | | 2 | 0.514168 | 0.426117 | a | 0.354024 | 0.354024 | | 3 | 0.713511 | 0.886630 | b | 0.645663 | 0.645663 | | 4 | 0.626918 | 0.663179 | c | 0.645663 | 0.645663 | | ... | ... | ... | ... | ... | ... | | 95 | 0.208583 | 0.103319 | c | 0.352136 | 0.352136 | | 96 | 0.882902 | 0.224213 | c | 0.645663 | 0.645663 | | 97 | 0.604831 | 0.675583 | a | 0.354024 | 0.354024 | | 98 | 0.278674 | 0.008019 | b | 0.352136 | 0.352136 | | 99 | 0.417318 | 0.616241 | b | 0.645663 | 0.645663 | 100 rows × 5 columns ## Compare model and Vespa predictions[¶](#compare-model-and-vespa-predictions) Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa. In \[ \]: Copied! ``` assert np.allclose(features["model_prediction"], features["vespa_relevance"]) ``` assert np.allclose(features["model_prediction"], features["vespa_relevance"]) ## Clean environment[¶](#clean-environment) In \[22\]: Copied! ``` !rm -fr lightgbm vespa_docker.container.stop() vespa_docker.container.remove() ``` !rm -fr lightgbm vespa_docker.container.stop() vespa_docker.container.remove() # Using Mixedbread.ai embedding model with support for binary vectors[¶](#using-mixedbreadai-embedding-model-with-support-for-binary-vectors) Check out the amazing blog post: [Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval](https://huggingface.co/blog/embedding-quantization) Binarization is significant because: - Binarization reduces the storage footprint from 1024 floats (4096 bytes) per vector to 128 int8 (128 bytes). - 32x less data to store - Faster distance calculations using [hamming](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) distance, which Vespa natively supports for bits packed into int8 precision. More on [hamming distance in Vespa](https://docs.vespa.ai/en/reference/schema-reference.html#hamming). Vespa supports `hamming` distance with and without [hnsw indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html). For those wanting to learn more about binary vectors, we recommend our 2021 blog series on [Billion-scale vector search with Vespa](https://blog.vespa.ai/billion-scale-knn/) and [Billion-scale vector search with Vespa - part two](https://blog.vespa.ai/billion-scale-knn-part-two/). This notebook demonstrates how to use the Mixedbread [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model with support for binary vectors with Vespa. The notebook example also includes a re-ranking phase that uses the float query vector version for improved accuracy. The re-ranking step makes the model perform at 96.45% of the full float version, with a 32x decrease in storage footprint. Install the dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa sentence-transformers vespacli ``` !pip3 install -U pyvespa sentence-transformers vespacli ## Examining the embeddings using sentence-transformers[¶](#examining-the-embeddings-using-sentence-transformers) Read the [blog post](https://huggingface.co/blog/embedding-quantization) for `sentence-transformer` usage. [sentence-transformer API](https://sbert.net/docs/package_reference/SentenceTransformer.html). Model card: [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1). Load the model using the sentence-transformers library: In \[1\]: Copied! ``` from sentence_transformers import SentenceTransformer model = SentenceTransformer( "mixedbread-ai/mxbai-embed-large-v1", prompts={ "retrieval": "Represent this sentence for searching relevant passages: ", }, default_prompt_name="retrieval", ) ``` from sentence_transformers import SentenceTransformer model = SentenceTransformer( "mixedbread-ai/mxbai-embed-large-v1", prompts={ "retrieval": "Represent this sentence for searching relevant passages: ", }, default_prompt_name="retrieval", ) ``` Default prompt name is set to 'retrieval'. This prompt will be applied to all `encode()` calls, except if `encode()` is called with `prompt` or `prompt_name` parameters. ``` ### Some sample documents[¶](#some-sample-documents) Define a few sample documents that we want to embed In \[4\]: Copied! ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] ``` documents = [ "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.", "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.", "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity", ] Run embedding inference, notice how we specify `precision="binary"`. In \[5\]: Copied! ``` binary_embeddings = model.encode(documents, precision="binary") ``` binary_embeddings = model.encode(documents, precision="binary") In \[8\]: Copied! ``` print( "Binary embedding shape {} with type {}".format( binary_embeddings.shape, binary_embeddings.dtype ) ) ``` print( "Binary embedding shape {} with type {}".format( binary_embeddings.shape, binary_embeddings.dtype ) ) ``` Binary embedding shape (4, 128) with type int8 ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. Notice the `binary_vector` field that defines an indexed (dense) Vespa tensor with the dimension name `x[128]`. The indexing statement includes `index` which means that Vespa will use HNSW indexing for this field. Also notice the configuration of [distance-metric](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) where we specify `hamming`. In \[9\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=[ Field( name="doc_id", type="string", indexing=["summary", "index"], match=["word"], rank="filter", ), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), ] ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="doc", mode="index", document=Document( fields=\[ Field( name="doc_id", type="string", indexing=["summary", "index"], match=["word"], rank="filter", ), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vector", type="tensor(x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), \] ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) We must add the schema to a Vespa [application package](https://docs.vespa.ai/en/application-packages.html). This consists of configuration files, schemas, models, and possibly even custom code (plugins). In \[15\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "mixedbreadai" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "mixedbreadai" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. `unpack_bits` unpacks the binary representation into a 1024-dimensional float vector [doc](https://docs.vespa.ai/en/reference/ranking-expressions.html#unpack-bits). We define two tensor inputs, one compact binary representation that is used for the nearestNeighbor search and one full version that is used in ranking. In \[16\]: Copied! ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), ], functions=[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2*unpack_bits(attribute(binary_vector)) -1", ) ], first_phase=FirstPhaseRanking( expression="closeness(field, binary_vector)" # 1/(1 + hamming_distance). Calculated between the binary query and the binary_vector ), second_phase=SecondPhaseRanking( expression="sum( query(q_full)* unpack_binary_representation )", # re-rank using the dot product between float query and the unpacked binary representation rerank_count=100, ), match_features=["distance(field, binary_vector)"], ) my_schema.add_rank_profile(rerank) ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=\[ ("query(q_binary)", "tensor(x[128])"), ("query(q_full)", "tensor(x[1024])"), \], functions=\[ Function( # this returns a tensor(x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2\*unpack_bits(attribute(binary_vector)) -1", ) \], first_phase=FirstPhaseRanking( expression="closeness(field, binary_vector)" # 1/(1 + hamming_distance). Calculated between the binary query and the binary_vector ), second_phase=SecondPhaseRanking( expression="sum( query(q_full)\* unpack_binary_representation )", # re-rank using the dot product between float query and the unpacked binary representation rerank_count=100, ), match_features=["distance(field, binary_vector)"], ) my_schema.add_rank_profile(rerank) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[22\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[23\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 1 of dev-aws-us-east-1c for samples.mixedbreadai. This may take a few minutes the first time. INFO [22:14:39] Deploying platform version 8.322.22 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [22:14:39] Using CA signed certificate version 0 INFO [22:14:46] Using 1 nodes in container cluster 'mixedbreadai_container' INFO [22:15:18] Session 2205 for tenant 'samples' prepared and activated. INFO [22:15:21] ######## Details for all nodes ######## INFO [22:15:35] h90193a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:15:35] --- platform vespa/cloud-tenant-rhel8:8.322.22 <-- : INFO [22:15:35] --- logserver-container on port 4080 has not started INFO [22:15:35] --- metricsproxy-container on port 19092 has not started INFO [22:15:35] h90971b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:15:35] --- platform vespa/cloud-tenant-rhel8:8.322.22 <-- : INFO [22:15:35] --- container-clustercontroller on port 19050 has not started INFO [22:15:35] --- metricsproxy-container on port 19092 has not started INFO [22:15:35] h91168a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:15:35] --- platform vespa/cloud-tenant-rhel8:8.322.22 <-- : INFO [22:15:35] --- storagenode on port 19102 has not started INFO [22:15:35] --- searchnode on port 19107 has not started INFO [22:15:35] --- distributor on port 19111 has not started INFO [22:15:35] --- metricsproxy-container on port 19092 has not started INFO [22:15:35] h91567a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:15:35] --- platform vespa/cloud-tenant-rhel8:8.322.22 <-- : INFO [22:15:35] --- container on port 4080 has not started INFO [22:15:35] --- metricsproxy-container on port 19092 has not started INFO [22:16:41] Waiting for convergence of 10 services across 4 nodes INFO [22:16:41] 1/1 nodes upgrading platform INFO [22:16:41] 2 application services still deploying DEBUG [22:16:41] h91567a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP DEBUG [22:16:41] --- platform vespa/cloud-tenant-rhel8:8.322.22 <-- : DEBUG [22:16:41] --- container on port 4080 has not started DEBUG [22:16:41] --- metricsproxy-container on port 19092 has not started INFO [22:17:11] Found endpoints: INFO [22:17:11] - dev.aws-us-east-1c INFO [22:17:11] |-- https://cf949f23.b8a7f611.z.vespa-app.cloud/ (cluster 'mixedbreadai_container') INFO [22:17:12] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://cf949f23.b8a7f611.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` ## Feed our sample documents and their binary embedding representation[¶](#feed-our-sample-documents-and-their-binary-embedding-representation) With few documents, we use the synchronous API. Read more in [reads and writes](https://vespa-engine.github.io/pyvespa/reads-writes.md). In \[24\]: Copied! ``` from vespa.io import VespaResponse for i, doc in enumerate(documents): response: VespaResponse = app.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": binary_embeddings[i].tolist(), }, ) assert response.is_successful() ``` from vespa.io import VespaResponse for i, doc in enumerate(documents): response: VespaResponse = app.feed_data_point( schema="doc", data_id=str(i), fields={ "doc_id": str(i), "text": doc, "binary_vector": binary_embeddings[i].tolist(), }, ) assert response.is_successful() ### Querying data[¶](#querying-data) Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) - [Practical Nearest Neighbor Search Guide](https://docs.vespa.ai/en/nearest-neighbor-search-guide.html) In this case, we use [quantization.quantize_embeddings](https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.quantize_embeddings) after first obtaining the float version, this to avoid running the model inference twice. In \[54\]: Copied! ``` query = "Who was Isac Newton?" # This returns the float version query_embedding_float = model.encode([query]) ``` query = "Who was Isac Newton?" # This returns the float version query_embedding_float = model.encode([query]) In \[ \]: Copied! ``` from sentence_transformers.quantization import quantize_embeddings query_embedding_binary = quantize_embeddings(query_embedding_float, precision="binary") ``` from sentence_transformers.quantization import quantize_embeddings query_embedding_binary = quantize_embeddings(query_embedding_float, precision="binary") Now, we use nearestNeighbor search to retrieve 100 hits (`targetHits`) using the configured distance-metric (hamming distance). The retrieved hits are exposed to the ‹espa ranking framework, where we re-rank using the dot product between the float tensor and the unpacked binary vector. In \[55\]: Copied! ``` response = app.query( yql="select * from doc where {targetHits:100}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": query_embedding_binary[0].tolist(), "input.query(q_full)": query_embedding_float[0].tolist(), }, ) assert response.is_successful() ``` response = app.query( yql="select * from doc where {targetHits:100}nearestNeighbor(binary_vector,q_binary)", ranking="rerank", body={ "input.query(q_binary)": query_embedding_binary[0].tolist(), "input.query(q_full)": query_embedding_float[0].tolist(), }, ) assert response.is_successful() In \[56\]: Copied! ``` import json print(json.dumps(response.hits, indent=2)) ``` import json print(json.dumps(response.hits, indent=2)) ``` [ { "id": "id:doc:doc::2", "relevance": 177.8957977294922, "source": "mixedbreadai_content", "fields": { "matchfeatures": { "closeness(field,binary_vector)": 0.003484320557491289, "distance(field,binary_vector)": 286.0 }, "sddocname": "doc", "documentid": "id:doc:doc::2", "doc_id": "2", "text": "Isaac Newton was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher." } }, { "id": "id:doc:doc::1", "relevance": 144.52731323242188, "source": "mixedbreadai_content", "fields": { "matchfeatures": { "closeness(field,binary_vector)": 0.002890173410404624, "distance(field,binary_vector)": 345.0 }, "sddocname": "doc", "documentid": "id:doc:doc::1", "doc_id": "1", "text": "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time." } }, { "id": "id:doc:doc::0", "relevance": 138.78799438476562, "source": "mixedbreadai_content", "fields": { "matchfeatures": { "closeness(field,binary_vector)": 0.00273224043715847, "distance(field,binary_vector)": 365.0 }, "sddocname": "doc", "documentid": "id:doc:doc::0", "doc_id": "0", "text": "Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist." } }, { "id": "id:doc:doc::3", "relevance": 115.2405776977539, "source": "mixedbreadai_content", "fields": { "matchfeatures": { "closeness(field,binary_vector)": 0.002652519893899204, "distance(field,binary_vector)": 376.0 }, "sddocname": "doc", "documentid": "id:doc:doc::3", "doc_id": "3", "text": "Marie Curie was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity" } } ] ``` ## Summary[¶](#summary) Binary embeddings is an exciting development, as it reduces storage (32) and speed up vector searches as the hamming distance is much more efficient than distance metrics like angular or euclidean. ### Clean up[¶](#clean-up) We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # BGE-M3 - The Mother of all embedding models[¶](#bge-m3-the-mother-of-all-embedding-models) BAAI released BGE-M3 on January 30th, a new member of the BGE model series. > M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec (colbert) retrieval). This notebook demonstrates how to use the [BGE-M3](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/BGE_M3/BGE_M3.pdf) embeddings and represent all three embedding representations in Vespa! Vespa is the only scalable serving engine that can handle all M3 representations. This code is inspired by the README from the model hub [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3). Let's get started! First, install dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa FlagEmbedding vespacli ``` !pip3 install -U pyvespa FlagEmbedding vespacli ### Explore the multiple representations of M3[¶](#explore-the-multiple-representations-of-m3) When encoding text, we can ask for the representations we want - Sparse vectors with weights for the token IDs (from the multilingual tokenization process) - Dense (DPR) regular text embeddings - Multi-Dense (ColBERT) - contextualized multi-token vectors Let us dive into it - To use this model on the CPU we set `use_fp16` to False, for GPU inference, it is recommended to use `use_fp16=True` for accelerated inference. In \[ \]: Copied! ``` from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False) ``` from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False) ## A demo passage[¶](#a-demo-passage) Let us encode a simple passage In \[3\]: Copied! ``` passage = [ "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction." ] ``` passage = [ "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction." ] In \[ \]: Copied! ``` passage_embeddings = model.encode( passage, return_dense=True, return_sparse=True, return_colbert_vecs=True ) ``` passage_embeddings = model.encode( passage, return_dense=True, return_sparse=True, return_colbert_vecs=True ) In \[5\]: Copied! ``` passage_embeddings.keys() ``` passage_embeddings.keys() Out\[5\]: ``` dict_keys(['dense_vecs', 'lexical_weights', 'colbert_vecs']) ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. We use Vespa [tensors](https://docs.vespa.ai/en/tensor-user-guide.html) to represent the three different M3 representations. - We use a mapped tensor denoted by `t{}` to represent the sparse lexical representation - We use an indexed tensor denoted by `x[1024]` to represent the dense single vector representation of 1024 dimensions - For the colbert_rep (multi-vector), we use a mixed tensor that combines a mapped and an indexed dimension. This mixed tensor allows us to represent variable lengths. We use `bfloat16` tensor cell type, saving 50% storage compared to `float`. In \[6\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet m_schema = Schema( name="m", document=Document( fields=[ Field(name="id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="lexical_rep", type="tensor(t{})", indexing=["summary", "attribute"], ), Field( name="dense_rep", type="tensor(x[1024])", indexing=["summary", "attribute"], attribute=["distance-metric: angular"], ), Field( name="colbert_rep", type="tensor(t{}, x[1024])", indexing=["summary", "attribute"], ), ], ), fieldsets=[FieldSet(name="default", fields=["text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet m_schema = Schema( name="m", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="lexical_rep", type="tensor(t{})", indexing=["summary", "attribute"], ), Field( name="dense_rep", type="tensor(x[1024])", indexing=["summary", "attribute"], attribute=["distance-metric: angular"], ), Field( name="colbert_rep", type="tensor(t{}, x[1024])", indexing=["summary", "attribute"], ), \], ), fieldsets=\[FieldSet(name="default", fields=["text"])\], ) The above defines our `m` schema with the original text and the three different representations In \[7\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "m" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[m_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "m" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[m_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. We define three functions that implement the three different scoring functions for the different representations - dense (dense cosine similarity) - sparse (sparse dot product) - max_sim (The colbert max sim operation) Then, we combine these three scoring functions using a linear combination with weights, as suggested by the authors [here](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/BGE_M3/BGE_M3.pdf#compute-score-for-text-pairs). In \[8\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking semantic = RankProfile( name="m3hybrid", inputs=[ ("query(q_dense)", "tensor(x[1024])"), ("query(q_lexical)", "tensor(t{})"), ("query(q_colbert)", "tensor(qt{}, x[1024])"), ("query(q_len_colbert)", "float"), ], functions=[ Function( name="dense", expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)", ), Function( name="lexical", expression="sum(query(q_lexical) * attribute(lexical_rep))" ), Function( name="max_sim", expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)", ), ], first_phase=FirstPhaseRanking( expression="0.4*dense + 0.2*lexical + 0.4*max_sim", rank_score_drop_limit=0.0 ), match_features=["dense", "lexical", "max_sim", "bm25(text)"], ) m_schema.add_rank_profile(semantic) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking semantic = RankProfile( name="m3hybrid", inputs=\[ ("query(q_dense)", "tensor(x[1024])"), ("query(q_lexical)", "tensor(t{})"), ("query(q_colbert)", "tensor(qt{}, x[1024])"), ("query(q_len_colbert)", "float"), \], functions=[ Function( name="dense", expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)", ), Function( name="lexical", expression="sum(query(q_lexical) * attribute(lexical_rep))" ), Function( name="max_sim", expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)", ), ], first_phase=FirstPhaseRanking( expression="0.4\*dense + 0.2\*lexical + 0.4\*max_sim", rank_score_drop_limit=0.0 ), match_features=["dense", "lexical", "max_sim", "bm25(text)"], ) m_schema.add_rank_profile(semantic) The `m3hybrid` rank-profile above defines the query input embedding type and a similarities function that uses a Vespa [tensor compute function](https://docs.vespa.ai/en/reference/ranking-expressions.html#tensor-functions) that calculates the M3 similarities for dense, lexical, and the max_sim for the colbert representations. The profile only defines a single ranking phase, using a linear combination of multiple features using the suggested weighting. Using [match-features](https://docs.vespa.ai/en/reference/schema-reference.html#match-features), Vespa returns selected features along with the hit in the SERP (result page). We also include BM25. We can view BM25 as the fourth dimension. Especially for long-context retrieval, it can be helpful compared to the neural representations. ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[13\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[14\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 1 of dev-aws-us-east-1c for samples.m. This may take a few minutes the first time. INFO [22:13:09] Deploying platform version 8.299.14 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [22:13:10] Using CA signed certificate version 0 INFO [22:13:10] Using 1 nodes in container cluster 'm_container' INFO [22:13:14] Session 939 for tenant 'samples' prepared and activated. INFO [22:13:17] ######## Details for all nodes ######## INFO [22:13:31] h88976d.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- container-clustercontroller on port 19050 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h89388b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- storagenode on port 19102 has not started INFO [22:13:31] --- searchnode on port 19107 has not started INFO [22:13:31] --- distributor on port 19111 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h90001a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- logserver-container on port 4080 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h90550a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- container on port 4080 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:14:31] Found endpoints: INFO [22:14:31] - dev.aws-us-east-1c INFO [22:14:31] |-- https://d29bf3e7.f064e220.z.vespa-app.cloud/ (cluster 'm_container') INFO [22:14:32] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://d29bf3e7.f064e220.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` ## Feed the M3 representations[¶](#feed-the-m3-representations) We convert the three different representations to Vespa feed format In \[15\]: Copied! ``` vespa_fields = { "text": passage[0], "lexical_rep": { key: float(value) for key, value in passage_embeddings["lexical_weights"][0].items() }, "dense_rep": passage_embeddings["dense_vecs"][0].tolist(), "colbert_rep": { index: passage_embeddings["colbert_vecs"][0][index].tolist() for index in range(passage_embeddings["colbert_vecs"][0].shape[0]) }, } ``` vespa_fields = { "text": passage[0], "lexical_rep": { key: float(value) for key, value in passage_embeddings["lexical_weights"][0].items() }, "dense_rep": passage_embeddings["dense_vecs"][0].tolist(), "colbert_rep": { index: passage_embeddings["colbert_vecs"][0][index].tolist() for index in range(passage_embeddings["colbert_vecs"][0].shape[0]) }, } In \[17\]: Copied! ``` from vespa.io import VespaResponse response: VespaResponse = app.feed_data_point( schema="m", data_id=0, fields=vespa_fields ) assert response.is_successful() ``` from vespa.io import VespaResponse response: VespaResponse = app.feed_data_point( schema="m", data_id=0, fields=vespa_fields ) assert response.is_successful() ### Querying data[¶](#querying-data) Now, we can also query our data. Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) In \[ \]: Copied! ``` query = ["What is BGE M3?"] query_embeddings = model.encode( query, return_dense=True, return_sparse=True, return_colbert_vecs=True ) ``` query = ["What is BGE M3?"] query_embeddings = model.encode( query, return_dense=True, return_sparse=True, return_colbert_vecs=True ) The M3 colbert scoring function needs the query length to normalize the score to the range 0 to 1. This helps when combining the score with the other scoring functions. In \[19\]: Copied! ``` query_length = query_embeddings["colbert_vecs"][0].shape[0] ``` query_length = query_embeddings["colbert_vecs"][0].shape[0] In \[20\]: Copied! ``` query_fields = { "input.query(q_lexical)": { key: float(value) for key, value in query_embeddings["lexical_weights"][0].items() }, "input.query(q_dense)": query_embeddings["dense_vecs"][0].tolist(), "input.query(q_colbert)": str( { index: query_embeddings["colbert_vecs"][0][index].tolist() for index in range(query_embeddings["colbert_vecs"][0].shape[0]) } ), "input.query(q_len_colbert)": query_length, } ``` query_fields = { "input.query(q_lexical)": { key: float(value) for key, value in query_embeddings["lexical_weights"][0].items() }, "input.query(q_dense)": query_embeddings["dense_vecs"][0].tolist(), "input.query(q_colbert)": str( { index: query_embeddings["colbert_vecs"][0][index].tolist() for index in range(query_embeddings["colbert_vecs"][0].shape[0]) } ), "input.query(q_len_colbert)": query_length, } In \[21\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id, text from m where userQuery() or ({targetHits:10}nearestNeighbor(dense_rep,q_dense))", ranking="m3hybrid", query=query[0], body={**query_fields}, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id, text from m where userQuery() or ({targetHits:10}nearestNeighbor(dense_rep,q_dense))", ranking="m3hybrid", query=query[0], body={\*\*query_fields}, ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "index:m_content/0/cfcd2084234135f700f08abf", "relevance": 0.5993361056332731, "source": "m_content", "fields": { "matchfeatures": { "bm25(text)": 0.8630462173553426, "dense": 0.6258970723760484, "lexical": 0.1941967010498047, "max_sim": 0.7753448411822319 }, "text": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction." } } ``` Notice the `matchfeatures` that returns the configured match-features from the rank-profile. We can use these to compare the torch model scoring with the computations specified in Vespa. Now, we can compare the Vespa computed scores with the model torch code and they line up perfectly In \[22\]: Copied! ``` model.compute_lexical_matching_score( passage_embeddings["lexical_weights"][0], query_embeddings["lexical_weights"][0] ) ``` model.compute_lexical_matching_score( passage_embeddings["lexical_weights"][0], query_embeddings["lexical_weights"][0] ) Out\[22\]: ``` 0.19554455392062664 ``` In \[23\]: Copied! ``` query_embeddings["dense_vecs"][0] @ passage_embeddings["dense_vecs"][0].T ``` query_embeddings["dense_vecs"][0] @ passage_embeddings["dense_vecs"][0].T Out\[23\]: ``` 0.6259037 ``` In \[24\]: Copied! ``` model.colbert_score( query_embeddings["colbert_vecs"][0], passage_embeddings["colbert_vecs"][0] ) ``` model.colbert_score( query_embeddings["colbert_vecs"][0], passage_embeddings["colbert_vecs"][0] ) Out\[24\]: ``` tensor(0.7797) ``` ### That is it![¶](#that-is-it) That is how easy it is to represent the brand new M3 FlagEmbedding representations in Vespa! Read more in the [M3 technical report](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/BGE_M3/BGE_M3.pdf). We can go ahead and delete the Vespa cloud instance we deployed by: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Multi-vector indexing with HNSW[¶](#multi-vector-indexing-with-hnsw) This is the pyvespa steps of the multi-vector-indexing sample application. Go to the [source](https://github.com/vespa-engine/sample-apps/tree/master/multi-vector-indexing) for a full description and prerequisites, and read the [blog post](https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/). Highlighted features: - Approximate Nearest Neighbor Search - using HNSW or exact - Use a Component to configure the Huggingface embedder. - Using synthetic fields with auto-generated [embeddings](https://docs.vespa.ai/en/embedding.html) in data and query flow. - Application package file export, model files in the application package, deployment from files. - [Multiphased ranking](https://docs.vespa.ai/en/phased-ranking.html). - How to control text search result highlighting. For simpler examples, see [text search](https://vespa-engine.github.io/pyvespa/getting-started-pyvespa.md) and [pyvespa examples](https://vespa-engine.github.io/pyvespa/examples/pyvespa-examples.md). Pyvespa is an add-on to Vespa, and this guide will export the application package containing `services.xml` and `wiki.sd`. The latter is the schema file for this application - knowing services.xml and schema files is useful when reading Vespa documentation. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. This notebook requires [pyvespa >= 0.37.1](https://vespa-engine.github.io/pyvespa/index.md#requirements), ZSTD, and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). In \[ \]: Copied! ``` !pip3 install pyvespa ``` !pip3 install pyvespa ## Create the application[¶](#create-the-application) Configure the Vespa instance with a component loading the E5-small model. Components are used to plug in code and models to a Vespa application - [read more](https://docs.vespa.ai/en/jdisc/container-components.html): In \[1\]: Copied! ``` from vespa.package import ( ApplicationPackage, Component, Parameter, Field, HNSW, RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking, FieldSet, DocumentSummary, Summary, ) from pathlib import Path import json app_package = ApplicationPackage( name="wiki", components=[ Component( id="e5-small-q", type="hugging-face-embedder", parameters=[ Parameter("transformer-model", {"path": "model/e5-small-v2-int8.onnx"}), Parameter("tokenizer-model", {"path": "model/tokenizer.json"}), ], ) ], ) ``` from vespa.package import ( ApplicationPackage, Component, Parameter, Field, HNSW, RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking, FieldSet, DocumentSummary, Summary, ) from pathlib import Path import json app_package = ApplicationPackage( name="wiki", components=\[ Component( id="e5-small-q", type="hugging-face-embedder", parameters=[ Parameter("transformer-model", {"path": "model/e5-small-v2-int8.onnx"}), Parameter("tokenizer-model", {"path": "model/tokenizer.json"}), ], ) \], ) ## Configure fields[¶](#configure-fields) Vespa has a variety of basic and complex [field types](https://docs.vespa.ai/en/reference/schema-reference.html#field). This application uses a combination of integer, text and tensor fields, making it easy to implement hybrid ranking use cases: In \[2\]: Copied! ``` app_package.schema.add_fields( Field(name="id", type="int", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="url", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="paragraphs", type="array", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="paragraph_embeddings", type="tensor(p{},x[384])", indexing=["input paragraphs", "embed", "index", "attribute"], ann=HNSW(distance_metric="angular"), is_document_field=False, ), # # Alteratively, for exact distance calculation not using HNSW: # # Field(name="paragraph_embeddings", type="tensor(p{},x[384])", # indexing=["input paragraphs", "embed", "attribute"], # attribute=["distance-metric: angular"], # is_document_field=False) ) ``` app_package.schema.add_fields( Field(name="id", type="int", indexing=["attribute", "summary"]), Field( name="title", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="url", type="string", indexing=["index", "summary"], index="enable-bm25" ), Field( name="paragraphs", type="array", indexing=["index", "summary"], index="enable-bm25", bolding=True, ), Field( name="paragraph_embeddings", type="tensor(p{},x[384])", indexing=["input paragraphs", "embed", "index", "attribute"], ann=HNSW(distance_metric="angular"), is_document_field=False, ), # # Alteratively, for exact distance calculation not using HNSW: # # Field(name="paragraph_embeddings", type="tensor(p{},x[384])", # indexing=["input paragraphs", "embed", "attribute"], # attribute=["distance-metric: angular"], # is_document_field=False) ) One field of particular interest is `paragraph_embeddings`. Note that we are *not* feeding embeddings to this instance. Instead, the embeddings are generated by using the [embed](https://docs.vespa.ai/en/embedding.html) feature, using the model configured at start. Read more in [Text embedding made simple](https://blog.vespa.ai/text-embedding-made-simple/). Looking closely at the code, `paragraph_embeddings` uses `is_document_field=False`, meaning it will read another field as input (here `paragraph`), and run `embed` on it. As only one model is configured, `embed` will use that one - it is possible to configure mode models and use `embed model-id` as well. As the code comment illustrates, there can be different distrance metrics used, as well as using an *exact* or *approximate* nearest neighbor search. ## Configure rank profiles[¶](#configure-rank-profiles) A rank profile defines the computation for the ranking, with a wide range of possible features as input. Below you will find `first_phase` ranking using text ranking (`bm`), semantic ranking using vector distance (consider a tensor a vector here), and combinations of the two: In \[3\]: Copied! ``` app_package.schema.add_rank_profile( RankProfile( name="semantic", inputs=[("query(q)", "tensor(x[384])")], inherits="default", first_phase="cos(distance(field,paragraph_embeddings))", match_features=["closest(paragraph_embeddings)"], ) ) app_package.schema.add_rank_profile( RankProfile(name="bm25", first_phase="2*bm25(title) + bm25(paragraphs)") ) app_package.schema.add_rank_profile( RankProfile( name="hybrid", inherits="semantic", functions=[ Function( name="avg_paragraph_similarity", expression="""reduce( sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x), avg, p )""", ), Function( name="max_paragraph_similarity", expression="""reduce( sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x), max, p )""", ), Function( name="all_paragraph_similarities", expression="sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x)", ), ], first_phase=FirstPhaseRanking( expression="cos(distance(field,paragraph_embeddings))" ), second_phase=SecondPhaseRanking( expression="firstPhase + avg_paragraph_similarity() + log( bm25(title) + bm25(paragraphs) + bm25(url))" ), match_features=[ "closest(paragraph_embeddings)", "firstPhase", "bm25(title)", "bm25(paragraphs)", "avg_paragraph_similarity", "max_paragraph_similarity", "all_paragraph_similarities", ], ) ) ``` app_package.schema.add_rank_profile( RankProfile( name="semantic", inputs=\[("query(q)", "tensor(x[384])")\], inherits="default", first_phase="cos(distance(field,paragraph_embeddings))", match_features=["closest(paragraph_embeddings)"], ) ) app_package.schema.add_rank_profile( RankProfile(name="bm25", first_phase="2\*bm25(title) + bm25(paragraphs)") ) app_package.schema.add_rank_profile( RankProfile( name="hybrid", inherits="semantic", functions=[ Function( name="avg_paragraph_similarity", expression="""reduce( sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x), avg, p )""", ), Function( name="max_paragraph_similarity", expression="""reduce( sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x), max, p )""", ), Function( name="all_paragraph_similarities", expression="sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x)", ), ], first_phase=FirstPhaseRanking( expression="cos(distance(field,paragraph_embeddings))" ), second_phase=SecondPhaseRanking( expression="firstPhase + avg_paragraph_similarity() + log( bm25(title) + bm25(paragraphs) + bm25(url))" ), match_features=[ "closest(paragraph_embeddings)", "firstPhase", "bm25(title)", "bm25(paragraphs)", "avg_paragraph_similarity", "max_paragraph_similarity", "all_paragraph_similarities", ], ) ) ## Configure fieldset[¶](#configure-fieldset) A [fieldset](https://docs.vespa.ai/en/reference/schema-reference.html#fieldset) is a way to configure search in multiple fields: In \[4\]: Copied! ``` app_package.schema.add_field_set( FieldSet(name="default", fields=["title", "url", "paragraphs"]) ) ``` app_package.schema.add_field_set( FieldSet(name="default", fields=["title", "url", "paragraphs"]) ) ## Configure document summary[¶](#configure-document-summary) A [document summary](https://docs.vespa.ai/en/document-summaries.html) is the collection of fields to return in query results - the default summary is used unless other specified in the query. Here we configure a `minimal` fieldset without the larger paragraph text/embedding fields: In \[5\]: Copied! ``` app_package.schema.add_document_summary( DocumentSummary( name="minimal", summary_fields=[Summary("id", "int"), Summary("title", "string")], ) ) ``` app_package.schema.add_document_summary( DocumentSummary( name="minimal", summary_fields=[Summary("id", "int"), Summary("title", "string")], ) ) ## Export the configuration[¶](#export-the-configuration) At this point, the application is well defined. Remember that the Component configuration at start configures model files to be found in a `model` directory. We must therefore export the configuration and add the models, before we can deploy to the Vespa instance. Export the [application package](https://docs.vespa.ai/en/application-packages.html): In \[6\]: Copied! ``` Path("pkg").mkdir(parents=True, exist_ok=True) app_package.to_files("pkg") ``` Path("pkg").mkdir(parents=True, exist_ok=True) app_package.to_files("pkg") It is a good idea to inspect the files exported into `pkg` - these are files referred to in the [Vespa Documentation](https://docs.vespa.ai/). ## Download model files[¶](#download-model-files) At this point, we can save the model files into the application package: In \[7\]: Copied! ``` ! mkdir -p pkg/model ! curl -L -o pkg/model/tokenizer.json \ https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json ! curl -L -o pkg/model/e5-small-v2-int8.onnx \ https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx ``` ! mkdir -p pkg/model ! curl -L -o pkg/model/tokenizer.json \ https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json ! curl -L -o pkg/model/e5-small-v2-int8.onnx \ https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx ``` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 694k 100 694k 0 0 2473k 0 --:--:-- --:--:-- --:--:-- 2508k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 32.3M 100 32.3M 0 0 27.1M 0 0:00:01 0:00:01 --:--:-- 53.0M ``` ## Deploy the application[¶](#deploy-the-application) As all the files in the app package are ready, we can start a Vespa instance - here using Docker. Deploy the app package: In \[8\]: Copied! ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk(application_name="wiki", application_root="pkg") ``` from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy_from_disk(application_name="wiki", application_root="pkg") ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` ## Feed documents[¶](#feed-documents) Download the Wikipedia articles: In \[9\]: Copied! ``` ! curl -s -H "Accept:application/vnd.github.v3.raw" \ https://api.github.com/repos/vespa-engine/sample-apps/contents/multi-vector-indexing/ext/articles.jsonl.zst | \ zstdcat - > articles.jsonl ``` ! curl -s -H "Accept:application/vnd.github.v3.raw" \ https://api.github.com/repos/vespa-engine/sample-apps/contents/multi-vector-indexing/ext/articles.jsonl.zst | \ zstdcat - > articles.jsonl I you do not have ZSTD install, get `articles.jsonl.zip` and unzip it instead. Feed and index the Wikipedia articles using the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html). As part of feeding, `embed` is called on each article, and the output of this is stored in the `paragraph_embeddings` field: In \[10\]: Copied! ``` ! vespa config set target local ! vespa feed articles.jsonl ``` ! vespa config set target local ! vespa feed articles.jsonl ``` { "feeder.seconds": 1.448, "feeder.ok.count": 8, "feeder.ok.rate": 5.524, "feeder.error.count": 0, "feeder.inflight.count": 0, "http.request.count": 8, "http.request.bytes": 12958, "http.request.MBps": 0.009, "http.exception.count": 0, "http.response.count": 8, "http.response.bytes": 674, "http.response.MBps": 0.000, "http.response.error.count": 0, "http.response.latency.millis.min": 728, "http.response.latency.millis.avg": 834, "http.response.latency.millis.max": 1446, "http.response.code.counts": { "200": 8 } } ``` Note that creating embeddings is computationally expensive, but this is a small dataset with only 8 articles, so will be done in a few seconds. The Vespa instance is now populated with the Wikipedia articles, with generated embeddings, and ready for queries. The next sections have examples of various kinds of queries to run on the dataset. ## Simple retrieve all articles with undefined ranking[¶](#simple-retrieve-all-articles-with-undefined-ranking) Run a query selecting *all* documents, returning two of them. The rank profile is the built-in `unranked` which means no ranking calculations are done, the results are returned in random order: In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse result: VespaQueryResponse = app.query( body={ "yql": "select * from wiki where true", "ranking.profile": "unranked", "hits": 2, } ) if not result.is_successful(): raise ValueError(result.get_json()) if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` from vespa.io import VespaQueryResponse result: VespaQueryResponse = app.query( body={ "yql": "select * from wiki where true", "ranking.profile": "unranked", "hits": 2, } ) if not result.is_successful(): raise ValueError(result.get_json()) if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ## Traditional keyword search with BM25 ranking on the article level[¶](#traditional-keyword-search-with-bm25-ranking-on-the-article-level) Run a text-search query and use the [bm25](https://docs.vespa.ai/en/reference/bm25.html) ranking profile configured at the start of this guide: `2*bm25(title) + bm25(paragraphs)`. Here, we use BM25 on the `title` and `paragraph` text fields, giving more weight to matches in title: In \[ \]: Copied! ``` result = app.query( body={ "yql": "select * from wiki where userQuery()", "query": 24, "ranking.profile": "bm25", "hits": 2, } ) if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select * from wiki where userQuery()", "query": 24, "ranking.profile": "bm25", "hits": 2, } ) if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ## Semantic vector search on the paragraph level[¶](#semantic-vector-search-on-the-paragraph-level) This query creates an embedding of the query "what does 24 mean in the context of railways" and specifies the `semantic` ranking profile: `cos(distance(field,paragraph_embeddings))`. This will hence compute the distance between the vector in the query and the vectors computed when indexing: `"input paragraphs", "embed", "index", "attribute"`: In \[14\]: Copied! ``` result = app.query( body={ "yql": "select * from wiki where {targetHits:2}nearestNeighbor(paragraph_embeddings,q)", "input.query(q)": "embed(what does 24 mean in the context of railways)", "ranking.profile": "semantic", "presentation.format.tensors": "short-value", "hits": 2, } ) result.hits if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select * from wiki where {targetHits:2}nearestNeighbor(paragraph_embeddings,q)", "input.query(q)": "embed(what does 24 mean in the context of railways)", "ranking.profile": "semantic", "presentation.format.tensors": "short-value", "hits": 2, } ) result.hits if len(result.hits) != 2: raise ValueError("Expected 2 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "id:wikipedia:wiki::9985", "relevance": 0.8807156260391702, "source": "wiki_content", "fields": { "matchfeatures": { "closest(paragraph_embeddings)": { "4": 1.0 } }, "sddocname": "wiki", "paragraphs": [ "The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.", "A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.", "However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.", "24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.", "In railway timetables 24:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00." ], "documentid": "id:wikipedia:wiki::9985", "title": "24-hour clock", "url": "https://simple.wikipedia.org/wiki?curid=9985" } }, { "id": "id:wikipedia:wiki::59079", "relevance": 0.7972394509946005, "source": "wiki_content", "fields": { "matchfeatures": { "closest(paragraph_embeddings)": { "4": 1.0 } }, "sddocname": "wiki", "paragraphs": [ "Logic gates are digital components. They normally work at only two levels of voltage, a positive level and zero level. Commonly they work based on two states: \"On\" and \"Off\". In the On state, voltage is positive. In the Off state, the voltage is at zero. The On state usually uses a voltage in the range of 3.5 to 5 volts. This range can be lower for some uses.", "Logic gates compare the state at their inputs to decide what the state at their output should be. A logic gate is \"on\" or active when its rules are correctly met. At this time, electricity is flowing through the gate and the voltage at its output is at the level of its On state.", "Logic gates are electronic versions of Boolean logic. Truth tables will tell you what the output will be, depending on the inputs.", "AND gates have two inputs. The output of an AND gate is on only if both inputs are on. If at least one of the inputs is off, the output will be off.", "Using the image at the right, if \"A\" and \"B\" are both in an On state, the output (out) will be an On state. If either \"A\" or \"B\" is in an Off state, the output will also be in an Off state. \"A\" and \"B\" must be On for the output to be On.", "OR gates have two inputs. The output of an OR gate will be on if at least one of the inputs are on. If both inputs are off, the output will be off.", "Using the image at the right, if either \"A\" or \"B\" is On, the output (\"out\") will also be On. If both \"A\" and \"B\" are Off, the output will be Off.", "The NOT logic gate has only one input. If the input is On then the output will be Off. In other words, the NOT logic gate changes the signal from On to Off or from Off to On. It is sometimes called an inverter.", "XOR (\"exclusive or\") gates have two inputs. The output of a XOR gate will be true only if the two inputs are different from each other. If both inputs are the same, the output will be off.", "NAND means not both. It is called NAND because it means \"not and.\" This means that it will always output true unless both inputs are on.", "XNOR means \"not exclusive or.\" This means that it will only output true if both inputs are the same. It is the opposite of a XOR logic gate." ], "documentid": "id:wikipedia:wiki::59079", "title": "Logic gate", "url": "https://simple.wikipedia.org/wiki?curid=59079" } } ] ``` An interesting question then is, of the paragraphs in the document, which one was the closest? When analysing ranking, using [match-features](https://docs.vespa.ai/en/reference/schema-reference.html#match-features) lets you export the scores used in the ranking calculations, see [closest]() - from the result above: ``` "matchfeatures": { "closest(paragraph_embeddings)": { "4": 1.0 } } ``` This means, the tensor of index 4 has the closest match. With this, it is straight forward to feed articles with an array of paragraphs and highlight the best matching paragraph in the document! In \[17\]: Copied! ``` def find_best_paragraph(hit: dict) -> str: paragraphs = hit["fields"]["paragraphs"] match_features = hit["fields"]["matchfeatures"] index = int(list(match_features["closest(paragraph_embeddings)"].keys())[0]) return paragraphs[index] ``` def find_best_paragraph(hit: dict) -> str: paragraphs = hit["fields"]["paragraphs"] match_features = hit["fields"]["matchfeatures"] index = int(list(match_features["closest(paragraph_embeddings)"].keys())[0]) return paragraphs[index] In \[18\]: Copied! ``` find_best_paragraph(result.hits[0]) ``` find_best_paragraph(result.hits[0]) Out\[18\]: ``` 'In railway timetables 24:00 means the "end" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00.' ``` ## Hybrid search and ranking[¶](#hybrid-search-and-ranking) Hybrid combining keyword search on the article level with vector search in the paragraph index: In \[20\]: Copied! ``` result = app.query( body={ "yql": "select * from wiki where userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))", "input.query(q)": "embed(what does 24 mean in the context of railways)", "query": "what does 24 mean in the context of railways", "ranking.profile": "hybrid", "presentation.format.tensors": "short-value", "hits": 1, } ) if len(result.hits) != 1: raise ValueError("Expected 1 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select * from wiki where userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))", "input.query(q)": "embed(what does 24 mean in the context of railways)", "query": "what does 24 mean in the context of railways", "ranking.profile": "hybrid", "presentation.format.tensors": "short-value", "hits": 1, } ) if len(result.hits) != 1: raise ValueError("Expected 1 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "id:wikipedia:wiki::9985", "relevance": 4.163399168193791, "source": "wiki_content", "fields": { "matchfeatures": { "bm25(paragraphs)": 10.468827250036052, "bm25(title)": 1.1272217840066168, "closest(paragraph_embeddings)": { "4": 1.0 }, "firstPhase": 0.8807156260391702, "all_paragraph_similarities": { "1": 0.8030083179473877, "2": 0.7992785573005676, "3": 0.8273358345031738, "4": 0.8807156085968018, "0": 0.849757194519043 }, "avg_paragraph_similarity": 0.8320191025733947, "max_paragraph_similarity": 0.8807156085968018 }, "sddocname": "wiki", "paragraphs": [ "The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.", "A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.", "However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.", "24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.", "In railway timetables 24:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00." ], "documentid": "id:wikipedia:wiki::9985", "title": "24-hour clock", "url": "https://simple.wikipedia.org/wiki?curid=9985" } } ] ``` This case combines exact search with nearestNeighbor search. The `hybrid` rank-profile above also calculates several additional features using [tensor expressions](https://docs.vespa.ai/en/tensor-user-guide.html): - `firstPhase` is the score of the first ranking phase, configured in the hybrid profile as `cos(distance(field, paragraph_embeddings))`. - `all_paragraph_similarities` returns all the similarity scores for all paragraphs. - `avg_paragraph_similarity` is the average similarity score across all the paragraphs. - `max_paragraph_similarity` is the same as `firstPhase`, but computed using a tensor expression. These additional features are calculated during [second-phase ranking](https://docs.vespa.ai/en/phased-ranking.html) to limit the number of vector computations. The [Tensor Playground](https://docs.vespa.ai/playground/) is useful to play with tensor expressions. The [Hybrid Search](https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa/) blog post series is a good read to learn more about hybrid ranking! In \[23\]: Copied! ``` def find_paragraph_scores(hit: dict) -> str: paragraphs = hit["fields"]["paragraphs"] match_features = hit["fields"]["matchfeatures"] indexes = [int(v) for v in match_features["all_paragraph_similarities"]] scores = list(match_features["all_paragraph_similarities"].values()) return list(zip([paragraphs[i] for i in indexes], scores)) ``` def find_paragraph_scores(hit: dict) -> str: paragraphs = hit["fields"]["paragraphs"] match_features = hit["fields"]["matchfeatures"] indexes = \[int(v) for v in match_features["all_paragraph_similarities"]\] scores = list(match_features["all_paragraph_similarities"].values()) return list(zip(\[paragraphs[i] for i in indexes\], scores)) In \[24\]: Copied! ``` find_paragraph_scores(result.hits[0]) ``` find_paragraph_scores(result.hits[0]) Out\[24\]: ``` [('A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.', 0.8030083179473877), ('However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say "23:59", which is one minute before midnight.', 0.7992785573005676), ('24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.', 0.8273358345031738), ('In railway timetables 24:00 means the "end" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00.', 0.8807156085968018), ('The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.', 0.849757194519043)] ``` ## Hybrid search and filter[¶](#hybrid-search-and-filter) YQL is a structured query langauge. In the query examples, the user input is fed as-is using the `userQuery()` operator. Filters are normally separate from the user input, below is an example of adding a filter `url contains "9985"` to the YQL string. Finally, the use the [Query API](https://docs.vespa.ai/en/query-api.html) for other options, like highlighting - here disable [bolding](https://docs.vespa.ai/en/reference/schema-reference.html#bolding): In \[25\]: Copied! ``` result = app.query( body={ "yql": 'select * from wiki where url contains "9985" and userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))', "input.query(q)": "embed(what does 24 mean in the context of railways)", "query": "what does 24 mean in the context of railways", "ranking.profile": "hybrid", "bolding": False, "presentation.format.tensors": "short-value", "hits": 1, } ) if len(result.hits) != 1: raise ValueError("Expected one hit, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": 'select * from wiki where url contains "9985" and userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))', "input.query(q)": "embed(what does 24 mean in the context of railways)", "query": "what does 24 mean in the context of railways", "ranking.profile": "hybrid", "bolding": False, "presentation.format.tensors": "short-value", "hits": 1, } ) if len(result.hits) != 1: raise ValueError("Expected one hit, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "id:wikipedia:wiki::9985", "relevance": 4.307079208249452, "source": "wiki_content", "fields": { "matchfeatures": { "bm25(paragraphs)": 10.468827250036052, "bm25(title)": 1.1272217840066168, "closest(paragraph_embeddings)": { "type": "tensor(p{})", "cells": { "4": 1.0 } }, "firstPhase": 0.8807156260391702, "all_paragraph_similarities": { "type": "tensor(p{})", "cells": { "1": 0.8030083179473877, "2": 0.7992785573005676, "3": 0.8273358345031738, "4": 0.8807156085968018, "0": 0.849757194519043 } }, "avg_paragraph_similarity": 0.8320191025733947, "max_paragraph_similarity": 0.8807156085968018 }, "sddocname": "wiki", "paragraphs": [ "The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.", "A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.", "However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.", "24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.", "In railway timetables 24:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00." ], "documentid": "id:wikipedia:wiki::9985", "title": "24-hour clock", "url": "https://simple.wikipedia.org/wiki?curid=9985" } } ] ``` In short, the above query demonstrates how easy it is to combine various ranking strategies, and also combine with filters. To learn more about pre-filtering vs post-filtering, read [Filtering strategies and serving performance](https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/). [Semantic search with multi-vector indexing](https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/) is a great read overall for this domain. ## Cleanup[¶](#cleanup) In \[26\]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() # Multilingual Hybrid Search with Cohere binary embeddings and Vespa[¶](#multilingual-hybrid-search-with-cohere-binary-embeddings-and-vespa) Cohere just released a new embedding API supporting binary vectors. Read the announcement in the blog post: [Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets](https://cohere.com/blog/int8-binary-embeddings). > We are excited to announce that Cohere Embed is the first embedding model that natively supports int8 and binary embeddings. This notebook demonstrates: - Building a multilingual search application over a sample of the German split of Wikipedia using [binarized cohere embeddings](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary) - Indexing multiple binary embeddings per document; without having to split the chunks across multiple retrievable units - Hybrid search, combining the lexical matching capabilities of Vespa with Cohere binary embeddings - Re-scoring the binarized vectors for improved accuracy Install the dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa cohere==4.57 datasets vespacli ``` !pip3 install -U pyvespa cohere==4.57 datasets vespacli ## Dataset exploration[¶](#dataset-exploration) Cohere has released a large [Wikipedia dataset](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary) > This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The embeddings are provided as int8 and ubinary that allow quick search and reduction of your vector index size up to 32. In \[ \]: Copied! ``` from datasets import load_dataset lang = "de" # Use the first 10K chunks from the German Wikipedia subset docs = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary", lang, split="train", streaming=True, ).take(10000) ``` from datasets import load_dataset lang = "de" # Use the first 10K chunks from the German Wikipedia subset docs = load_dataset( "Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary", lang, split="train", streaming=True, ).take(10000) ## Aggregate from chunks to pages[¶](#aggregate-from-chunks-to-pages) We want to aggregate the chunk \<> vector representations into their natural retrievable unit - a Wikipedia page. We can still search the chunks and the chunk vector representation but retrieve pages instead of chunks. This avoids duplicating page-level metadata like url and title, while still being able to have meaningful semantic search representations. For RAG applications, this also means that we have the full page level context available when we retrieve information for the generative phase. In \[160\]: Copied! ``` pages = dict() for d in docs: url = d["url"] if url not in pages: pages[url] = [d] else: pages[url].append(d) ``` pages = dict() for d in docs: url = d["url"] if url not in pages: pages[url] = [d] else: pages[url].append(d) In \[173\]: Copied! ``` print(len(list(pages.keys()))) ``` print(len(list(pages.keys()))) ``` 1866 ``` ## Defining the Vespa application[¶](#defining-the-vespa-application) First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. We use Vespa's multi-vector indexing support - See [Revolutionizing Semantic Search with Multi-Vector HNSW Indexing in Vespa](https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/) for details. Highlights - language for language-specific [linguistic](https://docs.vespa.ai/en/linguistics.html) processing for keyword search - Two named multi-vector representations with different precision and in-memory versus off-memory - The named multi-vector representations holds the chunk-level embeddings - Chunks is an array of string where we enable BM25 - Metadata for the page (url, title) In \[174\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="page", mode="index", document=Document( fields=[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="language", type="string", indexing=["summary", "index", "set_language"], match=["word"], rank="filter", ), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="chunks", type="array", indexing=["summary", "index"], index="enable-bm25", ), Field( name="url", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vectors", type="tensor(chunk{}, x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), Field( name="int8_vectors", type="tensor(chunk{}, x[1024])", indexing=["attribute"], attribute=["paged"], ), ] ), fieldsets=[FieldSet(name="default", fields=["chunks", "title"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet my_schema = Schema( name="page", mode="index", document=Document( fields=\[ Field(name="doc_id", type="string", indexing=["summary"]), Field( name="language", type="string", indexing=["summary", "index", "set_language"], match=["word"], rank="filter", ), Field( name="title", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="chunks", type="array", indexing=["summary", "index"], index="enable-bm25", ), Field( name="url", type="string", indexing=["summary", "index"], index="enable-bm25", ), Field( name="binary_vectors", type="tensor(chunk{}, x[128])", indexing=["attribute", "index"], attribute=["distance-metric: hamming"], ), Field( name="int8_vectors", type="tensor(chunk{}, x[1024])", indexing=["attribute"], attribute=["paged"], ), \] ), fieldsets=\[FieldSet(name="default", fields=["chunks", "title"])\], ) We must add the schema to a Vespa [application package](https://docs.vespa.ai/en/application-packages.html). This consists of configuration files, schemas, models, and possibly even custom code (plugins). In \[9\]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "wikipedia" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) ``` from vespa.package import ApplicationPackage vespa_app_name = "wikipedia" vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[my_schema]) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. `unpack_bits` unpacks the binary representation into a 1024-dimensional float vector [doc](https://docs.vespa.ai/en/reference/ranking-expressions.html#unpack-bits). We define two tensor inputs, one compact binary representation that is used for the nearestNeighbor search and one full version that is used in ranking. In \[138\]: Copied! ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=[ ("query(q_binary)", "tensor(x[128])"), ("query(q_int8)", "tensor(x[1024])"), ("query(q_full)", "tensor(x[1024])"), ], functions=[ Function( # this returns a tensor(chunk{}, x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2*unpack_bits(attribute(binary_vectors)) -1", ), Function( name="all_chunks_cosine", expression="cosine_similarity(query(q_int8), attribute(int8_vectors),x)", ), Function( name="int8_float_dot_products", expression="sum(query(q_full)*unpack_binary_representation,x)", ), ], first_phase=FirstPhaseRanking( expression="reduce(int8_float_dot_products, max, chunk)" ), second_phase=SecondPhaseRanking( expression="reduce(all_chunks_cosine, max, chunk)" # rescoring using the full query and a unpacked binary_vector ), match_features=[ "distance(field, binary_vectors)", "all_chunks_cosine", "firstPhase", "bm25(title)", "bm25(chunks)", ], ) my_schema.add_rank_profile(rerank) ``` from vespa.package import RankProfile, FirstPhaseRanking, SecondPhaseRanking, Function rerank = RankProfile( name="rerank", inputs=\[ ("query(q_binary)", "tensor(x[128])"), ("query(q_int8)", "tensor(x[1024])"), ("query(q_full)", "tensor(x[1024])"), \], functions=\[ Function( # this returns a tensor(chunk{}, x[1024]) with values -1 or 1 name="unpack_binary_representation", expression="2\*unpack_bits(attribute(binary_vectors)) -1", ), Function( name="all_chunks_cosine", expression="cosine_similarity(query(q_int8), attribute(int8_vectors),x)", ), Function( name="int8_float_dot_products", expression="sum(query(q_full)\*unpack_binary_representation,x)", ), \], first_phase=FirstPhaseRanking( expression="reduce(int8_float_dot_products, max, chunk)" ), second_phase=SecondPhaseRanking( expression="reduce(all_chunks_cosine, max, chunk)" # rescoring using the full query and a unpacked binary_vector ), match_features=[ "distance(field, binary_vectors)", "all_chunks_cosine", "firstPhase", "bm25(title)", "bm25(chunks)", ], ) my_schema.add_rank_profile(rerank) ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[24\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ## Feed the Wikipedia pages and the embedding representations[¶](#feed-the-wikipedia-pages-and-the-embedding-representations) Read more about feeding with pyvespa in [PyVespa:reads and writes](https://vespa-engine.github.io/pyvespa/reads-writes.md). In this case, we use a generator to yield document operations In \[153\]: Copied! ``` def generate_vespa_feed_documents(pages): for url, chunks in pages.items(): title = None text_chunks = [] binary_vectors = {} int8_vectors = {} for chunk_id, chunk in enumerate(chunks): title = chunk["title"] text = chunk["text"] text_chunks.append(text) emb_ubinary = chunk["emb_ubinary"] emb_ubinary = [x - 128 for x in emb_ubinary] emb_int8 = chunk["emb_int8"] binary_vectors[chunk_id] = emb_ubinary int8_vectors[chunk_id] = emb_int8 vespa_json = { "id": url, "fields": { "doc_id": url, "url": url, "language": lang, # Assuming `lang` is defined somewhere "title": title, "chunks": text_chunks, "binary_vectors": binary_vectors, "int8_vectors": int8_vectors, }, } yield vespa_json ``` def generate_vespa_feed_documents(pages): for url, chunks in pages.items(): title = None text_chunks = [] binary_vectors = {} int8_vectors = {} for chunk_id, chunk in enumerate(chunks): title = chunk["title"] text = chunk["text"] text_chunks.append(text) emb_ubinary = chunk["emb_ubinary"] emb_ubinary = [x - 128 for x in emb_ubinary] emb_int8 = chunk["emb_int8"] binary_vectors[chunk_id] = emb_ubinary int8_vectors[chunk_id] = emb_int8 vespa_json = { "id": url, "fields": { "doc_id": url, "url": url, "language": lang, # Assuming `lang` is defined somewhere "title": title, "chunks": text_chunks, "binary_vectors": binary_vectors, "int8_vectors": int8_vectors, }, } yield vespa_json In \[154\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) In \[156\]: Copied! ``` app.feed_iterable( iter=generate_vespa_feed_documents(pages), schema="page", callback=callback, max_queue_size=4000, max_workers=16, max_connections=16, ) ``` app.feed_iterable( iter=generate_vespa_feed_documents(pages), schema="page", callback=callback, max_queue_size=4000, max_workers=16, max_connections=16, ) ### Querying data[¶](#querying-data) Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) - [Practical Nearest Neighbor Search Guide](https://docs.vespa.ai/en/nearest-neighbor-search-guide.html) To obtain the query embedding we use the [Cohere embed API](https://docs.cohere.com/docs/embed-api). In \[48\]: Copied! ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() ``` import cohere # Make sure that the environment variable CO_API_KEY is set to your API key co = cohere.Client() In \[175\]: Copied! ``` query = 'Welche britische Rockband hat das Lied "Spread Your Wings"?' # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for 3 types of embeddings: float, binary, and int8 query_emb = co.embed( [query], model="embed-multilingual-v3.0", input_type="search_query", embedding_types=["float", "binary", "int8"], ) ``` query = 'Welche britische Rockband hat das Lied "Spread Your Wings"?' # Make sure to set input_type="search_query" when getting the embeddings for the query. # We ask for 3 types of embeddings: float, binary, and int8 query_emb = co.embed( [query], model="embed-multilingual-v3.0", input_type="search_query", embedding_types=["float", "binary", "int8"], ) Now, we use the [nearestNeighbor](https://docs.vespa.ai/en/reference/query-language-reference.html#nearestneighbor) query operator to to retrieve 1000 pages using hamming distance. This phase uses the minimum chunk-level distance for selecting pages. Essentially finding the best chunk in the page. This ensures diversity as we retrieve pages, not chunks. These hits are exposed to the configured ranking phases that perform the re-ranking. Notice the language parameter, for language-specific processing of the query. In \[158\]: Copied! ``` from vespa.io import VespaQueryResponse response: VespaQueryResponse = app.query( yql="select * from page where userQuery() or ({targetHits:1000, approximate:true}nearestNeighbor(binary_vectors,q_binary))", ranking="rerank", query=query, language="de", # don't guess the language of the query body={ "presentation.format.tensors": "short-value", "input.query(q_binary)": query_emb.embeddings.binary[0], "input.query(q_full)": query_emb.embeddings.float[0], "input.query(q_int8)": query_emb.embeddings.int8[0], }, ) assert response.is_successful() response.hits[0] ``` from vespa.io import VespaQueryResponse response: VespaQueryResponse = app.query( yql="select * from page where userQuery() or ({targetHits:1000, approximate:true}nearestNeighbor(binary_vectors,q_binary))", ranking="rerank", query=query, language="de", # don't guess the language of the query body={ "presentation.format.tensors": "short-value", "input.query(q_binary)": query_emb.embeddings.binary[0], "input.query(q_full)": query_emb.embeddings.float[0], "input.query(q_int8)": query_emb.embeddings.int8[0], }, ) assert response.is_successful() response.hits[0] Out\[158\]: ``` {'id': 'id:page:page::https:/de.wikipedia.org/wiki/Spread Your Wings', 'relevance': 0.8184863924980164, 'source': 'wikipedia_content', 'fields': {'matchfeatures': {'bm25(chunks)': 28.125529605038967, 'bm25(title)': 7.345395294159827, 'distance(field,binary_vectors)': 170.0, 'firstPhase': 8.274434089660645, 'all_chunks_cosine': {'0': 0.8184863924980164, '1': 0.6203299760818481, '2': 0.643619954586029, '3': 0.6706648468971252, '4': 0.524447500705719, '5': 0.6730406880378723}}, 'sddocname': 'page', 'documentid': 'id:page:page::https:/de.wikipedia.org/wiki/Spread Your Wings', 'doc_id': 'https://de.wikipedia.org/wiki/Spread%20Your%20Wings', 'language': 'de', 'title': 'Spread Your Wings', 'chunks': ['Spread Your Wings ist ein Lied der britischen Rockband Queen, das von deren Bassisten John Deacon geschrieben wurde. Es ist auf dem im Oktober 1977 erschienenen Album News of the World enthalten und wurde am 10. Februar 1978 in Europa als Single mit Sheer Heart Attack als B-Seite veröffentlicht. In Nordamerika wurde es nicht als Single veröffentlicht, sondern erschien stattdessen 1980 als B-Seite des Billboard Nummer-1-Hits Crazy Little Thing Called Love. Das Lied wurde zwar kein großer Hit in den Charts, ist aber unter Queen-Fans sehr beliebt.', 'Der Text beschreibt einen jungen Mann namens Sammy, der in einer Bar zum Putzen arbeitet (“You should’ve been sweeping/up the Emerald bar”). Während sein Chef ihn in den Strophen beschimpft und sagt, er habe keinerlei Ambitionen und solle sich mit dem zufriedengeben, was er hat (“You’ve got no real ambition,/you won’t get very far/Sammy boy don’t you know who you are/Why can’t you be happy/at the Emerald bar”), ermuntert ihn der Erzähler im Refrain, seinen Träumen nachzugehen (“spread your wings and fly away/Fly away, far away/Pull yourself together ‘cause you know you should do better/That’s because you’re a free man.”).', 'Das Lied ist im 4/4-Takt geschrieben, beginnt in der Tonart D-Dur, wechselt in der Bridge zu deren Paralleltonart h-Moll und endet wieder mit D-Dur. Es beginnt mit einem kurzen Piano-Intro, gefolgt von der ersten Strophe, die nur mit einer akustischen Gitarre, Piano und Hi-Hats begleitet wird, und dem Refrain, in dem die E-Gitarre und das Schlagzeug hinzukommen. Die Bridge besteht aus kurzen, langsamen Gitarrentönen. Die zweite Strophe enthält im Gegensatz zur ersten beinahe von Anfang an E-Gitarren-Klänge und Schlagzeugtöne. Darauf folgt nochmals der Refrain. Das Outro ist – abgesehen von zwei kurzen Rufen – instrumental. Es besteht aus einem längeren Gitarrensolo, in dem – was für Queen äußerst ungewöhnlich ist – dieselbe Akkordfolge mehrere Male wiederholt wird und ab dem vierten Mal langsam ausblendet. Das ganze Lied enthält keinerlei Hintergrundgesang, sondern nur den Leadgesang von Freddie Mercury.', 'Das Musikvideo wurde ebenso wie das zu We Will Rock You im Januar 1978 im Garten von Roger Taylors damaligen Anwesen Millhanger House gedreht, welches sich im Dorf Thursley im Südwesten der englischen Grafschaft Surrey befindet. Der Boden ist dabei von einer Eis- und Schneeschicht überzogen, auf der die Musiker spielten.', "Brian May sagte dazu später: “Looking back, it couldn't be done there – you couldn't do that!” („Wenn ich zurückschaue, hätte es nicht dort gemacht werden dürfen – man konnte das nicht tun!“)", 'Das Lied wurde mehrfach gecovert, unter anderem von der deutschen Metal-Band Blind Guardian auf ihrem 1992 erschienenen Album Somewhere Far Beyond. Weitere Coverversionen gibt es u. a. von Jeff Scott Soto und Shawn Mars.'], 'url': 'https://de.wikipedia.org/wiki/Spread%20Your%20Wings'}} ``` Notice the returned hits. The `relevance` is the score assigned by the second-phase expression. Also notice, that we included [bm25](https://docs.vespa.ai/en/reference/bm25.html) scores in the match-features. In this case, they do not influence ranking. The bm25 over chunks is calculated across all the elements, like if it was a bag of words or a single field string. We now have the full Wikipedia context for all the retrieved pages. We have all the chunks and all the cosine similarity scores for all the chunks in the wikipedia page, and no need to duplicate title and url into separate retrievable units like with single-vector databases. In RAG applications, we can now choose how much context we want to input to the generative step: - All the chunks - Only the best k chunks with a threshold on the cosine similarity - The adjacent chunks of the best chunk Or combinations of the above. ## Conclusions[¶](#conclusions) These new Cohere binary embeddings are a huge step forward for cost-efficient vector search at scale and integrate perfectly with the rich feature set in Vespa. Including multilingual text search capabilities and hybrid search. ### Clean up[¶](#clean-up) We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # PDF-Retrieval using ColQWen2 (ColPali) with Vespa[¶](#pdf-retrieval-using-colqwen2-colpali-with-vespa) This notebook is a continuation of our notebooks related to the ColPali models for complex document retrieval. This notebook demonstrates using the new [ColQWen2](https://huggingface.co/vidore/colqwen2-v0.1) model checkpoint. > ColQwen is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a Qwen2-VL-2B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository ColQWen2 is better than the previous ColPali model in the following ways: - Its more accurate on the ViDoRe dataset (+5 nDCCG@5 points) - It's permissive licensed as both the base model and adapter is using open-source licences (Apache 2.0 and MIT) - It uses fewer patch embeddings than ColPaliGemma (from 1024 to 768), this reduces both compute and storage. See also [Scaling ColPali to billions of PDFs with Vespa](https://blog.vespa.ai/scaling-colpali-to-billions/) The TLDR; of this notebook: - Generate an image per PDF page using [pdf2image](https://pypi.org/project/pdf2image/) and also extract the text using [pypdf](https://pypdf.readthedocs.io/en/stable/user/extract-text.html). - For each page image, use ColPali to obtain the visual multi-vector embeddings Then we store visual embeddings in Vespa as a `int8` tensor, where we use a binary compression technique to reduce the storage footprint by 32x compared to float representations. See [Scaling ColPali to billions of PDFs with Vespa](https://blog.vespa.ai/scaling-colpali-to-billions/) for details on binarization and using hamming distance for retrieval. During retrieval time, we use the same ColPali model to generate embeddings for the query and then use Vespa's `nearestNeighbor` query to retrieve the most similar documents per query vector token, using binary representation with hamming distance. Then we re-rank the results in two phases: - In the 0-phase we use hamming distance to retrieve the k closest pages per query token vector representation, this is expressed by using multiple nearestNeighbor query operators in Vespa. - The nearestNeighbor operators exposes pages to the first-phase ranking function, which uses an approximate MaxSim using inverted hamming distance insted of cosine similarity. This is done to reduce the number of pages that are re-ranked in the second phase. - In the second phase, we perform the full MaxSim operation, using float representations of the embeddings to re-rank the top-k pages from the first phase. This allows us to scale ColPali to very large collections of PDF pages, while still providing accurate and fast retrieval. Let us get started. Install dependencies: Note that the python pdf2image package requires poppler-utils, see other installation options [here](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler). For MacOs, the simplest install option is `brew install poppler` if you are using [Homebrew](https://brew.sh/). In \[ \]: Copied! ``` !sudo apt-get update && sudo apt-get install poppler-utils -y ``` !sudo apt-get update && sudo apt-get install poppler-utils -y Now install the required python packages: In \[ \]: Copied! ``` !pip3 install colpali-engine==0.3.1 pdf2image pypdf pyvespa vespacli requests numpy tqdm ``` !pip3 install colpali-engine==0.3.1 pdf2image pypdf pyvespa vespacli requests numpy tqdm In \[ \]: Copied! ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from io import BytesIO from colpali_engine.models import ColQwen2, ColQwen2Processor ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from io import BytesIO from colpali_engine.models import ColQwen2, ColQwen2Processor ### Load the model[¶](#load-the-model) We use device map auto to load the model on the available GPU if available, otherwise on the CPU or MPS if available. In \[ \]: Copied! ``` model_name = "vidore/colqwen2-v0.1" model = ColQwen2.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) processor = ColQwen2Processor.from_pretrained(model_name) model = model.eval() ``` model_name = "vidore/colqwen2-v0.1" model = ColQwen2.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) processor = ColQwen2Processor.from_pretrained(model_name) model = model.eval() ### Working with pdfs[¶](#working-with-pdfs) We need to convert a PDF to an array of images. One image per page. We use the `pdf2image` library for this task. Secondary, we also extract the text contents of the PDF using `pypdf`. NOTE: This step requires that you have `poppler` installed on your system. Read more in [pdf2image](https://pdf2image.readthedocs.io/en/latest/installation.html) docs. In \[ \]: Copied! ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) temp_file = "temp.pdf" with open(temp_file, "wb") as f: f.write(pdf_file.read()) reader = PdfReader(temp_file) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path(temp_file) assert len(images) == len(page_texts) return (images, page_texts) ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) temp_file = "temp.pdf" with open(temp_file, "wb") as f: f.write(pdf_file.read()) reader = PdfReader(temp_file) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path(temp_file) assert len(images) == len(page_texts) return (images, page_texts) We define a few sample PDFs to work with. The PDFs are discovered from [this url](https://www.conocophillips.com/company-reports-resources/sustainability-reporting/). In \[ \]: Copied! ``` sample_pdfs = [ { "title": "ConocoPhillips Sustainability Highlights - Nature (24-0976)", "url": "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf", }, { "title": "ConocoPhillips Managing Climate Related Risks", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf", }, { "title": "ConocoPhillips 2023 Sustainability Report", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf", }, ] ``` sample_pdfs = [ { "title": "ConocoPhillips Sustainability Highlights - Nature (24-0976)", "url": "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf", }, { "title": "ConocoPhillips Managing Climate Related Risks", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf", }, { "title": "ConocoPhillips 2023 Sustainability Report", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf", }, ] Now we can convert the PDFs to images and also extract the text content. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts Let us look at the extracted image of the first PDF page. This is the document side input to ColPali, one image per page. In \[ \]: Copied! ``` from IPython.display import display def resize_image(image, max_height=800): width, height = image.size if height > max_height: ratio = max_height / height new_width = int(width * ratio) new_height = int(height * ratio) return image.resize((new_width, new_height)) return image display(resize_image(sample_pdfs[0]["images"][0])) ``` from IPython.display import display def resize_image(image, max_height=800): width, height = image.size if height > max_height: ratio = max_height / height new_width = int(width * ratio) new_height = int(height * ratio) return image.resize((new_width, new_height)) return image display(resize_image(sample_pdfs[0]["images"][0])) Let us also look at the extracted text content of the first PDF page. In \[ \]: Copied! ``` print(sample_pdfs[0]["texts"][0]) ``` print(sample_pdfs[0]["texts"][0]) Notice how the layout and order of the text is different from the image representation. Note that - The headlines NATURE and Sustainability have been combined into one word (NATURESustainability). - The 0.03% has been converted to 0.03 and order is not preserved in the text representation. - The data in the infographics is not represented in the text representation. Now we use the ColPali model to generate embeddings of the images. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor.process_images(x), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc) page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor.process_images(x), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(\*\*batch_doc) page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings Now we are done with the document side embeddings, we convert the embeddings to Vespa JSON format so we can store (and index) them in Vespa. Details in [Vespa JSON feed format doc](https://docs.vespa.ai/en/reference/document-json-format.html). We use binary quantization (BQ) of the page level ColPali vector embeddings to reduce their size by 32x. Read more about binarization of multi-vector representations in the [colbert blog post](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/). The binarization step maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x. On the [DocVQA benchmark](https://huggingface.co/datasets/vidore/docvqa_test_subsampled), binarization results in a small drop in ranking accuracy. We also demonstrate how to store the image data in Vespa using the [raw](https://docs.vespa.ai/en/reference/schema-reference.html#raw) type for binary data. To encode the binary data in JSON, we use base64 encoding. In \[ \]: Copied! ``` import base64 def get_base64_image(image): buffered = BytesIO() image.save(buffered, format="JPEG") return str(base64.b64encode(buffered.getvalue()), "utf-8") ``` import base64 def get_base64_image(image): buffered = BytesIO() image.save(buffered, format="JPEG") return str(base64.b64encode(buffered.getvalue()), "utf-8") In \[ \]: Copied! ``` import numpy as np vespa_feed = [] for pdf in sample_pdfs: url = pdf["url"] title = pdf["title"] for page_number, (page_text, embedding, image) in enumerate( zip(pdf["texts"], pdf["embeddings"], pdf["images"]) ): base_64_image = get_base64_image(resize_image(image, 640)) embedding_dict = dict() for idx, patch_embedding in enumerate(embedding): binary_vector = ( np.packbits(np.where(patch_embedding > 0, 1, 0)) .astype(np.int8) .tobytes() .hex() ) embedding_dict[idx] = binary_vector page = { "id": hash(url + str(page_number)), "url": url, "title": title, "page_number": page_number, "image": base_64_image, "text": page_text, "embedding": embedding_dict, } vespa_feed.append(page) ``` import numpy as np vespa_feed = [] for pdf in sample_pdfs: url = pdf["url"] title = pdf["title"] for page_number, (page_text, embedding, image) in enumerate( zip(pdf["texts"], pdf["embeddings"], pdf["images"]) ): base_64_image = get_base64_image(resize_image(image, 640)) embedding_dict = dict() for idx, patch_embedding in enumerate(embedding): binary_vector = ( np.packbits(np.where(patch_embedding > 0, 1, 0)) .astype(np.int8) .tobytes() .hex() ) embedding_dict[idx] = binary_vector page = { "id": hash(url + str(page_number)), "url": url, "title": title, "page_number": page_number, "image": base_64_image, "text": page_text, "embedding": embedding_dict, } vespa_feed.append(page) ### Configure Vespa[¶](#configure-vespa) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[ \]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW colpali_schema = Schema( name="pdf_page", document=Document( fields=[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"] ), Field(name="url", type="string", indexing=["summary", "index"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], # adds HNSW index for candidate retrieval. ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW colpali_schema = Schema( name="pdf_page", document=Document( fields=\[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"] ), Field(name="url", type="string", indexing=["summary", "index"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], # adds HNSW index for candidate retrieval. ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "text"])\], ) Notice the `embedding` field which is a tensor field with the type `tensor(patch{}, v[16])`. This is the field we use to represent the page level patch embeddings from ColPali. We also enable [HNSW indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html) for this field to enable fast nearest neighbor search which is used for candidate retrieval. We use [binary hamming distance](https://docs.vespa.ai/en/nearest-neighbor-search.html#using-binary-embeddings-with-hamming-distance) as an approximation of the cosine similarity. Hamming distance is a good approximation for binary representations, and it is much faster to compute than cosine similarity/dot product. The `embedding` field is an example of a mixed tensor where we combine one mapped (sparse) dimensions with a dense dimension. Read more in [Tensor guide](https://docs.vespa.ai/en/tensor-user-guide.html). We also enable [BM25](https://docs.vespa.ai/en/reference/bm25.html) for the `title` and `texts` fields. Notice that the `image` field use type `raw` to store the binary image data, encoded with as a base64 string. Create the Vespa [application package](https://docs.vespa.ai/en/application-packages): In \[ \]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag6" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag6" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) Now we define how we want to rank the pages for a query. We use Vespa's support for [BM25](https://docs.vespa.ai/en/reference/bm25.html) for the text, and late interaction with Max Sim for the image embeddings. This means that we use the the text representations as a candidate retrieval phase, then we use the ColPALI embeddings with MaxSim to rerank the pages. In \[ \]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="default", inputs=[("query(qt)", "tensor(querytoken{}, v[128])")], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function(name="bm25_score", expression="bm25(title) + bm25(text)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100), ) colpali_schema.add_rank_profile(colpali_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="default", inputs=\[("query(qt)", "tensor(querytoken{}, v[128])")\], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function(name="bm25_score", expression="bm25(title) + bm25(text)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100), ) colpali_schema.add_rank_profile(colpali_profile) The first phase uses a linear combination of BM25 scores for the text fields, and the second phase uses the MaxSim function with the image embeddings. Notice that Vespa supports a `unpack_bits` function to convert the 16 compressed binary vectors to 128-dimensional floats for the MaxSim function. The query input tensor is not compressed and using full float resolution. ### Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). In \[ \]: Copied! ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() In \[ \]: Copied! ``` print("Number of PDF pages:", len(vespa_feed)) ``` print("Number of PDF pages:", len(vespa_feed)) Index the documents in Vespa using the Vespa HTTP API. In \[ \]: Copied! ``` from vespa.io import VespaResponse async with app.asyncio(connections=1, timeout=180) as session: for page in tqdm(vespa_feed): response: VespaResponse = await session.feed_data_point( data_id=page["id"], fields=page, schema="pdf_page" ) if not response.is_successful(): print(response.json()) ``` from vespa.io import VespaResponse async with app.asyncio(connections=1, timeout=180) as session: for page in tqdm(vespa_feed): response: VespaResponse = await session.feed_data_point( data_id=page["id"], fields=page, schema="pdf_page" ) if not response.is_successful(): print(response.json()) ### Querying Vespa[¶](#querying-vespa) Ok, so now we have indexed the PDF pages in Vespa. Let us now obtain ColPali embeddings for a few text queries and use it during ranking of the indexed pdf pages. Now we can query Vespa with the text query and rerank the results using the ColPali embeddings. In \[ \]: Copied! ``` queries = [ "Percentage of non-fresh water as source?", "Policies related to nature risk?", "How much of produced water is recycled?", ] ``` queries = [ "Percentage of non-fresh water as source?", "Policies related to nature risk?", "How much of produced water is recycled?", ] Obtain the query embeddings using the ColPali model: In \[ \]: Copied! ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor.process_queries(x), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query) qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor.process_queries(x), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query) qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) We create a simple routine to display the results. We render the image and the title of the retrieved page/document. In \[ \]: Copied! ``` from IPython.display import display, HTML def display_query_results(query, response, hits=5): query_time = response.json.get("timing", {}).get("searchtime", -1) query_time = round(query_time, 2) count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0) html_content = f"

Query text: '{query}', query time {query_time}s, count={count}, top results:

" for i, hit in enumerate(response.hits[:hits]): title = hit["fields"]["title"] url = hit["fields"]["url"] page = hit["fields"]["page_number"] image = hit["fields"]["image"] score = hit["relevance"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}, page {page+1} with score {score:.2f}

' html_content += ( f'' ) display(HTML(html_content)) ``` from IPython.display import display, HTML def display_query_results(query, response, hits=5): query_time = response.json.get("timing", {}).get("searchtime", -1) query_time = round(query_time, 2) count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0) html_content = f"

Query text: '{query}', query time {query_time}s, count={count}, top results:

" for i, hit in enumerate(response.hits[:hits]): title = hit["fields"]["title"] url = hit["fields"]["url"] page = hit["fields"]["page_number"] image = hit["fields"]["image"] score = hit["relevance"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}, page {page+1} with score {score:.2f}

' html_content += ( f'' ) display(HTML(html_content)) Query Vespa with the queries and display the results, here we are using the `default` rank profile. Note that we retrieve using textual representation with `userInput(@userQuery)`, this means that we use the BM25 ranking for the extracted text in the first ranking phase and then re-rank the top-k pages using the ColPali embeddings. Later in this notebook we will use Vespa's support for approximate nearest neighbor search (`nearestNeighbor`) to retrieve directly using the ColPali embeddings. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse async with app.asyncio(connections=1, timeout=120) as session: for idx, query in enumerate(queries): query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} response: VespaQueryResponse = await session.query( yql="select title,url,image,page_number from pdf_page where userInput(@userQuery)", ranking="default", userQuery=query, timeout=120, hits=3, body={"input.query(qt)": query_embedding, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ``` from vespa.io import VespaQueryResponse async with app.asyncio(connections=1, timeout=120) as session: for idx, query in enumerate(queries): query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} response: VespaQueryResponse = await session.query( yql="select title,url,image,page_number from pdf_page where userInput(@userQuery)", ranking="default", userQuery=query, timeout=120, hits=3, body={"input.query(qt)": query_embedding, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ### Using nearestNeighbor for retrieval[¶](#using-nearestneighbor-for-retrieval) In the above example, we used the ColPali embeddings in ranking, but using the text query for retrieval. This is a reasonable approach for text-heavy documents where the text representation is the most important and where ColPali embeddings are used to re-rank the top-k documents from the text retrieval phase. In some cases, the ColPali embeddings are the most important and we want to demonstrate how we can use HNSW indexing with binary hamming distance to retrieve the most similar pages to a query and then have two steps of re-ranking using the ColPali embeddings. All the phases here are executed locally inside the Vespa content node(s) so that no vector data needs to cross the network. Let us add a new rank-profile to the schema, the `nearestNeighbor` operator takes a query tensor and a field tensor as argument and we need to define the query tensors types in the rank-profile. In \[ \]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.append(("query(qt)", "tensor(querytoken{}, v[128])")) input_query_tensors.append(("query(qtb)", "tensor(querytoken{}, v[16])")) colpali_retrieval_profile = RankProfile( name="retrieval-and-rerank", inputs=input_query_tensors, functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1/(1 + sum( hamming(query(qtb), attribute(embedding)) ,v) ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_binary"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), ) colpali_schema.add_rank_profile(colpali_retrieval_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.append(("query(qt)", "tensor(querytoken{}, v[128])")) input_query_tensors.append(("query(qtb)", "tensor(querytoken{}, v[16])")) colpali_retrieval_profile = RankProfile( name="retrieval-and-rerank", inputs=input_query_tensors, functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1/(1 + sum( hamming(query(qtb), attribute(embedding)) ,v) ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_binary"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), ) colpali_schema.add_rank_profile(colpali_retrieval_profile) We define two functions, one for the first phase and one for the second phase. Instead of the float representations, we use the binary representations with inverted hamming distance in the first phase. Now, we need to re-deploy the application to Vespa Cloud. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() Now we can query Vespa with the text queries and use the `nearestNeighbor` operator to retrieve the most similar pages to the query and pass the different query tensors. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse target_hits_per_query_tensor = ( 20 # this is a hyper parameter that can be tuned for speed versus accuracy ) async with app.asyncio(connections=1, timeout=180) as session: for idx, query in enumerate(queries): float_query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} binary_query_embeddings = dict() for k, v in float_query_embedding.items(): binary_query_embeddings[k] = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) # The mixed tensors used in MaxSim calculations # We use both binary and float representations query_tensors = { "input.query(qtb)": binary_query_embeddings, "input.query(qt)": float_query_embedding, } # The query tensors used in the nearest neighbor calculations for i in range(0, len(binary_query_embeddings)): query_tensors[f"input.query(rq{i})"] = binary_query_embeddings[i] nn = [] for i in range(0, len(binary_query_embeddings)): nn.append( f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))" ) # We use a OR operator to combine the nearest neighbor operator nn = " OR ".join(nn) response: VespaQueryResponse = await session.query( yql=f"select title, url, image, page_number from pdf_page where {nn}", ranking="retrieval-and-rerank", timeout=120, hits=3, body={**query_tensors, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ``` from vespa.io import VespaQueryResponse target_hits_per_query_tensor = ( 20 # this is a hyper parameter that can be tuned for speed versus accuracy ) async with app.asyncio(connections=1, timeout=180) as session: for idx, query in enumerate(queries): float_query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} binary_query_embeddings = dict() for k, v in float_query_embedding.items(): binary_query_embeddings[k] = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) # The mixed tensors used in MaxSim calculations # We use both binary and float representations query_tensors = { "input.query(qtb)": binary_query_embeddings, "input.query(qt)": float_query_embedding, } # The query tensors used in the nearest neighbor calculations for i in range(0, len(binary_query_embeddings)): query_tensors[f"input.query(rq{i})"] = binary_query_embeddings[i] nn = [] for i in range(0, len(binary_query_embeddings)): nn.append( f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))" ) # We use a OR operator to combine the nearest neighbor operator nn = " OR ".join(nn) response: VespaQueryResponse = await session.query( yql=f"select title, url, image, page_number from pdf_page where {nn}", ranking="retrieval-and-rerank", timeout=120, hits=3, body={\*\*query_tensors, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) Depending on the scale, we can evaluate changing different number of targetHits per nearestNeighbor operator and the ranking depths in the two phases. We can also parallelize the ranking phases by using more threads per query request to reduce latency. ## Summary[¶](#summary) In this notebook, we have demonstrated how to represent the new ColQwen2 in Vespa. We have generated embeddings for images of PDF pages using ColQwen2 and stored the embeddings in Vespa using [mixed tensors](https://docs.vespa.ai/en/tensor-user-guide.html). We demonstrated how to store the base64 encoded image using the `raw` Vespa field type, plus meta data like title and url. We have demonstrated how to retrieve relevant pages for a query using the embeddings generated by ColPali. This notebook can be extended to include more complex ranking models, more complex queries, and more complex data structures, including metadata and other fields which can be filtered on or used for ranking. # Pyvespa examples[¶](#pyvespa-examples) This is a notebook with short examples one can build applications from. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md) for any problem when running this guide. Refer to [troubleshooting](https://vespa-engine.github.io/pyvespa/troubleshooting.md), which also has utilies for debugging. In \[ \]: Copied! ``` !pip3 install pyvespa ``` !pip3 install pyvespa ## Neighbors[¶](#neighbors) Explore distance between points in 3D vector space. These are simple examples, feeding documents with a tensor representing a point in space, and a rank profile calculating the distance between a point in the query and the point in the documents. The examples start with using simple ranking expressions like [euclidean-distance](https://docs.vespa.ai/en/reference/ranking-expressions.html#euclidean-distance-t), then rank features like [closeness()]() and setting different [distance-metrics](https://docs.vespa.ai/en/nearest-neighbor-search.html#distance-metrics-for-nearest-neighbor-search). ### Distant neighbor[¶](#distant-neighbor) First, find the point that is **most** distant from a point in query - deploy the Application Package: In \[14\]: Copied! ``` from vespa.package import ApplicationPackage, Field, RankProfile from vespa.deployment import VespaDocker from vespa.io import VespaResponse app_package = ApplicationPackage(name="neighbors") app_package.schema.add_fields( Field(name="point", type="tensor(d[3])", indexing=["attribute", "summary"]) ) app_package.schema.add_rank_profile( RankProfile( name="max_distance", inputs=[("query(qpoint)", "tensor(d[3])")], first_phase="euclidean_distance(attribute(point), query(qpoint), d)", ) ) vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=app_package) ``` from vespa.package import ApplicationPackage, Field, RankProfile from vespa.deployment import VespaDocker from vespa.io import VespaResponse app_package = ApplicationPackage(name="neighbors") app_package.schema.add_fields( Field(name="point", type="tensor(d[3])", indexing=["attribute", "summary"]) ) app_package.schema.add_rank_profile( RankProfile( name="max_distance", inputs=\[("query(qpoint)", "tensor(d[3])")\], first_phase="euclidean_distance(attribute(point), query(qpoint), d)", ) ) vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 25/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` Feed points in 3d space using a 3-dimensional [indexed tensor](https://docs.vespa.ai/en/tensor-user-guide.html). Pyvespa feeds using the [/document/v1/ API](https://docs.vespa.ai/en/reference/document-v1-api-reference.html), refer to [document format](https://docs.vespa.ai/en/reference/document-json-format.html): In \[15\]: Copied! ``` def get_feed(field_name): return [ {"id": 0, "fields": {field_name: [0.0, 1.0, 2.0]}}, {"id": 1, "fields": {field_name: [1.0, 2.0, 3.0]}}, {"id": 2, "fields": {field_name: [2.0, 3.0, 4.0]}}, ] with app.syncio(connections=1) as session: for u in get_feed("point"): response: VespaResponse = session.update_data( data_id=u["id"], schema="neighbors", fields=u["fields"], create=True ) if not response.is_successful(): print( "Update failed for document {}".format(u["id"]) + " with status code {}".format(response.status_code) + " with response {}".format(response.get_json()) ) ``` def get_feed(field_name): return \[ {"id": 0, "fields": {field_name: [0.0, 1.0, 2.0]}}, {"id": 1, "fields": {field_name: [1.0, 2.0, 3.0]}}, {"id": 2, "fields": {field_name: [2.0, 3.0, 4.0]}}, \] with app.syncio(connections=1) as session: for u in get_feed("point"): response: VespaResponse = session.update_data( data_id=u["id"], schema="neighbors", fields=u["fields"], create=True ) if not response.is_successful(): print( "Update failed for document {}".format(u["id"]) - " with status code {}".format(response.status_code) - " with response {}".format(response.get_json()) ) **Note:** The feed above uses [create-if-nonexistent](https://docs.vespa.ai/en/document-v1-api-guide.html#create-if-nonexistent), i.e. update a document, create it if it does not exists. Later in this notebook we will add a field and update it, so using an update to feed data makes it easier. Query from origo using [YQL](https://docs.vespa.ai/en/query-language.html). The rank profile will rank the most distant points highest, here `sqrt(2*2 + 3*3 + 4*4) = 5.385`: In \[16\]: Copied! ``` import json from vespa.io import VespaQueryResponse result: VespaQueryResponse = app.query( body={ "yql": "select point from neighbors where true", "input.query(qpoint)": "[0.0, 0.0, 0.0]", "ranking.profile": "max_distance", "presentation.format.tensors": "short-value", } ) if not response.is_successful(): print( "Query failed with status code {}".format(response.status_code) + " with response {}".format(response.get_json()) ) raise Exception("Query failed") if len(result.hits) != 3: raise Exception("Expected 3 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` import json from vespa.io import VespaQueryResponse result: VespaQueryResponse = app.query( body={ "yql": "select point from neighbors where true", "input.query(qpoint)": "[0.0, 0.0, 0.0]", "ranking.profile": "max_distance", "presentation.format.tensors": "short-value", } ) if not response.is_successful(): print( "Query failed with status code {}".format(response.status_code) - " with response {}".format(response.get_json()) ) raise Exception("Query failed") if len(result.hits) != 3: raise Exception("Expected 3 hits, got {}".format(len(result.hits))) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3", "relevance": 5.385164807134504, "source": "neighbors_content", "fields": { "point": [ 2.0, 3.0, 4.0 ] } }, { "id": "index:neighbors_content/0/c4ca4238db266f395150e961", "relevance": 3.7416573867739413, "source": "neighbors_content", "fields": { "point": [ 1.0, 2.0, 3.0 ] } }, { "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca", "relevance": 2.23606797749979, "source": "neighbors_content", "fields": { "point": [ 0.0, 1.0, 2.0 ] } } ] ``` Query from `[1.0, 2.0, 2.9]` - find that `[2.0, 3.0, 4.0]` is most distant: In \[17\]: Copied! ``` result = app.query( body={ "yql": "select point from neighbors where true", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "max_distance", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select point from neighbors where true", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "max_distance", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3", "relevance": 1.7916472308265357, "source": "neighbors_content", "fields": { "point": [ 2.0, 3.0, 4.0 ] } }, { "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca", "relevance": 1.6763055154708881, "source": "neighbors_content", "fields": { "point": [ 0.0, 1.0, 2.0 ] } }, { "id": "index:neighbors_content/0/c4ca4238db266f395150e961", "relevance": 0.09999990575011103, "source": "neighbors_content", "fields": { "point": [ 1.0, 2.0, 3.0 ] } } ] ``` ### Nearest neighbor[¶](#nearest-neighbor) The [nearestNeighbor](https://docs.vespa.ai/en/reference/query-language-reference.html#nearestneighbor) query operator calculates distances between points in vector space. Here, we are using the default distance metric (euclidean), as it is not specified. The [closeness()]() rank feature can be used to rank results - add a new rank profile: In \[18\]: Copied! ``` app_package.schema.add_rank_profile( RankProfile( name="nearest_neighbor", inputs=[("query(qpoint)", "tensor(d[3])")], first_phase="closeness(field, point)", ) ) app = vespa_docker.deploy(application_package=app_package) ``` app_package.schema.add_rank_profile( RankProfile( name="nearest_neighbor", inputs=\[("query(qpoint)", "tensor(d[3])")\], first_phase="closeness(field, point)", ) ) app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` Read more in [nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html). Query using nearestNeighbor query operator: In \[19\]: Copied! ``` result = app.query( body={ "yql": "select point from neighbors where {targetHits: 3}nearestNeighbor(point, qpoint)", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "nearest_neighbor", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select point from neighbors where {targetHits: 3}nearestNeighbor(point, qpoint)", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "nearest_neighbor", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "index:neighbors_content/0/c4ca4238db266f395150e961", "relevance": 0.9090909879069752, "source": "neighbors_content", "fields": { "point": [ 1.0, 2.0, 3.0 ] } }, { "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca", "relevance": 0.37364941905256455, "source": "neighbors_content", "fields": { "point": [ 0.0, 1.0, 2.0 ] } }, { "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3", "relevance": 0.35821144946644456, "source": "neighbors_content", "fields": { "point": [ 2.0, 3.0, 4.0 ] } } ] ``` ### Nearest neighbor - angular[¶](#nearest-neighbor-angular) So far, we have used the default [distance-metric](https://docs.vespa.ai/en/nearest-neighbor-search.html#distance-metrics-for-nearest-neighbor-search) which is euclidean - now try with another. Add new few field with "angular" distance metric: In \[20\]: Copied! ``` app_package.schema.add_fields( Field( name="point_angular", type="tensor(d[3])", indexing=["attribute", "summary"], attribute=["distance-metric: angular"], ) ) app_package.schema.add_rank_profile( RankProfile( name="nearest_neighbor_angular", inputs=[("query(qpoint)", "tensor(d[3])")], first_phase="closeness(field, point_angular)", ) ) app = vespa_docker.deploy(application_package=app_package) ``` app_package.schema.add_fields( Field( name="point_angular", type="tensor(d[3])", indexing=["attribute", "summary"], attribute=["distance-metric: angular"], ) ) app_package.schema.add_rank_profile( RankProfile( name="nearest_neighbor_angular", inputs=\[("query(qpoint)", "tensor(d[3])")\], first_phase="closeness(field, point_angular)", ) ) app = vespa_docker.deploy(application_package=app_package) ``` Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment. ``` Feed the same data to the `point_angular` field: In \[21\]: Copied! ``` for u in get_feed("point_angular"): response: VespaResponse = session.update_data( data_id=u["id"], schema="neighbors", fields=u["fields"] ) if not response.is_successful(): print( "Update failed for document {}".format(u["id"]) + " with status code {}".format(response.status_code) + " with response {}".format(response.get_json()) ) ``` for u in get_feed("point_angular"): response: VespaResponse = session.update_data( data_id=u["id"], schema="neighbors", fields=u["fields"] ) if not response.is_successful(): print( "Update failed for document {}".format(u["id"]) - " with status code {}".format(response.status_code) - " with response {}".format(response.get_json()) ) Observe the documents now have *two* vectors Notice that we pass [native Vespa document v1 api parameters](https://docs.vespa.ai/en/reference/document-v1-api-reference.html) to reduce the tensor verbosity. In \[24\]: Copied! ``` from vespa.io import VespaResponse response: VespaResponse = app.get_data( schema="neighbors", data_id=0, **{"format.tensors": "short-value"} ) print(json.dumps(response.get_json(), indent=4)) ``` from vespa.io import VespaResponse response: VespaResponse = app.get_data( schema="neighbors", data_id=0, \*\*{"format.tensors": "short-value"} ) print(json.dumps(response.get_json(), indent=4)) ``` { "pathId": "/document/v1/neighbors/neighbors/docid/0", "id": "id:neighbors:neighbors::0", "fields": { "point": [ 0.0, 1.0, 2.0 ], "point_angular": [ 0.0, 1.0, 2.0 ] } } ``` In \[25\]: Copied! ``` result = app.query( body={ "yql": "select point_angular from neighbors where {targetHits: 3}nearestNeighbor(point_angular, qpoint)", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "nearest_neighbor_angular", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` result = app.query( body={ "yql": "select point_angular from neighbors where {targetHits: 3}nearestNeighbor(point_angular, qpoint)", "input.query(qpoint)": "[1.0, 2.0, 2.9]", "ranking.profile": "nearest_neighbor_angular", "presentation.format.tensors": "short-value", } ) print(json.dumps(result.hits, indent=4)) ``` [ { "id": "index:neighbors_content/0/c4ca4238db266f395150e961", "relevance": 0.983943389010042, "source": "neighbors_content", "fields": { "point_angular": [ 1.0, 2.0, 3.0 ] } }, { "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3", "relevance": 0.9004871017951954, "source": "neighbors_content", "fields": { "point_angular": [ 2.0, 3.0, 4.0 ] } }, { "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca", "relevance": 0.7638041096953281, "source": "neighbors_content", "fields": { "point_angular": [ 0.0, 1.0, 2.0 ] } } ] ``` In the output above, observe the different in "relevance", compared to the query using `'ranking.profile': 'nearest_neighbor'` above - this is the difference in `closeness()` using different distance metrics. ## Next steps[¶](#next-steps) - Try the [multi-vector-indexing](https://vespa-engine.github.io/pyvespa/examples/multi-vector-indexing.md) notebook to explore using an HNSW-index for *approximate* nearest neighbor search. - Explore using the [distance()]() rank feature - this should give the same results as the ranking expressions using `euclidean-distance` above. - `label` is useful when having more vector fields - read more about the [nearestNeighbor](https://docs.vespa.ai/en/reference/query-language-reference.html#nearestneighbor) query operator. ## Cleanup[¶](#cleanup) In \[ \]: Copied! ``` vespa_docker.container.stop() vespa_docker.container.remove() ``` vespa_docker.container.stop() vespa_docker.container.remove() # RAG Blueprint tutorial[¶](#rag-blueprint-tutorial) Many of our users use Vespa to power large scale RAG Applications. This blueprint aims to exemplify many of the best practices we have learned while supporting these users. While many RAG tutorials exist, this blueprint provides a customizable template that: - Can [(auto)scale](https://docs.vespa.ai/en/cloud/autoscaling.html) with your data size and/or query load. - Is fast and [production grade](https://docs.vespa.ai/en/cloud/production-deployment.html). - Enables you to build RAG applications with state-of-the-art quality. This tutorial will show how we can develop a *high-quality* RAG application with an evaluation-driven mindset, while being a resource you can revisit for making informed choices for your own use case. We will guide you through the following steps: 1. [Installing dependencies](#installing-dependencies) 1. [Cloning the RAG Blueprint](#cloning-the-rag-blueprint) 1. [Inspecting the RAG Blueprint](#inspecting-the-rag-blueprint) 1. [Deploying to Vespa Cloud](#deploying-to-vespa-cloud) 1. [Our use case](#our-use-case) 1. [Data modeling](#data-modeling) 1. [Structuring your Vespa application](#structuring-your-vespa-application) 1. [Configuring match-phase (retrieval)](#configuring-match-phase-retrieval) 1. [First-phase ranking](#first-phase-ranking) 1. [Second-phase ranking](#second-phase-ranking) 1. [(Optional) Global-phase reranking](#optional-global-phase-reranking) All the accompanying code can be found in our [sample app](https://github.com/vespa-engine/sample-apps/tree/master/rag-blueprint) repo, but we will also clone the repo and run the code in this notebook. Some of the python scripts from the sample app will be adapted and shown inline in this notebook instead of running them separately. Each step will contain reasoning behind the choices and design of the blueprint, as well as pointers for customizing to your own application. This is not a **'Deploy RAG in 5 minutes'** tutorial (although you *can* technically do that by just running the notebook). This focus is more about providing you with the insights and tools for you to apply it to your own use case. Therefore we suggest taking your time to look at the code in the sample app, and run the described steps." Here is an overview of the retrieval and ranking pipeline we will build in this tutorial: ## Installing dependencies[¶](#installing-dependencies) In \[1\]: Copied! ``` !pip3 install pyvespa>=0.58.0 vespacli scikit-learn lightgbm pandas ``` !pip3 install pyvespa>=0.58.0 vespacli scikit-learn lightgbm pandas ``` zsh:1: 0.58.0 not found ``` ## Cloning the RAG Blueprint[¶](#cloning-the-rag-blueprint) Although you *could* define all components of the application with python code only from pyvespa, this would go against our advise on or the [Advanced Configuration](https://vespa-engine.github.io/pyvespa/advanced-configuration.md) notebook for a guide if you want to do that. Here, we will use pyvespa to deploy an application package from the existing files. Let us start by cloning the RAG Blueprint application from the [Vespa sample-apps repository](https://github.com/vespa-engine/sample-apps/tree/master/rag-blueprint). In \[1\]: Copied! ``` # Clone the RAG Blueprint sample application !git clone --depth 1 --filter=blob:none --sparse https://github.com/vespa-engine/sample-apps.git src && cd src && git sparse-checkout set rag-blueprint ``` # Clone the RAG Blueprint sample application !git clone --depth 1 --filter=blob:none --sparse https://github.com/vespa-engine/sample-apps.git src && cd src && git sparse-checkout set rag-blueprint ``` Cloning into 'src'... remote: Enumerating objects: 640, done. remote: Counting objects: 100% (640/640), done. remote: Compressing objects: 100% (350/350), done. remote: Total 640 (delta 7), reused 557 (delta 5), pack-reused 0 (from 0) Receiving objects: 100% (640/640), 62.63 KiB | 1.01 MiB/s, done. Resolving deltas: 100% (7/7), done. remote: Enumerating objects: 15, done. remote: Counting objects: 100% (15/15), done. remote: Compressing objects: 100% (13/13), done. remote: Total 15 (delta 2), reused 8 (delta 2), pack-reused 0 (from 0) Receiving objects: 100% (15/15), 92.91 KiB | 318.00 KiB/s, done. Resolving deltas: 100% (2/2), done. Updating files: 100% (15/15), done. remote: Enumerating objects: 37, done. remote: Counting objects: 100% (37/37), done. remote: Compressing objects: 100% (30/30), done. remote: Total 37 (delta 8), reused 21 (delta 6), pack-reused 0 (from 0) Receiving objects: 100% (37/37), 111.45 KiB | 401.00 KiB/s, done. Resolving deltas: 100% (8/8), done. Updating files: 100% (37/37), done. ``` ## Inspecting the RAG Blueprint[¶](#inspecting-the-rag-blueprint) First, let's examine the structure of the RAG Blueprint application we just cloned: In \[2\]: Copied! ``` from pathlib import Path def tree( root: str | Path = ".", *, show_hidden: bool = False, max_depth: int | None = None ) -> str: """ Return a Unix‐style 'tree' listing for *root*. Parameters ---------- root : str | Path Directory to walk (default: ".") show_hidden : bool Include dotfiles and dot-dirs? (default: False) max_depth : int | None Limit recursion depth; None = no limit. Returns ------- str A newline-joined string identical to `tree` output. """ root_path = Path(root).resolve() lines = [root_path.as_posix()] def _walk(current: Path, prefix: str = "", depth: int = 0) -> None: if max_depth is not None and depth >= max_depth: return entries = sorted( (e for e in current.iterdir() if show_hidden or not e.name.startswith(".")), key=lambda p: (not p.is_dir(), p.name.lower()), ) last = len(entries) - 1 for idx, entry in enumerate(entries): connector = "└── " if idx == last else "├── " lines.append(f"{prefix}{connector}{entry.name}") if entry.is_dir(): extension = " " if idx == last else "│ " _walk(entry, prefix + extension, depth + 1) _walk(root_path) return "\n".join(lines) ``` ## from pathlib import Path def tree( root: str | Path = ".", \*, show_hidden: bool = False, max_depth: int | None = None ) -> str: """ Return a Unix‐style 'tree' listing for \*root\*. Parameters ## root : str | Path Directory to walk (default: ".") show_hidden : bool Include dotfiles and dot-dirs? (default: False) max_depth : int | None Limit recursion depth; None = no limit. Returns str A newline-joined string identical to `tree` output. """ root_path = Path(root).resolve() lines = [root_path.as_posix()] def \_walk(current: Path, prefix: str = "", depth: int = 0) -> None: if max_depth is not None and depth >= max_depth: return entries = sorted( (e for e in current.iterdir() if show_hidden or not e.name.startswith(".")), key=lambda p: (not p.is_dir(), p.name.lower()), ) last = len(entries) - 1 for idx, entry in enumerate(entries): connector = "└── " if idx == last else "├── " lines.append(f"{prefix}{connector}{entry.name}") if entry.is_dir(): extension = " " if idx == last else "│ " \_walk(entry, prefix + extension, depth + 1) \_walk(root_path) return "\\n".join(lines) In \[3\]: Copied! ``` # Let's explore the RAG Blueprint application structure print(tree("src/rag-blueprint")) ``` # Let's explore the RAG Blueprint application structure print(tree("src/rag-blueprint")) ``` /Users/thomas/Repos/pyvespa/docs/sphinx/source/examples/src/rag-blueprint ├── app │ ├── models │ │ └── lightgbm_model.json │ ├── schemas │ │ ├── doc │ │ │ ├── base-features.profile │ │ │ ├── collect-second-phase.profile │ │ │ ├── collect-training-data.profile │ │ │ ├── learned-linear.profile │ │ │ ├── match-only.profile │ │ │ └── second-with-gbdt.profile │ │ └── doc.sd │ ├── search │ │ └── query-profiles │ │ ├── deepresearch-with-gbdt.xml │ │ ├── deepresearch.xml │ │ ├── hybrid-with-gbdt.xml │ │ ├── hybrid.xml │ │ ├── rag-with-gbdt.xml │ │ └── rag.xml │ └── services.xml ├── dataset │ └── docs.jsonl ├── eval │ ├── output │ │ ├── Vespa-training-data_match_first_phase_20250623_133241.csv │ │ ├── Vespa-training-data_match_first_phase_20250623_133241_logreg_coefficients.txt │ │ ├── Vespa-training-data_match_rank_second_phase_20250623_135819.csv │ │ └── Vespa-training-data_match_rank_second_phase_20250623_135819_feature_importance.csv │ ├── collect_pyvespa.py │ ├── evaluate_match_phase.py │ ├── evaluate_ranking.py │ ├── pyproject.toml │ ├── README.md │ ├── resp.json │ ├── train_lightgbm.py │ └── train_logistic_regression.py ├── queries │ ├── queries.json │ └── test_queries.json ├── deploy-locally.md ├── generation.md ├── query-profiles.md ├── README.md └── relevance.md ``` We can see that the RAG Blueprint includes a complete application package with: - `schemas/doc.sd` - The document schema with chunking and embeddings - `schemas/doc/*.profile` - Ranking profiles for collecting training data, first-phase ranking, and second-phase ranking - `services.xml` - Services configuration with embedder and LLM integration - `search/query-profiles/*.xml` - Pre-configured query profiles for different use cases - `models/` - Pre-trained ranking models ## Deploying to Vespa Cloud[¶](#deploying-to-vespa-cloud) ### Create a free trial[¶](#create-a-free-trial) Create a tenant from [here](https://vespa.ai/free-trial/). The trial includes $300 credit. Take note of your tenant name, and input it below. In \[5\]: Copied! ``` from vespa.deployment import VespaCloud from vespa.application import Vespa from pathlib import Path import os import json ``` from vespa.deployment import VespaCloud from vespa.application import Vespa from pathlib import Path import os import json In \[6\]: Copied! ``` VESPA_TENANT_NAME = "vespa-team" # Replace with your tenant name ``` VESPA_TENANT_NAME = "vespa-team" # Replace with your tenant name Here, set your desired application name. (Will be created in later steps) Note that you can not have hyphen `-` or underscore `_` in the application name. In \[7\]: Copied! ``` VESPA_APPLICATION_NAME = "rag-blueprint" # No hyphens or underscores allowed VESPA_SCHEMA_NAME = "doc" # RAG Blueprint uses 'doc' schema ``` VESPA_APPLICATION_NAME = "rag-blueprint" # No hyphens or underscores allowed VESPA_SCHEMA_NAME = "doc" # RAG Blueprint uses 'doc' schema In \[8\]: Copied! ``` repo_root = Path("src/rag-blueprint") application_root = repo_root / "app" ``` repo_root = Path("src/rag-blueprint") application_root = repo_root / "app" Note, you could also enable a token endpoint, for easier connection after deployment, see [Authenticating to Vespa Cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md) for details. We will stick to the default MTLS key/cert authentication for this notebook. ### Adding secret to Vespa Cloud Secret Store[¶](#adding-secret-to-vespa-cloud-secret-store) In order to use the LLM integration, you need to add your OpenAI API key to the Vespa Cloud [Secret Store](https://docs.vespa.ai/en/cloud/security/secret-store.html#). Then, we can reference this secret in our `services.xml` file, so that Vespa can use it to access the OpenAI API. Below we have added a vault called `sample-apps` and a secret named `openai-dev` that contains the OpenAI API key. Make sure that the vault and secret names match the ones in the `services.xml` file. ``` ``` Let us first take a look at the original `services.xml` file, which contains the configuration for the Vespa application services, including the LLM integration and embedder. !!! note It is also possible to define the services.xml-configuration in python code, see [Advanced Configuration](https://vespa-engine.github.io/pyvespa/advanced-configuration.md). In \[21\]: Copied! ```` from IPython.display import display, Markdown def display_md(text: str, tag: str = "txt"): text = text.rstrip() md = f"""```{tag} {text} ```""" display(Markdown(md)) services_content = (application_root / "services.xml").read_text() display_md(services_content, "xml") ```` from IPython.display import display, Markdown def display_md(text: str, tag: str = "txt"): text = text.rstrip() md = f"""\`\`\`{tag} {text} ```""" display(Markdown(md)) services_content = (application_root / "services.xml").read_text() display_md(services_content, "xml") ``` ``` openai-api-key token_embeddings 8192 search_query: search_document: openai 2 ``` ``` ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) Now let's deploy the RAG Blueprint application to Vespa Cloud: In \[10\]: Copied! ``` # This is only needed for CI. VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY", None) ``` # This is only needed for CI. VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY", None) In \[11\]: Copied! ``` vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_root=application_root, ) ``` vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_root=application_root, ) ``` Setting application... Running: vespa config set application vespa-team.rag-blueprint.default Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key. ``` Now, we will deploy the application to Vespa Cloud. This will take a few minutes, so feel free to skip ahead to the next section while waiting for the deployment to complete. In \[12\]: Copied! ``` # Deploy the application app: Vespa = vespa_cloud.deploy(disk_folder=application_root) ``` # Deploy the application app: Vespa = vespa_cloud.deploy(disk_folder=application_root) ``` Deployment started in run 85 of dev-aws-us-east-1c for vespa-team.rag-blueprint. This may take a few minutes the first time. INFO [09:40:36] Deploying platform version 8.586.25 and application dev build 85 for dev-aws-us-east-1c of default ... INFO [09:40:36] Using CA signed certificate version 5 INFO [09:40:43] Session 379704 for tenant 'vespa-team' prepared and activated. INFO [09:40:43] ######## Details for all nodes ######## INFO [09:40:43] h125699b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:40:43] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:40:43] --- storagenode on port 19102 has config generation 379704, wanted is 379704 INFO [09:40:43] --- searchnode on port 19107 has config generation 379704, wanted is 379704 INFO [09:40:43] --- distributor on port 19111 has config generation 379699, wanted is 379704 INFO [09:40:43] --- metricsproxy-container on port 19092 has config generation 379704, wanted is 379704 INFO [09:40:43] h125755a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:40:43] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:40:43] --- container on port 4080 has config generation 379699, wanted is 379704 INFO [09:40:43] --- metricsproxy-container on port 19092 has config generation 379704, wanted is 379704 INFO [09:40:43] h97530b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:40:43] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:40:43] --- logserver-container on port 4080 has config generation 379704, wanted is 379704 INFO [09:40:43] --- metricsproxy-container on port 19092 has config generation 379704, wanted is 379704 INFO [09:40:43] h119190c.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:40:43] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:40:43] --- container-clustercontroller on port 19050 has config generation 379699, wanted is 379704 INFO [09:40:43] --- metricsproxy-container on port 19092 has config generation 379699, wanted is 379704 INFO [09:40:51] Found endpoints: INFO [09:40:51] - dev.aws-us-east-1c INFO [09:40:51] |-- https://fe5fe13c.fe19121d.z.vespa-app.cloud/ (cluster 'default') INFO [09:40:51] Deployment of new application revision complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for default URL: https://fe5fe13c.fe19121d.z.vespa-app.cloud/ Application is up! ``` ## Feed Sample Data[¶](#feed-sample-data) The RAG Blueprint comes with sample data. Let's download and feed it to test our deployment: In \[16\]: Copied! ``` doc_file = repo_root / "dataset" / "docs.jsonl" with open(doc_file, "r") as f: docs = [json.loads(line) for line in f.readlines()] ``` doc_file = repo_root / "dataset" / "docs.jsonl" with open(doc_file, "r") as f: docs = [json.loads(line) for line in f.readlines()] In \[17\]: Copied! ``` docs[:2] ``` docs[:2] Out\[17\]: ```` [{'put': 'id:doc:doc::1', 'fields': {'created_timestamp': 1675209600, 'modified_timestamp': 1675296000, 'text': '# SynapseCore Module: Custom Attention Implementation\n\n```python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nclass CustomAttention(nn.Module):\n def __init__(self, hidden_dim):\n super(CustomAttention, self).__init__()\n self.hidden_dim = hidden_dim\n self.query_layer = nn.Linear(hidden_dim, hidden_dim)\n self.key_layer = nn.Linear(hidden_dim, hidden_dim)\n self.value_layer = nn.Linear(hidden_dim, hidden_dim)\n # More layers and logic here\n\n def forward(self, query_input, key_input, value_input, mask=None):\n # Q, K, V projections\n Q = self.query_layer(query_input)\n K = self.key_layer(key_input)\n V = self.value_layer(value_input)\n\n # Scaled Dot-Product Attention\n attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.hidden_dim ** 0.5)\n if mask is not None:\n attention_scores = attention_scores.masked_fill(mask == 0, -1e9)\n \n attention_probs = F.softmax(attention_scores, dim=-1)\n context_vector = torch.matmul(attention_probs, V)\n return context_vector, attention_probs\n\n# Example Usage:\n# attention_module = CustomAttention(hidden_dim=512)\n# output, probs = attention_module(q_tensor, k_tensor, v_tensor)\n```\n\n## Design Notes:\n- Optimized for speed with batched operations.\n- Includes optional masking for variable sequence lengths.\n## ', 'favorite': True, 'last_opened_timestamp': 1717308000, 'open_count': 25, 'title': 'custom_attention_impl.py.md', 'id': '1'}}, {'put': 'id:doc:doc::2', 'fields': {'created_timestamp': 1709251200, 'modified_timestamp': 1709254800, 'text': "# YC Workshop Notes: Scaling B2B Sales (W25)\nDate: 2025-03-01\nSpeaker: [YC Partner Name]\n\n## Key Takeaways:\n1. **ICP Definition is Crucial:** Don't try to sell to everyone. Narrow down your Ideal Customer Profile.\n - Characteristics: Industry, company size, pain points, decision-maker roles.\n2. **Outbound Strategy:**\n - Personalized outreach > Mass emails.\n - Tools mentioned: Apollo.io, Outreach.io.\n - Metrics: Open rates, reply rates, meeting booked rates.\n3. **Sales Process Stages:**\n - Prospecting -> Qualification -> Demo -> Proposal -> Negotiation -> Close.\n - Define clear entry/exit criteria for each stage.\n4. **Value Proposition:** Clearly articulate how you solve the customer's pain and deliver ROI.\n5. **Early Hires:** First sales hire should be a 'hunter-farmer' hybrid if possible, or a strong individual contributor.\n\n## Action Items for SynapseFlow:\n- [ ] Refine ICP based on beta user feedback.\n- [ ] Experiment with a small, targeted outbound campaign for 2 specific verticals.\n- [ ] Draft initial sales playbook outline.\n## ", 'favorite': True, 'last_opened_timestamp': 1717000000, 'open_count': 12, 'title': 'yc_b2b_sales_workshop_notes.md', 'id': '2'}}] ```` In \[18\]: Copied! ``` vespa_feed = [] for doc in docs: vespa_doc = doc.copy() vespa_doc["id"] = doc["fields"]["id"] vespa_doc.pop("put") vespa_feed.append(vespa_doc) vespa_feed[:2] ``` vespa_feed = [] for doc in docs: vespa_doc = doc.copy() vespa_doc["id"] = doc["fields"]["id"] vespa_doc.pop("put") vespa_feed.append(vespa_doc) vespa_feed[:2] Out\[18\]: ```` [{'fields': {'created_timestamp': 1675209600, 'modified_timestamp': 1675296000, 'text': '# SynapseCore Module: Custom Attention Implementation\n\n```python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nclass CustomAttention(nn.Module):\n def __init__(self, hidden_dim):\n super(CustomAttention, self).__init__()\n self.hidden_dim = hidden_dim\n self.query_layer = nn.Linear(hidden_dim, hidden_dim)\n self.key_layer = nn.Linear(hidden_dim, hidden_dim)\n self.value_layer = nn.Linear(hidden_dim, hidden_dim)\n # More layers and logic here\n\n def forward(self, query_input, key_input, value_input, mask=None):\n # Q, K, V projections\n Q = self.query_layer(query_input)\n K = self.key_layer(key_input)\n V = self.value_layer(value_input)\n\n # Scaled Dot-Product Attention\n attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.hidden_dim ** 0.5)\n if mask is not None:\n attention_scores = attention_scores.masked_fill(mask == 0, -1e9)\n \n attention_probs = F.softmax(attention_scores, dim=-1)\n context_vector = torch.matmul(attention_probs, V)\n return context_vector, attention_probs\n\n# Example Usage:\n# attention_module = CustomAttention(hidden_dim=512)\n# output, probs = attention_module(q_tensor, k_tensor, v_tensor)\n```\n\n## Design Notes:\n- Optimized for speed with batched operations.\n- Includes optional masking for variable sequence lengths.\n## ', 'favorite': True, 'last_opened_timestamp': 1717308000, 'open_count': 25, 'title': 'custom_attention_impl.py.md', 'id': '1'}, 'id': '1'}, {'fields': {'created_timestamp': 1709251200, 'modified_timestamp': 1709254800, 'text': "# YC Workshop Notes: Scaling B2B Sales (W25)\nDate: 2025-03-01\nSpeaker: [YC Partner Name]\n\n## Key Takeaways:\n1. **ICP Definition is Crucial:** Don't try to sell to everyone. Narrow down your Ideal Customer Profile.\n - Characteristics: Industry, company size, pain points, decision-maker roles.\n2. **Outbound Strategy:**\n - Personalized outreach > Mass emails.\n - Tools mentioned: Apollo.io, Outreach.io.\n - Metrics: Open rates, reply rates, meeting booked rates.\n3. **Sales Process Stages:**\n - Prospecting -> Qualification -> Demo -> Proposal -> Negotiation -> Close.\n - Define clear entry/exit criteria for each stage.\n4. **Value Proposition:** Clearly articulate how you solve the customer's pain and deliver ROI.\n5. **Early Hires:** First sales hire should be a 'hunter-farmer' hybrid if possible, or a strong individual contributor.\n\n## Action Items for SynapseFlow:\n- [ ] Refine ICP based on beta user feedback.\n- [ ] Experiment with a small, targeted outbound campaign for 2 specific verticals.\n- [ ] Draft initial sales playbook outline.\n## ", 'favorite': True, 'last_opened_timestamp': 1717000000, 'open_count': 12, 'title': 'yc_b2b_sales_workshop_notes.md', 'id': '2'}, 'id': '2'}] ```` Now, let us feed the data to Vespa. If you have a large dataset, you could also do this async, with `feed_async_iterable()`, see [Feeding Vespa cloud](https://vespa-engine.github.io/pyvespa/examples/feed_performance_cloud.md) for a detailed comparison. In \[19\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback) ## Test a query to the Vespa application[¶](#test-a-query-to-the-vespa-application) Let us test some queries to see if the application is working as expected. We will use one of the pre-configured query profiles, which we will explain in more detail later. For now, let us just see that we can get some results back from the application. In \[20\]: Copied! ``` query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "hybrid", "hits": 2, } with app.syncio() as sess: response = sess.query(body) response.json ``` query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "hybrid", "hits": 2, } with app.syncio() as sess: response = sess.query(body) response.json Out\[20\]: ``` {'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 100}, 'coverage': {'coverage': 100, 'documents': 100, 'full': True, 'nodes': 1, 'results': 1, 'resultsFull': 1}, 'children': [{'id': 'index:content/0/e369853debf684767dff1f16', 'relevance': 1.7111883427143333, 'source': 'content', 'fields': {'sddocname': 'doc', 'chunks_top3': ['# YC Application Draft Snippets - SynapseFlow (Late 2024)\n\n**Q: Describe what your company does in 50 characters or less.**\n- AI model deployment made easy for developers.\n- Effortless MLOps for startups.\n- Deploy ML models in minutes, not weeks.\n\n**Q: What is your company going to make?**\nSynapseFlow is building a PaaS solution that radically simplifies the deployment, management, and scaling of machine learning models. We provide a developer-first API and intuitive UI that abstracts away the complexities of MLOps infrastructure (Kubernetes, model servers, monitoring), allowing data scientists and developers ', "to focus on building models, not wrestling with ops. Our vision is to be the Heroku for AI.\n\n**Q: Why did you pick this idea to work on?**\nAs an AI engineer, I've experienced firsthand the immense friction and time wasted in operationalizing ML models. Existing solutions are often too complex for smaller teams (e.g., full SageMaker/Vertex AI) or lack the flexibility needed for custom model development. We believe there's a huge unmet need for a simple, powerful, and affordable MLOps platform.\n\n## (More Q&A drafts, team background notes)"], 'summaryfeatures': {'top_3_chunk_sim_scores': {'type': 'tensor(chunk{})', 'cells': {'0': 0.36166757345199585, '1': 0.21831661462783813}}, 'vespa.summaryFeatures.cached': 0.0}}}, {'id': 'index:content/0/98f13708aca18c358d9d52d0', 'relevance': 1.309791587164871, 'source': 'content', 'fields': {'sddocname': 'doc', 'chunks_top3': ["# Ideas for SynapseFlow Blog Post - 'Demystifying MLOps'\n\n**Target Audience:** Developers, data scientists new to MLOps, product managers.\n**Goal:** Explain what MLOps is, why it's important, and how SynapseFlow helps.\n\n## Outline:\n1. **Introduction: The AI/ML Development Lifecycle is More Than Just Model Training**\n * Analogy: Building a model is like writing code; MLOps is like DevOps for ML.\n2. **What is MLOps? (The Core Pillars)**\n * Data Management (Versioning, Lineage, Quality)\n * Experiment Tracking & Model Versioning\n * CI/CD for ML (Continuous Integration, Continuous Delivery, Continuous Training)\n * Model Deployment & Serving\n * Monitoring & Observability (Performance, Drift, Data Quality)\n * Governance & Reproducibility\n3. **Why is MLOps Hard? (The Challenges)", "**\n * Complexity of the ML lifecycle.\n * Bridging the gap between data science and engineering.\n * Tooling fragmentation.\n * Need for specialized skills.\n4. **How SynapseFlow Addresses These Challenges (Subtle Product Weave-in)**\n * Focus on ease of deployment (our current strength).\n * Streamlined workflow from experiment to production (our vision).\n * (Mention specific features that align with MLOps pillars without being overly salesy).\n5. **Getting Started with MLOps - Practical Tips**\n * Start simple, iterate.\n * Focus on automation early.\n * Choose tools that fit your team's scale and expertise.\n6. **Conclusion: MLOps is an Enabler for Realizing AI Value**\n\n## (Draft paragraphs, links to reference articles, potential graphics ideas)"], 'summaryfeatures': {'top_3_chunk_sim_scores': {'type': 'tensor(chunk{})', 'cells': {'0': 0.3064674735069275, '1': 0.29259079694747925}}, 'vespa.summaryFeatures.cached': 0.0}}}]}} ``` And by changing to the `rag` query profile, and adding the `streaming=True` parameter, we can stream the results from the LLM as server-sent events (SSE). In \[21\]: Copied! ``` query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "rag", "hits": 2, } resp_string = "" # Adding a string variable to use for asserting the response in CI. with app.syncio() as sess: stream_resp = sess.query( body, streaming=True, ) for line in stream_resp: if line.startswith("data: "): event = json.loads(line[6:]) token = event.get("token", "") resp_string += token print(token, end="") assert len(resp_string) > 10, "Response string should be longer than 10 characters." ``` query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "rag", "hits": 2, } resp_string = "" # Adding a string variable to use for asserting the response in CI. with app.syncio() as sess: stream_resp = sess.query( body, streaming=True, ) for line in stream_resp: if line.startswith("data: "): event = json.loads(line[6:]) token = event.get("token", "") resp_string += token print(token, end="") assert len(resp_string) > 10, "Response string should be longer than 10 characters." ``` SynapseFlow's strategy revolves around simplifying the deployment, management, and scaling of machine learning (ML) models through a developer-first platform-as-a-service (PaaS) solution. The key elements of their strategy include: 1. **Developer-Focused Solution:** SynapseFlow aims to provide a user-friendly API and intuitive user interface that abstracts the complexities associated with MLOps infrastructure (such as Kubernetes and model servers). This allows developers and data scientists to focus primarily on building models rather than dealing with operational challenges. 2. **Addressing Market Gaps:** The founders identified a significant pain point in the existing MLOps landscape, particularly for smaller teams. Many current solutions are either too complex or not flexible enough for custom model development. SynapseFlow targets this unmet need for a straightforward, powerful, and cost-effective platform. 3. **Vision of Simplified MLOps:** By positioning itself as "the Heroku for AI," SynapseFlow aims to offer an all-in-one solution that streamlines the workflow from experimentation to production, thus enhancing efficiency and speed in ML project deployment. 4. **Education and Support:** Their strategy also includes educational initiatives, as outlined in their blog post ideas targeting developers and product managers new to MLOps. By demystifying MLOps and discussing its challenges and the way SynapseFlow addresses them, they plan to enhance user understanding and adoption of their platform. 5. **Continuous Improvement:** SynapseFlow emphasizes a relentless focus on ease of deployment and improving automation capabilities, suggesting an iterative approach to platform development that responds to user feedback and evolving industry needs. Overall, SynapseFlow's strategy is centered on providing user-friendly solutions that reduce operational complexity, enabling faster deployment of machine learning models and supporting teams in successfully realizing the value of AI. ``` Great, we got some results. The quality is not very good yet, but we will show how to improve it in the next steps. But first, let us explain the use case we are trying to solve with this RAG application. ## Our use case[¶](#our-use-case) The sample use case is a document search application, for a user who wants to get answers and insights quickly from a document collection containing company documents, notes, learning material, training logs. To make the blueprint more realistic, we required a dataset with more structured fields than are commonly found in public datasets. Therefore, we used a Large Language Model (LLM) to generate a custom one. It is a toy example, with only 100 documents, but we think it will illustrate the necessary concepts. You can also feel confident that the blueprint will provide a starting point that can scale as you want, with minimal changes. Below you can see a sample document from the dataset. In \[22\]: Copied! ``` import json docs_file = repo_root / "dataset" / "docs.jsonl" with open(docs_file) as f: docs = [json.loads(line) for line in f] docs[10] ``` import json docs_file = repo_root / "dataset" / "docs.jsonl" with open(docs_file) as f: docs = [json.loads(line) for line in f] docs[10] Out\[22\]: ``` {'put': 'id:doc:doc::11', 'fields': {'created_timestamp': 1698796800, 'modified_timestamp': 1698796800, 'text': "# Journal Entry - 2024-11-01\n\nFeeling the YC pressure cooker, but in a good way. The pace is insane. It reminds me of peaking for a powerlifting meet – everything has to be precise, every session counts, and you're constantly pushing your limits.\n\nThinking about **periodization** in lifting – how you structure macrocycles, mesocycles, and microcycles. Can this apply to startup sprints? We have our big YC Demo Day goal (macro), then maybe 2-week sprints are mesocycles, and daily tasks are microcycles. Need to ensure we're not just redlining constantly but building in phases of intense work, focused development, and even 'deload' (strategic rest/refinement) to avoid burnout and make sustainable progress.\n\n**RPE (Rate of Perceived Exertion)** is another concept. In the gym, it helps auto-regulate training based on how you feel. For the startup, maybe we need an RPE check for the team? Are we pushing too hard on a feature that's yielding low returns (high RPE, low ROI)? Can we adjust the 'load' (scope) or 'reps' (iterations) based on team capacity and feedback?\n\nIt's interesting how the discipline and structured thinking from strength training can offer mental models for tackling the chaos of a startup. Both require consistency, grit, and a willingness to fail and learn.\n\n## (More reflections on YC, specific project challenges)", 'favorite': False, 'last_opened_timestamp': 1700000000, 'open_count': 5, 'title': 'journal_2024_11_01_yc_and_lifting.md', 'id': '11'}} ``` In order to evaluate the quality of the RAG application, we also need a set of representative queries, with annotated relevant documents. Crucially, you need a set of representative queries that thoroughly cover your expected use case. More is better, but *some* eval is always better than none. We used `gemini-2.5-pro` to create our queries and relevant document labels. Please check out our [blog post](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) to learn more about using LLM-as-a-judge. We decided to generate some queries that need several documents to provide a good answer, and some that only need one document. If these queries are representative of the use case, we will show that they can be a great starting point for creating an (initial) ranking expression that can be used for retrieving and ranking candidate documents. But, it can (and should) also be improved, for example by collecting user interaction data, human labeling and/ or using an LLM to generate relevance feedback following the initial ranking expression. In \[ \]: Copied! ``` queries_file = repo_root / "queries" / "queries.json" with open(queries_file) as f: queries = json.load(f) queries[10] ``` queries_file = repo_root / "queries" / "queries.json" with open(queries_file) as f: queries = json.load(f) queries[10] Out\[ \]: ``` {'query_id': 'alex_q_11', 'query_text': "Where's that journal entry where I compared YC to powerlifting?", 'category': 'Navigational - Personal', 'description': 'Finding a specific personal reflection in his journal.', 'relevant_document_ids': ['11', '58', '100']} ``` ## Data modeling[¶](#data-modeling) Here is the schema that we will use for our sample application. In \[24\]: Copied! ``` schema_file = repo_root / "app" / "schemas" / "doc.sd" schema_content = schema_file.read_text() display_md(schema_content) ``` schema_file = repo_root / "app" / "schemas" / "doc.sd" schema_content = schema_file.read_text() display_md(schema_content) ``` txt # Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. schema doc { document doc { field id type string { indexing: summary | attribute } field title type string { indexing: index | summary index: enable-bm25 } field text type string { } field created_timestamp type long { indexing: attribute | summary } field modified_timestamp type long { indexing: attribute | summary } field last_opened_timestamp type long { indexing: attribute | summary } field open_count type int { indexing: attribute | summary } field favorite type bool { indexing: attribute | summary } } field title_embedding type tensor(x[96]) { indexing: input title | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } field chunks type array { indexing: input text | chunk fixed-length 1024 | summary | index index: enable-bm25 } field chunk_embeddings type tensor(chunk{}, x[96]) { indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } fieldset default { fields: title, chunks } document-summary no-chunks { summary id {} summary title {} summary created_timestamp {} summary modified_timestamp {} summary last_opened_timestamp {} summary open_count {} summary favorite {} summary chunks {} } document-summary top_3_chunks { from-disk summary chunks_top3 { source: chunks select-elements-by: top_3_chunk_sim_scores #this needs to be added a summary-feature to the rank-profile } } } ``` Keep reading for an explanation and reasoning behind the choices in the schema. ### Picking your searchable unit[¶](#picking-your-searchable-unit) When building a RAG application, your first key decision is choosing the "searchable unit." This is the basic block of information your system will search through and return as context to the LLM. For instance, if you have millions of documents, some hundreds of pages long, what should be your searchable unit? Consider these points when selecting your searchable unit: - **Too fine-grained (e.g., individual sentences or very small paragraphs):** - Leads to duplication of context and metadata across many small units. - May result in units lacking sufficient context for the LLM to make good selections or generate relevant responses. - Increases overhead for managing many small document units. - **Too coarse-grained (e.g., very long chapters or entire large documents):** - Can cause performance issues due to the size of the units being processed. - May lead to some large documents appearing relevant to too many queries, reducing precision. - If you embed the whole document, a too large context will lead to reduced retrieval quality. We recommend erring on the side of using slightly larger units. - LLMs are increasingly capable of handling larger contexts. - In Vespa, you can index larger units, while avoiding data duplication and performance issues, by returning only the most relevant parts. With Vespa, it is now possible to return only the top k most relevant chunks of a document, and include and combine both document-level and chunk-level features in ranking. ### Chunk selection[¶](#chunk-selection) Assume you have chosen a document as your searchable unit. Your documents may then contain text index fields of highly variable lengths. Consider for example a corpus of web pages. Some might be very long, while the average is well within the recommended size. See [scaling retrieval size](https://docs.vespa.ai/en/performance/sizing-search.html#scaling-retrieval-size) for more details. While we recommend implementing guards against too long documents in your feeding pipeline, you still probably do not want to return every chunk of the top k documents to an LLM for RAG. In Vespa, we now have a solution for this problem. Below, we show how you can score both documents as well as individual chunks, and use that score to select the best chunks to be returned in a summary, instead of returning all chunks belonging to the top k ranked documents. Compute closeness per chunk in a ranking function; use `elementwise(bm25(chunks), i, double)` for a per-chunk text signal. See [rank feature reference](https://docs.vespa.ai/en/reference/rank-features.html#elementwise-bm25) This allows you to pick a large document as the searchable unit, while still addressing the potential drawbacks many encounter as follows: - Pick your (larger) document as your searchable unit. - Chunk the text-fields automatically on indexing. - Embed each chunk (enabled through Vespa's multivector support) - Calculate chunk-level features (e.g. bm25 and embedding similarity) and document-level features. Combine as you want. - Limit the actual chunks that are returned to the ones that are actually relevant context for the LLM. This allows you to index larger units, while avoiding data duplication and performance issues, by returning only the most relevant parts. Vespa also supports automatic [chunking](https://docs.vespa.ai/en/reference/indexing-language-reference.html#converters) in the [indexing language](https://docs.vespa.ai/en/indexing.html). Here are the parts of the schema, which defines the searchable unit as a document with a text field, and automatically chunks it into smaller parts of 1024 characters, which each are embedded and indexed separately: ``` txt field chunks type array { indexing: input text | chunk fixed-length 1024 | summary | index index: enable-bm25 } field chunk_embeddings type tensor(chunk{}, x[96]) { indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } ``` In Vespa, we can specify which chunks to be returned with a summary feature, see [docs](https://docs.vespa.ai/en/reference/schema-reference.html#select-elements-by) for details. For this blueprint, we will return the top 3 chunks based on the similarity score of the chunk embeddings, which is calculated in the ranking phase. Note that this feature could be any chunk-level summary feature defined in your rank-profile. Here is how the summary feature is calculated in the rank-profile: ``` txt # This function unpack the bits of each dimenrion of the mapped chunk_embeddings attribute tensor function chunk_emb_vecs() { expression: unpack_bits(attribute(chunk_embeddings)) } # This function calculate the dot product between the query embedding vector and the chunk embeddings (both are now float) over the x dimension function chunk_dot_prod() { expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x) } # This function calculate the L2 normalized length of an input tensor function vector_norms(t) { expression: sqrt(sum(pow(t, 2), x)) } # Here we calculate cosine similarity by dividing the dot product by the product of the L2 normalized query embedding and document embeddings function chunk_sim_scores() { expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding))) } function top_3_chunk_text_scores() { expression: top(3, chunk_text_scores()) } function top_3_chunk_sim_scores() { expression: top(3, chunk_sim_scores()) } summary-features { top_3_chunk_sim_scores } ``` The ranking expression may seem a bit complex, as we chose to embed each chunk independently, store the embeddings in a binarized format, and then unpack them to calculate similarity based on their float representations. For single dimension dense vector similarity between same-precision embeddings, this can be simplified significantly using the [closeness]() convenience function. Note that we want to use the float-representation of the query-embedding, and thus also need to convert the binary embedding of the chunks to float. After that, we can calculate the similarity score between the query embedding and the chunk embeddings using cosine similarity (the dot product, and then normalize it by the norms of the embeddings). See [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html#non-primitive-functions) for more details on the `top`-function, and other functions available for ranking expressions. Now, we can use this summary feature in our document summary to return the top 3 chunks of the document, which will be used as context for the LLM. Note that we can also define a document summary that returns all chunks, which might be useful for another use case, such as deep research. ``` txt document-summary top_3_chunks { from-disk summary chunks_top3 { source: chunks select-elements-by: top_3_chunk_sim_scores #this needs to be added a summary-feature to the rank-profile } } ``` ### Use multiple text fields, consider multiple embeddings[¶](#use-multiple-text-fields-consider-multiple-embeddings) We recommend indexing different textual content as separate indexes. These can be searched together, using [field-sets](https://docs.vespa.ai/en/reference/schema-reference.html#fieldset) In our schema, this is exemplified by the sections below, which define the `title` and `chunks` fields as separate indexed text fields. ``` txt ... field title type title { indexing: index | summary index: enable-bm25 } field chunks type array { indexing: input text | chunk fixed-length 1024 | summary | index index: enable-bm25 } ``` Whether you should have separate embedding fields, depends on whether the added memory usage is justified by the quality improvement you could get from the additional embedding field. We choose to index both a `title_embedding` and a `chunk_embeddings` field for this blueprint, as we aim to minimize cost by embedding the binary vectors. ``` txt field title_embedding type tensor(title{}, x[96]) { indexing: input text | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } field chunk_embeddings type tensor(chunk{}, x[96]) { indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } ``` Indexing several embedding fields may not be worth the cost for you. Evaluate whether the cost-quality trade-off is worth it for your application. If you have different vector space representations of your document (e.g images), indexing them separately is likely worth it, as they are likely to provide signals that are complementary to the text-based embeddings. ### Model Metadata and Signals Using Structured Fields[¶](#model-metadata-and-signals-using-structured-fields) We recommend modeling metadata and signals as structured fields in your schema. Below are some general recommendations, as well as the implementation in our blueprint schema. **Metadata** — knowledge about your data: - Authors, publish time, source, links, category, price, … - Usage: filters, ranking, grouping/aggregation - Index only metadata that are strong filters In our blueprint schema, we include these metadata fields to demonstrate these concepts: - `id` - document identifier - `title` - document name/filename for display and text matching - `created_timestamp`, `modified_timestamp` - temporal metadata for filtering and ranking by recency **Signals** — observations about your data: - Popularity, quality, spam probability, click_probability, … - Usage: ranking - Often updated separately via partial updates - Multiple teams can add their own signals independently In our blueprint schema, we include several of these signals: - `last_opened_timestamp` - user engagement signal for personalization - `open_count` - popularity signal indicating document importance - `favorite` - explicit user preference signal, can be used for boosting relevant content These fields are configured as `attribute | summary` to enable efficient filtering, sorting, and grouping operations while being returned in search results. The timestamp fields allow for temporal filtering (e.g., "recent documents") and recency-based ranking, while usage signals like `open_count` and `favorite` can boost frequently accessed or explicitly marked important documents. Consider [parent-child](https://docs.vespa.ai/en/parent-child.html) relationships for low-cardinality metadata. Most large scale RAG application schemas contain at least a hundred structured fields. ## LLM-generation with OpenAI-client[¶](#llm-generation-with-openai-client) Vespa supports both Local LLMs, and any OpenAI-compatible API for LLM generation. For details, see [LLMs in Vespa](https://docs.vespa.ai/en/llms-in-vespa.html) The recommended way to provide an API key is by using the [secret store](https://docs.vespa.ai/en/cloud/security/secret-store.html) in Vespa Cloud. To enable this, you need to create a vault (if you don't have one already) and a secret through the [Vespa Cloud console](https://cloud.vespa.ai/). If your vault is named `sample-apps` and contains a secret with the name `openai-api-key`, you would use the following configuration in your `services.xml` to set up the OpenAI client to use that secret: ``` openai-api-key ``` Alternatively, for local deployments, you can set the `X-LLM-API-KEY` header in your query to use the OpenAI client for generation. To test generation using the OpenAI client, post a query that runs the `openai` search chain, with `format=sse`. (Use `format=json` for a streaming json response including both the search hits and the LLM-generated tokens.) ``` vespa query \ --timeout 60 \ --header="X-LLM-API-KEY:" \ yql='select * from doc where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))' \ query="Summarize the key architectural decisions documented for SynapseFlow's v0.2 release." \ searchChain=openai \ format=sse \ hits=5 ``` ## Structuring your vespa application[¶](#structuring-your-vespa-application) This section provides recommendations for structuring your Vespa application package. See also the [application package docs](https://docs.vespa.ai/en/application-packages.html) for more details on the application package structure. Note that this is not mandatory, and it might be simpler to start without query profiles and rank profiles, but as you scale out your application, it will be beneficial to have a well-structured application package. Consider the following structure for our application package: In \[25\]: Copied! ``` # Let's explore the RAG Blueprint application structure print(tree("src/rag-blueprint")) ``` # Let's explore the RAG Blueprint application structure print(tree("src/rag-blueprint")) ``` /Users/thomas/Repos/pyvespa/docs/sphinx/source/examples/src/rag-blueprint ├── app │ ├── models │ │ └── lightgbm_model.json │ ├── schemas │ │ ├── doc │ │ │ ├── base-features.profile │ │ │ ├── collect-second-phase.profile │ │ │ ├── collect-training-data.profile │ │ │ ├── learned-linear.profile │ │ │ ├── match-only.profile │ │ │ └── second-with-gbdt.profile │ │ └── doc.sd │ ├── search │ │ └── query-profiles │ │ ├── deepresearch-with-gbdt.xml │ │ ├── deepresearch.xml │ │ ├── hybrid-with-gbdt.xml │ │ ├── hybrid.xml │ │ ├── rag-with-gbdt.xml │ │ └── rag.xml │ ├── security │ │ └── clients.pem │ └── services.xml ├── dataset │ ├── docs.jsonl │ ├── queries.json │ └── test_queries.json ├── eval │ ├── output │ │ ├── Vespa-training-data_match_first_phase_20250623_133241.csv │ │ ├── Vespa-training-data_match_first_phase_20250623_133241_logreg_coefficients.txt │ │ ├── Vespa-training-data_match_rank_second_phase_20250623_135819.csv │ │ └── Vespa-training-data_match_rank_second_phase_20250623_135819_feature_importance.csv │ ├── collect_pyvespa.py │ ├── evaluate_match_phase.py │ ├── evaluate_ranking.py │ ├── pyproject.toml │ ├── README.md │ ├── resp.json │ ├── train_lightgbm.py │ └── train_logistic_regression.py ├── deploy-locally.md ├── generation.md ├── query-profiles.md ├── README.md └── relevance.md ``` You can see that we have separated the [query profiles](https://docs.vespa.ai/en/query-profiles.html), and [rank profiles](https://docs.vespa.ai/en/ranking.html#rank-profiles) into their own directories. ### Manage queries in query profiles[¶](#manage-queries-in-query-profiles) Query profiles let you maintain collections of query parameters in one file. Clients choose a query profile → the profile sets everything else. This lets us change behavior for a use case without involving clients. Let us take a closer look at 3 of the query profiles in our sample application. 1. `hybrid` 1. `rag` 1. `deepresearch` ### ***hybrid*** query profile[¶](#hybrid-query-profile) This query profile will be the one used by clients for traditional search, where the user is presented a limited number of hits. Our other query profiles will inherit this one (but may override some fields). In \[26\]: Copied! ``` qp_dir = repo_root / "app" / "search" / "query-profiles" hybrid_qp = (qp_dir / "hybrid.xml").read_text() display_md(hybrid_qp, tag="xml") ``` qp_dir = repo_root / "app" / "search" / "query-profiles" hybrid_qp = (qp_dir / "hybrid.xml").read_text() display_md(hybrid_qp, tag="xml") ``` doc embed(@query) embed(@query) -7.798639 13.383840 0.203145 0.159914 0.191867 10.067169 0.153392 select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding)) 10 learned-linear top_3_chunks ``` ### ***rag*** query profile[¶](#rag-query-profile) This will be the query profile where the `openai` searchChain will be added, to generate a response based on the retrieved context. Here, we set some configuration that are specific to this use case. In \[27\]: Copied! ``` rag_blueprint_qp = (qp_dir / "rag.xml").read_text() display_md(rag_blueprint_qp, tag="xml") ``` rag_blueprint_qp = (qp_dir / "rag.xml").read_text() display_md(rag_blueprint_qp, tag="xml") ``` 50 openai sse ``` ### ***deepresearch*** query profile[¶](#deepresearch-query-profile) Again, we will inherit from the `hybrid` query profile, but override with a `targetHits` value of 10 000 (original was 100) that prioritizes recall over latency. We will also increase number of hits to be returned, and increase the timeout to 5 seconds. In \[28\]: Copied! ``` deep_qp = (qp_dir / "deepresearch.xml").read_text() display_md(deep_qp, tag="xml") ``` deep_qp = (qp_dir / "deepresearch.xml").read_text() display_md(deep_qp, tag="xml") ``` select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:10000}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:10000}nearestNeighbor(chunk_embeddings, embedding)) 100 5s ``` We will leave out the LLM-generation for this one, and let an LLM agent on the client side be responsible for using this API call as a tool, and to determine whether enough relevant context to answer has been retrieved. Note that the `targetHits` parameter set here does not really makes sense until your dataset reach a certain scale. As we add more rank-profiles, we can also inherit the existing query profiles, only to override the `ranking.profile` field to use a different rank profile. This is what we have done for the `rag-with-gbdt` and `deepresearch-with-gbdt` query profiles, which will use the `second-with-gbdt` rank profile instead of the `learned-linear` rank profile. In \[29\]: Copied! ``` rag_gbdt_qp = (qp_dir / "rag-with-gbdt.xml").read_text() display_md(rag_gbdt_qp, tag="xml") ``` rag_gbdt_qp = (qp_dir / "rag-with-gbdt.xml").read_text() display_md(rag_gbdt_qp, tag="xml") ``` 50 openai sse ``` ### Separating out rank profiles[¶](#separating-out-rank-profiles) To build a great RAG application, assume you’ll need many ranking models. This will allow you to bucket-test alternatives continuously and to serve different use cases, including data collection for different phases, and the rank profiles to be used in production. Separate common functions/setup into parent rank profiles and use `.profile` files. ## Phased ranking in Vespa[¶](#phased-ranking-in-vespa) Before we move on, it might be useful to recap Vespa´s [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) approach. Below is a schematic overview of how to think about retrieval and ranking for this RAG blueprint. Since we are developing this as a tutorial using a small toy dataset, the application can be deployed in a single machine, using a single docker container, where only one container node and one container node will run. This is obviously not the case for most real-world RAG applications, so this is cruical to have in mind as you want to scale your application. It is worth noting that parameters such as `targetHits` (for the match phase) and `rerank-count` (for first and second phase) are applied **per content node**. Also note that the stateless container nodes can also be [scaled independently](https://docs.vespa.ai/en/performance/sizing-search.html) to handle increased query load. ## Configuring match-phase (retrieval)[¶](#configuring-match-phase-retrieval) This section will contain important considerations for the retrieval-phase of a RAG application in Vespa. The goal of the retrieval phase is to retrieve candidate documents efficiently, and maximize recall, without exposing too many documents to ranking. ### Choosing a Retrieval Strategy: Vector, Text, or Hybrid?[¶](#choosing-a-retrieval-strategy-vector-text-or-hybrid) As you could see from the schema, we create and index both a text representation and a vector representation for each chunk of the document. This will allow us to use both text-based features and semantic features for both recall and ranking. The text and vector representation complement each other well: - **Text-only** → misses recall of semantically similar content - **Vector-only** → misses recall of specific content not well understood by the embedding models Our recommendation is to default to hybrid retrieval: ``` select * from doc where userInput(@query) or ({label:"title_label", targetHits:1000}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:1000}nearestNeighbor(chunk_embeddings, embedding)) ``` In generic domains, or if you have fine-tuned an embedding model for your specific data, you might consider a vector-only approach: ``` select * from doc where rank({targetHits:10000}nearestNeighbor(embeddings_field, query_embedding, userInput(@query))) ``` Notice that only the first argument of the [rank](https://docs.vespa.ai/en/reference/query-language-reference.html#rank)-operator will be used to determine if a document is a match, while all arguments are used for calculating rank features. This mean we can do vector only for matching, but still use text-based features such as `bm25` and `nativeRank` for ranking. Note that if you do this, it makes sense to increase the number of `targetHits` for the `nearestNeighbor`-operator. For our sample application, we add three different retrieval operators (that are combined with `OR`), one with `weakAnd` for text matching, and two `nearestNeighbor` operators for vector matching, one for the title and one for the chunks. This will allow us to retrieve both relevant documents based on text and vector similarity, while also allowing us to return the most relevant chunks of the documents. ``` select * from doc where userInput(@query) or ({targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({targetHits:100}nearestNeighbor(chunk_embeddings, embedding)) ``` ### Choosing your embedding model (and strategy)[¶](#choosing-your-embedding-model-and-strategy) Choice of embedding model will be a trade-off between inference time (both indexing and query time), memory usage (embedding dimensions) and quality. There are many good open-source models available, and we recommend checking out the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), and look at the `Retrieval`-column to gauge performance, while also considering the memory usage, vector dimensions, and context length of the model. See [model hub](https://docs.vespa.ai/en/cloud/model-hub.html) for a list of provided models ready to use with Vespa. See also [Huggingface Embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) for details on using other models (exported as ONNX) with Vespa. In addition to dense vector representation, Vespa supports sparse embeddings (token weights) and multi-vector (ColBERT-style) embeddings. See our [example notebook](https://vespa-engine.github.io/pyvespa//examples/mother-of-all-embedding-models-cloud.md#bge-m3-the-mother-of-all-embedding-models) of using the bge-m3 model, which supports both, with Vespa. Vespa also supports [Matryoshka embeddings](https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/), which can be a great way of reducing inference cost for retrieval phases, by using a subset of the embedding dimensions, while using more dimensions for increased precision in the later ranking phases. For domain-specific applications or less popular languages, you may want to consider finetuning a model on your own data. ### Consider binary vectors for recall[¶](#consider-binary-vectors-for-recall) Another decision to make is which precision you will use for your embeddings. See [binarization docs](https://docs.vespa.ai/en/binarizing-vectors.html) for an introduction to binarization in Vespa. For most cases, binary vectors (in Vespa, packed into `int8`-representation) will provide an attractive tradeoff, especially for recall during match-phase. Consider these factors to determine whether this holds true for your application: - Reduces memory-vector cost by 5 – 30 × - Reduces query and indexing cost by 30 × - Often reduces quality by only a few percentage points ``` txt field binary_chunk_embeddings type tensor(chunk{}, x) { indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index attribute { distance-metric: hamming } } ``` If you need higher precision vector similarity, you should use bfloat16 precision, and consider paging these vectors to disk to avoid large memory cost. Note that this means that when accessing this field in ranking, they will also need to be read from disk, so you need to restrict the number of hits that accesses this field to avoid performance issues. ``` txt field chunk_embeddings type tensor(chunk{}, x) { indexing: input text | chunk fixed-length 1024 | embed | attribute attribute: paged } ``` For example, if you want to calculate `closeness` for a paged embedding vector in first-phase, consider configuring your retrieval operators (typically `weakAnd` and/or `nearestNeighbor`, optionally combined with filters) so that not too many hits are matched. Another option is to enable match-phase limiting, see [match-phase docs](https://docs.vespa.ai/en/reference/schema-reference.html#match-phase). In essence, you restrict the number of matches by specifying an attribute field. ### Consider float-binary for ranking[¶](#consider-float-binary-for-ranking) In our blueprint, we choose to index binary vectors of the documents. This does not prevent us from using the float-representation of the query embedding though. By unpacking the binary document chunk embeddings to their float representations (using [`unpack_bits`](https://docs.vespa.ai/en/reference/ranking-expressions.html#unpack-bits)), we can calculate the similarity between query and document with slightly higher precision using a `float-binary` dot product, instead of hamming distance (`binary-binary`) Below, you can see how we can do this: ``` txt rank-profile collect-training-data { inputs { query(embedding) tensor(x[96]) query(float_embedding) tensor(x[768]) } function chunk_emb_vecs() { expression: unpack_bits(attribute(chunk_embeddings)) } function chunk_dot_prod() { expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x) } function vector_norms(t) { expression: sqrt(sum(pow(t, 2), x)) } function chunk_sim_scores() { expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding))) } function top_3_chunk_text_scores() { expression: top(3, chunk_text_scores()) } function top_3_chunk_sim_scores() { expression: top(3, chunk_sim_scores()) } } ``` ### Use complex linguistics/recall only for precision[¶](#use-complex-linguisticsrecall-only-for-precision) Vespa gives you extensive control over [linguistics](https://docs.vespa.ai/en/linguistics.html). You can decide [match mode](https://docs.vespa.ai/en/reference/schema-reference.html#match), stemming, normalization, or control derived tokens. It is also possible to use more specific operators than [weakAnd](https://docs.vespa.ai/en/reference/query-language-reference.html#weakand) to match only close occurrences ([near](https://docs.vespa.ai/en/reference/query-language-reference.html#near)/ [onear](https://docs.vespa.ai/en/reference/query-language-reference.html#near)), multiple alternatives ([equiv](https://docs.vespa.ai/en/query-rewriting.html#equiv)), weight items, set connectivity, and apply [query-rewrite](https://docs.vespa.ai/en/query-rewriting.html) rules. **Don’t use this to increase recall — improve your embedding model instead.** Consider using it to improve precision when needed. ### Evaluating recall of the retrieval phase[¶](#evaluating-recall-of-the-retrieval-phase) To know whether your retrieval phase is working well, you need to measure recall, number of total matches and the reported time spent. We can use [`VespaMatchEvaluator`](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaMatchEvaluator) from the pyvespa client library to do this. For this sample application, we set up an evaluation script that compares three different retrieval strategies, let us call them "retrieval arms": 1. **Semantic-only**: Uses only vector similarity through `nearestNeighbor` operators. 1. **WeakAnd-only**: Uses only text-based matching with `userQuery()`. 1. **Hybrid**: Combines both approaches with OR logic. Note that this is only generic suggestion for and that you are of course free to include both [filter clauses](https://docs.vespa.ai/en/reference/query-language-reference.html#where), [grouping](https://docs.vespa.ai/en/grouping), [predicates](https://docs.vespa.ai/en/predicate-fields.html), [geosearch](https://docs.vespa.ai/en/geo-search) etc. to support your specific use cases. It is recommended to use a ranking profile that does not use any first-phase ranking, to run the match-phase evaluation faster. The evaluation will output metrics like: - Recall (percentage of relevant documents matched) - Total number of matches per query - Query latency statistics - Per-query detailed results (when `write_verbose=True`) to identify "offending" queries with regards to recall or performance. This will be valuable input for tuning each of them. Run the cells below to evaluate all three retrieval strategies on your dataset. In \[30\]: Copied! ``` ids_to_query = {query["query_id"]: query["query_text"] for query in queries} relevant_docs = { query["query_id"]: set(query["relevant_document_ids"]) for query in queries if "relevant_document_ids" in query } ``` ids_to_query = {query\["query_id"\]: query["query_text"] for query in queries} relevant_docs = { query\["query_id"\]: set(query["relevant_document_ids"]) for query in queries if "relevant_document_ids" in query } In \[31\]: Copied! ``` from vespa.evaluation import VespaMatchEvaluator from vespa.application import Vespa import vespa.querybuilder as qb import json from pathlib import Path def match_weakand_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*").from_(VESPA_SCHEMA_NAME).where(qb.userQuery(query_text)) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } def match_hybrid_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } def match_semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } match_results = {} for evaluator_name, query_fn in [ ("semantic", match_semantic_query_fn), ("weakand", match_weakand_query_fn), ("hybrid", match_hybrid_query_fn), ]: print(f"Evaluating {evaluator_name}...") match_evaluator = VespaMatchEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name="test-run", id_field="id", write_csv=False, write_verbose=False, # optionally write verbose metrics to CSV ) results = match_evaluator() match_results[evaluator_name] = results ``` from vespa.evaluation import VespaMatchEvaluator from vespa.application import Vespa import vespa.querybuilder as qb import json from pathlib import Path def match_weakand_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*").from\_(VESPA_SCHEMA_NAME).where(qb.userQuery(query_text)) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } def match_hybrid_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } def match_semantic_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) ) ), "query": query_text, "ranking": "match-only", "input.query(embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } match_results = {} for evaluator_name, query_fn in \[ ("semantic", match_semantic_query_fn), ("weakand", match_weakand_query_fn), ("hybrid", match_hybrid_query_fn), \]: print(f"Evaluating {evaluator_name}...") match_evaluator = VespaMatchEvaluator( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=query_fn, app=app, name="test-run", id_field="id", write_csv=False, write_verbose=False, # optionally write verbose metrics to CSV ) results = match_evaluator() match_results[evaluator_name] = results ``` Evaluating semantic... Evaluating weakand... Evaluating hybrid... ``` In \[32\]: Copied! ``` import pandas as pd df = pd.DataFrame(match_results) df ``` import pandas as pd df = pd.DataFrame(match_results) df Out\[32\]: | | semantic | weakand | hybrid | | ---------------------- | --------- | ------- | --------- | | match_recall | 1.00000 | 1.0000 | 1.00000 | | avg_recall_per_query | 1.00000 | 1.0000 | 1.00000 | | total_relevant_docs | 51.00000 | 51.0000 | 51.00000 | | total_matched_relevant | 51.00000 | 51.0000 | 51.00000 | | avg_matched_per_query | 100.00000 | 88.7500 | 100.00000 | | total_queries | 20.00000 | 20.0000 | 20.00000 | | searchtime_avg | 0.06275 | 0.0330 | 0.04395 | | searchtime_q50 | 0.03200 | 0.0290 | 0.03750 | | searchtime_q90 | 0.06400 | 0.0511 | 0.08500 | | searchtime_q95 | 0.10055 | 0.0703 | 0.08800 | ### Tuning the retrieval phase[¶](#tuning-the-retrieval-phase) We can see that all queries match all relevant documents, which is expected, since we use `targetHits:100` in the `nearestNeighbor` operator, and this is also the default for `weakAnd`(and `userQuery`). By setting `targetHits` lower, we can see that recall will drop. In general, you have these options if you want to increase recall: 1. Increase `targetHits` in your retrieval operators (e.g., `nearestNeighbor`, `weakAnd`). 1. Improve your embedding model (use a better model or finetune it on your data). 1. You can also consider tuning HNSW parameters, see [docs on HNSW](https://docs.vespa.ai/en/approximate-nn-hnsw.html#using-vespas-approximate-nearest-neighbor-search). Conversely, if you want to reduce the latency of one of your retrieval 'arms' at the cost of a small trade-off in recall, you can: 1. Tune `weakAnd` parameters. This has potential to 3x your performance for the `weakAnd`-parameter of your query, see [blog post](https://blog.vespa.ai/tripling-the-query-performance-of-lexical-search/). Below are some empirically found default parameters that work well for most use cases: ``` txt rank-profile optimized inherits baseline { filter-threshold: 0.05 weakand { stopword-limit: 0.6 adjust-target: 0.01 } } ``` See the [reference](https://docs.vespa.ai/en/reference/schema-reference.html#weakand) for more details on the `weakAnd` parameters. These can also be set as query parameters. 1. As already [mentioned](#consider-binary-vectors-for-recall), consider binary vectors for your embeddings. 1. Consider using an embedding model with less dimensions, or using only a subset of the dimensions (e.g., using [Matryoshka embeddings](https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/)). ## First-phase ranking[¶](#first-phase-ranking) For the first-phase ranking, we must use a computationally cheap function, as it is applied to all documents matched in the retrieval phase. For many applications, this can amount to millions of candidate documents. Common options include (learned) linear combination of features including text similarity features, vector closeness, and metadata. It could also be a heuristic handwritten function. Text features should include [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html#nativerank) or [bm25](https://docs.vespa.ai/en/reference/bm25.html#ranking-function) — not [fieldMatch](https://docs.vespa.ai/en/reference/rank-features.html#field-match-features-normalized) (it is too expensive). Considerations for deciding whether to choose `bm25` or `nativeRank`: - **bm25**: cheapest, strong significance, no proximity, not normalized. - **nativeRank**: 2 – 3 × costlier, truncated significance, includes proximity, normalized. For this blueprint, we opted for using `bm25` for first phase, but you could evaluate and compare to see whether the additional cost of using `nativeRank` is justified by increased quality. ### Collecting training data for first-phase ranking[¶](#collecting-training-data-for-first-phase-ranking) The features we will use for first-phase ranking are not normalized (ie. they have values in different ranges). This means we can't just weight them equally and expect that to be a good proxy for relevance. Below we will show how we can find (learn) optimal weights (coefficients) for each feature, so that we can combine them into a ranking-expression on the format: ``` a * bm25(title) + b * bm25(chunks) + c * max_chunk_sim_scores() + d * max_chunk_text_scores() + e * avg_top_3_chunk_sim_scores() + f * avg_top_3_chunk_text_scores() ``` The first thing we need to is to collect training data. We do this using the [VespaFeatureCollector](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation.md#vespa.evaluation.VespaFeatureCollector) from the pyvespa library. These are the features we will include: ``` txt rank-profile collect-training-data { match-features { bm25(title) bm25(chunks) max_chunk_sim_scores max_chunk_text_scores avg_top_3_chunk_sim_scores avg_top_3_chunk_text_scores } # Since we need both binary embeddings (for match-phase) and float embeddings (for ranking) we define it as two inputs. inputs { query(embedding) tensor(x[96]) query(float_embedding) tensor(x[768]) } rank chunks { element-gap: 0 # Fixed length chunking should not cause any positional gap between elements } function chunk_text_scores() { expression: elementwise(bm25(chunks),chunk,float) } function chunk_emb_vecs() { expression: unpack_bits(attribute(chunk_embeddings)) } function chunk_dot_prod() { expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x) } function vector_norms(t) { expression: sqrt(sum(pow(t, 2), x)) } function chunk_sim_scores() { expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding))) } function top_3_chunk_text_scores() { expression: top(3, chunk_text_scores()) } function top_3_chunk_sim_scores() { expression: top(3, chunk_sim_scores()) } function avg_top_3_chunk_text_scores() { expression: reduce(top_3_chunk_text_scores(), avg, chunk) } function avg_top_3_chunk_sim_scores() { expression: reduce(top_3_chunk_sim_scores(), avg, chunk) } function max_chunk_text_scores() { expression: reduce(chunk_text_scores(), max, chunk) } function max_chunk_sim_scores() { expression: reduce(chunk_sim_scores(), max, chunk) } first-phase { expression { # Not used in this profile bm25(title) + bm25(chunks) + max_chunk_sim_scores() + max_chunk_text_scores() } } second-phase { expression: random } } ``` As you can see, we rely on the `bm25` and different vector similarity features (both document-level and chunk-level) for the first-phase ranking. These are relatively cheap to calculate, and will likely provide good enough ranking signals for the first-phase ranking. Running the command below will save a .csv-file with the collected features, which can be used to train a ranking model for the first-phase ranking. In \[33\]: Copied! ``` from vespa.application import Vespa from vespa.evaluation import VespaFeatureCollector from typing import Dict, Any import json from pathlib import Path def feature_collection_second_phase_query_fn( query_text: str, top_k: int = 10, query_id: str = None ) -> Dict[str, Any]: """ Convert plain text into a JSON body for Vespa query with 'feature-collection' rank profile. Includes both semantic similarity and BM25 matching with match features. """ return { "yql": str( qb.select("*") .from_("doc") .where( ( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={ "targetHits": 100, "label": "title_label", }, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={ "targetHits": 100, "label": "chunk_label", }, ) | qb.userQuery( query_text, ) ) ) ), "query": query_text, "ranking": "collect-second-phase", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "hits": top_k, "timeout": "10s", "presentation.summary": "no-chunks", "presentation.timing": True, } def feature_collection_first_phase_query_fn( query_text: str, top_k: int = 10, query_id: str = None ) -> Dict[str, Any]: """ Convert plain text into a JSON body for Vespa query with 'feature-collection' rank profile. Includes both semantic similarity and BM25 matching with match features. """ return { "yql": str( qb.select("*") .from_("doc") .where( ( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={ "targetHits": 100, "label": "title_label", }, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={ "targetHits": 100, "label": "chunk_label", }, ) | qb.userQuery( query_text, ) ) ) ), "query": query_text, "ranking": "collect-training-data", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "hits": top_k, "timeout": "10s", "presentation.summary": "no-chunks", "presentation.timing": True, } def generate_collector_name( collect_matchfeatures: bool, collect_rankfeatures: bool, collect_summaryfeatures: bool, second_phase: bool, ) -> str: """ Generate a collector name based on feature collection settings and phase. Args: collect_matchfeatures: Whether match features are being collected collect_rankfeatures: Whether rank features are being collected collect_summaryfeatures: Whether summary features are being collected second_phase: Whether using second phase (True) or first phase (False) Returns: Generated collector name string """ features = [] if collect_matchfeatures: features.append("match") if collect_rankfeatures: features.append("rank") if collect_summaryfeatures: features.append("summary") features_str = "_".join(features) if features else "nofeatures" phase_str = "second_phase" if second_phase else "first_phase" return f"{features_str}_{phase_str}" feature_collector = VespaFeatureCollector( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=feature_collection_first_phase_query_fn, app=app, name="first-phase", id_field="id", collect_matchfeatures=True, collect_summaryfeatures=False, collect_rankfeatures=False, write_csv=False, random_hits_strategy="ratio", random_hits_value=1, ) results = feature_collector.collect() ``` from vespa.application import Vespa from vespa.evaluation import VespaFeatureCollector from typing import Dict, Any import json from pathlib import Path def feature_collection_second_phase_query_fn( query_text: str, top_k: int = 10, query_id: str = None ) -> Dict\[str, Any\]: """ Convert plain text into a JSON body for Vespa query with 'feature-collection' rank profile. Includes both semantic similarity and BM25 matching with match features. """ return { "yql": str( qb.select("\*") .from\_("doc") .where( ( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={ "targetHits": 100, "label": "title_label", }, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={ "targetHits": 100, "label": "chunk_label", }, ) | qb.userQuery( query_text, ) ) ) ), "query": query_text, "ranking": "collect-second-phase", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "hits": top_k, "timeout": "10s", "presentation.summary": "no-chunks", "presentation.timing": True, } def feature_collection_first_phase_query_fn( query_text: str, top_k: int = 10, query_id: str = None ) -> Dict\[str, Any\]: """ Convert plain text into a JSON body for Vespa query with 'feature-collection' rank profile. Includes both semantic similarity and BM25 matching with match features. """ return { "yql": str( qb.select("\*") .from\_("doc") .where( ( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={ "targetHits": 100, "label": "title_label", }, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={ "targetHits": 100, "label": "chunk_label", }, ) | qb.userQuery( query_text, ) ) ) ), "query": query_text, "ranking": "collect-training-data", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "hits": top_k, "timeout": "10s", "presentation.summary": "no-chunks", "presentation.timing": True, } def generate_collector_name( collect_matchfeatures: bool, collect_rankfeatures: bool, collect_summaryfeatures: bool, second_phase: bool, ) -> str: """ Generate a collector name based on feature collection settings and phase. Args: collect_matchfeatures: Whether match features are being collected collect_rankfeatures: Whether rank features are being collected collect_summaryfeatures: Whether summary features are being collected second_phase: Whether using second phase (True) or first phase (False) Returns: Generated collector name string """ features = [] if collect_matchfeatures: features.append("match") if collect_rankfeatures: features.append("rank") if collect_summaryfeatures: features.append("summary") features_str = "_".join(features) if features else "nofeatures" phase_str = "second_phase" if second_phase else "first_phase" return f"{features_str}_{phase_str}" feature_collector = VespaFeatureCollector( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=feature_collection_first_phase_query_fn, app=app, name="first-phase", id_field="id", collect_matchfeatures=True, collect_summaryfeatures=False, collect_rankfeatures=False, write_csv=False, random_hits_strategy="ratio", random_hits_value=1, ) results = feature_collector.collect() In \[34\]: Copied! ``` feature_df = pd.DataFrame(results["results"]) feature_df ``` feature_df = pd.DataFrame(results["results"]) feature_df Out\[34\]: | | query_id | doc_id | relevance_label | relevance_score | match_avg_top_3_chunk_sim_scores | match_avg_top_3_chunk_text_scores | match_bm25(chunks) | match_bm25(title) | match_max_chunk_sim_scores | match_max_chunk_text_scores | | --- | --------- | ------ | --------------- | --------------- | -------------------------------- | --------------------------------- | ------------------ | ----------------- | -------------------------- | --------------------------- | | 0 | alex_q_01 | 1 | 1.0 | 0.734995 | 0.358027 | 15.100841 | 23.010389 | 4.333828 | 0.391143 | 20.582403 | | 1 | alex_q_01 | 82 | 1.0 | 0.262686 | 0.225300 | 12.327676 | 18.611592 | 2.453409 | 0.258905 | 15.644889 | | 2 | alex_q_01 | 50 | 1.0 | 0.060615 | 0.248329 | 8.444725 | 7.717984 | 0.000000 | 0.268457 | 8.444725 | | 3 | alex_q_01 | 64 | 0.0 | 0.994799 | 0.238926 | 3.608304 | 4.940433 | 0.000000 | 0.262717 | 4.063323 | | 4 | alex_q_01 | 21 | 0.0 | 0.986948 | 0.265199 | 3.424351 | 3.615531 | 0.000000 | 0.265199 | 3.424351 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 97 | alex_q_19 | 4 | 0.0 | 0.958641 | 0.210284 | 1.256423 | 2.238139 | 0.000000 | 0.229001 | 1.967774 | | 98 | alex_q_20 | 20 | 1.0 | 0.656100 | 0.337411 | 8.959117 | 12.534452 | 9.865092 | 0.402615 | 12.799867 | | 99 | alex_q_20 | 35 | 1.0 | 0.306241 | 0.227978 | 8.462585 | 13.478890 | 0.000000 | 0.239757 | 13.353056 | | 100 | alex_q_20 | 2 | 0.0 | 0.999038 | 0.200672 | 0.942418 | 0.871042 | 0.000000 | 0.206993 | 0.942418 | | 101 | alex_q_20 | 45 | 0.0 | 0.964807 | 0.151361 | 2.288041 | 2.695306 | 0.000000 | 0.151361 | 2.288041 | 102 rows × 10 columns Note that the `relevance_score` in this table is just the random expression we used in the `second-phase` of the `collect-training-data` rank profile, and will be dropped before training the model. ### Training a first-phase ranking model[¶](#training-a-first-phase-ranking-model) As you recall, a first-phase ranking expression must be cheap to evaluate. This most often means a heuristic handwritten combination of match features, or a linear model trained on match features. We will demonstrate how to train a simple Logistic Regression model to predict relevance based on the collected match features. The full training script can be found in the [sample-apps repository](https://github.com/vespa-engine/sample-apps/blob/master/rag-blueprint/eval/train_logistic_regression.py). Some "gotchas" to be aware of: - We sample an equal number of relevant and random documents for each query, to avoid class imbalance. - We make sure that we drop `query_id` and `doc_id` columns before training. - We apply standard scaling to the features before training the model. We apply the inverse transform to the model coefficients after training, so that we can use them in Vespa. - We do 5-fold stratified cross-validation to evaluate the model performance, ensuring that each fold has a balanced number of relevant and random documents. - We also make sure to have an unseen set of test queries to evaluate the model on, to avoid overfitting. Run the cell below to train the model and get the coefficients. In \[35\]: Copied! ``` import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, log_loss, roc_auc_score, average_precision_score, ) def get_coefficients_info(model, features, intercept, scaler): """ Returns the model coefficients as a dictionary that accounts for standardization. The transformation allows the model to be expressed in terms of the original, unscaled features. """ # For standardized features, the transformation is z = (x - mean) / std. # The original expression 'coef * z + intercept' becomes: # (coef / std) * x + (intercept - coef * mean / std) transformed_coefs = model.coef_[0] / scaler.scale_ transformed_intercept = intercept - np.sum( model.coef_[0] * scaler.mean_ / scaler.scale_ ) # Create a mathematical expression for the model using original (unscaled) features expression_parts = [f"{transformed_intercept:.6f}"] for feature, coef in zip(features, transformed_coefs): expression_parts.append(f"{coef:+.6f}*{feature}") expression = "".join(expression_parts) # Return a dictionary containing scaling parameters and coefficient information return { "expression": expression, "feature_means": dict(zip(features, scaler.mean_)), "feature_stds": dict(zip(features, scaler.scale_)), "original_coefficients": dict(zip(features, model.coef_[0])), "original_intercept": float(intercept), "transformed_coefficients": dict(zip(features, transformed_coefs)), "transformed_intercept": float(transformed_intercept), } def perform_cross_validation(df: pd.DataFrame): """ Loads data, applies standardization, and performs 5-fold stratified cross-validation. Args: df: A pandas DataFrame with features and a 'relevance_label' target column. Returns: A tuple containing two pandas DataFrames: - cv_results_df: The mean and standard deviation of evaluation metrics. - coef_df: The model coefficients for both scaled and unscaled features. """ # Define and drop irrelevant columns columns_to_drop = ["doc_id", "query_id", "relevance_score"] # Drop only the columns that exist in the DataFrame df = df.drop(columns=[col for col in columns_to_drop if col in df.columns]) df["relevance_label"] = df["relevance_label"].astype(int) # Define features (X) and target (y) X = df.drop(columns=["relevance_label"]) features = X.columns.tolist() y = df["relevance_label"] # Initialize StandardScaler, model, and cross-validator scaler = StandardScaler() N_SPLITS = 5 skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42) model = LogisticRegression(C=0.001, tol=1e-2, random_state=42) # Lists to store metrics for each fold metrics = { "Accuracy": [], "Precision": [], "Recall": [], "F1-Score": [], "Log Loss": [], "ROC AUC": [], "Avg Precision": [], } # Perform 5-Fold Stratified Cross-Validation for train_index, test_index in skf.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Fit scaler on training data and transform both sets X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train the model and make predictions model.fit(X_train_scaled, y_train) y_pred = model.predict(X_test_scaled) y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Calculate and store metrics for the fold metrics["Accuracy"].append(accuracy_score(y_test, y_pred)) metrics["Precision"].append(precision_score(y_test, y_pred, zero_division=0)) metrics["Recall"].append(recall_score(y_test, y_pred, zero_division=0)) metrics["F1-Score"].append(f1_score(y_test, y_pred, zero_division=0)) metrics["Log Loss"].append(log_loss(y_test, y_pred_proba)) metrics["ROC AUC"].append(roc_auc_score(y_test, y_pred_proba)) metrics["Avg Precision"].append(average_precision_score(y_test, y_pred_proba)) # --- Prepare Results DataFrames --- # Create DataFrame for cross-validation results cv_results = { "Metric": list(metrics.keys()), "Mean": [np.mean(v) for v in metrics.values()], "Std Dev": [np.std(v) for v in metrics.values()], } cv_results_df = pd.DataFrame(cv_results) # Retrain on full standardized data to get final coefficients X_scaled = scaler.fit_transform(X) model.fit(X_scaled, y) # Get transformed coefficients for original (unscaled) features coef_info = get_coefficients_info(model, features, model.intercept_[0], scaler) # Create DataFrame for coefficients coef_data = { "Feature": features + ["Intercept"], "Coefficient (Standardized)": np.append(model.coef_[0], model.intercept_[0]), "Coefficient (Original)": np.append( list(coef_info["transformed_coefficients"].values()), coef_info["transformed_intercept"], ), } coef_df = pd.DataFrame(coef_data) return cv_results_df, coef_df # Perform cross-validation and get the results cv_results_df, coefficients_df = perform_cross_validation(feature_df) # Print the results print("--- Cross-Validation Results ---") print(cv_results_df.to_string(index=False)) print("\n" + "=" * 40 + "\n") print("--- Model Coefficients ---") print(coefficients_df.to_string(index=False)) ``` import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, log_loss, roc_auc_score, average_precision_score, ) def get_coefficients_info(model, features, intercept, scaler): """ Returns the model coefficients as a dictionary that accounts for standardization. The transformation allows the model to be expressed in terms of the original, unscaled features. """ # For standardized features, the transformation is z = (x - mean) / std. # The original expression 'coef * z + intercept' becomes: # (coef / std) * x + (intercept - coef * mean / std) transformed_coefs = model.coef\_[0] / scaler.scale\_ transformed_intercept = intercept - np.sum( model.coef\_[0] * scaler.mean\_ / scaler.scale\_ ) # Create a mathematical expression for the model using original (unscaled) features expression_parts = [f"{transformed_intercept:.6f}"] for feature, coef in zip(features, transformed_coefs): expression_parts.append(f"{coef:+.6f}\*{feature}") expression = "".join(expression_parts) # Return a dictionary containing scaling parameters and coefficient information return { "expression": expression, "feature_means": dict(zip(features, scaler.mean\_)), "feature_stds": dict(zip(features, scaler.scale\_)), "original_coefficients": dict(zip(features, model.coef\_[0])), "original_intercept": float(intercept), "transformed_coefficients": dict(zip(features, transformed_coefs)), "transformed_intercept": float(transformed_intercept), } def perform_cross_validation(df: pd.DataFrame): """ Loads data, applies standardization, and performs 5-fold stratified cross-validation. Args: df: A pandas DataFrame with features and a 'relevance_label' target column. Returns: A tuple containing two pandas DataFrames: - cv_results_df: The mean and standard deviation of evaluation metrics. - coef_df: The model coefficients for both scaled and unscaled features. """ # Define and drop irrelevant columns columns_to_drop = ["doc_id", "query_id", "relevance_score"] # Drop only the columns that exist in the DataFrame df = df.drop(columns=[col for col in columns_to_drop if col in df.columns]) df["relevance_label"] = df["relevance_label"].astype(int) # Define features (X) and target (y) X = df.drop(columns=["relevance_label"]) features = X.columns.tolist() y = df["relevance_label"] # Initialize StandardScaler, model, and cross-validator scaler = StandardScaler() N_SPLITS = 5 skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42) model = LogisticRegression(C=0.001, tol=1e-2, random_state=42) # Lists to store metrics for each fold metrics = { "Accuracy": [], "Precision": [], "Recall": [], "F1-Score": [], "Log Loss": [], "ROC AUC": [], "Avg Precision": [], } # Perform 5-Fold Stratified Cross-Validation for train_index, test_index in skf.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Fit scaler on training data and transform both sets X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train the model and make predictions model.fit(X_train_scaled, y_train) y_pred = model.predict(X_test_scaled) y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Calculate and store metrics for the fold metrics["Accuracy"].append(accuracy_score(y_test, y_pred)) metrics["Precision"].append(precision_score(y_test, y_pred, zero_division=0)) metrics["Recall"].append(recall_score(y_test, y_pred, zero_division=0)) metrics["F1-Score"].append(f1_score(y_test, y_pred, zero_division=0)) metrics["Log Loss"].append(log_loss(y_test, y_pred_proba)) metrics["ROC AUC"].append(roc_auc_score(y_test, y_pred_proba)) metrics["Avg Precision"].append(average_precision_score(y_test, y_pred_proba)) # --- Prepare Results DataFrames --- # Create DataFrame for cross-validation results cv_results = { "Metric": list(metrics.keys()), "Mean": [np.mean(v) for v in metrics.values()], "Std Dev": [np.std(v) for v in metrics.values()], } cv_results_df = pd.DataFrame(cv_results) # Retrain on full standardized data to get final coefficients X_scaled = scaler.fit_transform(X) model.fit(X_scaled, y) # Get transformed coefficients for original (unscaled) features coef_info = get_coefficients_info(model, features, model.intercept\_[0], scaler) # Create DataFrame for coefficients coef_data = { "Feature": features + ["Intercept"], "Coefficient (Standardized)": np.append(model.coef\_[0], model.intercept\_[0]), "Coefficient (Original)": np.append( list(coef_info["transformed_coefficients"].values()), coef_info["transformed_intercept"], ), } coef_df = pd.DataFrame(coef_data) return cv_results_df, coef_df # Perform cross-validation and get the results cv_results_df, coefficients_df = perform_cross_validation(feature_df) # Print the results print("--- Cross-Validation Results ---") print(cv_results_df.to_string(index=False)) print("\\n" + "=" * 40 + "\\n") print("--- Model Coefficients ---") print(coefficients_df.to_string(index=False)) ``` --- Cross-Validation Results --- Metric Mean Std Dev Accuracy 0.659524 0.115234 Precision 0.623102 0.085545 Recall 1.000000 0.000000 F1-Score 0.764337 0.065585 Log Loss 0.639436 0.014668 ROC AUC 0.974949 0.019901 Avg Precision 0.979207 0.018465 ======================================== --- Model Coefficients --- Feature Coefficient (Standardized) Coefficient (Original) match_avg_top_3_chunk_sim_scores 0.034383 0.421609 match_avg_top_3_chunk_text_scores 0.031768 0.006793 match_bm25(chunks) 0.031909 0.004862 match_bm25(title) 0.021095 0.008671 match_max_chunk_sim_scores 0.034131 0.352846 match_max_chunk_text_scores 0.032141 0.005228 Intercept 0.158401 -0.143366 ``` In \[36\]: Copied! ``` coefficients_df ``` coefficients_df Out\[36\]: | | Feature | Coefficient (Standardized) | Coefficient (Original) | | --- | --------------------------------- | -------------------------- | ---------------------- | | 0 | match_avg_top_3_chunk_sim_scores | 0.034383 | 0.421609 | | 1 | match_avg_top_3_chunk_text_scores | 0.031768 | 0.006793 | | 2 | match_bm25(chunks) | 0.031909 | 0.004862 | | 3 | match_bm25(title) | 0.021095 | 0.008671 | | 4 | match_max_chunk_sim_scores | 0.034131 | 0.352846 | | 5 | match_max_chunk_text_scores | 0.032141 | 0.005228 | | 6 | Intercept | 0.158401 | -0.143366 | Which seems quite good. With such a small dataset however, it is easy to overfit. Let us evaluate on the unseen test queries to see how well the model generalizes. First, we need to add the learned coefficients as inputs to a new rank profile in our schema, so that we can use them in Vespa. In \[37\]: Copied! ``` learned_linear_rp = ( repo_root / "app" / "schemas" / "doc" / "learned-linear.profile" ).read_text() display_md(learned_linear_rp, tag="txt") ``` learned_linear_rp = ( repo_root / "app" / "schemas" / "doc" / "learned-linear.profile" ).read_text() display_md(learned_linear_rp, tag="txt") ``` txt rank-profile learned-linear inherits base-features { match-features: inputs { query(embedding) tensor(x[96]) query(float_embedding) tensor(x[768]) query(intercept) double query(avg_top_3_chunk_sim_scores_param) double query(avg_top_3_chunk_text_scores_param) double query(bm25_chunks_param) double query(bm25_title_param) double query(max_chunk_sim_scores_param) double query(max_chunk_text_scores_param) double } first-phase { expression { query(intercept) + query(avg_top_3_chunk_sim_scores_param) * avg_top_3_chunk_sim_scores() + query(avg_top_3_chunk_text_scores_param) * avg_top_3_chunk_text_scores() + query(bm25_title_param) * bm25(title) + query(bm25_chunks_param) * bm25(chunks) + query(max_chunk_sim_scores_param) * max_chunk_sim_scores() + query(max_chunk_text_scores_param) * max_chunk_text_scores() } } summary-features { top_3_chunk_sim_scores } } ``` To allow for changing the parameters without redeploying the application, we will also add the values of the coefficients as query parameters to a new query profile. In \[38\]: Copied! ``` display_md(hybrid_qp, tag="xml") ``` display_md(hybrid_qp, tag="xml") ``` doc embed(@query) embed(@query) -7.798639 13.383840 0.203145 0.159914 0.191867 10.067169 0.153392 select * from %{schema} where userInput(@query) or ({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or ({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding)) 10 learned-linear top_3_chunks ``` ### Evaluating first-phase ranking[¶](#evaluating-first-phase-ranking) Now we are ready to evaluate our first-phase ranking function. We can use the [VespaEvaluator](https://vespa-engine.github.io/pyvespa/evaluating-vespa-application-cloud.md#vespaevaluator) to evaluate the first-phase ranking function on the unseen test queries. In \[ \]: Copied! ``` test_queries_file = repo_root / "queries" / "test_queries.json" with open(test_queries_file) as f: test_queries = json.load(f) test_ids_to_query = {query["query_id"]: query["query_text"] for query in test_queries} test_relevant_docs = { query["query_id"]: set(query["relevant_document_ids"]) for query in test_queries if "relevant_document_ids" in query } ``` test_queries_file = repo_root / "queries" / "test_queries.json" with open(test_queries_file) as f: test_queries = json.load(f) test_ids_to_query = {query\["query_id"\]: query["query_text"] for query in test_queries} test_relevant_docs = { query\["query_id"\]: set(query["relevant_document_ids"]) for query in test_queries if "relevant_document_ids" in query } We need to parse the coefficients into the required format for input. In \[40\]: Copied! ``` coefficients_df ``` coefficients_df Out\[40\]: | | Feature | Coefficient (Standardized) | Coefficient (Original) | | --- | --------------------------------- | -------------------------- | ---------------------- | | 0 | match_avg_top_3_chunk_sim_scores | 0.034383 | 0.421609 | | 1 | match_avg_top_3_chunk_text_scores | 0.031768 | 0.006793 | | 2 | match_bm25(chunks) | 0.031909 | 0.004862 | | 3 | match_bm25(title) | 0.021095 | 0.008671 | | 4 | match_max_chunk_sim_scores | 0.034131 | 0.352846 | | 5 | match_max_chunk_text_scores | 0.032141 | 0.005228 | | 6 | Intercept | 0.158401 | -0.143366 | In \[41\]: Copied! ``` coef_dict = coefficients_df.to_dict() coef_dict ``` coef_dict = coefficients_df.to_dict() coef_dict Out\[41\]: ``` {'Feature': {0: 'match_avg_top_3_chunk_sim_scores', 1: 'match_avg_top_3_chunk_text_scores', 2: 'match_bm25(chunks)', 3: 'match_bm25(title)', 4: 'match_max_chunk_sim_scores', 5: 'match_max_chunk_text_scores', 6: 'Intercept'}, 'Coefficient (Standardized)': {0: 0.03438259396169029, 1: 0.031767760839597856, 2: 0.03190853104175455, 3: 0.021094809721098663, 4: 0.03413143203194206, 5: 0.0321408033796812, 6: 0.1584007329169953}, 'Coefficient (Original)': {0: 0.421609061801165, 1: 0.0067931485936015825, 2: 0.004861617295220699, 3: 0.008671224628375315, 4: 0.3528463496849927, 5: 0.005227988942349101, 6: -0.14336597939520906}} ``` In \[42\]: Copied! ``` def format_key(feature): """Formats the feature string into the desired key format.""" if feature == "Intercept": return "input.query(intercept)" name = feature.removeprefix("match_").replace("(", "_").replace(")", "") return f"input.query({name}_param)" linear_params = { format_key(feature): coef_dict["Coefficient (Original)"][i] for i, feature in enumerate(coef_dict["Feature"].values()) } linear_params ``` def format_key(feature): """Formats the feature string into the desired key format.""" if feature == "Intercept": return "input.query(intercept)" name = feature.removeprefix("match\_").replace("(", "\_").replace(")", "") return f"input.query({name}\_param)" linear_params = { format_key(feature): coef_dict["Coefficient (Original)"][i] for i, feature in enumerate(coef_dict["Feature"].values()) } linear_params Out\[42\]: ``` {'input.query(avg_top_3_chunk_sim_scores_param)': 0.421609061801165, 'input.query(avg_top_3_chunk_text_scores_param)': 0.0067931485936015825, 'input.query(bm25_chunks_param)': 0.004861617295220699, 'input.query(bm25_title_param)': 0.008671224628375315, 'input.query(max_chunk_sim_scores_param)': 0.3528463496849927, 'input.query(max_chunk_text_scores_param)': 0.005227988942349101, 'input.query(intercept)': -0.14336597939520906} ``` We run the evaluation script on a set of unseen test queries, and get the following output: In \[43\]: Copied! ``` # Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. from vespa.evaluation import VespaEvaluator from vespa.application import Vespa import json from pathlib import Path def rank_first_phase_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "hits": top_k, "query": query_text, "ranking.profile": "learned-linear", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } | linear_params first_phase_evaluator = VespaEvaluator( queries=test_ids_to_query, relevant_docs=test_relevant_docs, vespa_query_fn=rank_first_phase_query_fn, id_field="id", app=app, name="first-phase-evaluation", write_csv=False, precision_recall_at_k=[10, 20], ) first_phase_results = first_phase_evaluator() ``` # Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. from vespa.evaluation import VespaEvaluator from vespa.application import Vespa import json from pathlib import Path def rank_first_phase_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "hits": top_k, "query": query_text, "ranking.profile": "learned-linear", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } | linear_params first_phase_evaluator = VespaEvaluator( queries=test_ids_to_query, relevant_docs=test_relevant_docs, vespa_query_fn=rank_first_phase_query_fn, id_field="id", app=app, name="first-phase-evaluation", write_csv=False, precision_recall_at_k=[10, 20], ) first_phase_results = first_phase_evaluator() In \[44\]: Copied! ``` first_phase_results ``` first_phase_results Out\[44\]: ``` {'accuracy@1': 1.0, 'accuracy@3': 1.0, 'accuracy@5': 1.0, 'accuracy@10': 1.0, 'precision@10': 0.23500000000000001, 'recall@10': 0.9405303030303032, 'precision@20': 0.1275, 'recall@20': 0.990909090909091, 'mrr@10': 1.0, 'ndcg@10': 0.8893451868887793, 'map@100': 0.8183245416199961, 'searchtime_avg': 0.04085000000000001, 'searchtime_q50': 0.0425, 'searchtime_q90': 0.06040000000000004, 'searchtime_q95': 0.08305000000000001} ``` In \[45\]: Copied! ``` first_phase_df = pd.DataFrame(first_phase_results, index=["value"]).T first_phase_df ``` first_phase_df = pd.DataFrame(first_phase_results, index=["value"]).T first_phase_df Out\[45\]: | | value | | -------------- | -------- | | accuracy@1 | 1.000000 | | accuracy@3 | 1.000000 | | accuracy@5 | 1.000000 | | accuracy@10 | 1.000000 | | precision@10 | 0.235000 | | recall@10 | 0.940530 | | precision@20 | 0.127500 | | recall@20 | 0.990909 | | mrr@10 | 1.000000 | | ndcg@10 | 0.889345 | | map@100 | 0.818325 | | searchtime_avg | 0.040850 | | searchtime_q50 | 0.042500 | | searchtime_q90 | 0.060400 | | searchtime_q95 | 0.083050 | For the first phase ranking, we care most about recall, as we just want to make sure that the candidate documents are ranked high enough to be included in the second-phase ranking. (the default number of documents that will be exposed to second-phase is 10 000, but can be controlled by the `rerank-count` parameter). We can see that our results are already very good. This is of course due to the fact that we have a small,synthetic dataset. In reality, you should align the metric expectations with your dataset and test queries. We can also see that our search time is quite fast, with an average of 22ms. You should consider whether this is well within your latency budget, as you want some headroom for second-phase ranking. ## Second-phase ranking[¶](#second-phase-ranking) For the second-phase ranking, we can afford to use a more expensive ranking expression, since we will only run it on the top-k documents from the first-phase ranking (defined by the `rerank-count` parameter, which defaults to 10,000 documents). This is where we can significantly improve ranking quality by using more sophisticated models and features that would be too expensive to compute for all matched documents. ### Collecting features for second-phase ranking[¶](#collecting-features-for-second-phase-ranking) For second-phase ranking, we request Vespa's default set of rank features, which includes a comprehensive set of text features. See the [rank features documentation](https://docs.vespa.ai/en/reference/rank-features.html) for complete details. We can collect both match features and rank features by running the same code as we did for first-phase ranking, with some additional parameters to collect rank features as well. In \[46\]: Copied! ``` second_phase_collector = VespaFeatureCollector( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=feature_collection_second_phase_query_fn, app=app, name="second-phase", id_field="id", collect_matchfeatures=True, collect_summaryfeatures=False, collect_rankfeatures=True, write_csv=False, random_hits_strategy="ratio", random_hits_value=1, ) second_phase_features = second_phase_collector.collect() ``` second_phase_collector = VespaFeatureCollector( queries=ids_to_query, relevant_docs=relevant_docs, vespa_query_fn=feature_collection_second_phase_query_fn, app=app, name="second-phase", id_field="id", collect_matchfeatures=True, collect_summaryfeatures=False, collect_rankfeatures=True, write_csv=False, random_hits_strategy="ratio", random_hits_value=1, ) second_phase_features = second_phase_collector.collect() In \[47\]: Copied! ``` second_phase_df = pd.DataFrame(second_phase_features["results"]) second_phase_df ``` second_phase_df = pd.DataFrame(second_phase_features["results"]) second_phase_df Out\[47\]: | | query_id | doc_id | relevance_label | relevance_score | match_avg_top_3_chunk_sim_scores | match_avg_top_3_chunk_text_scores | match_bm25(chunks) | match_bm25(title) | match_is_favorite | match_max_chunk_sim_scores | ... | rank_term(3).significance | rank_term(3).weight | rank_term(4).connectedness | rank_term(4).significance | rank_term(4).weight | rank_textSimilarity(title).fieldCoverage | rank_textSimilarity(title).order | rank_textSimilarity(title).proximity | rank_textSimilarity(title).queryCoverage | rank_textSimilarity(title).score | | --- | --------- | ------ | --------------- | --------------- | -------------------------------- | --------------------------------- | ------------------ | ----------------- | ----------------- | -------------------------- | --- | ------------------------- | ------------------- | -------------------------- | ------------------------- | ------------------- | ---------------------------------------- | -------------------------------- | ------------------------------------ | ---------------------------------------- | -------------------------------- | | 0 | alex_q_01 | 1 | 1.0 | 0.928815 | 0.358027 | 15.100841 | 23.010389 | 4.333828 | 1.0 | 0.391143 | ... | 0.524369 | 100.0 | 0.1 | 0.560104 | 100.0 | 0.400000 | 1.0 | 1.00 | 0.133333 | 0.620000 | | 1 | alex_q_01 | 50 | 1.0 | 0.791824 | 0.248329 | 8.444725 | 7.717984 | 0.000000 | 0.0 | 0.268457 | ... | 0.524369 | 100.0 | 0.1 | 0.560104 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | 2 | alex_q_01 | 82 | 1.0 | 0.271836 | 0.225300 | 12.327676 | 18.611592 | 2.453409 | 1.0 | 0.258905 | ... | 0.524369 | 100.0 | 0.1 | 0.560104 | 100.0 | 0.200000 | 0.0 | 0.75 | 0.066667 | 0.322500 | | 3 | alex_q_01 | 34 | 0.0 | 0.982272 | 0.231970 | 5.111429 | 7.128779 | 0.000000 | 0.0 | 0.257180 | ... | 0.524369 | 100.0 | 0.1 | 0.560104 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | 4 | alex_q_01 | 24 | 0.0 | 0.975659 | 0.201503 | 2.404518 | 2.680087 | 0.000000 | 1.0 | 0.201503 | ... | 0.524369 | 100.0 | 0.1 | 0.560104 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 97 | alex_q_19 | 58 | 0.0 | 0.990156 | 0.136911 | 2.231116 | 2.606189 | 0.000000 | 0.0 | 0.136911 | ... | 0.548752 | 100.0 | 0.1 | 0.558248 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | 98 | alex_q_20 | 20 | 1.0 | 0.618527 | 0.337411 | 8.959117 | 12.534452 | 9.865092 | 0.0 | 0.402615 | ... | 0.558248 | 100.0 | 0.1 | 0.524369 | 100.0 | 0.833333 | 1.0 | 1.00 | 0.555556 | 0.833333 | | 99 | alex_q_20 | 35 | 1.0 | 0.617958 | 0.227978 | 8.462585 | 13.478890 | 0.000000 | 0.0 | 0.239757 | ... | 0.558248 | 100.0 | 0.1 | 0.524369 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | 100 | alex_q_20 | 63 | 0.0 | 0.979987 | 0.182378 | 3.131521 | 5.032468 | 0.000000 | 1.0 | 0.183292 | ... | 0.558248 | 100.0 | 0.1 | 0.524369 | 100.0 | 0.000000 | 0.0 | 0.00 | 0.000000 | 0.000000 | | 101 | alex_q_20 | 32 | 0.0 | 0.977501 | 0.157868 | 2.246247 | 2.442976 | 1.388680 | 0.0 | 0.157868 | ... | 0.558248 | 100.0 | 0.1 | 0.524369 | 100.0 | 0.200000 | 0.0 | 0.75 | 0.111111 | 0.335833 | 102 rows × 198 columns This collects 195 features (excluding ids and labels), providing a rich feature set for training more sophisticated ranking models. ### Training a GBDT model for second-phase ranking[¶](#training-a-gbdt-model-for-second-phase-ranking) With the expanded feature set, we can train a Gradient Boosted Decision Tree (GBDT) model to predict document relevance. We use [LightGBM](https://docs.vespa.ai/en/lightgbm.html) for this purpose. Vespa also supports [XGBoost](https://docs.vespa.ai/en/xgboost.html) and [ONNX](https://docs.vespa.ai/en/onnx.html) models. To train the model, run the following command ([link to training script](https://github.com/vespa-engine/sample-apps/blob/master/rag-blueprint/eval/train_lightgbm.py)): The training process includes several important considerations: - **Cross-validation**: We use 5-fold stratified cross-validation to evaluate model performance and prevent overfitting - **Hyperparameter tuning**: We set conservative hyperparameters to prevent growing overly large and deep trees, especially important for smaller datasets - **Feature selection**: Features with zero importance during cross-validation are excluded from the final model - **Early stopping**: Training stops when validation scores don't improve for 50 rounds In \[48\]: Copied! ``` import json import re from typing import Dict, Any, Tuple import pandas as pd import lightgbm as lgb from sklearn.preprocessing import LabelEncoder def strip_feature_prefix(feature_name: str) -> str: """Strips 'rank_' or 'match_' prefix from a feature name.""" return re.sub(r"^(rank_|match_)", "", feature_name) def calculate_mean_importance( importance_frames: list, ) -> pd.DataFrame: """Calculates and returns the mean feature importance from all folds.""" if not importance_frames: return pd.DataFrame(columns=["feature", "gain"]) imp_all = pd.concat(importance_frames, axis=0) imp_mean = ( imp_all.groupby("feature")["gain"] .mean() .sort_values(ascending=False) .reset_index() ) return imp_mean def perform_cross_validation( df: pd.DataFrame, args: Dict[str, Any] ) -> Tuple[pd.DataFrame, pd.DataFrame, Dict]: """ Performs stratified cross-validation with LightGBM on a DataFrame. Args: df: Input pandas DataFrame containing features and the target column. args: A dictionary of parameters for the training process. Returns: A tuple containing: - cv_results_df: DataFrame with the cross-validation metrics (Mean and Std Dev). - feature_importance_df: DataFrame with the mean feature importance (gain). - final_model_dict: The final trained LightGBM model, exported as a dictionary. """ # --- Parameter setup --- target_col = args.get("target", "relevance_label") drop_cols = args.get("drop_cols", ["query_id", "doc_id", "relevance_score"]) folds = args.get("folds", 5) seed = args.get("seed", 42) max_rounds = args.get("max_rounds", 1000) early_stop = args.get("early_stop", 50) learning_rate = args.get("learning_rate", 0.05) np.random.seed(seed) # --- Data Cleaning --- df = df.copy() constant_cols = [c for c in df.columns if df[c].nunique(dropna=False) <= 1] cols_to_drop = [c for c in drop_cols if c in df.columns] feature_cols = df.columns.difference( constant_cols + cols_to_drop + [target_col] ).tolist() # Strip prefixes from feature names and rename columns stripped_feature_mapping = { original_col: strip_feature_prefix(original_col) for original_col in feature_cols } df = df.rename(columns=stripped_feature_mapping) feature_cols = list(stripped_feature_mapping.values()) # --- Handle Categorical Variables --- cat_cols = [ c for c in df.select_dtypes(include=["object", "category"]).columns if c in feature_cols ] for c in cat_cols: df[c] = df[c].astype(str) df[c] = LabelEncoder().fit_transform(df[c]) categorical_feature_idx = [feature_cols.index(c) for c in cat_cols] # --- Prepare X and y --- X = df[feature_cols] y = df[target_col].astype(int) # Store original names and rename columns for LightGBM compatibility original_feature_names = X.columns.tolist() X.columns = [f"feature_{i}" for i in range(len(X.columns))] feature_name_mapping = dict(zip(X.columns, original_feature_names)) # --- Stratified K-Fold Cross-Validation --- skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed) oof_pred = np.zeros(len(df)) importance_frames = [] fold_metrics = {"Accuracy": [], "ROC AUC": []} best_iterations = [] print(f"Performing {folds}-Fold Stratified Cross-Validation...") for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1): X_train, y_train = X.iloc[train_idx], y.iloc[train_idx] X_val, y_val = X.iloc[val_idx], y.iloc[val_idx] lgb_train = lgb.Dataset( X_train, y_train, categorical_feature=categorical_feature_idx ) lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train) params = dict( objective="binary", metric="auc", seed=seed, verbose=-1, learning_rate=learning_rate, num_leaves=10, max_depth=3, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=5, ) callbacks = [lgb.early_stopping(early_stop, verbose=False)] model = lgb.train( params, lgb_train, num_boost_round=max_rounds, valid_sets=[lgb_val], callbacks=callbacks, ) best_iterations.append(model.best_iteration) val_preds = model.predict(X_val, num_iteration=model.best_iteration) oof_pred[val_idx] = val_preds fold_metrics["ROC AUC"].append(roc_auc_score(y_val, val_preds)) fold_metrics["Accuracy"].append( accuracy_score(y_val, (val_preds > 0.5).astype(int)) ) print( f"Fold {fold}: AUC = {fold_metrics['ROC AUC'][-1]:.4f}, ACC = {fold_metrics['Accuracy'][-1]:.4f}" ) importance_frames.append( pd.DataFrame( { "feature": original_feature_names, "gain": model.feature_importance(importance_type="gain"), } ) ) # --- Compile Results --- cv_results_df = pd.DataFrame( { "Metric": list(fold_metrics.keys()), "Mean": [np.mean(v) for v in fold_metrics.values()], "Std Dev": [np.std(v) for v in fold_metrics.values()], } ) feature_importance_df = calculate_mean_importance(importance_frames) # --- Train Final Model --- final_features = feature_importance_df[feature_importance_df["gain"] > 0][ "feature" ].tolist() print( f"\nTraining final model on {len(final_features)} features with non-zero importance." ) # Map selected original names back to 'feature_i' names final_feature_indices = [ key for key, val in feature_name_mapping.items() if val in final_features ] X_final = X[final_feature_indices] final_categorical_idx = [ X_final.columns.get_loc(c) for c in X_final.columns if feature_name_mapping[c] in cat_cols ] full_dataset = lgb.Dataset(X_final, y, categorical_feature=final_categorical_idx) final_boost_rounds = int(np.mean(best_iterations)) final_model = lgb.train(params, full_dataset, num_boost_round=final_boost_rounds) # Export model with original feature names model_json = final_model.dump_model() model_json_str = json.dumps(model_json) for renamed_feature, original_feature in feature_name_mapping.items(): model_json_str = model_json_str.replace( f'"{renamed_feature}"', f'"{original_feature}"' ) final_model_dict = json.loads(model_json_str) print("Training completed successfully!") return cv_results_df, feature_importance_df, final_model_dict # 2. Define arguments as a dictionary training_args = { "target": "relevance_label", "drop_cols": ["query_id", "doc_id", "relevance_score"], "folds": 5, "seed": 42, "max_rounds": 500, "early_stop": 25, "learning_rate": 0.05, } # 3. Run the cross-validation and get the results cv_results, feature_importance, final_model = perform_cross_validation( df=second_phase_df, args=training_args ) ``` import json import re from typing import Dict, Any, Tuple import pandas as pd import lightgbm as lgb from sklearn.preprocessing import LabelEncoder def strip_feature_prefix(feature_name: str) -> str: """Strips 'rank\_' or 'match\_' prefix from a feature name.""" return re.sub(r"^(rank\_|match\_)", "", feature_name) def calculate_mean_importance( importance_frames: list, ) -> pd.DataFrame: """Calculates and returns the mean feature importance from all folds.""" if not importance_frames: return pd.DataFrame(columns=["feature", "gain"]) imp_all = pd.concat(importance_frames, axis=0) imp_mean = ( imp_all.groupby("feature")["gain"] .mean() .sort_values(ascending=False) .reset_index() ) return imp_mean def perform_cross_validation( df: pd.DataFrame, args: Dict[str, Any] ) -> Tuple\[pd.DataFrame, pd.DataFrame, Dict\]: """ Performs stratified cross-validation with LightGBM on a DataFrame. Args: df: Input pandas DataFrame containing features and the target column. args: A dictionary of parameters for the training process. Returns: A tuple containing: - cv_results_df: DataFrame with the cross-validation metrics (Mean and Std Dev). - feature_importance_df: DataFrame with the mean feature importance (gain). - final_model_dict: The final trained LightGBM model, exported as a dictionary. """ # --- Parameter setup --- target_col = args.get("target", "relevance_label") drop_cols = args.get("drop_cols", ["query_id", "doc_id", "relevance_score"]) folds = args.get("folds", 5) seed = args.get("seed", 42) max_rounds = args.get("max_rounds", 1000) early_stop = args.get("early_stop", 50) learning_rate = args.get("learning_rate", 0.05) np.random.seed(seed) # --- Data Cleaning --- df = df.copy() constant_cols = \[c for c in df.columns if df[c].nunique(dropna=False) \<= 1\] cols_to_drop = [c for c in drop_cols if c in df.columns] feature_cols = df.columns.difference( constant_cols + cols_to_drop + [target_col] ).tolist() # Strip prefixes from feature names and rename columns stripped_feature_mapping = { original_col: strip_feature_prefix(original_col) for original_col in feature_cols } df = df.rename(columns=stripped_feature_mapping) feature_cols = list(stripped_feature_mapping.values()) # --- Handle Categorical Variables --- cat_cols = \[ c for c in df.select_dtypes(include=["object", "category"]).columns if c in feature_cols \] for c in cat_cols: df[c] = df[c].astype(str) df[c] = LabelEncoder().fit_transform(df[c]) categorical_feature_idx = [feature_cols.index(c) for c in cat_cols] # --- Prepare X and y --- X = df[feature_cols] y = df[target_col].astype(int) # Store original names and rename columns for LightGBM compatibility original_feature_names = X.columns.tolist() X.columns = [f"feature\_{i}" for i in range(len(X.columns))] feature_name_mapping = dict(zip(X.columns, original_feature_names)) # --- Stratified K-Fold Cross-Validation --- skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed) oof_pred = np.zeros(len(df)) importance_frames = [] fold_metrics = {"Accuracy": [], "ROC AUC": []} best_iterations = [] print(f"Performing {folds}-Fold Stratified Cross-Validation...") for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1): X_train, y_train = X.iloc[train_idx], y.iloc[train_idx] X_val, y_val = X.iloc[val_idx], y.iloc[val_idx] lgb_train = lgb.Dataset( X_train, y_train, categorical_feature=categorical_feature_idx ) lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train) params = dict( objective="binary", metric="auc", seed=seed, verbose=-1, learning_rate=learning_rate, num_leaves=10, max_depth=3, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=5, ) callbacks = [lgb.early_stopping(early_stop, verbose=False)] model = lgb.train( params, lgb_train, num_boost_round=max_rounds, valid_sets=[lgb_val], callbacks=callbacks, ) best_iterations.append(model.best_iteration) val_preds = model.predict(X_val, num_iteration=model.best_iteration) oof_pred[val_idx] = val_preds fold_metrics["ROC AUC"].append(roc_auc_score(y_val, val_preds)) fold_metrics["Accuracy"].append( accuracy_score(y_val, (val_preds > 0.5).astype(int)) ) print( f"Fold {fold}: AUC = {fold_metrics['ROC AUC']\[-1\]:.4f}, ACC = {fold_metrics['Accuracy']\[-1\]:.4f}" ) importance_frames.append( pd.DataFrame( { "feature": original_feature_names, "gain": model.feature_importance(importance_type="gain"), } ) ) # --- Compile Results --- cv_results_df = pd.DataFrame( { "Metric": list(fold_metrics.keys()), "Mean": [np.mean(v) for v in fold_metrics.values()], "Std Dev": [np.std(v) for v in fold_metrics.values()], } ) feature_importance_df = calculate_mean_importance(importance_frames) # --- Train Final Model --- final_features = feature_importance_df\[feature_importance_df["gain"] > 0\][ "feature" ].tolist() print( f"\\nTraining final model on {len(final_features)} features with non-zero importance." ) # Map selected original names back to 'feature_i' names final_feature_indices = [ key for key, val in feature_name_mapping.items() if val in final_features ] X_final = X[final_feature_indices] final_categorical_idx = \[ X_final.columns.get_loc(c) for c in X_final.columns if feature_name_mapping[c] in cat_cols \] full_dataset = lgb.Dataset(X_final, y, categorical_feature=final_categorical_idx) final_boost_rounds = int(np.mean(best_iterations)) final_model = lgb.train(params, full_dataset, num_boost_round=final_boost_rounds) # Export model with original feature names model_json = final_model.dump_model() model_json_str = json.dumps(model_json) for renamed_feature, original_feature in feature_name_mapping.items(): model_json_str = model_json_str.replace( f'"{renamed_feature}"', f'"{original_feature}"' ) final_model_dict = json.loads(model_json_str) print("Training completed successfully!") return cv_results_df, feature_importance_df, final_model_dict # 2. Define arguments as a dictionary training_args = { "target": "relevance_label", "drop_cols": ["query_id", "doc_id", "relevance_score"], "folds": 5, "seed": 42, "max_rounds": 500, "early_stop": 25, "learning_rate": 0.05, } # 3. Run the cross-validation and get the results cv_results, feature_importance, final_model = perform_cross_validation( df=second_phase_df, args=training_args ) ``` Performing 5-Fold Stratified Cross-Validation... Fold 1: AUC = 0.9727, ACC = 0.8095 Fold 2: AUC = 0.9636, ACC = 0.8571 Fold 3: AUC = 0.9798, ACC = 0.9000 Fold 4: AUC = 0.9798, ACC = 0.8500 Fold 5: AUC = 1.0000, ACC = 0.8000 Training final model on 14 features with non-zero importance. Training completed successfully! ``` In \[49\]: Copied! ``` cv_results ``` cv_results Out\[49\]: | | Metric | Mean | Std Dev | | --- | -------- | -------- | -------- | | 0 | Accuracy | 0.843333 | 0.035964 | | 1 | ROC AUC | 0.979192 | 0.011979 | In \[50\]: Copied! ``` feature_importance[:15] ``` feature_importance[:15] Out\[50\]: | | feature | gain | | --- | --------------------------------------------- | ---------- | | 0 | nativeProximity | 183.686466 | | 1 | firstPhase | 131.138263 | | 2 | avg_top_3_chunk_sim_scores | 58.646572 | | 3 | max_chunk_sim_scores | 40.141040 | | 4 | elementCompleteness(chunks).queryCompleteness | 37.331087 | | 5 | nativeRank | 13.850518 | | 6 | avg_top_3_chunk_text_scores | 1.838134 | | 7 | bm25(chunks) | 0.463590 | | 8 | modified_freshness | 0.386416 | | 9 | fieldMatch(title).absoluteProximity | 0.374392 | | 10 | fieldMatch(title).orderness | 0.363286 | | 11 | elementSimilarity(chunks) | 0.214760 | | 12 | max_chunk_text_scores | 0.183127 | | 13 | nativeFieldMatch | 0.119759 | | 14 | fieldTermMatch(title,3).weight | 0.000000 | ### Feature importance analysis[¶](#feature-importance-analysis) The trained model reveals which features are most important for ranking quality. (As this notebook runs in CI, and not everything from data_collection and training is deterministic, the exact feature importances may vary, but we *expect* the observations below to hold for most runs.) Key observations: - **Text proximity features** ([nativeProximity](https://docs.vespa.ai/en/reference/nativerank.html#nativeProximity)) are highly valuable for understanding query-document relevance - **First-phase score** (`firstPhase`) being important validates that our first-phase ranking provides a good foundation - **Chunk-level features** (both text and semantic) contribute significantly to ranking quality - **Traditional text features** like [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html#nativeRank) and [bm25](https://docs.vespa.ai/en/reference/bm25.html#ranking-function) remain important In \[51\]: Copied! ``` final_model ``` final_model Out\[51\]: ``` {'name': 'tree', 'version': 'v4', 'num_class': 1, 'num_tree_per_iteration': 1, 'label_index': 0, 'max_feature_idx': 16, 'objective': 'binary sigmoid:1', 'average_output': False, 'feature_names': ['avg_top_3_chunk_sim_scores', 'avg_top_3_chunk_text_scores', 'bm25(chunks)', 'bm25(chunks)', 'max_chunk_sim_scores', 'max_chunk_text_scores', 'modified_freshness', 'bm25(chunks)', 'bm25(chunks)', 'elementCompleteness(chunks).queryCompleteness', 'elementSimilarity(chunks)', 'fieldMatch(title).absoluteProximity', 'fieldMatch(title).orderness', 'firstPhase', 'nativeFieldMatch', 'nativeProximity', 'nativeRank'], 'monotone_constraints': [], 'feature_infos': {'avg_top_3_chunk_sim_scores': {'min_value': 0.08106629550457, 'max_value': 0.4134707450866699, 'values': []}, 'avg_top_3_chunk_text_scores': {'min_value': 0, 'max_value': 20.105823516845703, 'values': []}, 'bm25(chunks)': {'min_value': 0, 'max_value': 25.04552896302937, 'values': []}, 'max_chunk_sim_scores': {'min_value': 0.08106629550457, 'max_value': 0.4462931454181671, 'values': []}, 'max_chunk_text_scores': {'min_value': 0, 'max_value': 21.62700843811035, 'values': []}, 'modified_freshness': {'min_value': 0, 'max_value': 0.5671891292958484, 'values': []}, 'elementCompleteness(chunks).queryCompleteness': {'min_value': 0, 'max_value': 0.7777777777777778, 'values': []}, 'elementSimilarity(chunks)': {'min_value': 0, 'max_value': 0.7162878787878787, 'values': []}, 'fieldMatch(title).absoluteProximity': {'min_value': 0, 'max_value': 0.10000000149011612, 'values': []}, 'fieldMatch(title).orderness': {'min_value': 0, 'max_value': 1, 'values': []}, 'firstPhase': {'min_value': -5.438998465840945, 'max_value': 14.07283096376979, 'values': []}, 'nativeFieldMatch': {'min_value': 0, 'max_value': 0.3354072940571937, 'values': []}, 'nativeProximity': {'min_value': 0, 'max_value': 0.1963793884211417, 'values': []}, 'nativeRank': {'min_value': 0.0017429193899782137, 'max_value': 0.17263275990663562, 'values': []}}, 'tree_info': [{'tree_index': 0, 'num_leaves': 2, 'num_cat': 0, 'shrinkage': 1, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 50.4098014831543, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.165181, 'internal_weight': 18.8831, 'internal_count': 76, 'left_child': {'leaf_index': 0, 'leaf_value': 0.08130811914532406, 'leaf_weight': 9.193098649382593, 'leaf_count': 37}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.24475291179584288, 'leaf_weight': 9.690022900700567, 'leaf_count': 39}}}, {'tree_index': 1, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 44.23429870605469, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00762683, 'internal_weight': 18.8402, 'internal_count': 76, 'left_child': {'leaf_index': 0, 'leaf_value': -0.10463142349527131, 'leaf_weight': 5.986800223588946, 'leaf_count': 24}, 'right_child': {'split_index': 1, 'split_feature': 9, 'split_gain': 7.076389789581299, 'threshold': 0.44949494949494956, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0599142, 'internal_weight': 12.8534, 'internal_count': 52, 'left_child': {'leaf_index': 1, 'leaf_value': 0.013179562064110115, 'leaf_weight': 4.968685954809187, 'leaf_count': 20}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.08936491628319639, 'leaf_weight': 7.884672373533249, 'leaf_count': 32}}}}, {'tree_index': 2, 'num_leaves': 2, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 42.20650100708008, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00729477, 'internal_weight': 18.7478, 'internal_count': 76, 'left_child': {'leaf_index': 0, 'leaf_value': -0.06880462126513588, 'leaf_weight': 9.240163266658785, 'leaf_count': 37}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.08125312744778718, 'leaf_weight': 9.507659405469893, 'leaf_count': 39}}}, {'tree_index': 3, 'num_leaves': 2, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 38.436100006103516, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00699584, 'internal_weight': 18.6093, 'internal_count': 76, 'left_child': {'leaf_index': 0, 'leaf_value': -0.06538935309867093, 'leaf_weight': 9.236633136868479, 'leaf_count': 37}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.07833036395826393, 'leaf_weight': 9.372678577899931, 'leaf_count': 39}}}, {'tree_index': 4, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 35.5458984375, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00672514, 'internal_weight': 18.4298, 'internal_count': 76, 'left_child': {'leaf_index': 0, 'leaf_value': -0.09372889424381685, 'leaf_weight': 5.958949193358424, 'leaf_count': 24}, 'right_child': {'split_index': 1, 'split_feature': 9, 'split_gain': 5.318920135498047, 'threshold': 0.44949494949494956, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0547252, 'internal_weight': 12.4708, 'internal_count': 52, 'left_child': {'leaf_index': 1, 'leaf_value': 0.014303727398432995, 'leaf_weight': 4.924616768956183, 'leaf_count': 20}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.08110403985734628, 'leaf_weight': 7.546211168169975, 'leaf_count': 32}}}}, {'tree_index': 5, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 38.138301849365234, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00466505, 'internal_weight': 17.5394, 'internal_count': 73, 'left_child': {'leaf_index': 0, 'leaf_value': -0.08973068432306786, 'leaf_weight': 6.64585913717747, 'leaf_count': 27}, 'right_child': {'split_index': 1, 'split_feature': 9, 'split_gain': 1.3554699420928955, 'threshold': 0.4641025641025642, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0622534, 'internal_weight': 10.8935, 'internal_count': 46, 'left_child': {'leaf_index': 1, 'leaf_value': 0.04350337739463364, 'leaf_weight': 5.113931432366369, 'leaf_count': 21}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.07884389694057212, 'leaf_weight': 5.779602885246277, 'leaf_count': 25}}}}, {'tree_index': 6, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 34.902099609375, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.004498, 'internal_weight': 17.3039, 'internal_count': 73, 'left_child': {'leaf_index': 0, 'leaf_value': -0.08633609429142271, 'leaf_weight': 6.563828170299533, 'leaf_count': 27}, 'right_child': {'split_index': 1, 'split_feature': 15, 'split_gain': 1.338919997215271, 'threshold': 0.04231842199421151, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0600115, 'internal_weight': 10.7401, 'internal_count': 46, 'left_child': {'leaf_index': 1, 'leaf_value': 0.04135593626110073, 'leaf_weight': 5.074008285999296, 'leaf_count': 21}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.07671780288029927, 'leaf_weight': 5.66606205701828, 'leaf_count': 25}}}}, {'tree_index': 7, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 32.02009963989258, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00434441, 'internal_weight': 17.0374, 'internal_count': 73, 'left_child': {'leaf_index': 0, 'leaf_value': -0.08334419516313175, 'leaf_weight': 6.4620268940925625, 'leaf_count': 27}, 'right_child': {'split_index': 1, 'split_feature': 13, 'split_gain': 1.350219964981079, 'threshold': 2.3306006116972546, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0579262, 'internal_weight': 10.5754, 'internal_count': 46, 'left_child': {'leaf_index': 1, 'leaf_value': 0.039874616438302576, 'leaf_weight': 5.23301127552986, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.075608344236657, 'leaf_weight': 5.342339798808098, 'leaf_count': 24}}}}, {'tree_index': 8, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 29.436899185180664, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00420139, 'internal_weight': 16.7481, 'internal_count': 73, 'left_child': {'leaf_index': 0, 'leaf_value': -0.08069001048178517, 'leaf_weight': 6.343828111886981, 'leaf_count': 27}, 'right_child': {'split_index': 1, 'split_feature': 9, 'split_gain': 1.3577200174331665, 'threshold': 0.4641025641025642, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0559624, 'internal_weight': 10.4043, 'internal_count': 46, 'left_child': {'leaf_index': 1, 'leaf_value': 0.03721400081314201, 'leaf_weight': 5.008224830031393, 'leaf_count': 21}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.07336338756704952, 'leaf_weight': 5.396055206656456, 'leaf_count': 25}}}}, {'tree_index': 9, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 0, 'split_gain': 27.117399215698242, 'threshold': 0.18672376126050952, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.00406947, 'internal_weight': 16.4361, 'internal_count': 73, 'left_child': {'leaf_index': 0, 'leaf_value': -0.0783218588683625, 'leaf_weight': 6.212180107831958, 'leaf_count': 27}, 'right_child': {'split_index': 1, 'split_feature': 13, 'split_gain': 1.3397400379180908, 'threshold': 2.3306006116972546, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0541313, 'internal_weight': 10.2239, 'internal_count': 46, 'left_child': {'leaf_index': 1, 'leaf_value': 0.03614212999194114, 'leaf_weight': 5.143270537257193, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.07234219952515168, 'leaf_weight': 5.080672308802605, 'leaf_count': 24}}}}, {'tree_index': 10, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 24.532800674438477, 'threshold': 0.02681743703534994, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0040159, 'internal_weight': 17.9796, 'internal_count': 81, 'left_child': {'split_index': 1, 'split_feature': 1, 'split_gain': 7.316380023956299, 'threshold': 3.092608213424683, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0496308, 'internal_weight': 11.1677, 'internal_count': 48, 'left_child': {'leaf_index': 0, 'leaf_value': -0.0856005281817455, 'leaf_weight': 6.239090889692308, 'leaf_count': 27}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.004096688964982691, 'leaf_weight': 4.92857152223587, 'leaf_count': 21}}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.07076665519154234, 'leaf_weight': 6.811910331249236, 'leaf_count': 33}}}, {'tree_index': 11, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 13, 'split_gain': 23.044300079345703, 'threshold': -0.9175117702774908, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00387752, 'internal_weight': 17.6602, 'internal_count': 81, 'left_child': {'leaf_index': 0, 'leaf_value': -0.07470333072738213, 'leaf_weight': 6.959094658493998, 'leaf_count': 31}, 'right_child': {'split_index': 1, 'split_feature': 13, 'split_gain': 3.699049949645996, 'threshold': 1.8772808596672073, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0421818, 'internal_weight': 10.7011, 'internal_count': 50, 'left_child': {'leaf_index': 1, 'leaf_value': 0.011210025880369016, 'leaf_weight': 5.071562081575392, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.07008390819526038, 'leaf_weight': 5.629503101110458, 'leaf_count': 28}}}}, {'tree_index': 12, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 21.399799346923828, 'threshold': 0.02681743703534994, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00374963, 'internal_weight': 17.3372, 'internal_count': 81, 'left_child': {'split_index': 1, 'split_feature': 2, 'split_gain': 5.836999893188477, 'threshold': 3.5472756680480115, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.046492, 'internal_weight': 10.89, 'internal_count': 48, 'left_child': {'leaf_index': 0, 'leaf_value': -0.08218247103000542, 'leaf_weight': 5.5828584283590335, 'leaf_count': 25}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.008947176009566292, 'leaf_weight': 5.307131439447403, 'leaf_count': 23}}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.06844637890571116, 'leaf_weight': 6.447218477725982, 'leaf_count': 33}}}, {'tree_index': 13, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 13, 'split_gain': 19.988399505615234, 'threshold': -0.9175117702774908, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00362511, 'internal_weight': 17.0069, 'internal_count': 81, 'left_child': {'leaf_index': 0, 'leaf_value': -0.07099178346638545, 'leaf_weight': 6.683696135878566, 'leaf_count': 31}, 'right_child': {'split_index': 1, 'split_feature': 13, 'split_gain': 3.370919942855835, 'threshold': 1.8772808596672073, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0399912, 'internal_weight': 10.3232, 'internal_count': 50, 'left_child': {'leaf_index': 1, 'leaf_value': 0.010651145320500731, 'leaf_weight': 5.024654343724249, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': 0.06781494678857775, 'leaf_weight': 5.298499584197998, 'leaf_count': 28}}}}, {'tree_index': 14, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 18.75670051574707, 'threshold': 0.02681743703534994, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00351166, 'internal_weight': 16.6706, 'internal_count': 81, 'left_child': {'split_index': 1, 'split_feature': 1, 'split_gain': 5.915229797363281, 'threshold': 3.092608213424683, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0436897, 'internal_weight': 10.592, 'internal_count': 48, 'left_child': {'leaf_index': 0, 'leaf_value': -0.07794227861102893, 'leaf_weight': 5.755450502038004, 'leaf_count': 27}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.002928969366291567, 'leaf_weight': 4.836504548788071, 'leaf_count': 21}}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.06649753931977596, 'leaf_weight': 6.0786804407835, 'leaf_count': 33}}}, {'tree_index': 15, 'num_leaves': 3, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 9, 'split_gain': 19.521400451660156, 'threshold': 0.44949494949494956, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00670763, 'internal_weight': 16.4224, 'internal_count': 83, 'left_child': {'split_index': 1, 'split_feature': 1, 'split_gain': 2.7174599170684814, 'threshold': 2.4830845594406132, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0505802, 'internal_weight': 9.96688, 'internal_count': 47, 'left_child': {'leaf_index': 0, 'leaf_value': -0.0748234090211124, 'leaf_weight': 5.352049484848978, 'leaf_count': 26}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.022464201444285303, 'leaf_weight': 4.614828139543533, 'leaf_count': 21}}, 'right_child': {'leaf_index': 1, 'leaf_value': 0.06102882167971128, 'leaf_weight': 6.455503240227698, 'leaf_count': 36}}}, {'tree_index': 16, 'num_leaves': 4, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 20.915599822998047, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00650951, 'internal_weight': 16.0734, 'internal_count': 83, 'left_child': {'split_index': 1, 'split_feature': 1, 'split_gain': 0.7167580127716064, 'threshold': 2.181384921073914, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0584402, 'internal_weight': 8.78815, 'internal_count': 42, 'left_child': {'leaf_index': 0, 'leaf_value': -0.07283105883520614, 'leaf_weight': 4.359884411096575, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.044271557748743806, 'leaf_weight': 4.428262785077095, 'leaf_count': 20}}, 'right_child': {'split_index': 2, 'split_feature': 6, 'split_gain': 0.27922600507736206, 'threshold': 0.48491415168876384, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0561343, 'internal_weight': 7.28523, 'internal_count': 41, 'left_child': {'leaf_index': 1, 'leaf_value': 0.046566645885268335, 'leaf_weight': 3.725804477930068, 'leaf_count': 21}, 'right_child': {'leaf_index': 3, 'leaf_value': 0.06614921330100301, 'leaf_weight': 3.559424474835396, 'leaf_count': 20}}}}, {'tree_index': 17, 'num_leaves': 4, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 19.341999053955078, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0063281, 'internal_weight': 15.7046, 'internal_count': 83, 'left_child': {'split_index': 1, 'split_feature': 1, 'split_gain': 0.7211930155754089, 'threshold': 2.181384921073914, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0566291, 'internal_weight': 8.62062, 'internal_count': 42, 'left_child': {'leaf_index': 0, 'leaf_value': -0.07136146887938256, 'leaf_weight': 4.2304699271917325, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.04243262371555604, 'leaf_weight': 4.390146732330322, 'leaf_count': 20}}, 'right_child': {'split_index': 2, 'split_feature': 4, 'split_gain': 0.17738600075244904, 'threshold': 0.3187254816293717, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.054884, 'internal_weight': 7.08399, 'internal_count': 41, 'left_child': {'leaf_index': 1, 'leaf_value': 0.04723412069233138, 'leaf_weight': 3.6613249629735956, 'leaf_count': 20}, 'right_child': {'leaf_index': 3, 'leaf_value': 0.06306728950350501, 'leaf_weight': 3.4226654171943665, 'leaf_count': 21}}}}, {'tree_index': 18, 'num_leaves': 4, 'num_cat': 0, 'shrinkage': 0.05, 'tree_structure': {'split_index': 0, 'split_feature': 15, 'split_gain': 17.89940071105957, 'threshold': 0.02084435169178268, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.00615586, 'internal_weight': 15.3347, 'internal_count': 83, 'left_child': {'split_index': 1, 'split_feature': 9, 'split_gain': 0.660440981388092, 'threshold': 0.22649572649572652, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': -0.0549116, 'internal_weight': 8.45071, 'internal_count': 42, 'left_child': {'leaf_index': 0, 'leaf_value': -0.06930617045321565, 'leaf_weight': 4.101271376013754, 'leaf_count': 22}, 'right_child': {'leaf_index': 2, 'leaf_value': -0.04133840308087882, 'leaf_weight': 4.349441319704056, 'leaf_count': 20}}, 'right_child': {'split_index': 2, 'split_feature': 15, 'split_gain': 0.189178004860878, 'threshold': 0.05606487282356567, 'decision_type': '<=', 'default_left': True, 'missing_type': 'None', 'internal_value': 0.0536959, 'internal_weight': 6.88402, 'internal_count': 41, 'left_child': {'leaf_index': 1, 'leaf_value': 0.04578324148730414, 'leaf_weight': 3.6016914695501336, 'leaf_count': 20}, 'right_child': {'leaf_index': 3, 'leaf_value': 0.062378436439081024, 'leaf_weight': 3.282333254814148, 'leaf_count': 21}}}}], 'feature_importances': {'avg_top_3_chunk_sim_scores': 7, 'avg_top_3_chunk_text_scores': 5, 'bm25(chunks)': 1, 'max_chunk_sim_scores': 1, 'modified_freshness': 1, 'elementCompleteness(chunks).queryCompleteness': 6, 'firstPhase': 6, 'nativeProximity': 11}, 'pandas_categorical': []} ``` ### Integrating the GBDT model into Vespa[¶](#integrating-the-gbdt-model-into-vespa) The trained LightGBM model can be exported and added to your Vespa application package: ``` txt app/ ├── models/ │ └── lightgbm_model.json ``` In \[52\]: Copied! ``` # Write the final model to a file model_file = repo_root / "app" / "models" / "lightgbm_model.json" with open(model_file, "w") as f: json.dump(final_model, f, indent=2) ``` # Write the final model to a file model_file = repo_root / "app" / "models" / "lightgbm_model.json" with open(model_file, "w") as f: json.dump(final_model, f, indent=2) Create a new rank profile that uses this model: In \[53\]: Copied! ``` second_gbdt_rp = ( repo_root / "app" / "schemas" / "doc" / "second-with-gbdt.profile" ).read_text() display_md(second_gbdt_rp, tag="txt") ``` second_gbdt_rp = ( repo_root / "app" / "schemas" / "doc" / "second-with-gbdt.profile" ).read_text() display_md(second_gbdt_rp, tag="txt") ``` txt rank-profile second-with-gbdt inherits collect-second-phase { match-features { max_chunk_sim_scores max_chunk_text_scores avg_top_3_chunk_text_scores avg_top_3_chunk_sim_scores bm25(title) modified_freshness open_count firstPhase } # nativeProximity,168.84977385997772 # firstPhase,151.73823466300965 # max_chunk_sim_scores,69.43774781227111 # avg_top_3_chunk_text_scores,56.507930064201354 # avg_top_3_chunk_sim_scores,31.87002867460251 # nativeRank,20.071615393646063 # nativeFieldMatch,15.991393876075744 # elementSimilarity(chunks),9.700291919708253 # bm25(chunks),3.8777143508195877 # max_chunk_text_scores,3.6405647873878477 # "fieldTermMatch(chunks,4).firstPosition",1.2615019798278808 # "fieldTermMatch(chunks,4).occurrences",1.0542740106582642 # "fieldTermMatch(chunks,4).weight",0.7263560056686401 # term(3).significance,0.5077840089797974 rank-features { nativeProximity nativeFieldMatch nativeRank elementSimilarity(chunks) fieldTermMatch(chunks, 4).firstPosition fieldTermMatch(chunks, 4).occurrences fieldTermMatch(chunks, 4).weight term(3).significance } second-phase { expression: lightgbm("lightgbm_model.json") } summary-features: top_3_chunk_sim_scores } ``` And redeploy your application. We add a try/except block to this in case your authentication token has expired. In \[54\]: Copied! ``` try: app: Vespa = vespa_cloud.deploy(disk_folder=application_root) except Exception: vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_root=application_root, ) app: Vespa = vespa_cloud.deploy(disk_folder=application_root) ``` try: app: Vespa = vespa_cloud.deploy(disk_folder=application_root) except Exception: vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_root=application_root, ) app: Vespa = vespa_cloud.deploy(disk_folder=application_root) ``` Deployment started in run 87 of dev-aws-us-east-1c for vespa-team.rag-blueprint. This may take a few minutes the first time. INFO [09:43:43] Deploying platform version 8.586.25 and application dev build 87 for dev-aws-us-east-1c of default ... INFO [09:43:43] Using CA signed certificate version 5 INFO [09:43:52] Session 379708 for tenant 'vespa-team' prepared and activated. INFO [09:43:52] ######## Details for all nodes ######## INFO [09:43:52] h125699b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:43:52] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:43:52] --- storagenode on port 19102 has config generation 379705, wanted is 379708 INFO [09:43:52] --- searchnode on port 19107 has config generation 379708, wanted is 379708 INFO [09:43:52] --- distributor on port 19111 has config generation 379708, wanted is 379708 INFO [09:43:52] --- metricsproxy-container on port 19092 has config generation 379708, wanted is 379708 INFO [09:43:52] h125755a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:43:52] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:43:52] --- container on port 4080 has config generation 379708, wanted is 379708 INFO [09:43:52] --- metricsproxy-container on port 19092 has config generation 379708, wanted is 379708 INFO [09:43:52] h97530b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:43:52] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:43:52] --- logserver-container on port 4080 has config generation 379708, wanted is 379708 INFO [09:43:52] --- metricsproxy-container on port 19092 has config generation 379708, wanted is 379708 INFO [09:43:52] h119190c.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [09:43:52] --- platform vespa/cloud-tenant-rhel8:8.586.25 INFO [09:43:52] --- container-clustercontroller on port 19050 has config generation 379708, wanted is 379708 INFO [09:43:52] --- metricsproxy-container on port 19092 has config generation 379708, wanted is 379708 INFO [09:43:59] Found endpoints: INFO [09:43:59] - dev.aws-us-east-1c INFO [09:43:59] |-- https://fe5fe13c.fe19121d.z.vespa-app.cloud/ (cluster 'default') INFO [09:43:59] Deployment of new application revision complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for default URL: https://fe5fe13c.fe19121d.z.vespa-app.cloud/ Application is up! ``` ### Evaluating second-phase ranking performance[¶](#evaluating-second-phase-ranking-performance) Let us run the ranking evaluation to evaluate the GBDT-powered second-phase ranking on unseen test queries: In \[55\]: Copied! ``` def rank_second_phase_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("*") .from_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "hits": top_k, "query": query_text, "ranking": "second-with-gbdt", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } second_phase_evaluator = VespaEvaluator( queries=test_ids_to_query, relevant_docs=test_relevant_docs, vespa_query_fn=rank_second_phase_query_fn, id_field="id", app=app, name="second-phase-evaluation", write_csv=False, precision_recall_at_k=[10, 20], ) second_phase_results = second_phase_evaluator() ``` def rank_second_phase_query_fn(query_text: str, top_k: int) -> dict: return { "yql": str( qb.select("\*") .from\_(VESPA_SCHEMA_NAME) .where( qb.nearestNeighbor( field="title_embedding", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.nearestNeighbor( field="chunk_embeddings", query_vector="embedding", annotations={"targetHits": 100}, ) | qb.userQuery( query_text, ) ) ), "hits": top_k, "query": query_text, "ranking": "second-with-gbdt", "input.query(embedding)": f"embed({query_text})", "input.query(float_embedding)": f"embed({query_text})", "presentation.summary": "no-chunks", } second_phase_evaluator = VespaEvaluator( queries=test_ids_to_query, relevant_docs=test_relevant_docs, vespa_query_fn=rank_second_phase_query_fn, id_field="id", app=app, name="second-phase-evaluation", write_csv=False, precision_recall_at_k=[10, 20], ) second_phase_results = second_phase_evaluator() In \[56\]: Copied! ``` second_phase_results ``` second_phase_results Out\[56\]: ``` {'accuracy@1': 0.75, 'accuracy@3': 0.95, 'accuracy@5': 0.95, 'accuracy@10': 1.0, 'precision@10': 0.24000000000000005, 'recall@10': 0.9651515151515152, 'precision@20': 0.12999999999999998, 'recall@20': 0.9954545454545455, 'mrr@10': 0.8404761904761905, 'ndcg@10': 0.8391408637111896, 'map@100': 0.7673197781750414, 'searchtime_avg': 0.03360000000000001, 'searchtime_q50': 0.0285, 'searchtime_q90': 0.05120000000000001, 'searchtime_q95': 0.0534} ``` In \[57\]: Copied! ``` second_phase_df = pd.DataFrame(second_phase_results, index=["value"]).T second_phase_df ``` second_phase_df = pd.DataFrame(second_phase_results, index=["value"]).T second_phase_df Out\[57\]: | | value | | -------------- | -------- | | accuracy@1 | 0.750000 | | accuracy@3 | 0.950000 | | accuracy@5 | 0.950000 | | accuracy@10 | 1.000000 | | precision@10 | 0.240000 | | recall@10 | 0.965152 | | precision@20 | 0.130000 | | recall@20 | 0.995455 | | mrr@10 | 0.840476 | | ndcg@10 | 0.839141 | | map@100 | 0.767320 | | searchtime_avg | 0.033600 | | searchtime_q50 | 0.028500 | | searchtime_q90 | 0.051200 | | searchtime_q95 | 0.053400 | Expected results show significant improvement over first-phase ranking: In \[58\]: Copied! ``` total_df = pd.concat( [ first_phase_df.rename(columns={"value": "first_phase"}), second_phase_df.rename(columns={"value": "second_phase"}), ], axis=1, ) # Add diff total_df["diff"] = total_df["second_phase"] - total_df["first_phase"] total_df = total_df.round(4) # highlight recall@10 row and recall@20 row # Define a function to apply the style def highlight_rows_by_index(row, indices_to_highlight): if row.name in indices_to_highlight: return ["background-color: lightblue; color: black"] * len(row) return [""] * len(row) total_df.style.apply( highlight_rows_by_index, indices_to_highlight=["recall@10", "recall@20"], axis=1, ) ``` total_df = pd.concat( [ first_phase_df.rename(columns={"value": "first_phase"}), second_phase_df.rename(columns={"value": "second_phase"}), ], axis=1, ) # Add diff total_df["diff"] = total_df["second_phase"] - total_df["first_phase"] total_df = total_df.round(4) # highlight recall@10 row and recall@20 row # Define a function to apply the style def highlight_rows_by_index(row, indices_to_highlight): if row.name in indices_to_highlight: return ["background-color: lightblue; color: black"] * len(row) return [""] * len(row) total_df.style.apply( highlight_rows_by_index, indices_to_highlight=["recall@10", "recall@20"], axis=1, ) Out\[58\]: | | first_phase | second_phase | diff | | -------------- | ----------- | ------------ | --------- | | accuracy@1 | 1.000000 | 0.750000 | -0.250000 | | accuracy@3 | 1.000000 | 0.950000 | -0.050000 | | accuracy@5 | 1.000000 | 0.950000 | -0.050000 | | accuracy@10 | 1.000000 | 1.000000 | 0.000000 | | precision@10 | 0.235000 | 0.240000 | 0.005000 | | recall@10 | 0.940500 | 0.965200 | 0.024600 | | precision@20 | 0.127500 | 0.130000 | 0.002500 | | recall@20 | 0.990900 | 0.995500 | 0.004500 | | mrr@10 | 1.000000 | 0.840500 | -0.159500 | | ndcg@10 | 0.889300 | 0.839100 | -0.050200 | | map@100 | 0.818300 | 0.767300 | -0.051000 | | searchtime_avg | 0.040900 | 0.033600 | -0.007200 | | searchtime_q50 | 0.042500 | 0.028500 | -0.014000 | | searchtime_q90 | 0.060400 | 0.051200 | -0.009200 | | searchtime_q95 | 0.083100 | 0.053400 | -0.029700 | For a larger dataset, we would expect to see significant improvement over first-phase ranking. Since our first-phase ranking is already quite good, we can not see this here, but we will leave the comparison code for you to run on a real-world dataset. We also observe a slight increase in search time (from 22ms to 35ms average), which is expected due to the additional complexity of the GBDT model. ### Query profiles with GBDT ranking[¶](#query-profiles-with-gbdt-ranking) Create new query profiles that leverage the improved ranking: In \[59\]: Copied! ``` hybrid_with_gbdt_qp = (qp_dir / "hybrid-with-gbdt.xml").read_text() display_md(hybrid_with_gbdt_qp, tag="xml") ``` hybrid_with_gbdt_qp = (qp_dir / "hybrid-with-gbdt.xml").read_text() display_md(hybrid_with_gbdt_qp, tag="xml") ``` 20 second-with-gbdt top_3_chunks ``` In \[60\]: Copied! ``` rag_with_gbdt_qp = (qp_dir / "rag-with-gbdt.xml").read_text() display_md(rag_with_gbdt_qp, tag="xml") ``` rag_with_gbdt_qp = (qp_dir / "rag-with-gbdt.xml").read_text() display_md(rag_with_gbdt_qp, tag="xml") ``` 50 openai sse ``` Test the improved ranking: In \[61\]: Copied! ``` query = "what are key points learned for finetuning llms?" query_profile = "hybrid-with-gbdt" body = { "query": query, "queryProfile": query_profile, } with app.syncio() as sess: result = sess.query(body=body) result.hits[0] ``` query = "what are key points learned for finetuning llms?" query_profile = "hybrid-with-gbdt" body = { "query": query, "queryProfile": query_profile, } with app.syncio() as sess: result = sess.query(body=body) result.hits[0] Out\[61\]: ``` {'id': 'index:content/0/a3f390d8c35680335e3aebe1', 'relevance': 0.8034803261636057, 'source': 'content', 'fields': {'matchfeatures': {'bm25(title)': 0.0, 'firstPhase': 1.9722333906160157, 'avg_top_3_chunk_sim_scores': 0.2565740570425987, 'avg_top_3_chunk_text_scores': 4.844822406768799, 'max_chunk_sim_scores': 0.2736895978450775, 'max_chunk_text_scores': 7.804652690887451, 'modified_freshness': 0.5275786815220422, 'open_count': 7.0}, 'sddocname': 'doc', 'chunks_top3': ["# Parameter-Efficient Fine-Tuning (PEFT) Techniques - Overview\n\n**Goal:** Fine-tune large pre-trained models with significantly fewer trainable parameters, reducing computational cost and memory footprint.\n\n**Key Techniques I've Researched/Used:**\n\n1. **LoRA (Low-Rank Adaptation):**\n * Freezes pre-trained model weights.\n * Injects trainable rank decomposition matrices into Transformer layers.\n * Significantly reduces trainable parameters.\n * My default starting point for LLM fine-tuning (see `llm_finetuning_pitfalls_best_practices.md`).\n\n2. **QLoRA:**\n * Builds on LoRA.\n * Quantizes pre-trained model to 4-bit.\n * Uses LoRA for fine-tuning the quantized model.\n * Further reduces memory usage, enabling fine-tuning of larger models on ", 'consumer GPUs.\n\n3. **Adapter Modules:**\n * Inserts small, trainable neural network modules (adapters) between existing layers of the pre-trained model.\n * Only adapters are trained.\n\n4. **Prompt Tuning / Prefix Tuning:**\n * Keeps model parameters frozen.\n * Learns a small set of continuous prompt embeddings (virtual tokens) that are prepended to the input sequence.\n\n**Benefits for SynapseFlow (Internal Model Dev):**\n- Faster iteration on fine-tuning tasks.\n- Ability to experiment with larger models on available hardware.\n- Easier to manage multiple fine-tuned model versions (smaller delta to store).\n\n## (Links to papers, Hugging Face PEFT library notes)'], 'summaryfeatures': {'top_3_chunk_sim_scores': {'type': 'tensor(chunk{})', 'cells': {'0': 0.2736895978450775, '1': 0.23945851624011993}}, 'vespa.summaryFeatures.cached': 0.0}}} ``` Let us summarize our best practices for second-phase ranking. ### Best practices for second-phase ranking[¶](#best-practices-for-second-phase-ranking) **Model complexity considerations:** - Use more sophisticated models (GBDT, neural networks) that would be too expensive for first-phase - Take advantage of the reduced candidate set (typically 100-10,000 documents) - Include expensive text features like `nativeProximity` and `fieldMatch` **Feature engineering:** - Combine first-phase scores with additional text and semantic features - Use chunk-level aggregations (max, average, top-k) to capture document structure - Include metadata signals **Training data quality:** - Use the first-phase ranking to generate better training data - Consider having LLMs generate relevance judgments for top-k results - Iteratively improve with user interaction data when available **Performance monitoring:** - Monitor latency impact of second-phase ranking - Adjust `rerank-count` based on quality vs. performance trade-offs - Consider using different models for different query types or use cases The second-phase ranking represents a crucial step in building high-quality RAG applications, providing the precision needed for effective LLM context while maintaining reasonable query latencies. ## (Optional) Global-phase ranking[¶](#optional-global-phase-ranking) We also have the option of configuring [global-phase](https://docs.vespa.ai/en/reference/schema-reference.html#globalphase-rank) ranking, which can rerank the top k (as set by `rerank-count` parameter) documents from the second-phase ranking. Common options for global-phase are [cross-encoders](https://docs.vespa.ai/en/cross-encoders.html) or another GBDT model, trained for better separating top ranked documents on objectives such as [LambdaMart](https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html). For RAG applications, we consider this less important than for search applications where the results are mainly consumed by an human, as LLMs don't care that much about the ordering of the results. See also our notebook on using [cross-encoders for global reranking](https://vespa-engine.github.io/pyvespa/examples/cross-encoders-for-global-reranking.md) ## Further improvements[¶](#further-improvements) Finally, we will sketch out some opportunities for further improvements. As you have seen, we started out with only binary relevance labels for a few queries, and trained a model based on the relevant docs and a set of random documents. As you may have noted, we have not discussed what most people think about when discussing RAG evals, evaluating the "Generation"-step. There are several tools available to do this, for example [ragas](https://docs.ragas.io/en/stable/) and [ARES](https://github.com/stanford-futuredata/ARES). We refer to other sources for details on this, as this tutorial is probably enough to digest as it is. This was useful initially, as we had no better way to retrieve the candidate documents. Now, that we have a reasonably good second-phase ranking, we could potentially generate a new set of relevance labels for queries that we did not have labels for by having an LLM do relevance judgments of the top k returned hits. This training dataset would likely be even better in separating the top documents. ## Structured output from the LLM[¶](#structured-output-from-the-llm) Let us also show how we can request structured JSON output from the LLM, which can be useful for several reasons, the most common probably being citations. In \[62\]: Copied! ``` from vespa.io import VespaResponse import json schema = { "type": "object", "properties": { "answer": { "type": "string", "description": "The answer to the query if it is contained in the documents. If not, it say that you are not allowed to answer based on the documents.", }, "citations": { "type": "array", "description": "List of returned and cited document IDs", "items": {"type": "string"}, }, }, "required": ["answer", "citations"], "additionalProperties": False, } query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "hybrid", "searchChain": "openai", "llm.json_schema": json.dumps(schema), "presentation.format": "json", } with app.syncio() as sess: resp = sess.query(body=body) def response_to_string(response: VespaResponse): """ Convert a Vespa response to a string of the returned tokens. """ children = response.json.get("root", {}).get("children", []) tokens = "" for child in children: if child.get("id") == "event_stream": for stream_child in child.get("children", []): tokens += stream_child.get("fields", {}).get("token", "") return tokens tokens = response_to_string(resp) json.loads(tokens) ``` from vespa.io import VespaResponse import json schema = { "type": "object", "properties": { "answer": { "type": "string", "description": "The answer to the query if it is contained in the documents. If not, it say that you are not allowed to answer based on the documents.", }, "citations": { "type": "array", "description": "List of returned and cited document IDs", "items": {"type": "string"}, }, }, "required": ["answer", "citations"], "additionalProperties": False, } query = "What is SynapseFlows strategy" body = { "query": query, "queryProfile": "hybrid", "searchChain": "openai", "llm.json_schema": json.dumps(schema), "presentation.format": "json", } with app.syncio() as sess: resp = sess.query(body=body) def response_to_string(response: VespaResponse): """ Convert a Vespa response to a string of the returned tokens. """ children = response.json.get("root", {}).get("children", []) tokens = "" for child in children: if child.get("id") == "event_stream": for stream_child in child.get("children", []): tokens += stream_child.get("fields", {}).get("token", "") return tokens tokens = response_to_string(resp) json.loads(tokens) Out\[62\]: ``` {'answer': "SynapseFlow's strategy focuses on simplifying the deployment, management, and scaling of machine learning models for developers and small teams. The key components of their strategy include:\n\n1. **Target Audience**: They target individual developers, startups, and SMEs with a particular emphasis on those new to MLOps, allowing them to leverage AI deployment without needing deep Ops knowledge.\n\n2. **Customer Pain Points**: SynapseFlow aims to address common challenges such as complex deployment processes, reliance on DevOps teams for model deployment, and slow, bureaucratic workflows. They provide a solution that minimizes infrastructure overhead and streamlines the journey from model experimentation to production.\n\n3. **Developer-First Approach**: Offering a developer-first API and intuitive UI, they ensure that users can deploy models quickly, focusing on easing the operational burden of MLOps.\n\n4. **Marketing and Outreach**: Their go-to-market strategy includes content marketing to educate potential users, leveraging developer communities, and building relationships through the YC network. They're also focused on SEO for high visibility within relevant search terms.\n\n5. **Feature Differentiators**: The platform differentiates itself through ease of deployment, a simple user interface, and a transparent pricing model tailored for startups and small businesses, making it more accessible than traditional MLOps solutions like SageMaker or Vertex AI.\n\n6. **Feedback and Iteration**: SynapseFlow is committed to continuous improvement based on user feedback, refining their offerings, and iteratively enhancing their product based on real-world user experiences and needs. \n\n7. **Future Growth**: Plans for future growth include targeting additional user segments and functionalities, such as integrating advanced monitoring solutions and data drift detection.\n\nOverall, SynapseFlow's strategy is to be the go-to platform for AI deployment, with a focus on simplifying processes for those who may not have extensive technical resources, thereby enabling more teams to harness the power of AI effectively.", 'citations': ['1', '4', '5', '8', '9']} ``` ## Summary[¶](#summary) In this tutorial, we have built a complete RAG application using Vespa, providing our recommendations for how to approach both retrieval phase with binary vectors and text matching, first-phase ranking with a linear combination of relatively cheap features to a more sophisticated second-phase ranking system with more expensive features and a GBDT model. We hope that this tutorial, along with the provided code in our [sample-apps repository](https://github.com/vespa-engine/sample-apps/tree/master/rag-blueprint), will serve as a useful reference for building your own RAG applications, with an evaluation-driven approach. By using the principles demonstrated in this tutorial, you are empowered to build high-quality RAG applications that can scale to any dataset size, and any query load. ## FAQ[¶](#faq) - **Q: Which embedding models can I use with Vespa?** A: Vespa supports a variety of embedding models. For a list of vespa provided models on Vespa Cloud, see [Model hub](https://docs.vespa.ai/en/cloud/model-hub.html). See also [embedding reference](https://docs.vespa.ai/en/embedding.html#provided-embedders) for how to use embedders. You can also use private models (gated by authentication with Bearer token from Vespa Cloud secret store). - **Q: Why don't you use ColBERT for ranking?** A: We love ColBERT, and it has shown great performance. We do support ColBERT-style models in Vespa. The challenge is the added cost in memory storage, especially for large-scale applications. If you use it, we recommend consider binarizing the vectors to reduce memory usage 32x compared to float. If you want to improve the ranking quality and accept the additional cost, we encourage you to evaluate and try. Here are some resources if you want to learn more about using ColBERT with Vespa: - [Announcing ColBERT embedder](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/#what-is-colbert?) - [Long context ColBERT](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/) - [Long context ColBERT sample app](https://github.com/vespa-engine/sample-apps/tree/master/colbert-long/#vespa-sample-applications---long-context-colbert) - [ColBERT sample app](https://github.com/vespa-engine/sample-apps/tree/master/colbert) - [ColBERT embedder reference](https://docs.vespa.ai/en/embedding.html#colbert-embedder) - [ColBERT standalone python example notebook](https://vespa-engine.github.io/pyvespa/examples/colbert_standalone_Vespa-cloud.md) - [ColBERT standalone long context example notebook](https://vespa-engine.github.io/pyvespa/examples/colbert_standalone_long_context_Vespa-cloud.md) - **Q: Do I need to use an LLM with Vespa?** A: No, you are free to use Vespa as a search engine. We provide the option of calling out to LLMs from within a Vespa application for reduced latency compared to sending large search results sets several times over network as well as the option to deploy Local LLMs, optionally in your own infrastructure if you prefer. See [Vespa Cloud Enclave](https://docs.vespa.ai/en/cloud/enclave/enclave.html) - **Q: Why do we use binary vectors for the document embeddings?** A: Binary vectors takes up a lot less memory and are faster to compute distances on, with only a slight reduction in quality. See blog [post](https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/) for details. - **Q: How can you say that Vespa can scale to any data and query load?** A: Vespa can scale both the stateless container nodes and content nodes of your application. See [overview](https://docs.vespa.ai/en/overview.html) and [elasticity](https://docs.vespa.ai/en/elasticity.html) for details. ## Clean up[¶](#clean-up) As this tutorial is running in a CI environment, we will clean up the resources created. In \[63\]: Copied! ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() # Building cost-efficient retrieval-augmented personal AI assistants[¶](#building-cost-efficient-retrieval-augmented-personal-ai-assistants) This notebook demonstrates how to use [Vespa streaming mode](https://docs.vespa.ai/en/streaming-search.html) for cost-efficient retrieval for applications that store and retrieve personal data. You can read more about Vespa vector streaming search in these two blog posts: - [Announcing vector streaming search: AI assistants at scale without breaking the bank](https://blog.vespa.ai/announcing-vector-streaming-search/) - [Yahoo Mail turns to Vespa to do RAG at scale](https://blog.vespa.ai/yahoo-mail-turns-to-vespa-to-do-rag-at-scale/) ## A summary of Vespa streaming mode[¶](#a-summary-of-vespa-streaming-mode) Vespa’s streaming search solution lets you make the user id a part of the document ID so that Vespa can use it to co-locate the data of each user on a small set of nodes and the same chunk of disk. This allows you to do searches over a user’s data with low latency without keeping any user’s data in memory or paying the cost of managing indexes. - There is no accuracy drop for vector search as it uses exact vector search - Several orders of magnitude higher throughput (No expensive index builds to support approximate search) - Documents (including vector data) are disk-based. - Ultra-low memory requirements (fixed per document) This notebook connects a custom [LlamaIndex](https://docs.llamaindex.ai/) [Retriever](https://docs.llamaindex.ai/) with a [Vespa](https://vespa.ai/) app using streaming mode to retrieve personal data. The focus is on how to use the streaming mode feature. First, install dependencies: In \[ \]: Copied! ``` !pip3 install -U pyvespa llama-index vespacli ``` !pip3 install -U pyvespa llama-index vespacli ## Synthetic Mail & Calendar Data[¶](#synthetic-mail-calendar-data) There are few public email datasets because people care about their privacy, so this notebook uses synthetic data to examine how to use Vespa streaming mode. We create two generator functions that returns Python `dict`s with synthetic mail and calendar data. Notice that the dict has three keys: - `id` - `groupname` - `fields` This is the expected feed format for [PyVespa](https://vespa-engine.github.io/pyvespa/reads-writes.md) feed operations and where PyVespa will use these to build a Vespa [document v1 API](https://docs.vespa.ai/en/document-v1-api-guide.html) request(s). The `groupname` key is only to be used when using streaming mode. In \[2\]: Copied! ``` from typing import List def synthetic_mail_data_generator() -> List[dict]: synthetic_mails = [ { "id": 1, "groupname": "bergum@vespa.ai", "fields": { "subject": "LlamaIndex news, 2023-11-14", "to": "bergum@vespa.ai", "body": """Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory lane on our blog with twelve milestones from our first year. Be sure to check it out.""", "from": "news@llamaindex.ai", "display_date": "2023-11-15T09:00:00Z", }, }, { "id": 2, "groupname": "bergum@vespa.ai", "fields": { "subject": "Dentist Appointment Reminder", "to": "bergum@vespa.ai", "body": "Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist", "from": "dentist@dentist.no", "display_date": "2023-11-15T15:30:00Z", }, }, { "id": 1, "groupname": "giraffe@wildlife.ai", "fields": { "subject": "Wildlife Update: Giraffe Edition", "to": "giraffe@wildlife.ai", "body": "Dear Wildlife Enthusiasts 🦒, We're thrilled to share the latest insights into giraffe behavior in the wild. Join us on an adventure as we explore their natural habitat and learn more about these majestic creatures.", "from": "updates@wildlife.ai", "display_date": "2023-11-12T14:30:00Z", }, }, { "id": 1, "groupname": "penguin@antarctica.ai", "fields": { "subject": "Antarctica Expedition: Penguin Chronicles", "to": "penguin@antarctica.ai", "body": "Greetings Explorers 🐧, Our team is embarking on an exciting expedition to Antarctica to study penguin colonies. Stay tuned for live updates and behind-the-scenes footage as we dive into the world of these fascinating birds.", "from": "expedition@antarctica.ai", "display_date": "2023-11-11T11:45:00Z", }, }, { "id": 1, "groupname": "space@exploration.ai", "fields": { "subject": "Space Exploration News: November Edition", "to": "space@exploration.ai", "body": "Hello Space Enthusiasts 🚀, Join us as we highlight the latest discoveries and breakthroughs in space exploration. From distant galaxies to new technologies, there's a lot to explore!", "from": "news@exploration.ai", "display_date": "2023-11-01T16:20:00Z", }, }, { "id": 1, "groupname": "ocean@discovery.ai", "fields": { "subject": "Ocean Discovery: Hidden Treasures Unveiled", "to": "ocean@discovery.ai", "body": "Dear Ocean Explorers 🌊, Dive deep into the secrets of the ocean with our latest discoveries. From undiscovered species to underwater landscapes, our team is uncovering the wonders of the deep blue.", "from": "discovery@ocean.ai", "display_date": "2023-10-01T10:15:00Z", }, }, ] for mail in synthetic_mails: yield mail ``` from typing import List def synthetic_mail_data_generator() -> List\[dict\]: synthetic_mails = [ { "id": 1, "groupname": "bergum@vespa.ai", "fields": { "subject": "LlamaIndex news, 2023-11-14", "to": "bergum@vespa.ai", "body": """Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory lane on our blog with twelve milestones from our first year. Be sure to check it out.""", "from": "news@llamaindex.ai", "display_date": "2023-11-15T09:00:00Z", }, }, { "id": 2, "groupname": "bergum@vespa.ai", "fields": { "subject": "Dentist Appointment Reminder", "to": "bergum@vespa.ai", "body": "Dear Jo Kristian ,\\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\\nBest regards,\\nDr. Dentist", "from": "dentist@dentist.no", "display_date": "2023-11-15T15:30:00Z", }, }, { "id": 1, "groupname": "giraffe@wildlife.ai", "fields": { "subject": "Wildlife Update: Giraffe Edition", "to": "giraffe@wildlife.ai", "body": "Dear Wildlife Enthusiasts 🦒, We're thrilled to share the latest insights into giraffe behavior in the wild. Join us on an adventure as we explore their natural habitat and learn more about these majestic creatures.", "from": "updates@wildlife.ai", "display_date": "2023-11-12T14:30:00Z", }, }, { "id": 1, "groupname": "penguin@antarctica.ai", "fields": { "subject": "Antarctica Expedition: Penguin Chronicles", "to": "penguin@antarctica.ai", "body": "Greetings Explorers 🐧, Our team is embarking on an exciting expedition to Antarctica to study penguin colonies. Stay tuned for live updates and behind-the-scenes footage as we dive into the world of these fascinating birds.", "from": "expedition@antarctica.ai", "display_date": "2023-11-11T11:45:00Z", }, }, { "id": 1, "groupname": "space@exploration.ai", "fields": { "subject": "Space Exploration News: November Edition", "to": "space@exploration.ai", "body": "Hello Space Enthusiasts 🚀, Join us as we highlight the latest discoveries and breakthroughs in space exploration. From distant galaxies to new technologies, there's a lot to explore!", "from": "news@exploration.ai", "display_date": "2023-11-01T16:20:00Z", }, }, { "id": 1, "groupname": "ocean@discovery.ai", "fields": { "subject": "Ocean Discovery: Hidden Treasures Unveiled", "to": "ocean@discovery.ai", "body": "Dear Ocean Explorers 🌊, Dive deep into the secrets of the ocean with our latest discoveries. From undiscovered species to underwater landscapes, our team is uncovering the wonders of the deep blue.", "from": "discovery@ocean.ai", "display_date": "2023-10-01T10:15:00Z", }, }, ] for mail in synthetic_mails: yield mail In \[3\]: Copied! ``` from typing import List def synthetic_calendar_data_generator() -> List[dict]: calendar_data = [ { "id": 1, "groupname": "bergum@vespa.ai", "fields": { "subject": "Dentist Appointment", "to": "bergum@vespa.ai", "body": "Dentist appointment at 2023-12-04 at 09:30 - 1 hour duration", "from": "dentist@dentist.no", "display_date": "2023-11-15T15:30:00Z", "duration": 60, }, }, { "id": 2, "groupname": "bergum@vespa.ai", "fields": { "subject": "Public Cloud Platform Events", "to": "bergum@vespa.ai", "body": "The cloud team continues to push new features and improvements to the platform. Join us for a live demo of the latest updates", "from": "public-cloud-platform-events", "display_date": "2023-11-21T09:30:00Z", "duration": 60, }, }, ] for event in calendar_data: yield event ``` from typing import List def synthetic_calendar_data_generator() -> List\[dict\]: calendar_data = [ { "id": 1, "groupname": "bergum@vespa.ai", "fields": { "subject": "Dentist Appointment", "to": "bergum@vespa.ai", "body": "Dentist appointment at 2023-12-04 at 09:30 - 1 hour duration", "from": "dentist@dentist.no", "display_date": "2023-11-15T15:30:00Z", "duration": 60, }, }, { "id": 2, "groupname": "bergum@vespa.ai", "fields": { "subject": "Public Cloud Platform Events", "to": "bergum@vespa.ai", "body": "The cloud team continues to push new features and improvements to the platform. Join us for a live demo of the latest updates", "from": "public-cloud-platform-events", "display_date": "2023-11-21T09:30:00Z", "duration": 60, }, }, ] for event in calendar_data: yield event ## Defining a Vespa application[¶](#defining-a-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) help us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files. First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html). [PyVespa](https://vespa-engine.github.io/pyvespa/) offers a programmatic API for creating the schema. In the end, it is serialized to a file (`.sd`) before it can be deployed to Vespa. Vespa is statically typed, so we need to define the fields and their type in the schema before we can start feeding documents.\ Note that we set `mode` to `streaming` which enables [Vespa streaming mode for this schema](https://docs.vespa.ai/en/streaming-search.html). Other valid modes are `indexed` and `store-only`. In \[4\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW mail_schema = Schema( name="mail", mode="streaming", document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "index"]), Field(name="subject", type="string", indexing=["summary", "index"]), Field(name="to", type="string", indexing=["summary", "index"]), Field(name="from", type="string", indexing=["summary", "index"]), Field(name="body", type="string", indexing=["summary", "index"]), Field(name="display_date", type="string", indexing=["summary"]), Field( name="timestamp", type="long", indexing=[ "input display_date", "to_epoch_second", "summary", "attribute", ], is_document_field=False, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input subject ." ". input body', "embed e5", "attribute", "index", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), ], ), fieldsets=[FieldSet(name="default", fields=["subject", "body", "to", "from"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW mail_schema = Schema( name="mail", mode="streaming", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "index"]), Field(name="subject", type="string", indexing=["summary", "index"]), Field(name="to", type="string", indexing=["summary", "index"]), Field(name="from", type="string", indexing=["summary", "index"]), Field(name="body", type="string", indexing=["summary", "index"]), Field(name="display_date", type="string", indexing=["summary"]), Field( name="timestamp", type="long", indexing=[ "input display_date", "to_epoch_second", "summary", "attribute", ], is_document_field=False, ), Field( name="embedding", type="tensor(x[384])", indexing=[ 'input subject ." ". input body', "embed e5", "attribute", "index", ], ann=HNSW(distance_metric="angular"), is_document_field=False, ), \], ), fieldsets=\[FieldSet(name="default", fields=["subject", "body", "to", "from"])\], ) In the `mail` schema, we have six document fields; these are provided by us when we feed documents of type `mail` to this app. The [fieldset](https://docs.vespa.ai/en/schemas.html#fieldset) defines which fields are matched against when we do not mention explicit field names when querying. We can add as many fieldsets as we like without duplicating content. In addition to the fields within the `document`, there are two synthetic fields in the schema, `timestamp` and `embedding`, using Vespa [indexing expressions](https://docs.vespa.ai/en/reference/indexing-language-reference.html) taking inputs from the document and performing conversions. - the `timestamp` field takes the input `display_date` and uses the [to_epoch_second converter](https://docs.vespa.ai/en/reference/indexing-language-reference.html#converters) to convert the display date into an epoch timestamp. This is useful because we can calculate the document's age and use the `freshness(timestamp)` rank feature during ranking phases. - the `embedding` tensor field takes the subject and body as input and feeds that into an [embed](https://docs.vespa.ai/en/embedding.html#embedding-a-document-field) function that uses an embedding model to map the string input into an embedding vector representation using 384 dimensions with `bfloat16` precision. Vectors in Vespa are represented as [Tensors](https://docs.vespa.ai/en/tensor-user-guide.html). In \[5\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW calendar_schema = Schema( name="calendar", inherits="mail", mode="streaming", document=Document( inherits="mail", fields=[ Field(name="duration", type="int", indexing=["summary", "index"]), Field(name="guests", type="array", indexing=["summary", "index"]), Field(name="location", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="address", type="string", indexing=["summary", "index"]), ], ), ) ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW calendar_schema = Schema( name="calendar", inherits="mail", mode="streaming", document=Document( inherits="mail", fields=\[ Field(name="duration", type="int", indexing=["summary", "index"]), Field(name="guests", type="array", indexing=["summary", "index"]), Field(name="location", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="address", type="string", indexing=["summary", "index"]), \], ), ) The observant reader might have noticed the `e5` argument to the `embed` expression in the above `embedding` field. The `e5` argument references a component of the type [hugging-face-embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder). We configure the application package and its name with the `mail` schema and the `e5` embedder component. In \[6\]: Copied! ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "assistant" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[mail_schema, calendar_schema], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( name="transformer-model", args={ "url": "https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json" }, ), Parameter( name="prepend", args={}, children=[ Parameter(name="query", args={}, children="query: "), Parameter(name="document", args={}, children="passage: "), ], ), ], ) ], ) ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "assistant" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[mail_schema, calendar_schema], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=\[ Parameter( name="transformer-model", args={ "url": "https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx" }, ), Parameter( name="tokenizer-model", args={ "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json" }, ), Parameter( name="prepend", args={}, children=[ Parameter(name="query", args={}, children="query: "), Parameter(name="document", args={}, children="passage: "), ], ), \], ) \], ) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the mail schema. Vespa supports [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) and has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html). Users can also define custom functions with [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html). In \[7\]: Copied! ``` from vespa.package import RankProfile, Function, GlobalPhaseRanking, FirstPhaseRanking keywords_and_freshness = RankProfile( name="default", functions=[ Function( name="my_function", expression="nativeRank(subject) + nativeRank(body) + freshness(timestamp)", ) ], first_phase=FirstPhaseRanking(expression="my_function", rank_score_drop_limit=0.02), match_features=[ "nativeRank(subject)", "nativeRank(body)", "my_function", "freshness(timestamp)", ], ) semantic = RankProfile( name="semantic", functions=[ Function(name="cosine", expression="max(0,cos(distance(field, embedding)))") ], inputs=[("query(q)", "tensor(x[384])"), ("query(threshold)", "", "0.75")], first_phase=FirstPhaseRanking( expression="if(cosine > query(threshold), cosine, -1)", rank_score_drop_limit=0.1, ), match_features=[ "cosine", "freshness(timestamp)", "distance(field, embedding)", "query(threshold)", ], ) fusion = RankProfile( name="fusion", inherits="semantic", functions=[ Function( name="keywords_and_freshness", expression=" nativeRank(subject) + nativeRank(body) + freshness(timestamp)", ), Function(name="semantic", expression="cos(distance(field,embedding))"), ], inputs=[("query(q)", "tensor(x[384])"), ("query(threshold)", "", "0.75")], first_phase=FirstPhaseRanking( expression="if(cosine > query(threshold), cosine, -1)", rank_score_drop_limit=0.1, ), match_features=[ "nativeRank(subject)", "keywords_and_freshness", "freshness(timestamp)", "cosine", "query(threshold)", ], global_phase=GlobalPhaseRanking( rerank_count=1000, expression="reciprocal_rank_fusion(semantic, keywords_and_freshness)", ), ) ``` from vespa.package import RankProfile, Function, GlobalPhaseRanking, FirstPhaseRanking keywords_and_freshness = RankProfile( name="default", functions=[ Function( name="my_function", expression="nativeRank(subject) + nativeRank(body) + freshness(timestamp)", ) ], first_phase=FirstPhaseRanking(expression="my_function", rank_score_drop_limit=0.02), match_features=[ "nativeRank(subject)", "nativeRank(body)", "my_function", "freshness(timestamp)", ], ) semantic = RankProfile( name="semantic", functions=[ Function(name="cosine", expression="max(0,cos(distance(field, embedding)))") ], inputs=\[("query(q)", "tensor(x[384])"), ("query(threshold)", "", "0.75")\], first_phase=FirstPhaseRanking( expression="if(cosine > query(threshold), cosine, -1)", rank_score_drop_limit=0.1, ), match_features=[ "cosine", "freshness(timestamp)", "distance(field, embedding)", "query(threshold)", ], ) fusion = RankProfile( name="fusion", inherits="semantic", functions=[ Function( name="keywords_and_freshness", expression=" nativeRank(subject) + nativeRank(body) + freshness(timestamp)", ), Function(name="semantic", expression="cos(distance(field,embedding))"), ], inputs=\[("query(q)", "tensor(x[384])"), ("query(threshold)", "", "0.75")\], first_phase=FirstPhaseRanking( expression="if(cosine > query(threshold), cosine, -1)", rank_score_drop_limit=0.1, ), match_features=[ "nativeRank(subject)", "keywords_and_freshness", "freshness(timestamp)", "cosine", "query(threshold)", ], global_phase=GlobalPhaseRanking( rerank_count=1000, expression="reciprocal_rank_fusion(semantic, keywords_and_freshness)", ), ) The `default` rank profile defines a custom function `my_function` that computes a linear combination of three different features: - `nativeRank(subject)` Is a text matching feature , scoped to the `subject` field. - `nativeRank(body)` Same, but scoped to the `body` field. - `freshness(timestamp)` This is a built-in [rank-feature](https://docs.vespa.ai/en/reference/rank-features.html#freshness) that returns a number that is close to 1 if the timestamp is recent compared to the current query time. In \[8\]: Copied! ``` mail_schema.add_rank_profile(keywords_and_freshness) mail_schema.add_rank_profile(semantic) mail_schema.add_rank_profile(fusion) calendar_schema.add_rank_profile(keywords_and_freshness) calendar_schema.add_rank_profile(semantic) calendar_schema.add_rank_profile(fusion) ``` mail_schema.add_rank_profile(keywords_and_freshness) mail_schema.add_rank_profile(semantic) mail_schema.add_rank_profile(fusion) calendar_schema.add_rank_profile(keywords_and_freshness) calendar_schema.add_rank_profile(semantic) calendar_schema.add_rank_profile(fusion) Finally, we have our basic Vespa schema and application package. We can serialize the representation to application package files. This is handy when we want to start working with production deployments and when we want to manage the application with version control. In \[9\]: Copied! ``` import os application_directory = "my-assistant-vespa-app" vespa_application_package.to_files(application_directory) def print_files_in_directory(directory): for root, _, files in os.walk(directory): for file in files: print(os.path.join(root, file)) print_files_in_directory(application_directory) ``` import os application_directory = "my-assistant-vespa-app" vespa_application_package.to_files(application_directory) def print_files_in_directory(directory): for root, \_, files in os.walk(directory): for file in files: print(os.path.join(root, file)) print_files_in_directory(application_directory) ``` my-assistant-vespa-app/services.xml my-assistant-vespa-app/schemas/mail.sd my-assistant-vespa-app/schemas/calendar.sd my-assistant-vespa-app/search/query-profiles/default.xml my-assistant-vespa-app/search/query-profiles/types/root.xml ``` ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[15\]: Copied! ``` from vespa.deployment import VespaCloud # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy(disk_folder=application_directory) ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy(disk_folder=application_directory) ## Feeding data to Vespa[¶](#feeding-data-to-vespa) With the app up and running in Vespa Cloud, we can start feeding and querying our data. We use the [feed_iterable](https://vespa-engine.github.io/pyvespa/api/vespa/application.md#vespa.application.Vespa.feed_iterable) API of pyvespa with a custom `callback` that prints the URL and an error if the operation fails. We pass the `synthetic_*generator()` and call `feed_iterable` with the specific `schema` and `namespace`. Read more about [Vespa document IDs](https://docs.vespa.ai/en/documents.html#id-scheme). In \[ \]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error {response.url} : {response.get_json()}") else: print(f"Success {response.url}") app.feed_iterable( synthetic_mail_data_generator(), schema="mail", namespace="assistant", callback=callback, ) app.feed_iterable( synthetic_calendar_data_generator(), schema="calendar", namespace="assistant", callback=callback, ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print(f"Error {response.url} : {response.get_json()}") else: print(f"Success {response.url}") app.feed_iterable( synthetic_mail_data_generator(), schema="mail", namespace="assistant", callback=callback, ) app.feed_iterable( synthetic_calendar_data_generator(), schema="calendar", namespace="assistant", callback=callback, ) ### Querying data[¶](#querying-data) Now, we can also query our data. With [streaming mode](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming), we must pass the `groupname` parameter, or the request will fail with an error. The query request uses the Vespa Query API and the `Vespa.query()` function supports passing any of the Vespa query API parameters. Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) Sample query request for `when is my dentist appointment` for the user `bergum@vespa.ai`: In \[18\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select subject, display_date, to from sources mail where userQuery()", query="when is my dentist appointment", groupname="bergum@vespa.ai", ranking="default", timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select subject, display_date, to from sources mail where userQuery()", query="when is my dentist appointment", groupname="bergum@vespa.ai", ranking="default", timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "id:assistant:mail:g=bergum@vespa.ai:2", "relevance": 1.134783932836458, "source": "assistant_content.mail", "fields": { "matchfeatures": { "freshness(timestamp)": 0.9232458847736625, "nativeRank(body)": 0.09246780326887034, "nativeRank(subject)": 0.11907024479392506, "my_function": 1.134783932836458 }, "subject": "Dentist Appointment Reminder", "to": "bergum@vespa.ai", "display_date": "2023-11-15T15:30:00Z" } } ``` For the above query request, Vespa searched the `default` fieldset which we defined in the schema to match against several fields including the body and the subject. The `default` rank-profile calculated the relevance score as the sum of three rank-features: `nativeRank(body)` + `nativeRank(subject)` + `freshness(`timestamp)`, and the result of this computation is the` relevance`score of the hit. In addition, we also asked for Vespa to return`matchfeatures`that are handy for debugging the final`relevance\` score or for feature logging. Now, we can try the `semantic` ranking profile, using Vespa's support for nearestNeighbor search. This also exemplifies using the configured `e5` embedder to embed the user query into an embedding representation. See [embedding a query text](https://docs.vespa.ai/en/embedding.html#embedding-a-query-text) for more usage examples of using Vespa embedders. In \[19\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select subject, display_date from mail where {targetHits:10}nearestNeighbor(embedding,q)", groupname="bergum@vespa.ai", ranking="semantic", body={ "input.query(q)": 'embed(e5, "when is my dentist appointment")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select subject, display_date from mail where {targetHits:10}nearestNeighbor(embedding,q)", groupname="bergum@vespa.ai", ranking="semantic", body={ "input.query(q)": 'embed(e5, "when is my dentist appointment")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "id:assistant:mail:g=bergum@vespa.ai:2", "relevance": 0.9079386507883569, "source": "assistant_content.mail", "fields": { "matchfeatures": { "distance(field,embedding)": 0.4324572498488368, "freshness(timestamp)": 0.9232457561728395, "query(threshold)": 0.75, "cosine": 0.9079386507883569 }, "subject": "Dentist Appointment Reminder", "display_date": "2023-11-15T15:30:00Z" } } ``` ## LlamaIndex Retrievers Introduction[¶](#llamaindex-retrievers-introduction) Now, we have a basic Vespa app using streaming mode. We likely want to use an LLM framework like [LangChain](https://www.langchain.com/) or [LLamaIndex](https://www.llamaindex.ai/) to build an end-to-end assistant. In this example notebook, we use LLamaIndex retrievers. LlamaIndex [retriever](https://docs.llamaindex.ai/) abstraction allows developers to add custom retrievers that retrieve information in Retrieval Augmented Generation (RAG) pipelines. For an excellent introduction to LLamaIndex and its concepts, see [LLamaIndex High-Level Concepts](https://docs.llamaindex.ai/). To create a custom LlamaIndex retrieve, we implement a class that inherits from `llama_index.retrievers.BaseRetriever.BaseRetriever` and which implements `_retrieve(query)`. A simple `PersonalAssistantVespaRetriever` could look like the following: In \[ \]: Copied! ``` from llama_index.legacy.core.base_retriever import BaseRetriever from llama_index.legacy.schema import NodeWithScore, QueryBundle, TextNode from llama_index.legacy.callbacks.base import CallbackManager from vespa.application import Vespa from vespa.io import VespaQueryResponse from typing import List, Union, Optional class PersonalAssistantVespaRetriever(BaseRetriever): def __init__( self, app: Vespa, user: str, hits: int = 5, vespa_rank_profile: str = "default", vespa_score_cutoff: float = 0.70, sources: List[str] = ["mail"], fields: List[str] = ["subject", "body"], callback_manager: Optional[CallbackManager] = None, ) -> None: """Sample Retriever for a personal assistant application. Args: param: app: Vespa application object param: user: user id to retrieve documents for (used for Vespa streaming groupname) param: hits: number of hits to retrieve from Vespa app param: vespa_rank_profile: Vespa rank profile to use param: vespa_score_cutoff: Vespa score cutoff to use during first-phase ranking param: sources: sources to retrieve documents from param: fields: fields to retrieve """ self.app = app self.hits = hits self.user = user self.vespa_rank_profile = vespa_rank_profile self.vespa_score_cutoff = vespa_score_cutoff self.fields = fields self.summary_fields = ",".join(fields) self.sources = ",".join(sources) super().__init__(callback_manager) def _retrieve(self, query: Union[str, QueryBundle]) -> List[NodeWithScore]: """Retrieve documents from Vespa application.""" if isinstance(query, QueryBundle): query = query.query_str if self.vespa_rank_profile == "default": yql: str = f"select {self.summary_fields} from mail where userQuery()" else: yql = f"select {self.summary_fields} from sources {self.sources} where {{targetHits:10}}nearestNeighbor(embedding,q) or userQuery()" vespa_body_request = { "yql": yql, "query": query, "hits": self.hits, "ranking.profile": self.vespa_rank_profile, "timeout": "2s", "input.query(threshold)": self.vespa_score_cutoff, } if self.vespa_rank_profile != "default": vespa_body_request["input.query(q)"] = f'embed(e5, "{query}")' with self.app.syncio(connections=1) as session: response: VespaQueryResponse = session.query( body=vespa_body_request, groupname=self.user ) if not response.is_successful(): raise ValueError( f"Query request failed: {response.status_code}, response payload: {response.get_json()}" ) nodes: List[NodeWithScore] = [] for hit in response.hits: response_fields: dict = hit.get("fields", {}) text: str = "" for field in response_fields.keys(): if isinstance(response_fields[field], str) and field in self.fields: text += response_fields[field] + " " id = hit["id"] # doc = TextNode( id_=id, text=text, metadata=response_fields, ) nodes.append(NodeWithScore(node=doc, score=hit["relevance"])) return nodes ``` from llama_index.legacy.core.base_retriever import BaseRetriever from llama_index.legacy.schema import NodeWithScore, QueryBundle, TextNode from llama_index.legacy.callbacks.base import CallbackManager from vespa.application import Vespa from vespa.io import VespaQueryResponse from typing import List, Union, Optional class PersonalAssistantVespaRetriever(BaseRetriever): def __init__( self, app: Vespa, user: str, hits: int = 5, vespa_rank_profile: str = "default", vespa_score_cutoff: float = 0.70, sources: List[str] = ["mail"], fields: List[str] = ["subject", "body"], callback_manager: Optional[CallbackManager] = None, ) -> None: """Sample Retriever for a personal assistant application. Args: param: app: Vespa application object param: user: user id to retrieve documents for (used for Vespa streaming groupname) param: hits: number of hits to retrieve from Vespa app param: vespa_rank_profile: Vespa rank profile to use param: vespa_score_cutoff: Vespa score cutoff to use during first-phase ranking param: sources: sources to retrieve documents from param: fields: fields to retrieve """ self.app = app self.hits = hits self.user = user self.vespa_rank_profile = vespa_rank_profile self.vespa_score_cutoff = vespa_score_cutoff self.fields = fields self.summary_fields = ",".join(fields) self.sources = ",".join(sources) super().__init__(callback_manager) def \_retrieve(self, query: Union[str, QueryBundle]) -> List\[NodeWithScore\]: """Retrieve documents from Vespa application.""" if isinstance(query, QueryBundle): query = query.query_str if self.vespa_rank_profile == "default": yql: str = f"select {self.summary_fields} from mail where userQuery()" else: yql = f"select {self.summary_fields} from sources {self.sources} where {{targetHits:10}}nearestNeighbor(embedding,q) or userQuery()" vespa_body_request = { "yql": yql, "query": query, "hits": self.hits, "ranking.profile": self.vespa_rank_profile, "timeout": "2s", "input.query(threshold)": self.vespa_score_cutoff, } if self.vespa_rank_profile != "default": vespa_body_request["input.query(q)"] = f'embed(e5, "{query}")' with self.app.syncio(connections=1) as session: response: VespaQueryResponse = session.query( body=vespa_body_request, groupname=self.user ) if not response.is_successful(): raise ValueError( f"Query request failed: {response.status_code}, response payload: {response.get_json()}" ) nodes: List[NodeWithScore] = [] for hit in response.hits: response_fields: dict = hit.get("fields", {}) text: str = "" for field in response_fields.keys(): if isinstance(response_fields[field], str) and field in self.fields: text += response_fields[field] + " " id = hit["id"] # doc = TextNode( id\_=id, text=text, metadata=response_fields, ) nodes.append(NodeWithScore(node=doc, score=hit["relevance"])) return nodes The above defines a `PersonalAssistantVespaRetriever` which accepts most importantly a [pyvespa](https://vespa-engine.github.io/pyvespa/) `Vespa` application instance. The YQL specifies a hybrid retrieval query that retrieves both using embedding-based retrieval (vector search) using Vespa's nearest neighbor search operator in combination with traditional keyword matching. With the above, we can connect to the running Vespa app and initialize the `PersonalAssistantVespaRetriever` for the user `bergum@vespa.ai`. The `user` argument maps to the [streaming search groupname parameter](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming.groupname). In \[21\]: Copied! ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="default" ) retriever.retrieve("When is my dentist appointment?") ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="default" ) retriever.retrieve("When is my dentist appointment?") Out\[21\]: ``` [NodeWithScore(node=TextNode(id_='id:assistant:mail:g=bergum@vespa.ai:2', embedding=None, metadata={'matchfeatures': {'freshness(timestamp)': 0.9232454989711935, 'nativeRank(body)': 0.09246780326887034, 'nativeRank(subject)': 0.11907024479392506, 'my_function': 1.1347835470339889}, 'subject': 'Dentist Appointment Reminder', 'body': 'Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='269fe208f8d43a967dc683e1c9b832b18ddfb0b2efd801ab7e428620c8163021', text='Dentist Appointment Reminder Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=1.1347835470339889), NodeWithScore(node=TextNode(id_='id:assistant:mail:g=bergum@vespa.ai:1', embedding=None, metadata={'matchfeatures': {'freshness(timestamp)': 0.9202362397119341, 'nativeRank(body)': 0.02919821398130037, 'nativeRank(subject)': 1.3512214436142505e-38, 'my_function': 0.9494344536932345}, 'subject': 'LlamaIndex news, 2023-11-14', 'body': "Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory \n lane on our blog with twelve milestones from our first year. Be sure to check it out."}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='5e975eaece761d46956c9d301138f29b5c067d3da32fd013bb79c6ee9c033d3d', text="LlamaIndex news, 2023-11-14 Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory \n lane on our blog with twelve milestones from our first year. Be sure to check it out. ", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.9494344536932345)] ``` These `NodeWithScore` retrieved `default` rank-profile can then be used for the next steps in a generative chain. We can also try the `semantic` rank-profile, which has rank-score-drop functionality, allowing us to have a per-query time threshold. Altering the threshold will remove context. In \[22\]: Copied! ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="semantic", vespa_score_cutoff=0.6, hits=20, ) retriever.retrieve("When is my dentist appointment?") ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="semantic", vespa_score_cutoff=0.6, hits=20, ) retriever.retrieve("When is my dentist appointment?") Out\[22\]: ``` [NodeWithScore(node=TextNode(id_='id:assistant:mail:g=bergum@vespa.ai:2', embedding=None, metadata={'matchfeatures': {'distance(field,embedding)': 0.43945494361938975, 'freshness(timestamp)': 0.9232453703703704, 'query(threshold)': 0.6, 'cosine': 0.9049836898369259}, 'subject': 'Dentist Appointment Reminder', 'body': 'Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='e89f669e6c9cf64ab6a856d9857915481396e2aa84154951327cd889c23f7c4f', text='Dentist Appointment Reminder Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.9049836898369259), NodeWithScore(node=TextNode(id_='id:assistant:mail:g=bergum@vespa.ai:1', embedding=None, metadata={'matchfeatures': {'distance(field,embedding)': 0.69930099954744, 'freshness(timestamp)': 0.9202361111111111, 'query(threshold)': 0.6, 'cosine': 0.7652923088511814}, 'subject': 'LlamaIndex news, 2023-11-14', 'body': "Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory \n lane on our blog with twelve milestones from our first year. Be sure to check it out."}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='cb9b588e5b53dbdd0fbe6f7aadfa689d84a5bea23239293bd299347ee9ecd853', text="LlamaIndex news, 2023-11-14 Hello Llama Friends 🦙 LlamaIndex is 1 year old this week! 🎉 To celebrate, we're taking a stroll down memory \n lane on our blog with twelve milestones from our first year. Be sure to check it out. ", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.7652923088511814)] ``` Create a new retriever with sources including both mail and calendar data: In \[23\]: Copied! ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="fusion", sources=["calendar", "mail"], vespa_score_cutoff=0.80, ) retriever.retrieve("When is my dentist appointment?") ``` retriever = PersonalAssistantVespaRetriever( app=app, user="bergum@vespa.ai", vespa_rank_profile="fusion", sources=["calendar", "mail"], vespa_score_cutoff=0.80, ) retriever.retrieve("When is my dentist appointment?") Out\[23\]: ``` [NodeWithScore(node=TextNode(id_='id:assistant:calendar:g=bergum@vespa.ai:1', embedding=None, metadata={'matchfeatures': {'freshness(timestamp)': 0.9232447273662552, 'nativeRank(subject)': 0.11907024479392506, 'query(threshold)': 0.8, 'cosine': 0.8872983644178517, 'keywords_and_freshness': 1.1606592237923947}, 'subject': 'Dentist Appointment', 'body': 'Dentist appointment at 2023-12-04 at 09:30 - 1 hour duration'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='b30948011cbe9bbf29135384efbc72f85a6eb65113be0eb9762315a022f11ba1', text='Dentist Appointment Dentist appointment at 2023-12-04 at 09:30 - 1 hour duration ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.03278688524590164), NodeWithScore(node=TextNode(id_='id:assistant:mail:g=bergum@vespa.ai:2', embedding=None, metadata={'matchfeatures': {'freshness(timestamp)': 0.9232447273662552, 'nativeRank(subject)': 0.11907024479392506, 'query(threshold)': 0.8, 'cosine': 0.9049836898369259, 'keywords_and_freshness': 1.1347827754290507}, 'subject': 'Dentist Appointment Reminder', 'body': 'Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='21c501ccdc6e4b33d388eefa244c5039a0e1ed4b81e4f038916765e22be24705', text='Dentist Appointment Reminder Dear Jo Kristian ,\nThis is a reminder for your upcoming dentist appointment on 2023-12-04 at 09:30. Please arrive 15 minutes early.\nBest regards,\nDr. Dentist ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.03278688524590164)] ``` In \[24\]: Copied! ``` app.query( yql="select subject, display_date from calendar where duration > 0", ranking="default", groupname="bergum@vespa.ai", timeout="2s", ).json ``` app.query( yql="select subject, display_date from calendar where duration > 0", ranking="default", groupname="bergum@vespa.ai", timeout="2s", ).json Out\[24\]: ``` {'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 2}, 'coverage': {'coverage': 100, 'documents': 2, 'full': True, 'nodes': 1, 'results': 1, 'resultsFull': 1}, 'children': [{'id': 'id:assistant:calendar:g=bergum@vespa.ai:2', 'relevance': 0.987133487654321, 'source': 'assistant_content.calendar', 'fields': {'matchfeatures': {'freshness(timestamp)': 0.987133487654321, 'nativeRank(body)': 0.0, 'nativeRank(subject)': 0.0, 'my_function': 0.987133487654321}, 'subject': 'Public Cloud Platform Events', 'display_date': '2023-11-21T09:30:00Z'}}, {'id': 'id:assistant:calendar:g=bergum@vespa.ai:1', 'relevance': 0.9232445987654321, 'source': 'assistant_content.calendar', 'fields': {'matchfeatures': {'freshness(timestamp)': 0.9232445987654321, 'nativeRank(body)': 0.0, 'nativeRank(subject)': 0.0, 'my_function': 0.9232445987654321}, 'subject': 'Dentist Appointment', 'display_date': '2023-11-15T15:30:00Z'}}]}} ``` ## Conclusion[¶](#conclusion) In this notebook, we have demonstrated: - Configuring and using Vespa's streaming mode - Using multiple document types and schema to organize our data - Running embedding inference in Vespa - Hybrid retrieval techniques - combined with score thresholding to filter irrelevant contexts - Creating a custom LLamaIndex retriever and connecting it with our Vespa app - Vespa Cloud deployments to sandbox/dev zone We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Scaling ColPALI (VLM) Retrieval[¶](#scaling-colpali-vlm-retrieval) This notebook demonstrates how to represent [ColPali](https://huggingface.co/vidore/colpali) in Vespa and to scale to large collections. Also see the blog post: [Scaling ColPali to billions of PDFs with Vespa](https://blog.vespa.ai/scaling-colpali-to-billions/) Consider following the [ColQWen2](https://vespa-engine.github.io/pyvespa/examples/pdf-retrieval-with-ColQwen2-vlm_Vespa-cloud.md) notebook instead as it use a better model with improved performance (Both accuracy and speed). ColPali is a powerful visual language model that can generate embeddings for images (screenshots of PDF pages) and text queries. In this notebook, we will use ColPali to generate embeddings for images of PDF *pages* and store the embeddings in Vespa. We will also store the base64 encoded image of the PDF page and meta data like title and url. We demonstrate how to retrieve relevant pages for a query using the embeddings generated by ColPali. The TLDR of this notebook: - Generate an image per PDF page using [pdf2image](https://pypi.org/project/pdf2image/) and also extract the text using [pypdf](https://pypdf.readthedocs.io/en/stable/user/extract-text.html). - For each page image, use ColPali to obtain the visual multi-vector embeddings Then we store visual embeddings in Vespa as a `int8` tensor, where we use a binary compression technique to reduce the storage footprint by 32x compared to float representations. See [Scaling ColPali to billions of PDFs with Vespa](https://blog.vespa.ai/scaling-colpali-to-billions/) for details on binarization and using hamming distance for retrieval. During retrieval time, we use the same ColPali model to generate embeddings for the query and then use Vespa's `nearestNeighbor` query to retrieve the most similar documents per query vector token, using binary representation with hamming distance. Then we re-rank the results in two phases: - In the 0-phase we use hamming distance to retrieve the k closest pages per query token vector representation, this is expressed by using multiple nearestNeighbor query operators in Vespa. - The nearestNeighbor operators exposes pages to the first-phase ranking function, which uses an approximate MaxSim using inverted hamming distance insted of cosine similarity. This is done to reduce the number of pages that are re-ranked in the second phase. - In the second phase, we perform the full MaxSim operation, using float representations of the embeddings to re-rank the top-k pages from the first phase. This allows us to scale ColPali to very large collections of PDF pages, while still providing accurate and fast retrieval. Let us get started. Install dependencies: Note that the python pdf2image package requires poppler-utils, see other installation options [here](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler). In \[ \]: Copied! ``` !sudo apt-get update && sudo apt-get install poppler-utils -y ``` !sudo apt-get update && sudo apt-get install poppler-utils -y Now install the required python packages: In \[ \]: Copied! ``` !pip3 install transformers==4.51.3 accelerate vidore_benchmark==4.0.0 pdf2image pypdf==5.0.1 pyvespa vespacli requests numpy ``` !pip3 install transformers==4.51.3 accelerate vidore_benchmark==4.0.0 pdf2image pypdf==5.0.1 pyvespa vespacli requests numpy In \[ \]: Copied! ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from io import BytesIO from transformers import ColPaliForRetrieval, ColPaliProcessor from vidore_benchmark.utils.image_utils import scale_image, get_base64_image ``` import torch from torch.utils.data import DataLoader from tqdm import tqdm from io import BytesIO from transformers import ColPaliForRetrieval, ColPaliProcessor from vidore_benchmark.utils.image_utils import scale_image, get_base64_image ### Load the model[¶](#load-the-model) This requires that the HF_TOKEN environment variable is set as the underlaying PaliGemma model is hosted on Hugging Face and has a [restricive licence](https://ai.google.dev/gemma/terms) that requires authentication. Choose the right device to run the model. In \[ \]: Copied! ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 ``` # Load model (bfloat16 support is limited; fallback to float32 if needed) device = "cuda" if torch.cuda.is_available() else "cpu" if torch.backends.mps.is_available(): device = "mps" # For Apple Silicon devices dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 Load the model and the processor. In \[ \]: Copied! ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ``` model_name = "vidore/colpali-v1.2-hf" model = ColPaliForRetrieval.from_pretrained( model_name, torch_dtype=dtype, device_map=device, # "cpu", "cuda", or "mps" for Apple Silicon ).eval() processor = ColPaliProcessor.from_pretrained(model_name) ### Working with pdfs[¶](#working-with-pdfs) We need to convert a PDF to an array of images. One image per page. We use the `pdf2image` library for this task. Secondary, we also extract the text contents of the PDF using `pypdf`. NOTE: This step requires that you have `poppler` installed on your system. Read more in [pdf2image](https://pdf2image.readthedocs.io/en/latest/installation.html) docs. In \[ \]: Copied! ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) temp_file = "temp.pdf" with open(temp_file, "wb") as f: f.write(pdf_file.read()) reader = PdfReader(temp_file) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path(temp_file) assert len(images) == len(page_texts) return (images, page_texts) ``` import requests from pdf2image import convert_from_path from pypdf import PdfReader def download_pdf(url): response = requests.get(url) if response.status_code == 200: return BytesIO(response.content) else: raise Exception(f"Failed to download PDF: Status code {response.status_code}") def get_pdf_images(pdf_url): # Download the PDF pdf_file = download_pdf(pdf_url) # Save the PDF temporarily to disk (pdf2image requires a file path) temp_file = "temp.pdf" with open(temp_file, "wb") as f: f.write(pdf_file.read()) reader = PdfReader(temp_file) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) images = convert_from_path(temp_file) assert len(images) == len(page_texts) return (images, page_texts) We define a few sample PDFs to work with. The PDFs are discovered from [this url](https://www.conocophillips.com/company-reports-resources/sustainability-reporting/). In \[ \]: Copied! ``` sample_pdfs = [ { "title": "ConocoPhillips Sustainability Highlights - Nature (24-0976)", "url": "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf", }, { "title": "ConocoPhillips Managing Climate Related Risks", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf", }, { "title": "ConocoPhillips 2023 Sustainability Report", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf", }, ] ``` sample_pdfs = [ { "title": "ConocoPhillips Sustainability Highlights - Nature (24-0976)", "url": "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf", }, { "title": "ConocoPhillips Managing Climate Related Risks", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf", }, { "title": "ConocoPhillips 2023 Sustainability Report", "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf", }, ] Now we can convert the PDFs to images and also extract the text content. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts ``` for pdf in sample_pdfs: page_images, page_texts = get_pdf_images(pdf["url"]) pdf["images"] = page_images pdf["texts"] = page_texts Let us look at the extracted image of the first PDF page. This is the document side input to ColPali, one image per page. In \[ \]: Copied! ``` from IPython.display import display display(scale_image(sample_pdfs[0]["images"][0], 720)) ``` from IPython.display import display display(scale_image(sample_pdfs[0]["images"][0], 720)) Let us also look at the extracted text content of the first PDF page. In \[ \]: Copied! ``` print(sample_pdfs[0]["texts"][0]) ``` print(sample_pdfs[0]["texts"][0]) Notice how the layout and order of the text is different from the image representation. Note that - The headlines NATURE and Sustainability have been combined into one word (NATURESustainability). - The 0.03% has been converted to 0.03 and order is not preserved in the text representation. - The data in the infographics is not represented in the text representation. For example the source water distribution in the infographics is not represented in the text representation. Now we use the ColPali model to generate embeddings of the images. In \[ \]: Copied! ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc).embeddings page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings ``` for pdf in sample_pdfs: page_embeddings = [] dataloader = DataLoader( pdf["images"], batch_size=2, shuffle=False, collate_fn=lambda x: processor(images=x, return_tensors="pt"), ) for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(\*\*batch_doc).embeddings page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu")))) pdf["embeddings"] = page_embeddings Now we are done with the document side embeddings, we now convert the embeddings to Vespa JSON format so we can store (and index) them in Vespa. Details in [Vespa JSON feed format doc](https://docs.vespa.ai/en/reference/document-json-format.html). We use binary quantization (BQ) of the page level ColPali vector embeddings to reduce their size by 32x. Read more about binarization of multi-vector representations in the [colbert blog post](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/). The binarization step maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x. On the [DocVQA benchmark](https://huggingface.co/datasets/vidore/docvqa_test_subsampled), binarization results in a small drop in ranking accuracy. In \[ \]: Copied! ``` import numpy as np vespa_feed = [] for pdf in sample_pdfs: url = pdf["url"] title = pdf["title"] for page_number, (page_text, embedding, image) in enumerate( zip(pdf["texts"], pdf["embeddings"], pdf["images"]) ): base_64_image = get_base64_image(scale_image(image, 640), add_url_prefix=False) embedding_dict = dict() for idx, patch_embedding in enumerate(embedding): binary_vector = ( np.packbits(np.where(patch_embedding > 0, 1, 0)) .astype(np.int8) .tobytes() .hex() ) embedding_dict[idx] = binary_vector page = { "id": hash(url + str(page_number)), "fields": { "url": url, "title": title, "page_number": page_number, "image": base_64_image, "text": page_text, "embedding": embedding_dict, }, } vespa_feed.append(page) ``` import numpy as np vespa_feed = [] for pdf in sample_pdfs: url = pdf["url"] title = pdf["title"] for page_number, (page_text, embedding, image) in enumerate( zip(pdf["texts"], pdf["embeddings"], pdf["images"]) ): base_64_image = get_base64_image(scale_image(image, 640), add_url_prefix=False) embedding_dict = dict() for idx, patch_embedding in enumerate(embedding): binary_vector = ( np.packbits(np.where(patch_embedding > 0, 1, 0)) .astype(np.int8) .tobytes() .hex() ) embedding_dict[idx] = binary_vector page = { "id": hash(url + str(page_number)), "fields": { "url": url, "title": title, "page_number": page_number, "image": base_64_image, "text": page_text, "embedding": embedding_dict, }, } vespa_feed.append(page) ### Configure Vespa[¶](#configure-vespa) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[ \]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW colpali_schema = Schema( name="pdf_page", document=Document( fields=[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"] ), Field(name="url", type="string", indexing=["summary", "index"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], # adds HNSW index for candidate retrieval. ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), ] ), fieldsets=[FieldSet(name="default", fields=["title", "text"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW colpali_schema = Schema( name="pdf_page", document=Document( fields=\[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"] ), Field(name="url", type="string", indexing=["summary", "index"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], # adds HNSW index for candidate retrieval. ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), \] ), fieldsets=\[FieldSet(name="default", fields=["title", "text"])\], ) Notice the `embedding` field which is a tensor field with the type `tensor(patch{}, v[16])`. This is the field we use to represent the patch embeddings from ColPali. We also enable [HNSW indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html) for this field to enable fast nearest neighbor search which is used for candidate retrieval. We use [binary hamming distance](https://docs.vespa.ai/en/nearest-neighbor-search.html#using-binary-embeddings-with-hamming-distance) as an approximation of the cosine similarity. Hamming distance is a good approximation for binary representations, and it is much faster to compute than cosine similarity/dot product. The `embedding` field is an example of a mixed tensor where we combine one mapped (sparse) dimensions with a dense dimension. Read more in [Tensor guide](https://docs.vespa.ai/en/tensor-user-guide.html). We also enable [BM25](https://docs.vespa.ai/en/reference/bm25.html) for the `title` and `texts` fields. Notice that the `image` field use type `raw` to store the binary image data, encoded with as a base64 string. Create the Vespa [application package](https://docs.vespa.ai/en/application-packages): In \[ \]: Copied! ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag5" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) ``` from vespa.package import ApplicationPackage vespa_app_name = "visionrag5" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[colpali_schema] ) Now we define how we want to rank the pages for a query. We use Vespa's support for [BM25](https://docs.vespa.ai/en/reference/bm25.html) for the text, and late interaction with Max Sim for the image embeddings. This means that we use the the text representations as a candidate retrieval phase, then we use the ColPALI embeddings with MaxSim to rerank the pages. In \[ \]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="default", inputs=[("query(qt)", "tensor(querytoken{}, v[128])")], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function(name="bm25_score", expression="bm25(title) + bm25(text)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100), ) colpali_schema.add_rank_profile(colpali_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking colpali_profile = RankProfile( name="default", inputs=\[("query(qt)", "tensor(querytoken{}, v[128])")\], functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function(name="bm25_score", expression="bm25(title) + bm25(text)"), ], first_phase=FirstPhaseRanking(expression="bm25_score"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100), ) colpali_schema.add_rank_profile(colpali_profile) The first phase uses a linear combination of BM25 scores for the text fields, and the second phase uses the MaxSim function with the image embeddings. Notice that Vespa supports a `unpack_bits` function to convert the 16 compressed binary vectors to 128-dimensional floats for the MaxSim function. The query input tensor is not compressed and using full float resolution. ### Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). In \[ \]: Copied! ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() In \[ \]: Copied! ``` print("Number of PDF pages:", len(vespa_feed)) ``` print("Number of PDF pages:", len(vespa_feed)) Index the documents in Vespa using the Vespa HTTP API. In \[ \]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema="pdf_page", callback=callback) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema="pdf_page", callback=callback) ### Querying Vespa[¶](#querying-vespa) Ok, so now we have indexed the PDF pages in Vespa. Let us now obtain ColPali embeddings for a few text queries and use it during ranking of the indexed pdf pages. Now we can query Vespa with the text query and rerank the results using the ColPali embeddings. In \[ \]: Copied! ``` queries = [ "Percentage of non-fresh water as source?", "Policies related to nature risk?", "How much of produced water is recycled?", ] dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` queries = [ "Percentage of non-fresh water as source?", "Policies related to nature risk?", "How much of produced water is recycled?", ] dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) Obtain the query embeddings using the ColPali model: In \[ \]: Copied! ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) ``` dataloader = DataLoader( queries, batch_size=1, shuffle=False, collate_fn=lambda x: processor(text=x, return_tensors="pt"), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(\*\*batch_query).embeddings qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) We create a simple routine to display the results. We render the image and the title of the retrieved page/document. In \[ \]: Copied! ``` from IPython.display import display, HTML def display_query_results(query, response, hits=5): query_time = response.json.get("timing", {}).get("searchtime", -1) query_time = round(query_time, 2) count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0) html_content = f"

Query text: '{query}', query time {query_time}s, count={count}, top results:

" for i, hit in enumerate(response.hits[:hits]): title = hit["fields"]["title"] url = hit["fields"]["url"] page = hit["fields"]["page_number"] image = hit["fields"]["image"] score = hit["relevance"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}, page {page+1} with score {score:.2f}

' html_content += ( f'' ) display(HTML(html_content)) ``` from IPython.display import display, HTML def display_query_results(query, response, hits=5): query_time = response.json.get("timing", {}).get("searchtime", -1) query_time = round(query_time, 2) count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0) html_content = f"

Query text: '{query}', query time {query_time}s, count={count}, top results:

" for i, hit in enumerate(response.hits[:hits]): title = hit["fields"]["title"] url = hit["fields"]["url"] page = hit["fields"]["page_number"] image = hit["fields"]["image"] score = hit["relevance"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}, page {page+1} with score {score:.2f}

' html_content += ( f'' ) display(HTML(html_content)) Query Vespa with the queries and display the results, here we are using the `default` rank profile. Note that we retrieve using textual representation with `userInput(@userQuery)`, this means that we use the BM25 ranking for the extracted text in the first ranking phase and then re-rank the top-k pages using the ColPali embeddings. Later in this notebook we will use Vespa's support for approximate nearest neighbor search (`nearestNeighbor`) to retrieve directly using the ColPali embeddings. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse async with app.asyncio(connections=1, timeout=120) as session: for idx, query in enumerate(queries): query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} response: VespaQueryResponse = await session.query( yql="select title,url,image,page_number from pdf_page where userInput(@userQuery)", ranking="default", userQuery=query, timeout=120, hits=3, body={"input.query(qt)": query_embedding, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ``` from vespa.io import VespaQueryResponse async with app.asyncio(connections=1, timeout=120) as session: for idx, query in enumerate(queries): query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} response: VespaQueryResponse = await session.query( yql="select title,url,image,page_number from pdf_page where userInput(@userQuery)", ranking="default", userQuery=query, timeout=120, hits=3, body={"input.query(qt)": query_embedding, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ### Using nearestNeighbor for retrieval[¶](#using-nearestneighbor-for-retrieval) In the above example, we used the ColPali embeddings in ranking, but using the text query for retrieval. This is a reasonable approach for text-heavy documents where the text representation is the most important and where ColPali embeddings are used to re-rank the top-k documents from the text retrieval phase. In some cases, the ColPali embeddings are the most important and we want to demonstrate how we can use HNSW indexing with binary hamming distance to retrieve the most similar pages to a query and then have two steps of re-ranking using the ColPali embeddings. All the phases here are executed locally inside the Vespa content node(s) so that no vector data needs to cross the network. Let us add a new rank-profile to the schema, the `nearestNeighbor` operator takes a query tensor and a field tensor as argument and we need to define the query tensors types in the rank-profile. In \[ \]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.append(("query(qt)", "tensor(querytoken{}, v[128])")) input_query_tensors.append(("query(qtb)", "tensor(querytoken{}, v[16])")) colpali_retrieval_profile = RankProfile( name="retrieval-and-rerank", inputs=input_query_tensors, functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1/(1 + sum( hamming(query(qtb), attribute(embedding)) ,v) ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_binary"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), ) colpali_schema.add_rank_profile(colpali_retrieval_profile) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.append(("query(qt)", "tensor(querytoken{}, v[128])")) input_query_tensors.append(("query(qtb)", "tensor(querytoken{}, v[16])")) colpali_retrieval_profile = RankProfile( name="retrieval-and-rerank", inputs=input_query_tensors, functions=[ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)) , v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1/(1 + sum( hamming(query(qtb), attribute(embedding)) ,v) ), max, patch ), querytoken ) """, ), ], first_phase=FirstPhaseRanking(expression="max_sim_binary"), second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), ) colpali_schema.add_rank_profile(colpali_retrieval_profile) We define two functions, one for the first phase and one for the second phase. Instead of the float representations, we use the binary representations with inverted hamming distance in the first phase. Now, we need to re-deploy the application to Vespa Cloud. In \[ \]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() Now we can query Vespa with the text queries and use the `nearestNeighbor` operator to retrieve the most similar pages to the query and pass the different query tensors. In \[ \]: Copied! ``` from vespa.io import VespaQueryResponse target_hits_per_query_tensor = ( 20 # this is a hyper parameter that can be tuned for speed versus accuracy ) async with app.asyncio(connections=1, timeout=180) as session: for idx, query in enumerate(queries): float_query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} binary_query_embeddings = dict() for k, v in float_query_embedding.items(): binary_query_embeddings[k] = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) # The mixed tensors used in MaxSim calculations # We use both binary and float representations query_tensors = { "input.query(qtb)": binary_query_embeddings, "input.query(qt)": float_query_embedding, } # The query tensors used in the nearest neighbor calculations for i in range(0, len(binary_query_embeddings)): query_tensors[f"input.query(rq{i})"] = binary_query_embeddings[i] nn = [] for i in range(0, len(binary_query_embeddings)): nn.append( f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))" ) # We use a OR operator to combine the nearest neighbor operator nn = " OR ".join(nn) response: VespaQueryResponse = await session.query( yql=f"select title, url, image, page_number from pdf_page where {nn}", ranking="retrieval-and-rerank", timeout=120, hits=3, body={**query_tensors, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) ``` from vespa.io import VespaQueryResponse target_hits_per_query_tensor = ( 20 # this is a hyper parameter that can be tuned for speed versus accuracy ) async with app.asyncio(connections=1, timeout=180) as session: for idx, query in enumerate(queries): float_query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])} binary_query_embeddings = dict() for k, v in float_query_embedding.items(): binary_query_embeddings[k] = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) # The mixed tensors used in MaxSim calculations # We use both binary and float representations query_tensors = { "input.query(qtb)": binary_query_embeddings, "input.query(qt)": float_query_embedding, } # The query tensors used in the nearest neighbor calculations for i in range(0, len(binary_query_embeddings)): query_tensors[f"input.query(rq{i})"] = binary_query_embeddings[i] nn = [] for i in range(0, len(binary_query_embeddings)): nn.append( f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))" ) # We use a OR operator to combine the nearest neighbor operator nn = " OR ".join(nn) response: VespaQueryResponse = await session.query( yql=f"select title, url, image, page_number from pdf_page where {nn}", ranking="retrieval-and-rerank", timeout=120, hits=3, body={\*\*query_tensors, "presentation.timing": True}, ) assert response.is_successful() display_query_results(query, response) Depending on the scale, we can evaluate changing different number of targetHits per nearestNeighbor operator and the ranking depths in the two phases. We can also parallelize the ranking phases by using more threads per query request to reduce latency. ## Summary[¶](#summary) In this notebook, we have demonstrated how to represent ColPali in Vespa. We have generated embeddings for images of PDF pages using ColPali and stored the embeddings in Vespa using [mixed tensors](https://docs.vespa.ai/en/tensor-user-guide.html). We demonstrated how to store the base64 encoded image using the `raw` Vespa field type, plus meta data like title and url. We have demonstrated how to retrieve relevant pages for a query using the embeddings generated by ColPali. This notebook can be extended to include more complex ranking models, more complex queries, and more complex data structures, including metadata and other fields which can be filtered on or used for ranking. ## Cleanup[¶](#cleanup) When this notebook is running in CI, we want to delete the application. In \[ \]: Copied! ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() # Turbocharge RAG with LangChain and Vespa Streaming Mode for Partitioned Data[¶](#turbocharge-rag-with-langchain-and-vespa-streaming-mode-for-partitioned-data) This notebook illustrates using [Vespa streaming mode](https://docs.vespa.ai/en/streaming-search.html) to build cost-efficient RAG applications over naturally sharded data. You can read more about Vespa vector streaming search in these blog posts: - [Announcing vector streaming search: AI assistants at scale without breaking the bank](https://blog.vespa.ai/announcing-vector-streaming-search/) - [Yahoo Mail turns to Vespa to do RAG at scale](https://blog.vespa.ai/yahoo-mail-turns-to-vespa-to-do-rag-at-scale/) - [Hands-On RAG guide for personal data with Vespa and LLamaIndex](https://blog.vespa.ai/scaling-personal-ai-assistants-with-streaming-mode/) This notebook is also available in blog form: [Turbocharge RAG with LangChain and Vespa Streaming Mode for Sharded Data](https://blog.vespa.ai/turbocharge-rag-with-langchain-and-vespa-streaming-mode/) ### TLDR; Vespa streaming mode for partitioned data[¶](#tldr-vespa-streaming-mode-for-partitioned-data) Vespa's streaming search solution enables you to integrate a user ID (or any sharding key) into the Vespa document ID. This setup allows Vespa to efficiently group each user's data on a small set of nodes and the same disk chunk. Streaming mode enables low latency searches on a user's data without keeping data in memory. The key benefits of streaming mode: - Eliminating compromises in precision introduced by approximate algorithms - Achieve significantly higher write throughput, thanks to the absence of index builds required for supporting approximate search. - Optimize efficiency by storing documents, including tensors and data, on disk, benefiting from the cost-effective economics of storage tiers. - Storage cost is the primary cost driver of Vespa streaming mode; no data is in memory. Avoiding memory usage lowers deployment costs significantly. ### Connecting LangChain Retriever with Vespa for Context Retrieval from PDF Documents[¶](#connecting-langchain-retriever-with-vespa-for-context-retrieval-from-pdf-documents) In this notebook, we seamlessly integrate a custom [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction) [retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/) with a Vespa app, leveraging Vespa's streaming mode to extract meaningful context from PDF documents. The workflow - Define and deploy a Vespa [application package](https://docs.vespa.ai/en/application-packages.html) using PyVespa. - Utilize [LangChain PDF Loaders](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf) to download and parse PDF files. - Leverage [LangChain Document Transformers](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) to convert each PDF page into multiple text chunks. - Feed the transformer representation to the running Vespa instance - Employ Vespa's built-in embedder functionality (using an open-source embedding model) for embedding the text chunks per page, resulting in a multi-vector representation. - Develop a custom [Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/) to enable seamless retrieval for any unstructured text query. Let's get started! First, install dependencies: In \[ \]: Copied! ``` !uv pip install -q pyvespa langchain langchain-community langchain-openai langchain-text-splitters pypdf==5.0.1 openai vespacli ``` !uv pip install -q pyvespa langchain langchain-community langchain-openai langchain-text-splitters pypdf==5.0.1 openai vespacli ## Sample data[¶](#sample-data) We love [ColBERT](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-3/), so we'll use a few COlBERT related papers as examples of PDFs in this notebook. In \[1\]: Copied! ``` def sample_pdfs(): return [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, { "title": "On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval", "url": "https://arxiv.org/pdf/2108.11480.pdf", "authors": "Craig Macdonald, Nicola Tonellotto", }, { "title": "A Study on Token Pruning for ColBERT", "url": "https://arxiv.org/pdf/2112.06540.pdf", "authors": "Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant", }, { "title": "Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval", "url": "https://arxiv.org/pdf/2106.11251.pdf", "authors": "Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis", }, ] ``` def sample_pdfs(): return [ { "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", "url": "https://arxiv.org/pdf/2112.01488.pdf", "authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia", }, { "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "url": "https://arxiv.org/pdf/2004.12832.pdf", "authors": "Omar Khattab, Matei Zaharia", }, { "title": "On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval", "url": "https://arxiv.org/pdf/2108.11480.pdf", "authors": "Craig Macdonald, Nicola Tonellotto", }, { "title": "A Study on Token Pruning for ColBERT", "url": "https://arxiv.org/pdf/2112.06540.pdf", "authors": "Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant", }, { "title": "Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval", "url": "https://arxiv.org/pdf/2106.11251.pdf", "authors": "Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis", }, ] ## Defining the Vespa application[¶](#defining-the-vespa-application) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[2\]: Copied! ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW pdf_schema = Schema( name="pdf", mode="streaming", document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "index"]), Field(name="title", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="authors", type="array", indexing=["summary", "index"]), Field(name="page", type="int", indexing=["summary", "index"]), Field( name="metadata", type="map", indexing=["summary", "index"], ), Field(name="chunks", type="array", indexing=["summary", "index"]), Field( name="embedding", type="tensor(chunk{}, x[384])", indexing=["input chunks", "embed e5", "attribute", "index"], ann=HNSW(distance_metric="angular"), is_document_field=False, ), ], ), fieldsets=[FieldSet(name="default", fields=["chunks", "title"])], ) ``` from vespa.package import Schema, Document, Field, FieldSet, HNSW pdf_schema = Schema( name="pdf", mode="streaming", document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "index"]), Field(name="title", type="string", indexing=["summary", "index"]), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="authors", type="array", indexing=["summary", "index"]), Field(name="page", type="int", indexing=["summary", "index"]), Field( name="metadata", type="map\", indexing=["summary", "index"], ), Field(name="chunks", type="array", indexing=["summary", "index"]), Field( name="embedding", type="tensor(chunk{}, x[384])", indexing=["input chunks", "embed e5", "attribute", "index"], ann=HNSW(distance_metric="angular"), is_document_field=False, ), \], ), fieldsets=\[FieldSet(name="default", fields=["chunks", "title"])\], ) The above defines our `pdf` schema using mode `streaming`. Most fields are straightforward, but take a note of: - `metadata` using `map` - here we can store and match over page level metadata extracted by the PDF parser. - `chunks` using `array`, these are the text chunks that we use langchain document transformers for - The `embedding` field of type `tensor(chunk{},x[384])` allows us to store and search the 384-dimensional embeddings per chunk in the same document The observant reader might have noticed the `e5` argument to the `embed` expression in the above `embedding` field. The `e5` argument references a component of the type [hugging-face-embedder](https://docs.vespa.ai/en/embedding.html#huggingface-embedder). We configure the application package and its name with the `pdf` schema and the `e5` embedder component. In \[3\]: Copied! ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "ragpdfs" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[pdf_schema], components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json" }, ), ], ) ], ) ``` from vespa.package import ApplicationPackage, Component, Parameter vespa_app_name = "ragpdfs" vespa_application_package = ApplicationPackage( name=vespa_app_name, schema=[pdf_schema], components=\[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", { "url": "https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx" }, ), Parameter( "tokenizer-model", { "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json" }, ), ], ) \], ) In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. Vespa supports [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) and has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html), including many text-matching features such as: - [BM25](https://docs.vespa.ai/en/reference/bm25.html). - [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html) and many more. Users can also define custom functions using [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html). The following defines a `hybrid` Vespa ranking profile. In \[4\]: Copied! ``` from vespa.package import RankProfile, Function, FirstPhaseRanking semantic = RankProfile( name="hybrid", inputs=[("query(q)", "tensor(x[384])")], functions=[ Function( name="similarities", expression="cosine_similarity(query(q), attribute(embedding),x)", ) ], first_phase=FirstPhaseRanking( expression="nativeRank(title) + nativeRank(chunks) + reduce(similarities, max, chunk)", rank_score_drop_limit=0.0, ), match_features=[ "closest(embedding)", "similarities", "nativeRank(chunks)", "nativeRank(title)", "elementSimilarity(chunks)", ], ) pdf_schema.add_rank_profile(semantic) ``` from vespa.package import RankProfile, Function, FirstPhaseRanking semantic = RankProfile( name="hybrid", inputs=\[("query(q)", "tensor(x[384])")\], functions=[ Function( name="similarities", expression="cosine_similarity(query(q), attribute(embedding),x)", ) ], first_phase=FirstPhaseRanking( expression="nativeRank(title) + nativeRank(chunks) + reduce(similarities, max, chunk)", rank_score_drop_limit=0.0, ), match_features=[ "closest(embedding)", "similarities", "nativeRank(chunks)", "nativeRank(title)", "elementSimilarity(chunks)", ], ) pdf_schema.add_rank_profile(semantic) The `hybrid` rank-profile above defines the query input embedding type and a similarities function that uses a Vespa [tensor compute function](https://docs.vespa.ai/en/reference/ranking-expressions.html#tensor-functions) that calculates the cosine similarity between all the chunk embeddings and the query embedding. The profile only defines a single ranking phase, using a linear combination of multiple features. Using [match-features](https://docs.vespa.ai/en/reference/schema-reference.html#match-features), Vespa returns selected features along with the hit in the SERP (result page). ## Deploy the application to Vespa Cloud[¶](#deploy-the-application-to-vespa-cloud) With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud: Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). Make note of the tenant name, it is used in the next steps. > Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days. In \[8\]: Copied! ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) ``` from vespa.deployment import VespaCloud import os # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Key is only used for CI/CD. Can be removed if logging in interactively key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") # To parse key correctly vespa_cloud = VespaCloud( tenant=tenant_name, application=vespa_app_name, key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively application_package=vespa_application_package, ) Now deploy the app to Vespa Cloud dev zone. The first deployment typically takes 2 minutes until the endpoint is up. In \[18\]: Copied! ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` from vespa.application import Vespa app: Vespa = vespa_cloud.deploy() ``` Deployment started in run 2 of dev-aws-us-east-1c for samples.pdfs. This may take a few minutes the first time. INFO [17:23:35] Deploying platform version 8.270.8 and application dev build 2 for dev-aws-us-east-1c of default ... INFO [17:23:35] Using CA signed certificate version 0 WARNING [17:23:35] For schema 'pdf', field 'page': Changed to attribute because numerical indexes (field has type int) is not currently supported. Index-only settings may fail. Ignore this warning for streaming search. INFO [17:23:35] Using 1 nodes in container cluster 'pdfs_container' WARNING [17:23:36] For streaming search cluster 'pdfs_content.pdf', SD field 'embedding': hnsw index is not relevant and not supported, ignoring setting WARNING [17:23:36] For streaming search cluster 'pdfs_content.pdf', SD field 'embedding': hnsw index is not relevant and not supported, ignoring setting INFO [17:23:38] Deployment successful. INFO [17:23:38] Session 3239 for tenant 'samples' prepared and activated. INFO [17:23:38] ######## Details for all nodes ######## INFO [17:23:38] h88963a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [17:23:38] --- platform vespa/cloud-tenant-rhel8:8.270.8 INFO [17:23:38] --- storagenode on port 19102 has config generation 3239, wanted is 3239 INFO [17:23:38] --- searchnode on port 19107 has config generation 3239, wanted is 3239 INFO [17:23:38] --- distributor on port 19111 has config generation 3238, wanted is 3239 INFO [17:23:38] --- metricsproxy-container on port 19092 has config generation 3239, wanted is 3239 INFO [17:23:38] h88969g.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [17:23:38] --- platform vespa/cloud-tenant-rhel8:8.270.8 INFO [17:23:38] --- logserver-container on port 4080 has config generation 3239, wanted is 3239 INFO [17:23:38] --- metricsproxy-container on port 19092 has config generation 3239, wanted is 3239 INFO [17:23:38] h88972i.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [17:23:38] --- platform vespa/cloud-tenant-rhel8:8.270.8 INFO [17:23:38] --- container-clustercontroller on port 19050 has config generation 3239, wanted is 3239 INFO [17:23:38] --- metricsproxy-container on port 19092 has config generation 3239, wanted is 3239 INFO [17:23:38] h89461a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [17:23:38] --- platform vespa/cloud-tenant-rhel8:8.270.8 INFO [17:23:38] --- container on port 4080 has config generation 3239, wanted is 3239 INFO [17:23:38] --- metricsproxy-container on port 19092 has config generation 3239, wanted is 3239 INFO [17:23:51] Found endpoints: INFO [17:23:51] - dev.aws-us-east-1c INFO [17:23:51] |-- https://c4f42a1b.bfbdb4fd.z.vespa-app.cloud/ (cluster 'pdfs_container') INFO [17:23:52] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://c4f42a1b.bfbdb4fd.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment. ``` ## Processing PDFs with LangChain[¶](#processing-pdfs-with-langchain) [LangChain](https://python.langchain.com/) has a rich set of [document loaders](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/) that can be used to load and process various file formats. In this notebook, we use the [PyPDFLoader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf). We also want to split the extracted text into *chunks* using a [text splitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/). Most text embedding models have limited input lengths (typically less than 512 language model tokens, so splitting the text into multiple chunks that fits into the context limit of the embedding model is a common strategy. For embedding text data, models based on the Transformer architecture have become the de facto standard. A challenge with Transformer-based models is their input length limitation due to the quadratic self-attention computational complexity. For example, a popular open-source text embedding model like [e5](https://huggingface.co/intfloat/e5-small) has an absolute maximum input length of 512 wordpiece tokens. In addition to the technical limitation, trying to fit more tokens than used during fine-tuning of the model will impact the quality of the vector representation. One can view text embedding encoding as a lossy compression technique, where variable-length texts are compressed into a fixed dimensional vector representation. In \[ \]: Copied! ``` from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # chars, not llm tokens chunk_overlap=0, length_function=len, is_separator_regex=False, ) ``` from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, # chars, not llm tokens chunk_overlap=0, length_function=len, is_separator_regex=False, ) The following iterates over the `sample_pdfs` and performs the following: - Load the URL and extract the text into pages. A page is the retrievable unit we will use in Vespa - For each page, use the text splitter to split the text into chunks. The chunks are represented as an `array` in the Vespa schema - Create the page level Vespa `fields`, note that we duplicate some content like the title and URL into the page level representation. In \[11\]: Copied! ``` import hashlib import unicodedata def remove_control_characters(s): return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C") my_docs_to_feed = [] for pdf in sample_pdfs(): url = pdf["url"] loader = PyPDFLoader(url) pages = loader.load_and_split() for index, page in enumerate(pages): source = page.metadata["source"] chunks = text_splitter.transform_documents([page]) text_chunks = [chunk.page_content for chunk in chunks] text_chunks = [remove_control_characters(chunk) for chunk in text_chunks] page_number = index + 1 vespa_id = f"{url}#{page_number}" hash_value = hashlib.sha1(vespa_id.encode()).hexdigest() fields = { "title": pdf["title"], "url": url, "page": page_number, "id": hash_value, "authors": [a.strip() for a in pdf["authors"].split(",")], "chunks": text_chunks, "metadata": page.metadata, } my_docs_to_feed.append(fields) ``` import hashlib import unicodedata def remove_control_characters(s): return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C") my_docs_to_feed = [] for pdf in sample_pdfs(): url = pdf["url"] loader = PyPDFLoader(url) pages = loader.load_and_split() for index, page in enumerate(pages): source = page.metadata["source"] chunks = text_splitter.transform_documents([page]) text_chunks = [chunk.page_content for chunk in chunks] text_chunks = [remove_control_characters(chunk) for chunk in text_chunks] page_number = index + 1 vespa_id = f"{url}#{page_number}" hash_value = hashlib.sha1(vespa_id.encode()).hexdigest() fields = { "title": pdf["title"], "url": url, "page": page_number, "id": hash_value, "authors": \[a.strip() for a in pdf["authors"].split(",")\], "chunks": text_chunks, "metadata": page.metadata, } my_docs_to_feed.append(fields) Now that we have parsed the input PDFs and created a list of pages that we want to add to Vespa, we must format the list into the format that PyVespa accepts. Notice the `fields`, `id` and `groupname` keys. The `groupname` is the key that is used to shard and co-locate the data and is only relevant when using Vespa with streaming mode. In \[12\]: Copied! ``` from typing import Iterable def vespa_feed(user: str) -> Iterable[dict]: for doc in my_docs_to_feed: yield {"fields": doc, "id": doc["id"], "groupname": user} ``` from typing import Iterable def vespa_feed(user: str) -> Iterable\[dict\]: for doc in my_docs_to_feed: yield {"fields": doc, "id": doc["id"], "groupname": user} Now, we can feed to the Vespa instance (`app`), using the `feed_iterable` API, using the generator function above as input with a custom `callback` function. Vespa also performs embedding inference during this step using the built-in Vespa [embedding](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) functionality. In \[13\]: Copied! ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="pdf", iter=vespa_feed("jo-bergum"), namespace="personal", callback=callback ) ``` from vespa.io import VespaResponse def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}" ) app.feed_iterable( schema="pdf", iter=vespa_feed("jo-bergum"), namespace="personal", callback=callback ) Notice the `schema` and `namespace` arguments. PyVespa transforms the input operations to Vespa [document v1](https://docs.vespa.ai/en/document-v1-api-guide.html) requests. ### Querying data[¶](#querying-data) Now, we can also query our data. With [streaming mode](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming), we must pass the `groupname` parameter, or the request will fail with an error. The query request uses the Vespa Query API and the `Vespa.query()` function supports passing any of the Vespa query API parameters. Read more about querying Vespa in: - [Vespa Query API](https://docs.vespa.ai/en/query-api.html) - [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html) - [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html) Sample query request for `why is colbert effective?` for the user `bergum@vespa.ai`: In \[15\]: Copied! ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id,title,page,chunks from pdf where userQuery() or ({targetHits:10}nearestNeighbor(embedding,q))", groupname="jo-bergum", ranking="hybrid", query="why is colbert effective?", body={ "presentation.format.tensors": "short-value", "input.query(q)": 'embed(e5, "why is colbert effective?")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` from vespa.io import VespaQueryResponse import json response: VespaQueryResponse = app.query( yql="select id,title,page,chunks from pdf where userQuery() or ({targetHits:10}nearestNeighbor(embedding,q))", groupname="jo-bergum", ranking="hybrid", query="why is colbert effective?", body={ "presentation.format.tensors": "short-value", "input.query(q)": 'embed(e5, "why is colbert effective?")', }, timeout="2s", ) assert response.is_successful() print(json.dumps(response.hits[0], indent=2)) ``` { "id": "id:personal:pdf:g=jo-bergum:a4b2ced87807ee9cb0325b7a1c64a070d05a31f7", "relevance": 1.1412738851962692, "source": "pdfs_content.pdf", "fields": { "matchfeatures": { "closest(embedding)": { "0": 1.0 }, "elementSimilarity(chunks)": 0.5006379585326953, "nativeRank(chunks)": 0.15642522855051508, "nativeRank(title)": 0.1341324233922751, "similarities": { "1": 0.7731813192367554, "2": 0.8196794986724854, "3": 0.796222984790802, "4": 0.7699441909790039, "0": 0.850716233253479 } }, "id": "a4b2ced87807ee9cb0325b7a1c64a070d05a31f7", "title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", "page": 9, "chunks": [ "Sq,d:=\u00d5i\u2208[|Eq|]maxj\u2208[|Ed|]Eqi\u00b7ETdj(3)ColBERT is di\ufb00erentiable end-to-end. We /f_ine-tune the BERTencoders and train from scratch the additional parameters (i.e., thelinear layer and the [Q] and [D] markers\u2019 embeddings) using theAdam [ 16] optimizer. Notice that our interaction mechanism hasno trainable parameters. Given a triple \u27e8q,d+,d\u2212\u27e9with query q,positive document d+and negative document d\u2212, ColBERT is usedto produce a score for each document individually and is optimizedvia pairwise so/f_tmax cross-entropy loss over the computed scoresofd+andd\u2212.3.4 O\ufb00line Indexing: Computing & StoringDocument EmbeddingsBy design, ColBERT isolates almost all of the computations betweenqueries and documents, largely to enable pre-computing documentrepresentations o\ufb04ine. At a high level, our indexing procedure isstraight-forward: we proceed over the documents in the collectionin batches, running our document encoder fDon each batch andstoring the output embeddings per document. Although indexing", "a set of documents is an o\ufb04ine process, we incorporate a fewsimple optimizations for enhancing the throughput of indexing. Aswe show in \u00a74.5, these optimizations can considerably reduce theo\ufb04ine cost of indexing.To begin with, we exploit multiple GPUs, if available, for fasterencoding of batches of documents in parallel. When batching, wepad all documents to the maximum length of a document withinthe batch.3To make capping the sequence length on a per-batchbasis more e\ufb00ective, our indexer proceeds through documents ingroups of B(e.g., B=100,000) documents. It sorts these documentsby length and then feeds batches of b(e.g., b=128) documents ofcomparable length through our encoder. /T_his length-based bucket-ing is sometimes refered to as a BucketIterator in some libraries(e.g., allenNLP). Lastly, while most computations occur on the GPU,we found that a non-trivial portion of the indexing time is spent onpre-processing the text sequences, primarily BERT\u2019s WordPiece to-", "kenization. Exploiting that these operations are independent acrossdocuments in a batch, we parallelize the pre-processing across theavailable CPU cores.Once the document representations are produced, they are savedto disk using 32-bit or 16-bit values to represent each dimension.As we describe in \u00a73.5 and 3.6, these representations are eithersimply loaded from disk for ranking or are subsequently indexedfor vector-similarity search, respectively.3.5 Top- kRe-ranking with ColBERTRecall that ColBERT can be used for re-ranking the output of an-other retrieval model, typically a term-based model, or directlyfor end-to-end retrieval from a document collection. In this sec-tion, we discuss how we use ColBERT for ranking a small set ofk(e.g., k=1000) documents given a query q. Since kis small, werely on batch computations to exhaustively score each document", "3/T_he public BERT implementations we saw simply pad to a pre-de/f_ined length.(unlike our approach in \u00a73.6). To begin with, our query serving sub-system loads the indexed documents representations into memory,representing each document as a matrix of embeddings.Given a query q, we compute its bag of contextualized embed-dings Eq(Equation 1) and, concurrently, gather the document repre-sentations into a 3-dimensional tensor Dconsisting of kdocumentmatrices. We pad the kdocuments to their maximum length tofacilitate batched operations, and move the tensor Dto the GPU\u2019smemory. On the GPU, we compute a batch dot-product of EqandD, possibly over multiple mini-batches. /T_he output materializes a3-dimensional tensor that is a collection of cross-match matricesbetween qand each document. To compute the score of each docu-ment, we reduce its matrix across document terms via a max-pool(i.e., representing an exhaustive implementation of our MaxSim", "computation) and reduce across query terms via a summation. Fi-nally, we sort the kdocuments by their total scores." ] } } ``` Notice the `matchfeatures` that returns the configured match-features from the rank-profile, including all the chunk similarities. ## LangChain Retriever[¶](#langchain-retriever) We use the [LangChain Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/) interface so that we can connect our Vespa app with the flexibility and power of the [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction) LLM framework. > A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well. The retriever interface fits perfectly with Vespa, as Vespa can support a wide range of features and ways to retrieve and rank content. The following implements a custom retriever `VespaStreamingHybridRetriever` that takes the following arguments: - `app:Vespa` The Vespa application we retrieve from. This could be a Vespa Cloud instance or a local instance, for example running on a laptop. - `user:str` The user that that we want to retrieve for, this argument maps to the [Vespa streaming mode groupname parameter](https://docs.vespa.ai/en/reference/query-api-reference.html#streaming.groupname) - `pages:int` The target number of PDF pages we want to retrieve for a given query - `chunks_per_page` The is the target number of relevant text chunks that are associated with the page - `chunk_similarity_threshold` - The chunk similarity threshold, only chunks with a similarity above this threshold The core idea is to *retrieve* pages using maximum chunk similarity as the initial scoring function, then consider other chunks on the same page potentially relevant. In \[19\]: Copied! ``` from langchain_core.documents import Document from langchain_core.retrievers import BaseRetriever from typing import List class VespaStreamingHybridRetriever(BaseRetriever): app: Vespa user: str pages: int = 5 chunks_per_page: int = 3 chunk_similarity_threshold: float = 0.8 def _get_relevant_documents(self, query: str) -> List[Document]: response: VespaQueryResponse = self.app.query( yql="select id, url, title, page, authors, chunks from pdf where userQuery() or ({targetHits:20}nearestNeighbor(embedding,q))", groupname=self.user, ranking="hybrid", query=query, hits=self.pages, body={ "presentation.format.tensors": "short-value", "input.query(q)": f'embed(e5, "query: {query} ")', }, timeout="2s", ) if not response.is_successful(): raise ValueError( f"Query failed with status code {response.status_code}, url={response.url} response={response.json}" ) return self._parse_response(response) def _parse_response(self, response: VespaQueryResponse) -> List[Document]: documents: List[Document] = [] for hit in response.hits: fields = hit["fields"] chunks_with_scores = self._get_chunk_similarities(fields) ## Best k chunks from each page best_chunks_on_page = " ### ".join( [ chunk for chunk, score in chunks_with_scores[0 : self.chunks_per_page] if score > self.chunk_similarity_threshold ] ) documents.append( Document( id=fields["id"], page_content=best_chunks_on_page, title=fields["title"], metadata={ "title": fields["title"], "url": fields["url"], "page": fields["page"], "authors": fields["authors"], "features": fields["matchfeatures"], }, ) ) return documents def _get_chunk_similarities(self, hit_fields: dict) -> List[tuple]: match_features = hit_fields["matchfeatures"] similarities = match_features["similarities"] chunk_scores = [] for i in range(0, len(similarities)): chunk_scores.append(similarities.get(str(i), 0)) chunks = hit_fields["chunks"] chunks_with_scores = list(zip(chunks, chunk_scores)) return sorted(chunks_with_scores, key=lambda x: x[1], reverse=True) ``` from langchain_core.documents import Document from langchain_core.retrievers import BaseRetriever from typing import List class VespaStreamingHybridRetriever(BaseRetriever): app: Vespa user: str pages: int = 5 chunks_per_page: int = 3 chunk_similarity_threshold: float = 0.8 def \_get_relevant_documents(self, query: str) -> List\[Document\]: response: VespaQueryResponse = self.app.query( yql="select id, url, title, page, authors, chunks from pdf where userQuery() or ({targetHits:20}nearestNeighbor(embedding,q))", groupname=self.user, ranking="hybrid", query=query, hits=self.pages, body={ "presentation.format.tensors": "short-value", "input.query(q)": f'embed(e5, "query: {query} ")', }, timeout="2s", ) if not response.is_successful(): raise ValueError( f"Query failed with status code {response.status_code}, url={response.url} response={response.json}" ) return self.\_parse_response(response) def \_parse_response(self, response: VespaQueryResponse) -> List\[Document\]: documents: List[Document] = [] for hit in response.hits: fields = hit["fields"] chunks_with_scores = self.\_get_chunk_similarities(fields) ## Best k chunks from each page best_chunks_on_page = " ### ".join( \[ chunk for chunk, score in chunks_with_scores[0 : self.chunks_per_page] if score > self.chunk_similarity_threshold \] ) documents.append( Document( id=fields["id"], page_content=best_chunks_on_page, title=fields["title"], metadata={ "title": fields["title"], "url": fields["url"], "page": fields["page"], "authors": fields["authors"], "features": fields["matchfeatures"], }, ) ) return documents def \_get_chunk_similarities(self, hit_fields: dict) -> List\[tuple\]: match_features = hit_fields["matchfeatures"] similarities = match_features["similarities"] chunk_scores = [] for i in range(0, len(similarities)): chunk_scores.append(similarities.get(str(i), 0)) chunks = hit_fields["chunks"] chunks_with_scores = list(zip(chunks, chunk_scores)) return sorted(chunks_with_scores, key=lambda x: x[1], reverse=True) That's it! We can give our newborn retriever a spin for the user `jo-bergum` by In \[20\]: Copied! ``` vespa_hybrid_retriever = VespaStreamingHybridRetriever( app=app, user="jo-bergum", pages=1, chunks_per_page=1 ) ``` vespa_hybrid_retriever = VespaStreamingHybridRetriever( app=app, user="jo-bergum", pages=1, chunks_per_page=1 ) In \[21\]: Copied! ``` vespa_hybrid_retriever.invoke("what is the maxsim operator in colbert?") ``` vespa_hybrid_retriever.invoke("what is the maxsim operator in colbert?") Out\[21\]: ``` [Document(page_content='ture that precisely does so. As illustrated, every query embeddinginteracts with all document embeddings via a MaxSim operator,which computes maximum similarity (e.g., cosine similarity), andthe scalar outputs of these operators are summed across queryterms. /T_his paradigm allows ColBERT to exploit deep LM-basedrepresentations while shi/f_ting the cost of encoding documents of-/f_line and amortizing the cost of encoding the query once acrossall ranked documents. Additionally, it enables ColBERT to lever-age vector-similarity search indexes (e.g., [ 1,15]) to retrieve thetop-kresults directly from a large document collection, substan-tially improving recall over models that only re-rank the output ofterm-based retrieval.As Figure 1 illustrates, ColBERT can serve queries in tens orfew hundreds of milliseconds. For instance, when used for re-ranking as in “ColBERT (re-rank)”, it delivers over 170 ×speedup(and requires 14,000 ×fewer FLOPs) relative to existing BERT-based', metadata={'title': 'ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT', 'url': 'https://arxiv.org/pdf/2004.12832.pdf', 'page': 4, 'authors': ['Omar Khattab', 'Matei Zaharia'], 'features': {'closest(embedding)': {'0': 1.0}, 'elementSimilarity(chunks)': 0.41768707482993195, 'nativeRank(chunks)': 0.1401101487033024, 'nativeRank(title)': 0.0520403737720047, 'similarities': {'1': 0.8369992971420288, '0': 0.8730311393737793}}})] ``` ## RAG[¶](#rag) Finally, we can connect our custom retriever with the complete flexibility and power of the [LangChain] LLM framework. The following uses [LangChain Expression Language, or LCEL](https://python.langchain.com/v0.1/docs/expression_language/), a declarative way to compose chains. We have several steps composed into a chain: - The prompt template and LLM model, in this case using OpenAI - The retriever that provides the retrieved context for the question - The formatting of the retrieved context In \[22\]: Copied! ``` vespa_hybrid_retriever = VespaStreamingHybridRetriever( app=app, user="jo-bergum", pages=3, chunks_per_page=3 ) ``` vespa_hybrid_retriever = VespaStreamingHybridRetriever( app=app, user="jo-bergum", pages=3, chunks_per_page=3 ) In \[ \]: Copied! ``` from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough prompt_template = """ Answer the question based only on the following context. Cite the page number and the url of the document you are citing. {context} Question: {question} """ prompt = ChatPromptTemplate.from_template(prompt_template) model = ChatOpenAI() def format_prompt_context(docs) -> str: context = [] for d in docs: context.append(f"{d.metadata['title']} by {d.metadata['authors']}\n") context.append(f"url: {d.metadata['url']}\n") context.append(f"page: {d.metadata['page']}\n") context.append(f"{d.page_content}\n\n") return "".join(context) chain = ( { "context": vespa_hybrid_retriever | format_prompt_context, "question": RunnablePassthrough(), } | prompt | model | StrOutputParser() ) ``` from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough prompt_template = """ Answer the question based only on the following context. Cite the page number and the url of the document you are citing. {context} Question: {question} """ prompt = ChatPromptTemplate.from_template(prompt_template) model = ChatOpenAI() def format_prompt_context(docs) -> str: context = [] for d in docs: context.append(f"{d.metadata['title']} by {d.metadata['authors']}\\n") context.append(f"url: {d.metadata['url']}\\n") context.append(f"page: {d.metadata['page']}\\n") context.append(f"{d.page_content}\\n\\n") return "".join(context) chain = ( { "context": vespa_hybrid_retriever | format_prompt_context, "question": RunnablePassthrough(), } | prompt | model | StrOutputParser() ) ### Interact with the chain[¶](#interact-with-the-chain) Now, we can start asking questions using the `chain` define above. In \[26\]: Copied! ``` chain.invoke("what is colbert?") ``` chain.invoke("what is colbert?") Out\[26\]: ``` 'ColBERT is a ranking model that adapts deep language models, specifically BERT, for efficient retrieval. It introduces a late interaction architecture that independently encodes queries and documents using BERT and then uses a cheap yet powerful interaction step to model their fine-grained similarity. This allows ColBERT to leverage the expressiveness of deep language models while also being able to pre-compute document representations offline, significantly speeding up query processing. ColBERT can be used for re-ranking documents retrieved by a traditional model or for end-to-end retrieval directly from a large document collection. It has been shown to be effective and efficient compared to existing models. (source: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT by Omar Khattab, Matei Zaharia, page 1, url: https://arxiv.org/pdf/2004.12832.pdf)' ``` In \[27\]: Copied! ``` chain.invoke("what is the colbert maxsim operator") ``` chain.invoke("what is the colbert maxsim operator") Out\[27\]: ``` "The ColBERT model utilizes the MaxSim operator, which computes the maximum similarity (e.g., cosine similarity) between query embeddings and document embeddings. The scalar outputs of these operators are summed across query terms, allowing ColBERT to exploit deep LM-based representations while reducing the cost of encoding documents offline and amortizing the cost of encoding the query once across all ranked documents.\n\nSource: \nColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT by ['Omar Khattab', 'Matei Zaharia']\nURL: https://arxiv.org/pdf/2004.12832.pdf\nPage: 4" ``` In \[28\]: Copied! ``` chain.invoke( "What is the difference between colbert and single vector representational models?" ) ``` chain.invoke( "What is the difference between colbert and single vector representational models?" ) Out\[28\]: ``` 'The difference between ColBERT and single vector representational models is that ColBERT utilizes a late interaction architecture that independently encodes the query and the document using BERT, while single vector models use a single embedding vector for both the query and the document. This late interaction mechanism in ColBERT allows for fine-grained similarity estimation, which leads to more effective retrieval. (Source: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT by Omar Khattab and Matei Zaharia, page 17, url: https://arxiv.org/pdf/2004.12832.pdf)' ``` ## Summary[¶](#summary) Vespa’s streaming mode is a game-changer, enabling the creation of highly cost-effective RAG applications for naturally partitioned data. In this notebook, we delved into the hands-on application of [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction), leveraging document loaders and transformers. Finally, we showcased a custom LangChain retriever that connected all the functionality of LangChain with Vespa. For those interested in learning more about Vespa, join the [Vespa community on Slack](https://vespatalk.slack.com/) to exchange ideas, seek assistance, or stay in the loop on the latest Vespa developments. We can now delete the cloud instance: In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # Video Search and Retrieval with Vespa and TwelveLabs[¶](#video-search-and-retrieval-with-vespa-and-twelvelabs) In the following notebook, we will demonstrate how to leverage [TwelveLabs](https://www.twelvelabs.io/) `Marengo-retrieval-2.7` a SOTA multimodal embedding model to demonstrate a use case of video embeddings storage and semantic search retrieval using Vespa.ai. The steps we will take in this notebook are: 1. Setup and configuration 1. Generate Attributes and Embeddings for 3 sample videos using the TwelveLabs python SDK. 1. Deploy the Vespa application to Vespa Cloud and Feed the Data 1. Perform a semantic search with hybrid multi-phase ranking on the videos 1. Review the results 1. Cleanup All the steps that are needed to provision the Vespa application, including feeding the data, can be done by running this notebook. We have tried to make it easy for others to run this notebook, to create your own Video semantic search application using TwelveLabs models with Vespa. ## 1. Setup and Configuration[¶](#1-setup-and-configuration) For reference, this is the Python version used for this notebook. In \[1\]: Copied! ``` !python --version ``` !python --version ``` Python 3.12.4 ``` ### 1.1 Install libraries[¶](#11-install-libraries) Install the required Python dependencies from TwelveLabs python SDK and pyvespa python API. In \[2\]: Copied! ``` !pip3 install pyvespa vespacli twelvelabs pandas ``` !pip3 install pyvespa vespacli twelvelabs pandas ``` Requirement already satisfied: pyvespa in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (0.55.0) Requirement already satisfied: vespacli in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (8.391.23) Requirement already satisfied: twelvelabs in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (0.4.10) Requirement already satisfied: pandas in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (2.2.2) Requirement already satisfied: requests in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (2.32.3) Requirement already satisfied: requests_toolbelt in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (1.0.0) Requirement already satisfied: docker in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (7.1.0) Requirement already satisfied: jinja2 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (3.1.4) Requirement already satisfied: cryptography in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (43.0.3) Requirement already satisfied: aiohttp in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (3.10.10) Requirement already satisfied: httpx[http2] in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (0.28.1) Requirement already satisfied: tenacity>=8.4.1 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (9.0.0) Requirement already satisfied: typing_extensions in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (4.12.2) Requirement already satisfied: python-dateutil in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (2.9.0.post0) Requirement already satisfied: fastcore>=1.7.8 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (1.7.19) Requirement already satisfied: lxml in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pyvespa) (5.3.0) Requirement already satisfied: pydantic>=2.4.2 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from twelvelabs) (2.10.6) Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pandas) (1.26.4) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pandas) (2023.3) Requirement already satisfied: packaging in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from fastcore>=1.7.8->pyvespa) (24.2) Requirement already satisfied: anyio in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpx[http2]->pyvespa) (4.8.0) Requirement already satisfied: certifi in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpx[http2]->pyvespa) (2025.1.31) Requirement already satisfied: httpcore==1.* in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpx[http2]->pyvespa) (1.0.7) Requirement already satisfied: idna in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpx[http2]->pyvespa) (3.10) Requirement already satisfied: h11<0.15,>=0.13 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpcore==1.*->httpx[http2]->pyvespa) (0.14.0) Requirement already satisfied: annotated-types>=0.6.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pydantic>=2.4.2->twelvelabs) (0.7.0) Requirement already satisfied: pydantic-core==2.27.2 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from pydantic>=2.4.2->twelvelabs) (2.27.2) Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from python-dateutil->pyvespa) (1.16.0) Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (2.4.0) Requirement already satisfied: aiosignal>=1.1.2 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (1.3.1) Requirement already satisfied: attrs>=17.3.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (24.2.0) Requirement already satisfied: frozenlist>=1.1.1 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (1.4.0) Requirement already satisfied: multidict<7.0,>=4.5 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.12.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from aiohttp->pyvespa) (1.15.5) Requirement already satisfied: cffi>=1.12 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from cryptography->pyvespa) (1.17.1) Requirement already satisfied: urllib3>=1.26.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from docker->pyvespa) (2.3.0) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from requests->pyvespa) (3.4.1) Requirement already satisfied: h2<5,>=3 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from httpx[http2]->pyvespa) (4.1.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from jinja2->pyvespa) (3.0.2) Requirement already satisfied: pycparser in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from cffi>=1.12->cryptography->pyvespa) (2.22) Requirement already satisfied: hyperframe<7,>=6.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from h2<5,>=3->httpx[http2]->pyvespa) (6.0.1) Requirement already satisfied: hpack<5,>=4.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from h2<5,>=3->httpx[http2]->pyvespa) (4.0.0) Requirement already satisfied: propcache>=0.2.0 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from yarl<2.0,>=1.12.0->aiohttp->pyvespa) (0.2.0) Requirement already satisfied: sniffio>=1.1 in /opt/anaconda3/envs/vespa-env/lib/python3.12/site-packages (from anyio->httpx[http2]->pyvespa) (1.3.1) ``` Import all the required packages in this notebook. In \[3\]: Copied! ``` import os import hashlib import json from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, FieldSet, SecondPhaseRanking, Function, ) from vespa.deployment import VespaCloud from vespa.io import VespaResponse, VespaQueryResponse from twelvelabs import TwelveLabs from twelvelabs.models.embed import EmbeddingsTask import pandas as pd from datetime import datetime ``` import os import hashlib import json from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, FieldSet, SecondPhaseRanking, Function, ) from vespa.deployment import VespaCloud from vespa.io import VespaResponse, VespaQueryResponse from twelvelabs import TwelveLabs from twelvelabs.models.embed import EmbeddingsTask import pandas as pd from datetime import datetime ### 1.2 Get a TwelveLabs API key[¶](#12-get-a-twelvelabs-api-key) [Sign-up](https://auth.twelvelabs.io/u/signup) for TwelveLabs. After logging in, navigate to your profile and get your [API key](https://playground.twelvelabs.io/dashboard/api-key). Copy it and paste it below. The Free plan includes indexing of 600 mins of videos, which should be sufficient to explore the capabilities of the API. In \[8\]: Copied! ``` TL_API_KEY = os.getenv("TL_API_KEY") or input("Enter your TL_API key: ") ``` TL_API_KEY = os.getenv("TL_API_KEY") or input("Enter your TL_API key: ") ### 1.3 Sign-up for a Vespa Trial Account[¶](#13-sign-up-for-a-vespa-trial-account) **Pre-requisite**: - Spin-up a Vespa Cloud [Trial](https://vespa.ai/free-trial) account. - Login to the account you just created and create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/). - Save the tenant name. ### 1.4 Setup the tenant name and the application name[¶](#14-setup-the-tenant-name-and-the-application-name) - Paste below the name of the tenant name. - Give your application a name. Note that the name cannot have `-` or `_`. In \[ \]: Copied! ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "videosearch" ``` # Replace with your tenant name from the Vespa Cloud Console tenant_name = "vespa-team" # Replace with your application name (does not need to exist yet) application = "videosearch" ## 2. Generate Attributes and Embeddings for sample videos using TwelveLabs Embedding API[¶](#2-generate-attributes-and-embeddings-for-sample-videos-using-twelvelabs-embedding-api) ### 2.1 Generate attributes on the videos[¶](#21-generate-attributes-on-the-videos) In this section, we will leverage the [Pegasus 1.2](https://docs.twelvelabs.io/v1.3/docs/concepts/models/pegasus) generative model to generate some attributes about our videos to store as part of the searchable information in Vespa. Attributes we want to store as part of the videos include: - Keywords - Summaries For video samples, we are selecting the 3 videos in the array below from the [Internet Archive](https://archive.org/). You can customize this code with the urls of your choice. Note that there are certain restrictions such as the resolution of the videos. In \[10\]: Copied! ``` VIDEO_URLs = [ "https://archive.org/download/the-end-blue-sky-studios/The%20End%281080P_60FPS%29.ia.mp4", "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4", "https://archive.org/download/The_Worm_in_the_Apple_Animation_Test/AnimationTest.mov", ] ``` VIDEO_URLs = [ "https://archive.org/download/the-end-blue-sky-studios/The%20End%281080P_60FPS%29.ia.mp4", "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4", "https://archive.org/download/The_Worm_in_the_Apple_Animation_Test/AnimationTest.mov", ] In order to generate text on the videos, the prerequisite is to upload the videos and index them. Let's first create an index below: In \[11\]: Copied! ``` # Spin-up session client = TwelveLabs(api_key=TL_API_KEY) # Generating Index Name timestamp = int(datetime.now().timestamp()) index_name = "Vespa_" + str(timestamp) # Create Index print("Creating Index:" + index_name) index = client.index.create( name=index_name, models=[ { "name": "pegasus1.2", "options": ["visual", "audio"], } ], addons=["thumbnail"], # Optional ) print(f"Created index: id={index.id} name={index.name} models={index.models}") ``` # Spin-up session client = TwelveLabs(api_key=TL_API_KEY) # Generating Index Name timestamp = int(datetime.now().timestamp()) index_name = "Vespa\_" + str(timestamp) # Create Index print("Creating Index:" + index_name) index = client.index.create( name=index_name, models=\[ { "name": "pegasus1.2", "options": ["visual", "audio"], } \], addons=["thumbnail"], # Optional ) print(f"Created index: id={index.id} name={index.name} models={index.models}") ``` Creating Index:Vespa_1752595622 Created index: id=68767ca6e01b53f51c3f2ac5 name=Vespa_1752595622 models=root=[Model(name='pegasus1.2', options=['visual', 'audio'], addons=None, finetuned=False)] ``` We can now upload the videos: In \[12\]: Copied! ``` # Capturing index id for upload index_id = index.id def on_task_update(task: EmbeddingsTask): print(f" Status={task.status}") for video_url in VIDEO_URLs: # Create a video indexing task task = client.task.create(index_id=index_id, url=video_url) print(f"Task created successfully! Task ID: {task.id}") status = task.wait_for_done(sleep_interval=10, callback=on_task_update) print(f"Indexing done: {status}") if task.status != "ready": raise RuntimeError(f"Indexing failed with status {task.status}") print( f"Uploaded {video_url}. The unique identifer of your video is {task.video_id}." ) ``` # Capturing index id for upload index_id = index.id def on_task_update(task: EmbeddingsTask): print(f" Status={task.status}") for video_url in VIDEO_URLs: # Create a video indexing task task = client.task.create(index_id=index_id, url=video_url) print(f"Task created successfully! Task ID: {task.id}") status = task.wait_for_done(sleep_interval=10, callback=on_task_update) print(f"Indexing done: {status}") if task.status != "ready": raise RuntimeError(f"Indexing failed with status {task.status}") print( f"Uploaded {video_url}. The unique identifer of your video is {task.video_id}." ) ``` Task created successfully! Task ID: 68767caa47c93cd3ab1e4b05 Status=pending Status=pending Status=pending Status=pending Status=ready Indexing done: Task(id='68767caa47c93cd3ab1e4b05', created_at='2025-07-15T16:07:08.998Z', updated_at='2025-07-15T16:07:08.998Z', index_id='68767ca6e01b53f51c3f2ac5', video_id='68767caa47c93cd3ab1e4b05', status='ready', system_metadata={'filename': 'The End(1080P_60FPS).ia.mp4', 'duration': 34.667392, 'width': 1920, 'height': 1080}, hls=TaskHLS(video_url='', thumbnail_urls=[], status='PROCESSING', updated_at='2025-07-15T16:07:08.998Z')) Uploaded https://archive.org/download/the-end-blue-sky-studios/The%20End%281080P_60FPS%29.ia.mp4. The unique identifer of your video is 68767caa47c93cd3ab1e4b05. Task created successfully! Task ID: 68767ce06c4253f85f0820d0 Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=indexing Status=ready Indexing done: Task(id='68767ce06c4253f85f0820d0', created_at='2025-07-15T16:08:01.059Z', updated_at='2025-07-15T16:08:01.059Z', index_id='68767ca6e01b53f51c3f2ac5', video_id='68767ce06c4253f85f0820d0', status='ready', system_metadata={'filename': 'twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4', 'duration': 1448.88, 'width': 640, 'height': 480}, hls=TaskHLS(video_url='', thumbnail_urls=[], status='PROCESSING', updated_at='2025-07-15T16:08:01.059Z')) Uploaded https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4. The unique identifer of your video is 68767ce06c4253f85f0820d0. Task created successfully! Task ID: 68767d7a03f1a1f6cd14797d Status=pending Status=indexing Status=ready Indexing done: Task(id='68767d7a03f1a1f6cd14797d', created_at='2025-07-15T16:10:37.601Z', updated_at='2025-07-15T16:10:37.601Z', index_id='68767ca6e01b53f51c3f2ac5', video_id='68767d7a03f1a1f6cd14797d', status='ready', system_metadata={'filename': 'AnimationTest.mov', 'duration': 24.45679, 'width': 720, 'height': 405}, hls=TaskHLS(video_url='', thumbnail_urls=[], status='PROCESSING', updated_at='2025-07-15T16:10:37.601Z')) Uploaded https://archive.org/download/The_Worm_in_the_Apple_Animation_Test/AnimationTest.mov. The unique identifer of your video is 68767d7a03f1a1f6cd14797d. ``` Now that the videos have been uploaded, we can generate the keywords, and summaries on the videos below. You will notice on the output that the video uploaded last is the one that is processed first in this stage. This matters since we store other attributes on the videos on arrays (eg URLs, Titles). In \[13\]: Copied! ``` import textwrap client = TwelveLabs(api_key=TL_API_KEY) summaries = [] keywords_array = [] # Get all videos in an Index videos = client.index.video.list(index_id) for video in videos: print(f"Generating text for {video.id}") res = client.summarize( video_id=video.id, type="summary", prompt="Generate an abstract of the video serving as metadata on the video, up to five sentences.", ) wrapped = textwrap.wrap(res.summary, width=110) print("Summary:") print("\n".join(wrapped)) summaries.append(res.summary) keywords = client.analyze( video_id=video.id, prompt="Based on this video, I want to generate five keywords for SEO (Search Engine Optimization). Provide just the keywords as a comma delimited list without any additional text.", ) print(f"Open-ended Text: {keywords.data}") keywords_array.append(keywords.data) ``` import textwrap client = TwelveLabs(api_key=TL_API_KEY) summaries = [] keywords_array = [] # Get all videos in an Index videos = client.index.video.list(index_id) for video in videos: print(f"Generating text for {video.id}") res = client.summarize( video_id=video.id, type="summary", prompt="Generate an abstract of the video serving as metadata on the video, up to five sentences.", ) wrapped = textwrap.wrap(res.summary, width=110) print("Summary:") print("\\n".join(wrapped)) summaries.append(res.summary) keywords = client.analyze( video_id=video.id, prompt="Based on this video, I want to generate five keywords for SEO (Search Engine Optimization). Provide just the keywords as a comma delimited list without any additional text.", ) print(f"Open-ended Text: {keywords.data}") keywords_array.append(keywords.data) ``` Generating text for 68767d7a03f1a1f6cd14797d Summary: The video titled "The Worm in the Apple Animation Test" showcases a whimsical scene where a segmented worm emerges from a red apple, positioned on the left side of the frame, and moves across a green field under a cloudy sky. As the worm progresses, its segments detach one by one, leaving the head connected to the last segment, with the detached parts scattered around the base of the hill where the apple rests. The camera zooms out to reveal more of the grassy terrain and then focuses closely on the worm's face, which exhibits a range of expressions from surprise to anger, enhancing the animated narrative. The worm's journey ends as it crawls off-screen, leaving behind a visually engaging and animated sequence. The video is accompanied by a repetitive, light-hearted musical score that adds to the playful tone of the animation. Open-ended Text: worm, apple, animation, test, victor lyuboslavsky Generating text for 68767ce06c4253f85f0820d0 Summary: The video is an animated adaptation of "Twas The Night Before Christmas," featuring a blend of human and mouse characters. It begins with a snowy night scene and transitions to a clockmaker's workshop, where the clockmaker, Joshua Trundle, and his family face challenges after a critical letter to Santa is written by Albert, Trundle's son. The story unfolds with the town's efforts to reconcile with Santa through a special clock designed to play a welcoming song on Christmas Eve, but complications arise when the clock malfunctions. Despite the setbacks, the family and community work together to fix the clock and restore belief in Santa, culminating in his magical arrival, bringing joy and gifts to all. The video concludes with a heartfelt message about the power of belief and the importance of making amends. Open-ended Text: snowy village, clock tower, Santa Claus, mechanical gears, Christmas chimes Generating text for 68767caa47c93cd3ab1e4b05 Summary: The video captures a serene snowy landscape with pine trees under a cloudy sky, where a squirrel emerges from behind a rock formation carrying an acorn. Upon noticing another acorn in the foreground, the squirrel appears momentarily surprised, as indicated by its vocalization "Oh...". It then drops one acorn and begins to nibble on the other, eventually discarding fragments of it before leaping away. The scene concludes with the squirrel's departure, leaving behind the remnants of the acorn, as darkness gradually engulfs the snowy setting. Open-ended Text: squirrel, acorn, winter, snow, forest ``` We need to store the titles of the videos as an additional attribute. In \[14\]: Copied! ``` # Creating array with titles titles = [ "The Worm in the Apple Animation Test", "Twas the night before Christmas", "The END (Blue Sky Studios)", ] ``` # Creating array with titles titles = [ "The Worm in the Apple Animation Test", "Twas the night before Christmas", "The END (Blue Sky Studios)", ] ## 2.2 Generate Embeddings[¶](#22-generate-embeddings) The following code leverages the [Embed API](https://docs.twelvelabs.io/docs/create-video-embeddings) to create an asynchronous embedding task to embed the sample videos. Twelve Labs video embeddings capture all the subtle cues and interactions between different modalities, including the visual expressions, body language, spoken words, and the overall context of the video, encapsulating the essence of all these modalities and their interrelations over time. In \[15\]: Copied! ``` client = TwelveLabs(api_key=TL_API_KEY) # Initialize an array to store the task IDs as strings task_ids = [] for url in VIDEO_URLs: task = client.embed.task.create(model_name="Marengo-retrieval-2.7", video_url=url) print( f"Created task: id={task.id} model_name={task.model_name} status={task.status}" ) # Append the task ID to the array task_ids.append(str(task.id)) status = task.wait_for_done(sleep_interval=10, callback=on_task_update) print(f"Embedding done: {status}") if task.status != "ready": raise RuntimeError(f"Embedding failed with status {task.status}") ``` client = TwelveLabs(api_key=TL_API_KEY) # Initialize an array to store the task IDs as strings task_ids = [] for url in VIDEO_URLs: task = client.embed.task.create(model_name="Marengo-retrieval-2.7", video_url=url) print( f"Created task: id={task.id} model_name={task.model_name} status={task.status}" ) # Append the task ID to the array task_ids.append(str(task.id)) status = task.wait_for_done(sleep_interval=10, callback=on_task_update) print(f"Embedding done: {status}") if task.status != "ready": raise RuntimeError(f"Embedding failed with status {task.status}") ``` Created task: id=6876856e4fc16ea9b2fdb823 model_name=Marengo-retrieval-2.7 status=processing Status=processing Status=processing Status=ready Embedding done: ready Created task: id=68768593de7e2a0235058cc6 model_name=Marengo-retrieval-2.7 status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=processing Status=ready Embedding done: ready Created task: id=6876860547c93cd3ab1e4cd7 model_name=Marengo-retrieval-2.7 status=processing Status=processing Status=ready Embedding done: ready ``` ## 2.3 Retrieve Embeddings[¶](#23-retrieve-embeddings) Once the embedding task is completed, we can retrieve the results of the embedding task based on the task_ids. In \[16\]: Copied! ``` # Spin-up session client = TwelveLabs(api_key=TL_API_KEY) # Initialize an array to store the task objects directly tasks = [] for task_id in task_ids: # Retrieve the task task = client.embed.task.retrieve(task_id) tasks.append(task) # Print task details print(f"Task ID: {task.id}") print(f"Status: {task.status}") ``` # Spin-up session client = TwelveLabs(api_key=TL_API_KEY) # Initialize an array to store the task objects directly tasks = [] for task_id in task_ids: # Retrieve the task task = client.embed.task.retrieve(task_id) tasks.append(task) # Print task details print(f"Task ID: {task.id}") print(f"Status: {task.status}") ``` Task ID: 6876856e4fc16ea9b2fdb823 Status: ready Task ID: 68768593de7e2a0235058cc6 Status: ready Task ID: 6876860547c93cd3ab1e4cd7 Status: ready ``` We can now review the output structure of the first segment for each one of these videos. This output will help us define the schema to store the embeddings in Vespa in the second part of this notebook. From looking at this output, the video has been embedded into chunks of 6 seconds each (default configurable value in the Embed API). Each embedding has a float vector of dimension 1024. The number of segments generated vary per video, based on the length of the videos ranging from 37 to 242 segments. In \[17\]: Copied! ``` for task in tasks: print(task.id) # Display data types of each field for key, value in task.video_embedding.segments[0]: if isinstance(value, list): print( f"{key}: list of size {len(value)} (truncated to 5 items): {value[:5]} " ) else: print(f"{key}: {type(value).__name__} : {value}") print(f"Total Number of segments: {len(task.video_embedding.segments)}") ``` for task in tasks: print(task.id) # Display data types of each field for key, value in task.video_embedding.segments\[0\]: if isinstance(value, list): print( f"{key}: list of size {len(value)} (truncated to 5 items): {value[:5]} " ) else: print(f"{key}: {type(value).__name__} : {value}") print(f"Total Number of segments: {len(task.video_embedding.segments)}") ``` 6876856e4fc16ea9b2fdb823 start_offset_sec: float : 0.0 end_offset_sec: float : 6.0 embedding_scope: str : clip embedding_option: str : visual-text embeddings_float: list of size 1024 (truncated to 5 items): [0.0227238, -0.002079417, 0.01519275, -0.009030234, -0.00162781] Total Number of segments: 12 68768593de7e2a0235058cc6 start_offset_sec: float : 0.0 end_offset_sec: float : 6.0 embedding_scope: str : clip embedding_option: str : visual-text embeddings_float: list of size 1024 (truncated to 5 items): [0.024328815, -0.0035867887, 0.016065866, 0.02501548, 0.007778642] Total Number of segments: 484 6876860547c93cd3ab1e4cd7 start_offset_sec: float : 0.0 end_offset_sec: float : 6.0 embedding_scope: str : clip embedding_option: str : visual-text embeddings_float: list of size 1024 (truncated to 5 items): [0.05419811, -0.0018933096, 0.008044507, -0.01940344, 0.013152712] Total Number of segments: 8 ``` # 3. Deploy a Vespa Application[¶](#3-deploy-a-vespa-application) At this point, we are ready to deploy a Vespa Application. We have generated the attributes we needed on each video, as well as the embeddings. ## 3.1 Create an Application Package[¶](#31-create-an-application-package) The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) has all the Vespa configuration files - create one from scratch: The Vespa schema deployed as part of the package is called `videos`. All the fields are matching the output of the Twelvelabs Embed API above. Refer to the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html) for more information on the schema specification. We can first define the schema using pyvespa In \[18\]: Copied! ``` videos_schema = Schema( name="videos", document=Document( fields=[ Field(name="video_url", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="keywords", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="video_summary", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="embedding_scope", type="string", indexing=["attribute", "summary"] ), Field( name="start_offset_sec", type="array", indexing=["attribute", "summary"], ), Field( name="end_offset_sec", type="array", indexing=["attribute", "summary"], ), Field( name="embeddings", type="tensor(p{},x[1024])", indexing=["index", "attribute"], ann=HNSW(distance_metric="angular"), ), ] ), ) fieldsets = ( [ FieldSet( name="default", fields=["title", "keywords", "video_summary"], ), ], ) mapfunctions = [ Function( name="similarities", expression=""" sum( query(q) * attribute(embeddings), x ) """, ), Function( name="bm25_score", expression="bm25(title) + bm25(keywords) + bm25(video_summary)", ), ] semantic_rankprofile = RankProfile( name="hybrid", inputs=[("query(q)", "tensor(x[1024])")], first_phase="bm25_score", second_phase=SecondPhaseRanking( expression="closeness(field, embeddings)", rerank_count=10 ), match_features=["closest(embeddings)"], summary_features=["similarities"], functions=mapfunctions, ) videos_schema.add_rank_profile(semantic_rankprofile) ``` videos_schema = Schema( name="videos", document=Document( fields=\[ Field(name="video_url", type="string", indexing=["summary"]), Field( name="title", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="keywords", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="video_summary", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25", ), Field( name="embedding_scope", type="string", indexing=["attribute", "summary"] ), Field( name="start_offset_sec", type="array", indexing=["attribute", "summary"], ), Field( name="end_offset_sec", type="array", indexing=["attribute", "summary"], ), Field( name="embeddings", type="tensor(p{},x[1024])", indexing=["index", "attribute"], ann=HNSW(distance_metric="angular"), ), \] ), ) fieldsets = ( \[ FieldSet( name="default", fields=["title", "keywords", "video_summary"], ), \], ) mapfunctions = [ Function( name="similarities", expression=""" sum( query(q) * attribute(embeddings), x ) """, ), Function( name="bm25_score", expression="bm25(title) + bm25(keywords) + bm25(video_summary)", ), ] semantic_rankprofile = RankProfile( name="hybrid", inputs=\[("query(q)", "tensor(x[1024])")\], first_phase="bm25_score", second_phase=SecondPhaseRanking( expression="closeness(field, embeddings)", rerank_count=10 ), match_features=["closest(embeddings)"], summary_features=["similarities"], functions=mapfunctions, ) videos_schema.add_rank_profile(semantic_rankprofile) We can now create the package based on the previous schema In \[19\]: Copied! ``` # Create the Vespa application package package = ApplicationPackage(name=application, schema=[videos_schema]) ``` # Create the Vespa application package package = ApplicationPackage(name=application, schema=[videos_schema]) ## 3.2 Deploy the Application Package[¶](#32-deploy-the-application-package) The app is now defined and ready to deploy to Vespa Cloud. Deploy `package` to Vespa Cloud, by creating an instance of [VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#VespaCloud): In \[20\]: Copied! ``` vespa_cloud = VespaCloud( tenant=tenant_name, application=application, application_package=package, key_content=os.getenv("VESPA_TEAM_API_KEY", None), ) ``` vespa_cloud = VespaCloud( tenant=tenant_name, application=application, application_package=package, key_content=os.getenv("VESPA_TEAM_API_KEY", None), ) ``` Setting application... Running: vespa config set application vespa-presales.videosearch.default Setting target cloud... Running: vespa config set target cloud No api-key found for control plane access. Using access token. Checking for access token in auth.json... Access token expired. Please re-authenticate. Your Device Confirmation code is: MJKL-VTBW Automatically open confirmation page in your default browser? [Y/n] Opened link in your browser: https://login.console.vespa-cloud.com/activate?user_code=MJKL-VTBW Waiting for login to complete in browser ... done;1m⣽ Success: Logged in auth.json created at /Users/zohar/.vespa/auth.json Successfully obtained access token for control plane access. ``` In \[21\]: Copied! ``` app = vespa_cloud.deploy() ``` app = vespa_cloud.deploy() ``` Deployment started in run 19 of dev-aws-us-east-1c for vespa-presales.videosearch. This may take a few minutes the first time. INFO [16:48:18] Deploying platform version 8.547.15 and application dev build 11 for dev-aws-us-east-1c of default ... INFO [16:48:18] Using CA signed certificate version 3 INFO [16:48:18] Using 1 nodes in container cluster 'videosearch_container' INFO [16:48:21] Session 7523 for tenant 'vespa-presales' prepared and activated. INFO [16:48:21] ######## Details for all nodes ######## INFO [16:48:21] h121570a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [16:48:21] --- platform vespa/cloud-tenant-rhel8:8.547.15 INFO [16:48:21] --- container on port 4080 has config generation 7522, wanted is 7523 INFO [16:48:21] --- metricsproxy-container on port 19092 has config generation 7522, wanted is 7523 INFO [16:48:21] h119160h.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [16:48:21] --- platform vespa/cloud-tenant-rhel8:8.547.15 INFO [16:48:21] --- container-clustercontroller on port 19050 has config generation 7522, wanted is 7523 INFO [16:48:21] --- metricsproxy-container on port 19092 has config generation 7523, wanted is 7523 INFO [16:48:21] h117409h.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [16:48:21] --- platform vespa/cloud-tenant-rhel8:8.547.15 INFO [16:48:21] --- logserver-container on port 4080 has config generation 7523, wanted is 7523 INFO [16:48:21] --- metricsproxy-container on port 19092 has config generation 7522, wanted is 7523 INFO [16:48:21] h121486b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [16:48:21] --- platform vespa/cloud-tenant-rhel8:8.547.15 INFO [16:48:21] --- storagenode on port 19102 has config generation 7522, wanted is 7523 INFO [16:48:21] --- searchnode on port 19107 has config generation 7523, wanted is 7523 INFO [16:48:21] --- distributor on port 19111 has config generation 7523, wanted is 7523 INFO [16:48:21] --- metricsproxy-container on port 19092 has config generation 7523, wanted is 7523 INFO [16:48:29] Found endpoints: INFO [16:48:29] - dev.aws-us-east-1c INFO [16:48:29] |-- https://d4ed0f5e.ee8b6819.z.vespa-app.cloud/ (cluster 'videosearch_container') INFO [16:48:30] Deployment of new application revision complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for videosearch_container URL: https://d4ed0f5e.ee8b6819.z.vespa-app.cloud/ Application is up! ``` ## 3.3 Feed the Vespa Application[¶](#33-feed-the-vespa-application) The `vespa_feed` feed format for `pyvespa` expects a dict with the keys `id` and `fields`: `{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}` For the id, we will use a md5 hash of the video url. The video embedding output segments are added to the `fields` in `vespa_feed`. In \[22\]: Copied! ``` # Initialize a list to store Vespa feed documents vespa_feed = [] # Need to reverse VIDEO_URLS as keywords/summaries generated in reverse order VIDEO_URLs.reverse() # Iterate through each task and corresponding metadata for i, task in enumerate(tasks): video_url = VIDEO_URLs[i] title = titles[i] keywords = keywords_array[i] summary = summaries[i] start_offsets = [] # Reset for each video end_offsets = [] # Reset for each video embeddings = {} # Reset for each video # Iterate through the video embedding segments for index, segment in enumerate(task.video_embedding.segments): # Append start and end offsets as floats start_offsets.append(float(segment.start_offset_sec)) end_offsets.append(float(segment.end_offset_sec)) # Add embedding to a multi-dimensional dictionary with index as the key embeddings[str(index)] = list(map(float, segment.embeddings_float)) # Create Vespa document for each task for segment in task.video_embedding.segments: start_offset_sec = segment.start_offset_sec end_offset_sec = segment.end_offset_sec embedding = list(map(float, segment.embeddings_float)) # Create a unique ID by hashing the URL and segment index id_hash = hashlib.md5(f"{video_url}_{index}".encode()).hexdigest() document = { "id": id_hash, "fields": { "video_url": video_url, "title": title, "keywords": keywords, "video_summary": summary, "embedding_scope": segment.embedding_scope, "start_offset_sec": start_offsets, "end_offset_sec": end_offsets, "embeddings": embeddings, }, } vespa_feed.append(document) ``` # Initialize a list to store Vespa feed documents vespa_feed = [] # Need to reverse VIDEO_URLS as keywords/summaries generated in reverse order VIDEO_URLs.reverse() # Iterate through each task and corresponding metadata for i, task in enumerate(tasks): video_url = VIDEO_URLs[i] title = titles[i] keywords = keywords_array[i] summary = summaries[i] start_offsets = [] # Reset for each video end_offsets = [] # Reset for each video embeddings = {} # Reset for each video # Iterate through the video embedding segments for index, segment in enumerate(task.video_embedding.segments): # Append start and end offsets as floats start_offsets.append(float(segment.start_offset_sec)) end_offsets.append(float(segment.end_offset_sec)) # Add embedding to a multi-dimensional dictionary with index as the key embeddings[str(index)] = list(map(float, segment.embeddings_float)) # Create Vespa document for each task for segment in task.video_embedding.segments: start_offset_sec = segment.start_offset_sec end_offset_sec = segment.end_offset_sec embedding = list(map(float, segment.embeddings_float)) # Create a unique ID by hashing the URL and segment index id_hash = hashlib.md5(f"{video_url}\_{index}".encode()).hexdigest() document = { "id": id_hash, "fields": { "video_url": video_url, "title": title, "keywords": keywords, "video_summary": summary, "embedding_scope": segment.embedding_scope, "start_offset_sec": start_offsets, "end_offset_sec": end_offsets, "embeddings": embeddings, }, } vespa_feed.append(document) We can quickly validate the number of the number of documents created (one for each video), and visually check the first record. In \[23\]: Copied! ``` # Print Vespa feed size and an example print(f"Total documents created: {len(vespa_feed)}") ``` # Print Vespa feed size and an example print(f"Total documents created: {len(vespa_feed)}") ``` Total documents created: 3 ``` In \[24\]: Copied! ``` # The positional index of the document i = 0 # Iterate through the first 3 embeddings in vespa_feed for i in range( min(3, len(vespa_feed)) ): # Ensure we don't exceed the length of vespa_feed # Limit the embedding to the first 3 keys and first 5 values for each key embedding = vespa_feed[i]["fields"]["embeddings"] embedding_sample = {key: values[:3] for key, values in list(embedding.items())[:3]} # Beautify and print the first document with only the first 5 embedding values pretty_json = json.dumps( { "id": vespa_feed[i]["id"], "fields": { "video_url": vespa_feed[i]["fields"]["video_url"], "title": vespa_feed[i]["fields"]["title"], "keywords": vespa_feed[i]["fields"]["keywords"], "video_summary": vespa_feed[i]["fields"]["video_summary"], "embedding_scope": vespa_feed[i]["fields"]["embedding_scope"], "start_offset_sec": vespa_feed[i]["fields"]["start_offset_sec"][:3], "end_offset_sec": vespa_feed[i]["fields"]["end_offset_sec"][:3], "embedding": embedding_sample, }, }, indent=4, ) print(pretty_json) ``` # The positional index of the document i = 0 # Iterate through the first 3 embeddings in vespa_feed for i in range( min(3, len(vespa_feed)) ): # Ensure we don't exceed the length of vespa_feed # Limit the embedding to the first 3 keys and first 5 values for each key embedding = vespa_feed[i]["fields"]["embeddings"] embedding_sample = {key: values[:3] for key, values in list(embedding.items())[:3]} # Beautify and print the first document with only the first 5 embedding values pretty_json = json.dumps( { "id": vespa_feed[i]["id"], "fields": { "video_url": vespa_feed[i]["fields"]["video_url"], "title": vespa_feed[i]["fields"]["title"], "keywords": vespa_feed[i]["fields"]["keywords"], "video_summary": vespa_feed[i]["fields"]["video_summary"], "embedding_scope": vespa_feed[i]["fields"]["embedding_scope"], "start_offset_sec": vespa_feed[i]["fields"]["start_offset_sec"][:3], "end_offset_sec": vespa_feed[i]["fields"]["end_offset_sec"][:3], "embedding": embedding_sample, }, }, indent=4, ) print(pretty_json) ``` { "id": "93d8476bee530eb39a2122f586d0d13a", "fields": { "video_url": "https://archive.org/download/the-end-blue-sky-studios/The%20End%281080P_60FPS%29.ia.mp4", "title": "The END (Blue Sky Studios)", "keywords": "squirrel, acorn, winter, snow, forest", "video_summary": "The video captures a serene snowy landscape with pine trees under a cloudy sky, where a squirrel emerges from behind a rock formation carrying an acorn. Upon noticing another acorn in the foreground, the squirrel appears momentarily surprised, as indicated by its vocalization \"Oh...\". It then drops one acorn and begins to nibble on the other, eventually discarding fragments of it before leaping away. The scene concludes with the squirrel's departure, leaving behind the remnants of the acorn, as darkness gradually engulfs the snowy setting.", "embedding_scope": "clip", "start_offset_sec": [ 0.0, 6.0, 12.0 ], "end_offset_sec": [ 6.0, 12.0, 18.0 ], "embedding": { "0": [ 0.05419811, -0.0018933096, 0.008044507 ], "1": [ 0.016035125, -0.015930071, 0.022429857 ], "2": [ 0.014023403, -0.012773005, 0.019988379 ] } } } ``` Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can check the outcome of each operation. In \[25\]: Copied! ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema="videos", callback=callback) ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema="videos", callback=callback) # 4. Performing search on the videos[¶](#4-performing-search-on-the-videos) ## 4.1 Performing a hybrid search on the video[¶](#41-performing-a-hybrid-search-on-the-video) As an example query, we will retrieve all the chunks which shows Santa Claus on his sleigh. The first step is to generate a text embedding for `Santa Claus on his sleigh` using the `Marengo-retrieval-2.7` model. In \[28\]: Copied! ``` client = TwelveLabs(api_key=TL_API_KEY) user_query = "Santa Claus on his sleigh" res = client.embed.create( model_name="Marengo-retrieval-2.7", text=user_query, ) print("Created a text embedding") print(f" Model: {res.model_name}") if res.text_embedding is not None and res.text_embedding.segments is not None: q_embedding = res.text_embedding.segments[0].embeddings_float print(f" Embedding Dimension: {len(q_embedding)}") print(f" Sample 5 values from array: {q_embedding[:5]}") ``` client = TwelveLabs(api_key=TL_API_KEY) user_query = "Santa Claus on his sleigh" res = client.embed.create( model_name="Marengo-retrieval-2.7", text=user_query, ) print("Created a text embedding") print(f" Model: {res.model_name}") if res.text_embedding is not None and res.text_embedding.segments is not None: q_embedding = res.text_embedding.segments[0].embeddings_float print(f" Embedding Dimension: {len(q_embedding)}") print(f" Sample 5 values from array: {q_embedding[:5]}") ``` Created a text embedding Model: Marengo-retrieval-2.7 Embedding Dimension: 1024 Sample 5 values from array: [-0.018066406, -0.0065307617, 0.05859375, -0.033447266, -0.02368164] ``` The following uses dense vector representations of the query embedding obtained previously and document and matching is performed and accelerated by Vespa's support for [approximate nearest neighbor search](https://docs.vespa.ai/en/approximate-nn-hnsw.html). The output is limited to the top 1 hit, as we only have a sample of 3 videos. The top hit returned was based on a hybrid ranking based on a bm25 ranking based on a lexical search on the text, keywords and summary of the video, performed as a first phase, and similarity search on the embeddings. We can see as part of the `match-features`, the segment 212 in the video was the one providing the highest match. We also calculate the similarities as part of the `summary-features` for the rest of the segments so we can look for top N segments within a video, optionally. In \[29\]: Copied! ``` with app.syncio(connections=1) as session: response: VespaQueryResponse = session.query( yql="select * from videos where userQuery() OR ({targetHits:100}nearestNeighbor(embeddings,q))", query=user_query, ranking="hybrid", hits=1, body={"input.query(q)": q_embedding}, ) assert response.is_successful() hit = response.hits[0] # Extract metadata doc_id = hit.get("id") relevance = hit.get("relevance") source = hit.get("source") fields = hit.get("fields", {}) # Extract the embedding match cell index (first key in matchfeatures) match_cells = fields.get("matchfeatures", {}).get("closest(embeddings)", {}).get("cells", {}) if not match_cells: raise ValueError("No cells found in matchfeatures.closest(embeddings)") # Get the first (and only) cell key and value cell_index, cell_value = next(iter(match_cells.items())) cell_index = int(cell_index) # Convert key from string to int # Extract aligned fields using the index start_offset = fields.get("start_offset_sec", [])[cell_index] end_offset = fields.get("end_offset_sec", [])[cell_index] similarity = fields.get("summaryfeatures", {}).get("similarities", {}).get("cells", {}).get(str(cell_index)) # Print full info print("Document Metadata:") print(f"documentid: {doc_id}") print(f"Relevance: {relevance}") print(f"Source: {source}") print(f"Match Features: {fields.get('matchfeatures', 'N/A')}") print() print(f"Title: {fields.get('title', 'N/A')}") print(f"Keywords: {fields.get('keywords', 'N/A')}") print(f"Video URL: {fields.get('video_url', 'N/A')}") print(f"Video Summary: {fields.get('video_summary', 'N/A')}") print(f"Embedding Scope: {fields.get('embedding_scope', 'N/A')}") print() # Print details for the matched cell print(f"Details for cell {cell_index}:") print(f"Start offset: {start_offset} sec") print(f"End offset: {end_offset} sec") print(f"Similarity score: {similarity}") print(f"Match feature score: {cell_value}") ``` with app.syncio(connections=1) as session: response: VespaQueryResponse = session.query( yql="select * from videos where userQuery() OR ({targetHits:100}nearestNeighbor(embeddings,q))", query=user_query, ranking="hybrid", hits=1, body={"input.query(q)": q_embedding}, ) assert response.is_successful() hit = response.hits[0] # Extract metadata doc_id = hit.get("id") relevance = hit.get("relevance") source = hit.get("source") fields = hit.get("fields", {}) # Extract the embedding match cell index (first key in matchfeatures) match_cells = fields.get("matchfeatures", {}).get("closest(embeddings)", {}).get("cells", {}) if not match_cells: raise ValueError("No cells found in matchfeatures.closest(embeddings)") # Get the first (and only) cell key and value cell_index, cell_value = next(iter(match_cells.items())) cell_index = int(cell_index) # Convert key from string to int # Extract aligned fields using the index start_offset = fields.get("start_offset_sec", [])[cell_index] end_offset = fields.get("end_offset_sec", [])[cell_index] similarity = fields.get("summaryfeatures", {}).get("similarities", {}).get("cells", {}).get(str(cell_index)) # Print full info print("Document Metadata:") print(f"documentid: {doc_id}") print(f"Relevance: {relevance}") print(f"Source: {source}") print(f"Match Features: {fields.get('matchfeatures', 'N/A')}") print() print(f"Title: {fields.get('title', 'N/A')}") print(f"Keywords: {fields.get('keywords', 'N/A')}") print(f"Video URL: {fields.get('video_url', 'N/A')}") print(f"Video Summary: {fields.get('video_summary', 'N/A')}") print(f"Embedding Scope: {fields.get('embedding_scope', 'N/A')}") print() # Print details for the matched cell print(f"Details for cell {cell_index}:") print(f"Start offset: {start_offset} sec") print(f"End offset: {end_offset} sec") print(f"Similarity score: {similarity}") print(f"Match feature score: {cell_value}") ``` Document Metadata: documentid: id:videos:videos::d4175516790d7e55a79eb7f190495a92 Relevance: 0.47162757625475055 Source: videosearch_content Match Features: {'closest(embeddings)': {'type': 'tensor(p{})', 'cells': {'212': 1.0}}} Title: Twas the night before Christmas Keywords: snowy village, clock tower, Santa Claus, mechanical gears, Christmas chimes Video URL: https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4 Video Summary: The video is an animated adaptation of "Twas The Night Before Christmas," featuring a blend of human and mouse characters. It begins with a snowy night scene and transitions to a clockmaker's workshop, where the clockmaker, Joshua Trundle, and his family face challenges after a critical letter to Santa is written by Albert, Trundle's son. The story unfolds with the town's efforts to reconcile with Santa through a special clock designed to play a welcoming song on Christmas Eve, but complications arise when the clock malfunctions. Despite the setbacks, the family and community work together to fix the clock and restore belief in Santa, culminating in his magical arrival, bringing joy and gifts to all. The video concludes with a heartfelt message about the power of belief and the importance of making amends. Embedding Scope: clip Details for cell 212: Start offset: 1272.0 sec End offset: 1278.0 sec Similarity score: 0.43537065386772156 Match feature score: 1.0 ``` You should see output similar to this: ```` Document documentid: id:videos:videos::d4175516790d7e55a79eb7f190495a92 Relevance: 0.47162757625475055 Source: videosearch_content Match Features: {'closest(embeddings)': {'type': 'tensor(p{})', 'cells': {'212': 1.0}}} Title: Twas the night before Christmas Keywords: snowy village, clock tower, Santa Claus, mechanical gears, Christmas chimes Video URL: https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4 Video Summary: The video is an animated adaptation of "Twas The Night Before Christmas," featuring a blend of human and mouse characters. It begins with a snowy night scene and transitions to a clockmaker's workshop, where the clockmaker, Joshua Trundle, and his family face challenges after a critical letter to Santa is written by Albert, Trundle's son. The story unfolds with the town's efforts to reconcile with Santa through a special clock designed to play a welcoming song on Christmas Eve, but complications arise when the clock malfunctions. Despite the setbacks, the family and community work together to fix the clock and restore belief in Santa, culminating in his magical arrival, bringing joy and gifts to all. The video concludes with a heartfelt message about the power of belief and the importance of making amends. Embedding Scope: clip Details for cell 212: Start offset: 1272.0 sec End offset: 1278.0 sec Similarity score: 0.43537065386772156 Match feature score: 1.0``` ```` In order to process the results above in a more consumable format and sort out the top N segments based on similarities, we can do this more conveniently in a pandas dataframe below: In \[37\]: Copied! ``` def get_top_n_similarity_matches(data, N=5): """ Function to extract the top N similarity scores and their corresponding start and end offsets. Args: - data (dict): Input JSON-like structure containing similarities and offsets. - N (int): The number of top similarity scores to return. Returns: - pd.DataFrame: A DataFrame with the top N similarity scores and their corresponding offsets. """ # Extract relevant fields similarities = data["fields"]["summaryfeatures"]["similarities"]["cells"] start_offset_sec = data["fields"]["start_offset_sec"] end_offset_sec = data["fields"]["end_offset_sec"] # Convert similarity scores to a list of tuples (index, similarity_score) and sort by similarity score sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True) # Extract top N similarity scores top_n_similarities = sorted_similarities[:N] # Prepare results results = [] for index_str, score in top_n_similarities: index = int(index_str) if index < len(start_offset_sec): result = { "index": index, "similarity_score": score, "start_offset_sec": start_offset_sec[index], "end_offset_sec": end_offset_sec[index], } else: result = { "index": index, "similarity_score": score, "start_offset_sec": None, "end_offset_sec": None, } results.append(result) # Convert results to a DataFrame df = pd.DataFrame(results) return df ``` def get_top_n_similarity_matches(data, N=5): """ Function to extract the top N similarity scores and their corresponding start and end offsets. Args: - data (dict): Input JSON-like structure containing similarities and offsets. - N (int): The number of top similarity scores to return. Returns: - pd.DataFrame: A DataFrame with the top N similarity scores and their corresponding offsets. """ # Extract relevant fields similarities = data["fields"]["summaryfeatures"]["similarities"]["cells"] start_offset_sec = data["fields"]["start_offset_sec"] end_offset_sec = data["fields"]["end_offset_sec"] # Convert similarity scores to a list of tuples (index, similarity_score) and sort by similarity score sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True) # Extract top N similarity scores top_n_similarities = sorted_similarities[:N] # Prepare results results = [] for index_str, score in top_n_similarities: index = int(index_str) if index < len(start_offset_sec): result = { "index": index, "similarity_score": score, "start_offset_sec": start_offset_sec[index], "end_offset_sec": end_offset_sec[index], } else: result = { "index": index, "similarity_score": score, "start_offset_sec": None, "end_offset_sec": None, } results.append(result) # Convert results to a DataFrame df = pd.DataFrame(results) return df In \[38\]: Copied! ``` df_result = get_top_n_similarity_matches(response.hits[0], N=10) df_result ``` df_result = get_top_n_similarity_matches(response.hits[0], N=10) df_result Out\[38\]: | | index | similarity_score | start_offset_sec | end_offset_sec | | --- | ----- | ---------------- | ---------------- | -------------- | | 0 | 212 | 0.435371 | 1272.0 | 1278.0 | | 1 | 230 | 0.418007 | 1380.0 | 1386.0 | | 2 | 210 | 0.411242 | 1260.0 | 1266.0 | | 3 | 211 | 0.409344 | 1266.0 | 1272.0 | | 4 | 208 | 0.408644 | 1248.0 | 1254.0 | | 5 | 231 | 0.406000 | 1386.0 | 1392.0 | | 6 | 209 | 0.404767 | 1254.0 | 1260.0 | | 7 | 229 | 0.403729 | 1374.0 | 1380.0 | | 8 | 203 | 0.403292 | 1218.0 | 1224.0 | | 9 | 207 | 0.391671 | 1242.0 | 1248.0 | ## 5. Review results (Optional)[¶](#5-review-results-optional) We can review the results by spinning up a video player in the notebook and check the segments identified and judge by ourselves. But, first we need to obtain the contiguous segments, add 3 seconds overlap in the consolidated segments and convert to MM:SS so we can quickly find the segments to watch in the player. Let's write a function that takes the response as an input and provides the consolidated segments to view in the player. In \[40\]: Copied! ``` def concatenate_contiguous_segments(df): """ Function to concatenate contiguous segments based on their start and end offsets. Converts the concatenated segments to MM:SS format. Args: - df (pd.DataFrame): DataFrame with columns 'start_offset_sec' and 'end_offset_sec'. Returns: - List of tuples with concatenated segments in MM:SS format as (start_time, end_time). """ if df.empty: return [] # Sort by start_offset_sec for ordered processing df = df.sort_values(by="start_offset_sec").reset_index(drop=True) # Initialize the list to hold concatenated segments concatenated_segments = [] # Initialize the first segment start = df.iloc[0]["start_offset_sec"] end = df.iloc[0]["end_offset_sec"] for i in range(1, len(df)): current_start = df.iloc[i]["start_offset_sec"] current_end = df.iloc[i]["end_offset_sec"] # Check if the current segment is contiguous with the previous one if current_start <= end: # Extend the segment if it is contiguous end = max(end, current_end) else: # Add the previous segment to the result list in MM:SS format concatenated_segments.append( (convert_seconds_to_mmss(start - 3), convert_seconds_to_mmss(end + 3)) ) # Start a new segment start = current_start end = current_end # Add the final segment concatenated_segments.append( (convert_seconds_to_mmss(start - 3), convert_seconds_to_mmss(end + 3)) ) return concatenated_segments def convert_seconds_to_mmss(seconds): """ Converts seconds to MM:SS format. Args: - seconds (float): Time in seconds. Returns: - str: Time in MM:SS format. """ minutes = int(seconds // 60) seconds = int(seconds % 60) return f"{minutes:02}:{seconds:02}" ``` def concatenate_contiguous_segments(df): """ Function to concatenate contiguous segments based on their start and end offsets. Converts the concatenated segments to MM:SS format. Args: - df (pd.DataFrame): DataFrame with columns 'start_offset_sec' and 'end_offset_sec'. Returns: - List of tuples with concatenated segments in MM:SS format as (start_time, end_time). """ if df.empty: return [] # Sort by start_offset_sec for ordered processing df = df.sort_values(by="start_offset_sec").reset_index(drop=True) # Initialize the list to hold concatenated segments concatenated_segments = [] # Initialize the first segment start = df.iloc[0]["start_offset_sec"] end = df.iloc[0]["end_offset_sec"] for i in range(1, len(df)): current_start = df.iloc[i]["start_offset_sec"] current_end = df.iloc[i]["end_offset_sec"] # Check if the current segment is contiguous with the previous one if current_start \<= end: # Extend the segment if it is contiguous end = max(end, current_end) else: # Add the previous segment to the result list in MM:SS format concatenated_segments.append( (convert_seconds_to_mmss(start - 3), convert_seconds_to_mmss(end + 3)) ) # Start a new segment start = current_start end = current_end # Add the final segment concatenated_segments.append( (convert_seconds_to_mmss(start - 3), convert_seconds_to_mmss(end + 3)) ) return concatenated_segments def convert_seconds_to_mmss(seconds): """ Converts seconds to MM:SS format. Args: - seconds (float): Time in seconds. Returns: - str: Time in MM:SS format. """ minutes = int(seconds // 60) seconds = int(seconds % 60) return f"{minutes:02}:{seconds:02}" In \[41\]: Copied! ``` segments = concatenate_contiguous_segments(df_result) segments ``` segments = concatenate_contiguous_segments(df_result) segments Out\[41\]: ``` [('20:15', '20:27'), ('20:39', '21:21'), ('22:51', '23:15')] ``` We can now spin-up the player and review the segments of interest. Video player is set to start in the middle of the first segment. In \[42\]: Copied! ``` from IPython.display import HTML video_url = "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4" video_player = f""" """ HTML(video_player) ``` from IPython.display import HTML video_url = "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4" video_player = f""" """ HTML(video_player) Out\[42\]: \[ Your browser does not support the video tag. \](https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4) ## 6. Clean-up[¶](#6-clean-up) The following will delete the application and data from the dev environment. In \[35\]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() ``` Deactivated vespa-presales.videosearch in dev.aws-us-east-1c Deleted instance vespa-presales.videosearch.default ``` The following will delete the index created earlier where videos where uploaded: In \[36\]: Copied! ``` # Creating a client client = TwelveLabs(api_key=TL_API_KEY) client.index.delete(index_id) ``` # Creating a client client = TwelveLabs(api_key=TL_API_KEY) client.index.delete(index_id) # Visual PDF RAG with Vespa - ColPali demo application[¶](#visual-pdf-rag-with-vespa-colpali-demo-application) We created an end-to-end demo application for visual retrieval of PDF pages using Vespa, including a frontend web application. To see the live demo, visit . The main goal of the demo is to make it easy for *you* to create your own PDF Enterprise Search application using Vespa. To deploy a full demo, you need two main components: 1. A Vespa application that lets you index and search PDF pages using ColPali embeddings. 1. A live web application that lets you interact with the Vespa application. After running this notebook, you will have set up a Vespa application, and indexed some PDF pages. You can then test that you are able to query the Vespa application, and you will be ready to deploy the web application including the frontend. Some of the features we want to highlight in this demo are: - Visual retrieval of PDF pages using ColPali embeddings - Explainability by displaying similarity maps over the patches in the PDF pages for each query token. - Extracting queries and questions from the PDF pages using `gemini-1.5-8b` model. - Type-ahead search suggestions based on the extracted queries and questions. - Comparison of different retrieval and ranking strategies (BM25, ColPali MaxSim, and a combination of both). - AI-generated responses to the query based on the top ranked PDF pages. Also using the `gemini-1.5-8b` model. We also wanted to give a notion of which latency one can expect using Vespa for this use case. Event though your users might not state this explicitly, we consider it important to provide a snappy user experience. In this notebook, we will prepare the Vespa backend application for our visual retrieval demo. We will use ColPali as the model to extract patch vectors from images of pdf pages. At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results. The steps we will take in this notebook are: 1. Setup and configuration 1. Download PDFs 1. Convert PDFs to images 1. Generate queries and questions 1. Generate ColPali embeddings 1. Prepare the Vespa application package 1. Deploy the Vespa application to Vespa Cloud 1. Feed the data to the Vespa application 1. Test a query to the Vespa application All the steps that are needed to provision the Vespa application, including feeding the data, can be done by running this notebook. We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa. If you want to run this notebook in Colab, you can do so by clicking the button below: ## 1. Setup and Configuration[¶](#1-setup-and-configuration) In \[ \]: Copied! ``` !python --version ``` !python --version Install dependencies: Note that the python pdf2image package requires poppler-utils, see other installation options [here](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler). In \[ \]: Copied! ``` !sudo apt-get update && sudo apt-get install poppler-utils -y ``` !sudo apt-get update && sudo apt-get install poppler-utils -y Now install the required python packages: In \[ \]: Copied! ``` !pip3 install colpali-engine==0.3.10 pdf2image pypdf==5.0.1 pyvespa>=0.50.0 vespacli numpy==1.26.4 pillow==10.4.0 google-generativeai==0.8.3 transformers python-dotenv ``` !pip3 install colpali-engine==0.3.10 pdf2image pypdf==5.0.1 pyvespa>=0.50.0 vespacli numpy==1.26.4 pillow==10.4.0 google-generativeai==0.8.3 transformers python-dotenv In \[ \]: Copied! ``` import os import json from typing import Tuple import hashlib import numpy as np # Vespa from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Function, FieldSet, SecondPhaseRanking, Summary, DocumentSummary, ) from vespa.deployment import VespaCloud from vespa.application import Vespa from vespa.io import VespaResponse # Google Generative AI for Google Gemini interaction import google.generativeai as genai # Torch and other ML libraries import torch from torch.utils.data import DataLoader from tqdm import tqdm from pdf2image import convert_from_path from pypdf import PdfReader # ColPali model and processor from colpali_engine.models import ColPali, ColPaliProcessor from colpali_engine.utils.torch_utils import get_torch_device # Load environment variables from dotenv import load_dotenv load_dotenv() # Avoid warning from huggingface tokenizers os.environ["TOKENIZERS_PARALLELISM"] = "false" ``` import os import json from typing import Tuple import hashlib import numpy as np # Vespa from vespa.package import ( ApplicationPackage, Field, Schema, Document, HNSW, RankProfile, Function, FieldSet, SecondPhaseRanking, Summary, DocumentSummary, ) from vespa.deployment import VespaCloud from vespa.application import Vespa from vespa.io import VespaResponse # Google Generative AI for Google Gemini interaction import google.generativeai as genai # Torch and other ML libraries import torch from torch.utils.data import DataLoader from tqdm import tqdm from pdf2image import convert_from_path from pypdf import PdfReader # ColPali model and processor from colpali_engine.models import ColPali, ColPaliProcessor from colpali_engine.utils.torch_utils import get_torch_device # Load environment variables from dotenv import load_dotenv load_dotenv() # Avoid warning from huggingface tokenizers os.environ["TOKENIZERS_PARALLELISM"] = "false" ### Create a free trial in Vespa Cloud[¶](#create-a-free-trial-in-vespa-cloud) Create a tenant from [here](https://vespa.ai/free-trial/). The trial includes $300 credit. Take note of your tenant name, and input it below. In \[ \]: Copied! ``` VESPA_TENANT_NAME = "vespa-team" # Replace with your tenant name ``` VESPA_TENANT_NAME = "vespa-team" # Replace with your tenant name Here, set your desired application name. (Will be created in later steps) Note that you can not have hyphen `-` or underscore `_` in the application name. In \[ \]: Copied! ``` VESPA_APPLICATION_NAME = "colpalidemodev" VESPA_SCHEMA_NAME = "pdf_page" ``` VESPA_APPLICATION_NAME = "colpalidemodev" VESPA_SCHEMA_NAME = "pdf_page" Next, you can to create a token. This is an optional authentication method (the default is mTLS), and will be used for feeding data, and querying the application. For details, see [Authenticating to Vespa Cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.md). For now, we will use a single token with both read and write permissions. For production, we recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission). The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section. Please make sure to save the both the token id and it's value somwhere safe - you'll need it when you're going to connect to your app. In \[ \]: Copied! ``` # Replace this with the id of your token VESPA_TOKEN_ID = "pyvespa_integration" # This needs to match the token_id that you created in the Vespa Cloud Console ``` # Replace this with the id of your token VESPA_TOKEN_ID = "pyvespa_integration" # This needs to match the token_id that you created in the Vespa Cloud Console We also need to set the value of the write token to be able to feed data to the Vespa application (value of VESPA_TOKEN_ID_WRITE). Please run the cell below to set the variable. In \[ \]: Copied! ``` VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input( "Enter Vespa cloud secret token: " ) ``` VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input( "Enter Vespa cloud secret token: " ) We will use Google's Gemini API to create sample queries for our images. Create a Gemini API key from [here](https://aistudio.google.com/app/apikey). Once you have the key, please run the cell below. You can also use other VLM's to create these queries. In \[ \]: Copied! ``` GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or input( "Enter Google Generative AI API key: " ) # Configure Google Generative AI genai.configure(api_key=GOOGLE_API_KEY) ``` GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or input( "Enter Google Generative AI API key: " ) # Configure Google Generative AI genai.configure(api_key=GOOGLE_API_KEY) ### Loading the ColPali model from huggingface 🤗[¶](#loading-the-colpali-model-from-huggingface) In \[ \]: Copied! ``` MODEL_NAME = "vidore/colpali-v1.2" # Set device for Torch device = get_torch_device("auto") print(f"Using device: {device}") # Load the ColPali model and processor model = ColPali.from_pretrained( MODEL_NAME, torch_dtype=torch.float32, device_map=device, ).eval() processor = ColPaliProcessor.from_pretrained(MODEL_NAME) ``` MODEL_NAME = "vidore/colpali-v1.2" # Set device for Torch device = get_torch_device("auto") print(f"Using device: {device}") # Load the ColPali model and processor model = ColPali.from_pretrained( MODEL_NAME, torch_dtype=torch.float32, device_map=device, ).eval() processor = ColPaliProcessor.from_pretrained(MODEL_NAME) ## 2. Download PDFs[¶](#2-download-pdfs) We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund). The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance. These reports are the ones we are going to use for this showcase. Here are some sample images: As we can see, a lot of the information is in the form of tables, charts and numbers. These are not easily extractable using pdf-readers or OCR tools. In \[ \]: Copied! ``` import requests pdfs = [ { "url": "https://drive.google.com/uc?export=download&id=1nDO0KN_BjyFu42xFAfhJagOeeaJ8fhki", "path": "pdfs/gpfg-half-year-report-2024.pdf", "year": "2024", }, { "url": "https://drive.google.com/uc?export=download&id=1Saw_wM8RI6Zej5qkWDDpeM-3tyOQQTwR", "path": "pdfs/gpfg-annual-report_2023.pdf", "year": "2023", }, ] ``` import requests pdfs = [ { "url": "https://drive.google.com/uc?export=download&id=1nDO0KN_BjyFu42xFAfhJagOeeaJ8fhki", "path": "pdfs/gpfg-half-year-report-2024.pdf", "year": "2024", }, { "url": "https://drive.google.com/uc?export=download&id=1Saw_wM8RI6Zej5qkWDDpeM-3tyOQQTwR", "path": "pdfs/gpfg-annual-report_2023.pdf", "year": "2023", }, ] ### Downloading the PDFs[¶](#downloading-the-pdfs) We create a function to download the PDFs from the web to the provided directory. In \[ \]: Copied! ``` PDFS_DIR = "pdfs" os.makedirs(PDFS_DIR, exist_ok=True) def download_pdf(url: str, path: str): r = requests.get(url, stream=True) with open(path, "wb") as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) return path # Download the pdfs for pdf in pdfs: download_pdf(pdf["url"], pdf["path"]) ``` PDFS_DIR = "pdfs" os.makedirs(PDFS_DIR, exist_ok=True) def download_pdf(url: str, path: str): r = requests.get(url, stream=True) with open(path, "wb") as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) return path # Download the pdfs for pdf in pdfs: download_pdf(pdf["url"], pdf["path"]) ## 3. Convert PDFs to Images[¶](#3-convert-pdfs-to-images) In \[ \]: Copied! ``` def get_pdf_images(pdf_path): reader = PdfReader(pdf_path) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) # Convert to PIL images images = convert_from_path(pdf_path) assert len(images) == len(page_texts) return images, page_texts pdf_folder = "pdfs" pdf_pages = [] for pdf in tqdm(pdfs): pdf_file = pdf["path"] title = os.path.splitext(os.path.basename(pdf_file))[0] images, texts = get_pdf_images(pdf_file) for page_no, (image, text) in enumerate(zip(images, texts)): pdf_pages.append( { "title": title, "year": pdf["year"], "url": pdf["url"], "path": pdf_file, "image": image, "text": text, "page_no": page_no, } ) ``` def get_pdf_images(pdf_path): reader = PdfReader(pdf_path) page_texts = [] for page_number in range(len(reader.pages)): page = reader.pages[page_number] text = page.extract_text() page_texts.append(text) # Convert to PIL images images = convert_from_path(pdf_path) assert len(images) == len(page_texts) return images, page_texts pdf_folder = "pdfs" pdf_pages = [] for pdf in tqdm(pdfs): pdf_file = pdf["path"] title = os.path.splitext(os.path.basename(pdf_file))[0] images, texts = get_pdf_images(pdf_file) for page_no, (image, text) in enumerate(zip(images, texts)): pdf_pages.append( { "title": title, "year": pdf["year"], "url": pdf["url"], "path": pdf_file, "image": image, "text": text, "page_no": page_no, } ) In \[ \]: Copied! ``` len(pdf_pages) ``` len(pdf_pages) In \[ \]: Copied! ``` MAX_PAGES = 10 # Set to None to use all pages pdf_pages = pdf_pages[:MAX_PAGES] if MAX_PAGES else pdf_pages ``` MAX_PAGES = 10 # Set to None to use all pages pdf_pages = pdf_pages[:MAX_PAGES] if MAX_PAGES else pdf_pages We now have 176 pages, which will be the entity we define as one document in Vespa. Let us look at the extracted text from the pages displayed above. In \[ \]: Copied! ``` pdf_pages[8]["image"] ``` pdf_pages[8]["image"] In \[ \]: Copied! ``` print(pdf_pages[8]["text"]) ``` print(pdf_pages[8]["text"]) In \[ \]: Copied! ``` # print(pdf_pages[95]["text"]) ``` # print(pdf_pages[95]["text"]) As we can see, the extracted text fails to capture the visual information we see in the image, and it would be difficult for an LLM to correctly answer questions such as *'Price development in Technology sector from April 2023?'* based on the text alone. ## 4. Generate Queries[¶](#4-generate-queries) In this step, we want to generate queries for each page image. These will be useful for 2 reasons: 1. We can use these queries as typeahead suggestions in the search bar. 1. We could potentially use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic. This will not be within the scope of this notebook though. The prompt for generating queries is adapted from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien. We have modified the prompt to also generate keword based queries, in addition to the question based queries. We will use the Gemini API to generate these queries, with `gemini-flash-lite-latest` as the model. In \[ \]: Copied! ``` from pydantic import BaseModel class GeneratedQueries(BaseModel): broad_topical_question: str broad_topical_query: str specific_detail_question: str specific_detail_query: str visual_element_question: str visual_element_query: str def get_retrieval_prompt() -> Tuple[str, GeneratedQueries]: prompt = ( prompt ) = """You are an investor, stock analyst and financial expert. You will be presented an image of a document page from a report published by the Norwegian Government Pension Fund Global (GPFG). The report may be annual or quarterly reports, or policy reports, on topics such as responsible investment, risk etc. Your task is to generate retrieval queries and questions that you would use to retrieve this document (or ask based on this document) in a large corpus. Please generate 3 different types of retrieval queries and questions. A retrieval query is a keyword based query, made up of 2-5 words, that you would type into a search engine to find this document. A question is a natural language question that you would ask, for which the document contains the answer. The queries should be of the following types: 1. A broad topical query: This should cover the main subject of the document. 2. A specific detail query: This should cover a specific detail or aspect of the document. 3. A visual element query: This should cover a visual element of the document, such as a chart, graph, or image. Important guidelines: - Ensure the queries are relevant for retrieval tasks, not just describing the page content. - Use a fact-based natural language style for the questions. - Frame the queries as if someone is searching for this document in a large corpus. - Make the queries diverse and representative of different search strategies. Format your response as a JSON object with the structure of the following example: { "broad_topical_question": "What was the Responsible Investment Policy in 2019?", "broad_topical_query": "responsible investment policy 2019", "specific_detail_question": "What is the percentage of investments in renewable energy?", "specific_detail_query": "renewable energy investments percentage", "visual_element_question": "What is the trend of total holding value over time?", "visual_element_query": "total holding value trend" } If there are no relevant visual elements, provide an empty string for the visual element question and query. Here is the document image to analyze: Generate the queries based on this image and provide the response in the specified JSON format. Only return JSON. Don't return any extra explanation text. """ return prompt, GeneratedQueries prompt_text, pydantic_model = get_retrieval_prompt() ``` from pydantic import BaseModel class GeneratedQueries(BaseModel): broad_topical_question: str broad_topical_query: str specific_detail_question: str specific_detail_query: str visual_element_question: str visual_element_query: str def get_retrieval_prompt() -> Tuple\[str, GeneratedQueries\]: prompt = ( prompt ) = """You are an investor, stock analyst and financial expert. You will be presented an image of a document page from a report published by the Norwegian Government Pension Fund Global (GPFG). The report may be annual or quarterly reports, or policy reports, on topics such as responsible investment, risk etc. Your task is to generate retrieval queries and questions that you would use to retrieve this document (or ask based on this document) in a large corpus. Please generate 3 different types of retrieval queries and questions. A retrieval query is a keyword based query, made up of 2-5 words, that you would type into a search engine to find this document. A question is a natural language question that you would ask, for which the document contains the answer. The queries should be of the following types: 1. A broad topical query: This should cover the main subject of the document. 1. A specific detail query: This should cover a specific detail or aspect of the document. 1. A visual element query: This should cover a visual element of the document, such as a chart, graph, or image. Important guidelines: - Ensure the queries are relevant for retrieval tasks, not just describing the page content. - Use a fact-based natural language style for the questions. - Frame the queries as if someone is searching for this document in a large corpus. - Make the queries diverse and representative of different search strategies. Format your response as a JSON object with the structure of the following example: { "broad_topical_question": "What was the Responsible Investment Policy in 2019?", "broad_topical_query": "responsible investment policy 2019", "specific_detail_question": "What is the percentage of investments in renewable energy?", "specific_detail_query": "renewable energy investments percentage", "visual_element_question": "What is the trend of total holding value over time?", "visual_element_query": "total holding value trend" } If there are no relevant visual elements, provide an empty string for the visual element question and query. Here is the document image to analyze: Generate the queries based on this image and provide the response in the specified JSON format. Only return JSON. Don't return any extra explanation text. """ return prompt, GeneratedQueries prompt_text, pydantic_model = get_retrieval_prompt() In \[ \]: Copied! ``` gemini_model = genai.GenerativeModel("gemini-flash-lite-latest") def generate_queries(image, prompt_text, pydantic_model): try: response = gemini_model.generate_content( [image, "\n\n", prompt_text], generation_config=genai.GenerationConfig( response_mime_type="application/json", response_schema=pydantic_model, ), ) queries = json.loads(response.text) except Exception as _e: print(_e) queries = { "broad_topical_question": "", "broad_topical_query": "", "specific_detail_question": "", "specific_detail_query": "", "visual_element_question": "", "visual_element_query": "", } return queries ``` gemini_model = genai.GenerativeModel("gemini-flash-lite-latest") def generate_queries(image, prompt_text, pydantic_model): try: response = gemini_model.generate_content( [image, "\\n\\n", prompt_text], generation_config=genai.GenerationConfig( response_mime_type="application/json", response_schema=pydantic_model, ), ) queries = json.loads(response.text) except Exception as \_e: print(\_e) queries = { "broad_topical_question": "", "broad_topical_query": "", "specific_detail_question": "", "specific_detail_query": "", "visual_element_question": "", "visual_element_query": "", } return queries In \[ \]: Copied! ``` for pdf in tqdm(pdf_pages): image = pdf.get("image") pdf["queries"] = generate_queries(image, prompt_text, pydantic_model) ``` for pdf in tqdm(pdf_pages): image = pdf.get("image") pdf["queries"] = generate_queries(image, prompt_text, pydantic_model) Let's take a look at the queries and questions generated for the page displayed above. In \[ \]: Copied! ``` pdf_pages[8]["queries"] ``` pdf_pages[8]["queries"] ## 5. Generate embeddings[¶](#5-generate-embeddings) Now that we have the queries, we can use the ColPali model to generate embeddings for each page image. In \[ \]: Copied! ``` def generate_embeddings(images, model, processor, batch_size=1) -> np.ndarray: """ Generate embeddings for a list of images. Move to CPU only once per batch. Args: images (List[PIL.Image]): List of PIL images. model (nn.Module): The model to generate embeddings. processor: The processor to preprocess images. batch_size (int, optional): Batch size for processing. Defaults to 64. Returns: np.ndarray: Embeddings for the images, shape (len(images), processor.max_patch_length (1030 for ColPali), model.config.hidden_size (Patch embedding dimension - 128 for ColPali)). """ def collate_fn(batch): # Batch is a list of images return processor.process_images(batch) # Should return a dict of tensors dataloader = DataLoader( images, shuffle=False, collate_fn=collate_fn, ) embeddings_list = [] for batch in tqdm(dataloader): with torch.no_grad(): batch = {k: v.to(model.device) for k, v in batch.items()} embeddings_batch = model(**batch) # Convert tensor to numpy array and append to list embeddings_list.extend( [t.cpu().numpy() for t in torch.unbind(embeddings_batch)] ) # Stack all embeddings into a single numpy array all_embeddings = np.stack(embeddings_list, axis=0) return all_embeddings ``` def generate_embeddings(images, model, processor, batch_size=1) -> np.ndarray: """ Generate embeddings for a list of images. Move to CPU only once per batch. Args: images (List[PIL.Image]): List of PIL images. model (nn.Module): The model to generate embeddings. processor: The processor to preprocess images. batch_size (int, optional): Batch size for processing. Defaults to 64. Returns: np.ndarray: Embeddings for the images, shape (len(images), processor.max_patch_length (1030 for ColPali), model.config.hidden_size (Patch embedding dimension - 128 for ColPali)). """ def collate_fn(batch): # Batch is a list of images return processor.process_images(batch) # Should return a dict of tensors dataloader = DataLoader( images, shuffle=False, collate_fn=collate_fn, ) embeddings_list = [] for batch in tqdm(dataloader): with torch.no_grad(): batch = {k: v.to(model.device) for k, v in batch.items()} embeddings_batch = model(\*\*batch) # Convert tensor to numpy array and append to list embeddings_list.extend( [t.cpu().numpy() for t in torch.unbind(embeddings_batch)] ) # Stack all embeddings into a single numpy array all_embeddings = np.stack(embeddings_list, axis=0) return all_embeddings In \[ \]: Copied! ``` # Generate embeddings for all images images = [pdf["image"] for pdf in pdf_pages] embeddings = generate_embeddings(images, model, processor) ``` # Generate embeddings for all images images = \[pdf["image"] for pdf in pdf_pages\] embeddings = generate_embeddings(images, model, processor) Now, we have one embedding vector of dimension 128 for each patch of each image (1024 patches + some special tokens). In \[ \]: Copied! ``` embeddings.shape ``` embeddings.shape In \[ \]: Copied! ``` assert len(pdf_pages) == embeddings.shape[0] assert embeddings.shape[1] > 1028 # Number of patches (including special tokens) assert embeddings.shape[2] == 128 # Embedding dimension per patch ``` assert len(pdf_pages) == embeddings.shape[0] assert embeddings.shape[1] > 1028 # Number of patches (including special tokens) assert embeddings.shape[2] == 128 # Embedding dimension per patch ## 6. Prepare Data on Vespa Format[¶](#6-prepare-data-on-vespa-format) Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa. We now convert the embeddings to Vespa JSON format so we can store (and index) them in Vespa. Details in [Vespa JSON feed format doc](https://docs.vespa.ai/en/reference/document-json-format.html). We use binary quantization (BQ) of the page level ColPali vector embeddings to reduce their size by 32x. Read more about binarization of multi-vector representations in the [colbert blog post](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/). The binarization step maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x. On the [DocVQA benchmark](https://huggingface.co/datasets/vidore/docvqa_test_subsampled), binarization results in only a small drop in ranking accuracy. In \[ \]: Copied! ``` def float_to_binary_embedding(float_query_embedding: dict) -> dict: """Utility function to convert float query embeddings to binary query embeddings.""" binary_query_embeddings = {} for k, v in float_query_embedding.items(): binary_vector = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) binary_query_embeddings[k] = binary_vector return binary_query_embeddings ``` def float_to_binary_embedding(float_query_embedding: dict) -> dict: """Utility function to convert float query embeddings to binary query embeddings.""" binary_query_embeddings = {} for k, v in float_query_embedding.items(): binary_vector = ( np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist() ) binary_query_embeddings[k] = binary_vector return binary_query_embeddings We also need a couple of image processing helper functions. These are borrowed from [vidore-benchmark](https://github.com/illuin-tech/vidore-benchmark/blob/v4.0.0/src/vidore_benchmark/utils/image_utils.py) repo. In \[ \]: Copied! ``` import base64 import io from pathlib import Path from typing import Union from PIL import Image def scale_image(image: Image.Image, new_height: int = 1024) -> Image.Image: """ Scale an image to a new height while maintaining the aspect ratio. """ # Calculate the scaling factor width, height = image.size aspect_ratio = width / height new_width = int(new_height * aspect_ratio) # Resize the image scaled_image = image.resize((new_width, new_height)) return scaled_image def get_base64_image(img: Union[str, Image.Image], add_url_prefix: bool = True) -> str: """ Convert an image (from a filepath or a PIL.Image object) to a JPEG-base64 string. """ if isinstance(img, str): img = Image.open(img) elif isinstance(img, Image.Image): pass else: raise ValueError("`img` must be a path to an image or a PIL Image object.") buffered = io.BytesIO() img.save(buffered, format="jpeg") b64_data = base64.b64encode(buffered.getvalue()).decode("utf-8") return f"data:image/jpeg;base64,{b64_data}" if add_url_prefix else b64_data ``` import base64 import io from pathlib import Path from typing import Union from PIL import Image def scale_image(image: Image.Image, new_height: int = 1024) -> Image.Image: """ Scale an image to a new height while maintaining the aspect ratio. """ # Calculate the scaling factor width, height = image.size aspect_ratio = width / height new_width = int(new_height * aspect_ratio) # Resize the image scaled_image = image.resize((new_width, new_height)) return scaled_image def get_base64_image(img: Union[str, Image.Image], add_url_prefix: bool = True) -> str: """ Convert an image (from a filepath or a PIL.Image object) to a JPEG-base64 string. """ if isinstance(img, str): img = Image.open(img) elif isinstance(img, Image.Image): pass else: raise ValueError("`img` must be a path to an image or a PIL Image object.") buffered = io.BytesIO() img.save(buffered, format="jpeg") b64_data = base64.b64encode(buffered.getvalue()).decode("utf-8") return f"data:image/jpeg;base64,{b64_data}" if add_url_prefix else b64_data Note that we also store a scaled down (blurred) version of the image in Vespa. The purpose of this is to return this fast on first results to the frontend, to provide a snappy user experience, and then load the full resolution image async in the background. In \[ \]: Copied! ``` vespa_feed = [] for pdf, embedding in zip(pdf_pages, embeddings): url = pdf["url"] year = pdf["year"] title = pdf["title"] image = pdf["image"] text = pdf.get("text", "") page_no = pdf["page_no"] query_dict = pdf["queries"] questions = [v for k, v in query_dict.items() if "question" in k and v] queries = [v for k, v in query_dict.items() if "query" in k and v] base_64_image = get_base64_image( scale_image(image, 32), add_url_prefix=False ) # Scaled down image to return fast on search (~1kb) base_64_full_image = get_base64_image(image, add_url_prefix=False) embedding_dict = {k: v for k, v in enumerate(embedding)} binary_embedding = float_to_binary_embedding(embedding_dict) # id_hash should be md5 hash of url and page_number id_hash = hashlib.md5(f"{url}_{page_no}".encode()).hexdigest() page = { "id": id_hash, "fields": { "id": id_hash, "url": url, "title": title, "year": year, "page_number": page_no, "blur_image": base_64_image, "full_image": base_64_full_image, "text": text, "embedding": binary_embedding, "queries": queries, "questions": questions, }, } vespa_feed.append(page) ``` vespa_feed = [] for pdf, embedding in zip(pdf_pages, embeddings): url = pdf["url"] year = pdf["year"] title = pdf["title"] image = pdf["image"] text = pdf.get("text", "") page_no = pdf["page_no"] query_dict = pdf["queries"] questions = [v for k, v in query_dict.items() if "question" in k and v] queries = [v for k, v in query_dict.items() if "query" in k and v] base_64_image = get_base64_image( scale_image(image, 32), add_url_prefix=False ) # Scaled down image to return fast on search (~1kb) base_64_full_image = get_base64_image(image, add_url_prefix=False) embedding_dict = {k: v for k, v in enumerate(embedding)} binary_embedding = float_to_binary_embedding(embedding_dict) # id_hash should be md5 hash of url and page_number id_hash = hashlib.md5(f"{url}\_{page_no}".encode()).hexdigest() page = { "id": id_hash, "fields": { "id": id_hash, "url": url, "title": title, "year": year, "page_number": page_no, "blur_image": base_64_image, "full_image": base_64_full_image, "text": text, "embedding": binary_embedding, "queries": queries, "questions": questions, }, } vespa_feed.append(page) ### [Optional] Saving the feed file[¶](#optional-saving-the-feed-file) If you have a large dataset, you can optionally save the file, and feed it using the Vespa CLI, which is more performant than the pyvespa client. See [Feeding to Vespa Cloud](https://vespa-engine.github.io/pyvespa/examples/feed_performance_cloud.md) for more details. Uncomment the cell below if you want to save the feed file. In \[ \]: Copied! ``` # os.makedirs("output", exist_ok=True) # with open("output/vespa_feed.jsonl", "w") as f: # vespa_feed_to_save = [] # for page in vespa_feed: # document_id = page["id"] # put_id = f"id:{VESPA_APPLICATION_NAME}:{VESPA_SCHEMA_NAME}::{document_id}" # vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]}) # json.dump(vespa_feed_to_save, f) ``` # os.makedirs("output", exist_ok=True) # with open("output/vespa_feed.jsonl", "w") as f: # vespa_feed_to_save = [] # for page in vespa_feed: # document_id = page["id"] # put_id = f"id:{VESPA_APPLICATION_NAME}:{VESPA_SCHEMA_NAME}::{document_id}" # vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]}) # json.dump(vespa_feed_to_save, f) ## 7. Prepare Vespa Application[¶](#7-prepare-vespa-application) ### Configuring the application package[¶](#configuring-the-application-package) [PyVespa](https://vespa-engine.github.io/pyvespa/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). A Vespa application package consists of configuration files, schemas, models, and code (plugins). Here are some of the key components of this application package: 1. We store images (and a scaled down version of the image) as a `raw` field. 1. We store the binarized ColPali embeddings as a `tensor` field. 1. We store the queries and questions as a `array` field. 1. We define 3 different ranking profiles: - `default` Uses BM25 for first phase ranking and MaxSim for second phase ranking. - `bm25` Uses `bm25(title) + bm25(text)` (first phase only) for ranking. - `retrieval-and-rerank` Uses `nearestneighbor` of the query embedding over the document embeddings for retrieval, `binary_max_sim` for first phase ranking, and `max_sim` of the query-embeddings as float for second phase ranking. Vespa's [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) allows us to use different ranking strategies for retrieval and reranking, to choose attractive trade-offs between latency, cost, and accuracy. 1. We also calculate dot product between the query and each document, so that it can be returned with the results, to generate the similarity maps, which show which patches of the image is most similar to the query token embeddings. First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. In \[ \]: Copied! ``` colpali_schema = Schema( name=VESPA_SCHEMA_NAME, document=Document( fields=[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"], ), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="year", type="int", indexing=["summary", "attribute"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="blur_image", type="raw", indexing=["summary"]), Field(name="full_image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), Field( name="questions", type="array", indexing=["summary", "attribute"], summary=Summary(fields=["matched-elements-only"]), ), Field( name="queries", type="array", indexing=["summary", "attribute"], summary=Summary(fields=["matched-elements-only"]), ), ] ), fieldsets=[ FieldSet( name="default", fields=["title", "text"], ), ], document_summaries=[ DocumentSummary( name="default", summary_fields=[ Summary( name="text", fields=[("bolding", "on")], ), Summary( name="snippet", fields=[("source", "text"), "dynamic"], ), ], from_disk=True, ), DocumentSummary( name="suggestions", summary_fields=[ Summary(name="questions"), ], from_disk=True, ), ], ) # Define similarity functions used in all rank profiles mapfunctions = [ Function( name="similarities", # computes similarity scores between each query token and image patch expression=""" sum( query(qt) * unpack_bits(attribute(embedding)), v ) """, ), Function( name="normalized", # normalizes the similarity scores to [-1, 1] expression=""" (similarities - reduce(similarities, min)) / (reduce((similarities - reduce(similarities, min)), max)) * 2 - 1 """, ), Function( name="quantized", # quantizes the normalized similarity scores to signed 8-bit integers [-128, 127] expression=""" cell_cast(normalized * 127.999, int8) """, ), ] # Define the 'bm25' rank profile bm25 = RankProfile( name="bm25", inputs=[("query(qt)", "tensor(querytoken{}, v[128])")], first_phase="bm25(title) + bm25(text)", functions=mapfunctions, ) # A function to create an inherited rank profile which also returns quantized similarity scores def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile: return RankProfile( name=f"{rank_profile.name}_sim", first_phase=rank_profile.first_phase, inherits=rank_profile.name, summary_features=["quantized"], ) colpali_schema.add_rank_profile(bm25) colpali_schema.add_rank_profile(with_quantized_similarity(bm25)) # Update the 'colpali' rank profile input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.extend( [ ("query(qt)", "tensor(querytoken{}, v[128])"), ("query(qtb)", "tensor(querytoken{}, v[16])"), ] ) colpali = RankProfile( name="colpali", inputs=input_query_tensors, first_phase="max_sim_binary", second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), functions=mapfunctions + [ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)), v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1 / (1 + sum( hamming(query(qtb), attribute(embedding)), v) ), max, patch ), querytoken ) """, ), ], ) colpali_schema.add_rank_profile(colpali) colpali_schema.add_rank_profile(with_quantized_similarity(colpali)) # Update the 'hybrid' rank profile hybrid = RankProfile( name="hybrid", inputs=input_query_tensors, first_phase="max_sim_binary", second_phase=SecondPhaseRanking( expression="max_sim + 2 * (bm25(text) + bm25(title))", rerank_count=10 ), functions=mapfunctions + [ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)), v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1 / (1 + sum( hamming(query(qtb), attribute(embedding)), v) ), max, patch ), querytoken ) """, ), ], ) colpali_schema.add_rank_profile(hybrid) colpali_schema.add_rank_profile(with_quantized_similarity(hybrid)) ``` colpali_schema = Schema( name=VESPA_SCHEMA_NAME, document=Document( fields=\[ Field( name="id", type="string", indexing=["summary", "index"], match=["word"], ), Field(name="url", type="string", indexing=["summary", "index"]), Field(name="year", type="int", indexing=["summary", "attribute"]), Field( name="title", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field(name="page_number", type="int", indexing=["summary", "attribute"]), Field(name="blur_image", type="raw", indexing=["summary"]), Field(name="full_image", type="raw", indexing=["summary"]), Field( name="text", type="string", indexing=["summary", "index"], match=["text"], index="enable-bm25", ), Field( name="embedding", type="tensor(patch{}, v[16])", indexing=[ "attribute", "index", ], ann=HNSW( distance_metric="hamming", max_links_per_node=32, neighbors_to_explore_at_insert=400, ), ), Field( name="questions", type="array", indexing=["summary", "attribute"], summary=Summary(fields=["matched-elements-only"]), ), Field( name="queries", type="array", indexing=["summary", "attribute"], summary=Summary(fields=["matched-elements-only"]), ), \] ), fieldsets=\[ FieldSet( name="default", fields=["title", "text"], ), \], document_summaries=\[ DocumentSummary( name="default", summary_fields=\[ Summary( name="text", fields=[("bolding", "on")], ), Summary( name="snippet", fields=[("source", "text"), "dynamic"], ), \], from_disk=True, ), DocumentSummary( name="suggestions", summary_fields=[ Summary(name="questions"), ], from_disk=True, ), \], ) # Define similarity functions used in all rank profiles mapfunctions = \[ Function( name="similarities", # computes similarity scores between each query token and image patch expression=""" sum( query(qt) * unpack_bits(attribute(embedding)), v ) """, ), Function( name="normalized", # normalizes the similarity scores to [-1, 1] expression=""" (similarities - reduce(similarities, min)) / (reduce((similarities - reduce(similarities, min)), max)) * 2 - 1 """, ), Function( name="quantized", # quantizes the normalized similarity scores to signed 8-bit integers [-128, 127] expression=""" cell_cast(normalized * 127.999, int8) """, ), \] # Define the 'bm25' rank profile bm25 = RankProfile( name="bm25", inputs=\[("query(qt)", "tensor(querytoken{}, v[128])")\], first_phase="bm25(title) + bm25(text)", functions=mapfunctions, ) # A function to create an inherited rank profile which also returns quantized similarity scores def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile: return RankProfile( name=f"{rank_profile.name}\_sim", first_phase=rank_profile.first_phase, inherits=rank_profile.name, summary_features=["quantized"], ) colpali_schema.add_rank_profile(bm25) colpali_schema.add_rank_profile(with_quantized_similarity(bm25)) # Update the 'colpali' rank profile input_query_tensors = [] MAX_QUERY_TERMS = 64 for i in range(MAX_QUERY_TERMS): input_query_tensors.append((f"query(rq{i})", "tensor(v[16])")) input_query_tensors.extend( \[ ("query(qt)", "tensor(querytoken{}, v[128])"), ("query(qtb)", "tensor(querytoken{}, v[16])"), \] ) colpali = RankProfile( name="colpali", inputs=input_query_tensors, first_phase="max_sim_binary", second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10), functions=mapfunctions - [ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)), v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1 / (1 + sum( hamming(query(qtb), attribute(embedding)), v) ), max, patch ), querytoken ) """, ), ], ) colpali_schema.add_rank_profile(colpali) colpali_schema.add_rank_profile(with_quantized_similarity(colpali)) # Update the 'hybrid' rank profile hybrid = RankProfile( name="hybrid", inputs=input_query_tensors, first_phase="max_sim_binary", second_phase=SecondPhaseRanking( expression="max_sim + 2 * (bm25(text) + bm25(title))", rerank_count=10 ), functions=mapfunctions - [ Function( name="max_sim", expression=""" sum( reduce( sum( query(qt) * unpack_bits(attribute(embedding)), v ), max, patch ), querytoken ) """, ), Function( name="max_sim_binary", expression=""" sum( reduce( 1 / (1 + sum( hamming(query(qtb), attribute(embedding)), v) ), max, patch ), querytoken ) """, ), ], ) colpali_schema.add_rank_profile(hybrid) colpali_schema.add_rank_profile(with_quantized_similarity(hybrid)) ### Configuring the `services.xml`[¶](#configuring-the-servicesxml) [services.xml](https://docs.vespa.ai/en/reference/services.html) is the primary configuration file for a Vespa application, with a plethora of options to configure the application. Since `pyvespa` version `0.50.0`, these configuration options are also available in `pyvespa`. See [Pyvespa - Advanced configuration](https://vespa-engine.github.io/pyvespa/advanced-configuration.md) for more details. (Note that configurating this is optional, and pyvespa will use basic defaults for you if you opt out). We will use the advanced configuration to configure up [dynamic snippets](https://docs.vespa.ai/en/document-summaries.html#dynamic-snippets). This allows us to highlight matched terms in the search results and generate a `snippet` to display, rather than the full text of the document. In \[ \]: Copied! ``` from vespa.configuration.services import ( services, container, search, document_api, document_processing, clients, client, config, content, redundancy, documents, node, certificate, token, document, nodes, ) from vespa.configuration.vt import vt from vespa.package import ServicesConfiguration service_config = ServicesConfiguration( application_name=VESPA_APPLICATION_NAME, services_config=services( container( search(), document_api(), document_processing(), clients( client( certificate(file="security/clients.pem"), id="mtls", permissions="read,write", ), client( token(id=f"{VESPA_TOKEN_ID}"), id="token_write", permissions="read,write", ), ), config( vt("tag")( vt("bold")( vt("open", ""), vt("close", ""), ), vt("separator", "..."), ), name="container.qr-searchers", ), id=f"{VESPA_APPLICATION_NAME}_container", version="1.0", ), content( redundancy("1"), documents(document(type="pdf_page", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), config( vt("max_matches", "2", replace_underscores=False), vt("length", "1000"), vt("surround_max", "500", replace_underscores=False), vt("min_length", "300", replace_underscores=False), name="vespa.config.search.summary.juniperrc", ), id=f"{VESPA_APPLICATION_NAME}_content", version="1.0", ), version="1.0", ), ) ``` from vespa.configuration.services import ( services, container, search, document_api, document_processing, clients, client, config, content, redundancy, documents, node, certificate, token, document, nodes, ) from vespa.configuration.vt import vt from vespa.package import ServicesConfiguration service_config = ServicesConfiguration( application_name=VESPA_APPLICATION_NAME, services_config=services( container( search(), document_api(), document_processing(), clients( client( certificate(file="security/clients.pem"), id="mtls", permissions="read,write", ), client( token(id=f"{VESPA_TOKEN_ID}"), id="token_write", permissions="read,write", ), ), config( vt("tag")( vt("bold")( vt("open", ""), vt("close", ""), ), vt("separator", "..."), ), name="container.qr-searchers", ), id=f"{VESPA_APPLICATION_NAME}\_container", version="1.0", ), content( redundancy("1"), documents(document(type="pdf_page", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), config( vt("max_matches", "2", replace_underscores=False), vt("length", "1000"), vt("surround_max", "500", replace_underscores=False), vt("min_length", "300", replace_underscores=False), name="vespa.config.search.summary.juniperrc", ), id=f"{VESPA_APPLICATION_NAME}\_content", version="1.0", ), version="1.0", ), ) In \[ \]: Copied! ``` # Create the Vespa application package vespa_application_package = ApplicationPackage( name=VESPA_APPLICATION_NAME, schema=[colpali_schema], services_config=service_config, ) ``` # Create the Vespa application package vespa_application_package = ApplicationPackage( name=VESPA_APPLICATION_NAME, schema=[colpali_schema], services_config=service_config, ) ## 8. Deploy Vespa Application[¶](#8-deploy-vespa-application) In \[ \]: Copied! ``` # This is only needed for CI. VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY", None) ``` # This is only needed for CI. VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY", None) In \[ \]: Copied! ``` vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_package=vespa_application_package, ) # Deploy the application vespa_cloud.deploy() # Output the endpoint URL endpoint_url = vespa_cloud.get_token_endpoint() print(f"Application deployed. Token endpoint URL: {endpoint_url}") ``` vespa_cloud = VespaCloud( tenant=VESPA_TENANT_NAME, application=VESPA_APPLICATION_NAME, key_content=VESPA_TEAM_API_KEY, application_package=vespa_application_package, ) # Deploy the application vespa_cloud.deploy() # Output the endpoint URL endpoint_url = vespa_cloud.get_token_endpoint() print(f"Application deployed. Token endpoint URL: {endpoint_url}") Make sure to take note of the token endpoint_url. You need to put this in your `.env` file for your web application - `VESPA_APP_TOKEN_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application. ## 9. Feed Data to Vespa[¶](#9-feed-data-to-vespa) We will need the `enpdoint_url` and `colpalidemo_write` token to feed the data to the Vespa application. In \[ \]: Copied! ``` # Instantiate Vespa connection using token app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN) app.get_application_status() ``` # Instantiate Vespa connection using token app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN) app.get_application_status() Now, let us feed the data to Vespa. If you have a large dataset, you could also do this async, with `feed_async_iterable()`, see [Feeding Vespa cloud](https://vespa-engine.github.io/pyvespa/examples/feed_performance_cloud.md) for a detailed comparison. In \[ \]: Copied! ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback) ``` def callback(response: VespaResponse, id: str): if not response.is_successful(): print( f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}" ) # Feed data into Vespa synchronously app.feed_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback) ## 10. Test a query to the Vespa application[¶](#10-test-a-query-to-the-vespa-application) For now, we will just run a query with the default rank profile. We will need a utility function to generate embeddings for the query, and pass this to Vespa to use for calculating MaxSim. In the web application, we also provide function to generate binary embeddings, allowing the user to choose different rank profiles at query time. In \[ \]: Copied! ``` query = "Price development in Technology sector from April 2023?" ``` query = "Price development in Technology sector from April 2023?" In \[ \]: Copied! ``` def get_q_embs_vespa_format(query: str): inputs = processor.process_queries([query]).to(model.device) with torch.no_grad(): embeddings_query = model(**inputs) q_embs = embeddings_query.to("cpu")[0] # Extract the single embedding return {idx: emb.tolist() for idx, emb in enumerate(q_embs)} ``` def get_q_embs_vespa_format(query: str): inputs = processor.process_queries([query]).to(model.device) with torch.no_grad(): embeddings_query = model(\*\*inputs) q_embs = embeddings_query.to("cpu")[0] # Extract the single embedding return {idx: emb.tolist() for idx, emb in enumerate(q_embs)} In \[ \]: Copied! ``` q_emb = get_q_embs_vespa_format(query) ``` q_emb = get_q_embs_vespa_format(query) In \[ \]: Copied! ``` with app.syncio() as sess: response = sess.query( body={ "yql": ( f"select id, url, title, year, full_image, quantized from {VESPA_SCHEMA_NAME} where userQuery();" ), "ranking": "default", "query": query, "timeout": "10s", "hits": 3, "input.query(qt)": q_emb, "presentation.timing": True, } ) ``` with app.syncio() as sess: response = sess.query( body={ "yql": ( f"select id, url, title, year, full_image, quantized from {VESPA_SCHEMA_NAME} where userQuery();" ), "ranking": "default", "query": query, "timeout": "10s", "hits": 3, "input.query(qt)": q_emb, "presentation.timing": True, } ) In \[ \]: Copied! ``` assert len(response.json["root"]["children"]) == 3 ``` assert len(response.json["root"]["children"]) == 3 Great. You have now deployed the Vespa application and fed the data to it, and made sure you are able to query it using the vespa endpoint and a token. ### Saving the generated key/cert files[¶](#saving-the-generated-keycert-files) A key and cert file is generated for you as an alternative to using tokens for authentication. We advise you to save these files in a secure location, in case you want to use them for authentication in the future. In \[ \]: Copied! ``` key_path = Path( f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-private-key.pem" ).expanduser() cert_path = Path( f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-public-cert.pem" ).expanduser() assert key_path.exists(), cert_path.exists() ``` key_path = Path( f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-private-key.pem" ).expanduser() cert_path = Path( f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-public-cert.pem" ).expanduser() assert key_path.exists(), cert_path.exists() ## 11. Deploying your web app[¶](#11-deploying-your-web-app) To deploy a frontend to let users interact with the Vespa application. you can clone the sample app from [sample-apps repo](https://github.com/vespa-engine/sample-apps/blob/master/visual-retrieval-colpali/README.md). It includes instructions for running and connecting your web application to your vespa app. In \[ \]: Copied! ``` !git clone --depth 1 --filter=blob:none --sparse https://github.com/vespa-engine/sample-apps.git src && cd src && git sparse-checkout set visual-retrieval-colpali ``` !git clone --depth 1 --filter=blob:none --sparse https://github.com/vespa-engine/sample-apps.git src && cd src && git sparse-checkout set visual-retrieval-colpali Now, you have the code for the webapp in your `src/visual-retrieval-colpali`-directory In \[ \]: Copied! ``` os.listdir("src/visual-retrieval-colpali") ``` os.listdir("src/visual-retrieval-colpali") ### Setting environment variables for your web app[¶](#setting-environment-variables-for-your-web-app) Now, you need to set the following variables in the `src/.env.example`-file: ``` VESPA_APP_TOKEN_URL=https://abcde.z.vespa-app.cloud # Your token endpoint url you got after deploying your Vespa app. VESPA_CLOUD_SECRET_TOKEN=vespa_cloud_xxxxxxxx # The value of the token that your created in this notebook. GEMINI_API_KEY=your_api_key # The same as GOOGLE_API_KEY in this notebook HF_TOKEN=hf_xxxx # If you want to deploy your web app to huggingface spaces - https://huggingface.co/settings/tokens ``` After, that, rename your file to .env. In \[ \]: Copied! ``` # rename src/visual-retrieval-colpali/.env.example os.rename( "src/visual-retrieval-colpali/.env.example", dst="src/visual-retrieval-colpali/.env" ) ``` # rename src/visual-retrieval-colpali/.env.example os.rename( "src/visual-retrieval-colpali/.env.example", dst="src/visual-retrieval-colpali/.env" ) And you're ready to spin up your web app locally, and deploy to huggingface spaces if you want. Navigate to `src/visual-retrieval-colpali/` directory and follow the instructions in the `README.md` to continue. 🚀 ## Cleanup[¶](#cleanup) As this notebook runs in CI, we will delete the Vespa application after running the notebook. DO NOT run the cell below unless you are sure you want to delete the Vespa application. In \[ \]: Copied! ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() ``` if os.getenv("CI", "false") == "true": vespa_cloud.delete() # Scalable Asymmetric Retrieval with Voyage AI Embeddings in Vespa[¶](#scalable-asymmetric-retrieval-with-voyage-ai-embeddings-in-vespa) The [Voyage 4 model family](https://blog.voyageai.com/2026/01/15/voyage-4/) offers state-of-the-art embedding quality across a range of model sizes. Vespa recently added an integration to allow for seamless embedding through Voyage's API. This notebook demonstrates an **asymmetric retrieval** pattern, combining both this API-based integration, and a Vespa-local Open Source model: - **Indexing**: Use `voyage-4-large` (API-based, highest quality) to embed documents once via Vespa's [voyage-ai-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder). - **Querying**: Use `voyage-4-nano` (open-source, runs locally on the Vespa container) via [hugging-face-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#hugging-face-embedder) for zero-cost, low-latency queries. We combine [binary embeddings](https://docs.vespa.ai/en/rag/binarizing-vectors.html) for fast first-phase retrieval with **float reranking** for accuracy. Relevant resources: - [Vespa embedding documentation](https://docs.vespa.ai/en/rag/embedding.html) - [Embedding Tradeoffs, Quantified](https://blog.vespa.ai/embedding-tradeoffs-quantified/) — benchmarks of voyage-4-nano-int8 and other models on Vespa - [Nearest Neighbor Search](https://docs.vespa.ai/en/querying/nearest-neighbor-search.html) In \[ \]: Copied! ``` !pip3 install -U pyvespa vespacli ``` !pip3 install -U pyvespa vespacli ## Why Asymmetric Retrieval?[¶](#why-asymmetric-retrieval) Embedding documents and queries with the same API-based model works well, but at high query volumes the cost of embedding every query adds up. Asymmetric retrieval eliminates this cost entirely. ### The asymmetric insight[¶](#the-asymmetric-insight) - **Documents are embedded once** at indexing time. Use the best model (`voyage-4-large`) for maximum quality. - **Queries happen on every search**. Use a fast, local model (`voyage-4-nano`) with zero API cost and no rate limits. ### Example: 10K QPS[¶](#example-10k-qps) At 10,000 queries/sec with ~30-token queries, that's ~18M tokens per minute. Even at $0.02 per 1M tokens, this adds up to \*\*~$31K/month\*\* in embedding costs alone. Running `voyage-4-nano` locally on the Vespa container reduces this to **$0/month** with single-digit ms latency — the model runs as part of the serving infrastructure you're already paying for. The `voyage-4-nano` model from the same Voyage 4 family produces embeddings in the same vector space as `voyage-4-large`, making cross-model similarity meaningful. ### voyage-4-nano-int8 Quality Benchmarks[¶](#voyage-4-nano-int8-quality-benchmarks) From [Embedding Tradeoffs, Quantified](https://blog.vespa.ai/embedding-tradeoffs-quantified/), benchmarked on an AWS c7g.2xlarge instance. The model is 332 MB and supports a 32,768 token context with an embedding latency of 12.6-15.0 ms, running on CPU. It also supports [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) for flexible dimensionality. ## Define the Vespa Schema[¶](#define-the-vespa-schema) We define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with two document fields (`id`, `text`) and two synthetic embedding fields computed at indexing time by the `voyage-4-large` embedder: - `embedding_float`: Half-precision (bfloat16) embeddings (2048 dimensions) for accurate reranking. Uses [`paged` attribute](https://docs.vespa.ai/en/content/attributes.html#paged-attributes) to keep data on disk, reducing memory cost. - `embedding_binary`: [Binary (int8) embeddings](https://docs.vespa.ai/en/rag/binarizing-vectors.html) (2048/8 = 256 bytes) for fast [hamming-distance](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric) retrieval. In \[54\]: Copied! ``` from vespa.package import Schema, Document, Field SCHEMA_NAME = "doc" FEED_MODEL_ID = "voyage-4-large" QUERY_MODEL_ID = "voyage-4-nano-int8" schema = Schema( name=SCHEMA_NAME, document=Document( fields=[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="text", type="string", indexing=["index", "summary"]), ] ), ) # Synthetic fields: computed from 'text' at indexing time using the voyage-4-large embedder. # These are not part of the document type, so is_document_field=False. schema.add_fields( Field( name="embedding_float", type="tensor(x[2048])", indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: prenormalized-angular", "paged"], is_document_field=False, ) ) schema.add_fields( Field( name="embedding_binary", type="tensor(x[256])", # 2048 bits / 8 = 256 bytes indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: hamming"], is_document_field=False, ) ) ``` from vespa.package import Schema, Document, Field SCHEMA_NAME = "doc" FEED_MODEL_ID = "voyage-4-large" QUERY_MODEL_ID = "voyage-4-nano-int8" schema = Schema( name=SCHEMA_NAME, document=Document( fields=\[ Field(name="id", type="string", indexing=["summary", "attribute"]), Field(name="text", type="string", indexing=["index", "summary"]), \] ), ) # Synthetic fields: computed from 'text' at indexing time using the voyage-4-large embedder. # These are not part of the document type, so is_document_field=False. schema.add_fields( Field( name="embedding_float", type="tensor(x[2048])", indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: prenormalized-angular", "paged"], is_document_field=False, ) ) schema.add_fields( Field( name="embedding_binary", type="tensor(x[256])", # 2048 bits / 8 = 256 bytes indexing=["input text", f"embed {FEED_MODEL_ID}", "attribute"], attribute=["distance-metric: hamming"], is_document_field=False, ) ) ## Rank Profile: Binary Retrieval with Float Reranking[¶](#rank-profile-binary-retrieval-with-float-reranking) This [rank profile](https://docs.vespa.ai/en/ranking.html) implements a two-phase strategy: 1. **First phase**: Hamming distance on binary embeddings. This is extremely fast and scans many candidates cheaply. 1. **Second phase**: Cosine closeness on full float embeddings. This is more accurate and applied only to the top candidates from phase one. The query inputs (`q_float`, `q_bin`) are produced by the local `voyage-4-nano` model at query time. The [`rerank_count`](https://docs.vespa.ai/en/reference/schema-reference.html#rerank-count) controls how many first-phase candidates are rescored in the second phase. In \[55\]: Copied! ``` from vespa.package import RankProfile, Function, SecondPhaseRanking RERANK_COUNT = 2000 schema.add_rank_profile( RankProfile( name="binary-with-rerank", inputs=[ ("query(q_float)", "tensor(x[2048])"), ("query(q_bin)", "tensor(x[256])"), ], functions=[ Function( name="binary_closeness", expression="1 - (distance(field, embedding_binary) / 2048)", ), Function( name="float_closeness", expression="reduce(query(q_float) * attribute(embedding_float), sum, x)", ), ], first_phase="binary_closeness", second_phase=SecondPhaseRanking( expression="float_closeness", rerank_count=RERANK_COUNT ), summary_features=[ "binary_closeness", "float_closeness", ], ) ) ``` from vespa.package import RankProfile, Function, SecondPhaseRanking RERANK_COUNT = 2000 schema.add_rank_profile( RankProfile( name="binary-with-rerank", inputs=\[ ("query(q_float)", "tensor(x[2048])"), ("query(q_bin)", "tensor(x[256])"), \], functions=[ Function( name="binary_closeness", expression="1 - (distance(field, embedding_binary) / 2048)", ), Function( name="float_closeness", expression="reduce(query(q_float) * attribute(embedding_float), sum, x)", ), ], first_phase="binary_closeness", second_phase=SecondPhaseRanking( expression="float_closeness", rerank_count=RERANK_COUNT ), summary_features=[ "binary_closeness", "float_closeness", ], ) ) ### Why `paged` for float embeddings?[¶](#why-paged-for-float-embeddings) The `embedding_float` field uses the [`paged` attribute](https://docs.vespa.ai/en/content/attributes.html#paged-attributes), which lets Vespa page attribute data from memory to disk. This is critical for keeping memory costs manageable. **Napkin math** — memory per document at 2048 dimensions: | Representation | Type | Bytes/vector | 1M docs | 10M docs | 100M docs | | ------------------ | --------------------- | ------------ | -------- | -------- | --------- | | `embedding_float` | `bfloat16` (16-bit) | 4,096 B | ~3.8 GB | ~38 GB | ~381 GB | | `embedding_binary` | `int8` (1-bit packed) | 256 B | ~0.24 GB | ~2.4 GB | ~24 GB | The float embeddings are **16x larger** than the binary ones. Without `paged`, all float vectors must fit in memory. At 100M documents that's ~381 GB of RAM just for one field. With `paged`, the OS kernel manages what's in memory based on access patterns — only the vectors actually touched during reranking need to be resident. This works well here because **float vectors are only accessed during second-phase reranking**, not during first-phase retrieval. The first phase uses only the compact binary embeddings (always in memory), and the second phase touches at most `rerank-count` float vectors per query per content node. > **Important**: Do not combine `paged` with [HNSW indexing](https://docs.vespa.ai/en/approximate-nn-hnsw.html), as HNSW requires random access across the full graph during search, which would cause excessive disk I/O. Here we use `paged` safely because `embedding_float` has no HNSW index — it's accessed only via direct attribute lookups during reranking. ### Why `rerank-count` matters with `paged`[¶](#why-rerank-count-matters-with-paged) The [`rerank-count`](https://docs.vespa.ai/en/ranking/phased-ranking.html) parameter (set to 2000 above) controls how many first-phase candidates are re-scored in the second phase **per content node**. This is the knob that bounds cost: - **Too low** (e.g., 50): Fast, but the cheap binary first-phase may miss relevant documents that float reranking would have rescued. Recall suffers. - **Too high** (e.g., 50,000): More float vectors paged in from disk per query, increasing latency and disk I/O. The quality gains diminish quickly — most relevant documents are already in the top few thousand candidates. - **2000**: A reasonable default that balances recall, latency, and disk I/O. At 2000 candidates x 4,096 bytes per vector = ~8 MB of float data accessed per query per node — easily serviceable from the OS page cache for any reasonable query rate. The combination of `paged` + bounded `rerank-count` is what makes this architecture work: you get the storage efficiency of keeping float vectors on disk, with the performance guarantee that each query only touches a small, predictable number of them. ## Services Configuration[¶](#services-configuration) We configure two [embedder components](https://docs.vespa.ai/en/rag/embedding.html): 1. **`voyage-4-large`** ([voyage-ai-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder)): Calls the Voyage AI API. Used at document indexing time to produce high-quality embeddings. Requires an API key stored in Vespa Cloud's [secret store](https://cloud.vespa.ai/en/security/secret-store). 1. **`voyage-4-nano-int8`** ([hugging-face-embedder](https://docs.vespa.ai/en/reference/rag/embedding.html#hugging-face-embedder)): Runs locally on the Vespa container as an ONNX model. Used at query time for zero-cost, low-latency embedding. No API key needed. ### Batching for throughput[¶](#batching-for-throughput) The `voyage-ai-embedder` supports [dynamic batching](https://docs.vespa.ai/en/reference/rag/embedding.html#voyageai-embedder) of concurrent embedding requests into single API calls. We configure `max-size: 20` (up to 20 documents per batch) and `max-delay: 20ms` (maximum wait time before sending a partial batch). Since each batch counts as a single API call, this can reduce the number of calls by up to 20x — making it much easier to stay within the RPM (Requests Per Minute) limit of your Voyage API key. Combined with the increased `document-processing` threadpool (512 threads), this enables high-throughput parallel embedding at index time. ### Quantization[¶](#quantization) The `voyage-ai-embedder` also supports server-side `quantization` (with values `auto`, `float`, `int8`, or `binary`). When set to `auto` (the default), Vespa infers the appropriate quantization from the destination tensor's cell type and dimensions — so our `tensor` float field and `tensor` binary field are handled automatically. For this notebook we rely on `auto` quantization, which gives us full-precision bfloat16 embeddings paged to disk for accurate reranking, and compact binary embeddings in memory for fast retrieval. The `ServicesConfiguration` class below uses pyvespa's type-safe Python API for generating [`services.xml`](https://docs.vespa.ai/en/reference/services.html). For a deeper dive into all the configuration options available, see the [advanced configuration](https://vespa-engine.github.io/pyvespa/advanced-configuration.ipynb) notebook. In \[56\]: Copied! ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, batching, container, content, search, document_api, document_processing, component, components, model, api_key_secret_ref, dimensions, documents, document, nodes, node, resources, secrets, threadpool, threads, redundancy, transformer_model, tokenizer_model, pooling_strategy, normalize, prepend, max_tokens, query, ) from vespa.configuration.vt import vt APPLICATION_NAME = "voyageai" # Replace with your Vespa Cloud secret store vault and secret name SECRET_STORE_VAULT_NAME = "pyvespa-testvault" VOYAGE_SECRET_NAME = "voyage_api_key" services_config = ServicesConfiguration( application_name=APPLICATION_NAME, services_config=services( container(id=f"{APPLICATION_NAME}_container", version="1.0")( secrets( vt( tag="apiKey", vault=SECRET_STORE_VAULT_NAME, name=VOYAGE_SECRET_NAME, ) ), search(), document_api(), # 256 threads per vCPU = 512 total with 2 vCPUs document_processing(threadpool(threads("256"))), components( # Local model for query-time embedding (zero API cost) component(id="voyage-4-nano-int8", type_="hugging-face-embedder")( transformer_model(model_id="voyage-4-nano-int8"), tokenizer_model(model_id="voyage-4-nano-vocab"), max_tokens("32768"), pooling_strategy("mean"), normalize("true"), prepend( query( "Represent the query for retrieving supporting documents: " ) ), ), # API-based model for index-time embedding (highest quality) component(id="voyage-4-large", type_="voyage-ai-embedder")( model("voyage-4-large"), api_key_secret_ref("apiKey"), dimensions("2048"), batching(max_size="20", max_delay="20ms"), ), ), nodes(count="1", required="true")( resources(vcpu="2", memory="8Gb", disk="50Gb", architecture="arm64") ), ), content(id=f"{APPLICATION_NAME}_content", version="1.0")( redundancy("1"), documents(document(type_="doc", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), ), ), ) ``` from vespa.package import ServicesConfiguration from vespa.configuration.services import ( services, batching, container, content, search, document_api, document_processing, component, components, model, api_key_secret_ref, dimensions, documents, document, nodes, node, resources, secrets, threadpool, threads, redundancy, transformer_model, tokenizer_model, pooling_strategy, normalize, prepend, max_tokens, query, ) from vespa.configuration.vt import vt APPLICATION_NAME = "voyageai" # Replace with your Vespa Cloud secret store vault and secret name SECRET_STORE_VAULT_NAME = "pyvespa-testvault" VOYAGE_SECRET_NAME = "voyage_api_key" services_config = ServicesConfiguration( application_name=APPLICATION_NAME, services_config=services( container(id=f"{APPLICATION_NAME}\_container", version="1.0")( secrets( vt( tag="apiKey", vault=SECRET_STORE_VAULT_NAME, name=VOYAGE_SECRET_NAME, ) ), search(), document_api(), # 256 threads per vCPU = 512 total with 2 vCPUs document_processing(threadpool(threads("256"))), components( # Local model for query-time embedding (zero API cost) component(id="voyage-4-nano-int8", type\_="hugging-face-embedder")( transformer_model(model_id="voyage-4-nano-int8"), tokenizer_model(model_id="voyage-4-nano-vocab"), max_tokens("32768"), pooling_strategy("mean"), normalize("true"), prepend( query( "Represent the query for retrieving supporting documents: " ) ), ), # API-based model for index-time embedding (highest quality) component(id="voyage-4-large", type\_="voyage-ai-embedder")( model("voyage-4-large"), api_key_secret_ref("apiKey"), dimensions("2048"), batching(max_size="20", max_delay="20ms"), ), ), nodes(count="1", required="true")( resources(vcpu="2", memory="8Gb", disk="50Gb", architecture="arm64") ), ), content(id=f"{APPLICATION_NAME}_content", version="1.0")( redundancy("1"), documents(document(type_="doc", mode="index")), nodes(node(distribution_key="0", hostalias="node1")), ), ), ) ## Create and Deploy the Application Package[¶](#create-and-deploy-the-application-package) In \[57\]: Copied! ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name=APPLICATION_NAME, schema=[schema], services_config=services_config, ) ``` from vespa.package import ApplicationPackage app_package = ApplicationPackage( name=APPLICATION_NAME, schema=[schema], services_config=services_config, ) Deploy to [Vespa Cloud](https://cloud.vespa.ai/en/). Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) if you don't have one. Before deploying, you need to configure a secret in the Vespa Cloud secret store with your Voyage AI API key. See [Vespa Cloud secret store](https://cloud.vespa.ai/en/security/secret-store) for instructions. > Deployments to dev expire after 14 days of inactivity. In \[58\]: Copied! ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os tenant_name = "vespa-team" # Replace with your tenant name key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\n", "\n") vespa_cloud = VespaCloud( tenant=tenant_name, application=APPLICATION_NAME, # key_content=key, application_package=app_package, ) app: Vespa = vespa_cloud.deploy() ``` from vespa.deployment import VespaCloud from vespa.application import Vespa import os tenant_name = "vespa-team" # Replace with your tenant name key = os.getenv("VESPA_TEAM_API_KEY", None) if key is not None: key = key.replace(r"\\n", "\\n") vespa_cloud = VespaCloud( tenant=tenant_name, application=APPLICATION_NAME, # key_content=key, application_package=app_package, ) app: Vespa = vespa_cloud.deploy() ``` Setting application... Running: vespa config set application vespa-team.voyageai.default Setting target cloud... Running: vespa config set target cloud No api-key found for control plane access. Using access token. Checking for access token in auth.json... Access token expired. Please re-authenticate. Your Device Confirmation code is: RWHK-VXWW Automatically open confirmation page in your default browser? [Y/n] y Opened link in your browser: https://login.console.vespa-cloud.com/activate?user_code=RWHK-VXWW Waiting for login to complete in browser ... done;1m⣻ Success: Logged in auth.json created at /Users/thomas/.vespa/auth.json Successfully obtained access token for control plane access. Deployment started in run 9 of dev-aws-us-east-1c for vespa-team.voyageai. This may take a few minutes the first time. INFO [13:26:31] Deploying platform version 8.649.29 and application dev build 9 for dev-aws-us-east-1c of default ... INFO [13:26:31] Using CA signed certificate version 1 INFO [13:26:37] Session 404822 for tenant 'vespa-team' prepared and activated. INFO [13:27:04] ######## Details for all nodes ######## INFO [13:27:04] h136163a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- logserver-container on port 4080 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136163b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- container-clustercontroller on port 19050 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136175a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- container on port 4080 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:27:04] h136066a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [13:27:04] --- platform vespa/cloud-tenant-rhel8:8.649.29 INFO [13:27:04] --- storagenode on port 19102 has not started INFO [13:27:04] --- searchnode on port 19107 has not started INFO [13:27:04] --- distributor on port 19111 has not started INFO [13:27:04] --- metricsproxy-container on port 19092 has not started INFO [13:28:01] Waiting for convergence of 10 services across 4 nodes INFO [13:28:01] 1 nodes booting INFO [13:28:01] 1 application services still deploying DEBUG [13:28:01] h136175a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP DEBUG [13:28:01] --- platform vespa/cloud-tenant-rhel8:8.649.29 DEBUG [13:28:01] --- container on port 4080 has not started DEBUG [13:28:01] --- metricsproxy-container on port 19092 has config generation 404822, wanted is 404822 INFO [13:28:43] Found endpoints: INFO [13:28:43] - dev.aws-us-east-1c INFO [13:28:43] |-- https://ca603d84.b347094a.z.vespa-app.cloud/ (cluster 'voyageai_container') INFO [13:28:43] Deployment complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for voyageai_container URL: https://ca603d84.b347094a.z.vespa-app.cloud/ Application is up! ``` ## Feed Sample Documents[¶](#feed-sample-documents) We feed a few sample passages. At indexing time, Vespa calls the `voyage-4-large` API to generate both the float and binary embedding representations. In \[59\]: Copied! ``` from vespa.io import VespaResponse sample_docs = [ { "id": "1", "fields": { "id": "1", "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers.", }, }, { "id": "2", "fields": { "id": "2", "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost.", }, }, { "id": "3", "fields": { "id": "3", "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases.", }, }, { "id": "4", "fields": { "id": "4", "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down.", }, }, { "id": "5", "fields": { "id": "5", "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment.", }, }, ] for doc in sample_docs: response: VespaResponse = app.feed_data_point( schema=SCHEMA_NAME, data_id=doc["id"], fields=doc["fields"] ) assert ( response.is_successful() ), f"Failed to feed doc {doc['id']}: {response.get_json()}" print(f"Fed document {doc['id']}") ``` from vespa.io import VespaResponse sample_docs = [ { "id": "1", "fields": { "id": "1", "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers.", }, }, { "id": "2", "fields": { "id": "2", "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost.", }, }, { "id": "3", "fields": { "id": "3", "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases.", }, }, { "id": "4", "fields": { "id": "4", "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down.", }, }, { "id": "5", "fields": { "id": "5", "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment.", }, }, ] for doc in sample_docs: response: VespaResponse = app.feed_data_point( schema=SCHEMA_NAME, data_id=doc["id"], fields=doc["fields"] ) assert ( response.is_successful() ), f"Failed to feed doc {doc['id']}: {response.get_json()}" print(f"Fed document {doc['id']}") ``` Fed document 1 Fed document 2 Fed document 3 Fed document 4 Fed document 5 ``` ## Query with Binary Retrieval and Float Reranking[¶](#query-with-binary-retrieval-and-float-reranking) At query time, Vespa uses the local `voyage-4-nano` model to embed the query text. The [`embed()` function](https://docs.vespa.ai/en/rag/embedding.html#embedding-a-query-text) in the query invokes the local embedder, producing both float and binary query representations. The retrieval pipeline: 1. [`nearestNeighbor`](https://docs.vespa.ai/en/querying/nearest-neighbor-search.html) on `embedding_binary` scans candidates using fast hamming distance. 1. First-phase ranking scores by `binary_closeness`. 1. Second-phase reranking scores the top candidates by `float_closeness`. In \[60\]: Copied! ``` from vespa.io import VespaQueryResponse import vespa.querybuilder as qb import json query_text = "How does asymmetric embedding retrieval work?" response: VespaQueryResponse = app.query( yql=str( qb.select("*") .from_(SCHEMA_NAME) .where( qb.nearestNeighbor( field="embedding_binary", query_vector="q_bin", annotations={"targetHits": 100}, ) ) ), ranking="binary-with-rerank", body={ "input.query(q_bin)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "input.query(q_float)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "hits": 5, "presentation.timing": "true", }, ) assert response.is_successful() for hit in response.hits: print(json.dumps(hit, indent=2)) ``` from vespa.io import VespaQueryResponse import vespa.querybuilder as qb import json query_text = "How does asymmetric embedding retrieval work?" response: VespaQueryResponse = app.query( yql=str( qb.select("\*") .from\_(SCHEMA_NAME) .where( qb.nearestNeighbor( field="embedding_binary", query_vector="q_bin", annotations={"targetHits": 100}, ) ) ), ranking="binary-with-rerank", body={ "input.query(q_bin)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "input.query(q_float)": f'embed({QUERY_MODEL_ID}, "{query_text}")', "hits": 5, "presentation.timing": "true", }, ) assert response.is_successful() for hit in response.hits: print(json.dumps(hit, indent=2)) ``` { "fields": { "documentid": "id:doc:doc::4", "id": "4", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.63623046875, "float_closeness": 0.5481828630707257, "vespa.summaryFeatures.cached": 0.0 }, "text": "Asymmetric retrieval uses different models for documents and queries. Documents are embedded once with an expensive, high-quality model, while queries use a smaller, faster model to keep latency low and costs down." }, "id": "id:doc:doc::4", "relevance": 0.5481828630707257, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::2", "id": "2", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.607421875, "float_closeness": 0.44831951343722665, "vespa.summaryFeatures.cached": 0.0 }, "text": "Binary quantization reduces embedding storage by representing each dimension as a single bit. While this loses precision, combining binary retrieval with float reranking recovers most of the accuracy at a fraction of the memory cost." }, "id": "id:doc:doc::2", "relevance": 0.44831951343722665, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::1", "id": "1", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.58837890625, "float_closeness": 0.34075708348829004, "vespa.summaryFeatures.cached": 0.0 }, "text": "Retrieval-augmented generation (RAG) combines a retrieval system with a generative language model. The retriever finds relevant passages from a corpus, and the generator uses them as context to produce accurate, grounded answers." }, "id": "id:doc:doc::1", "relevance": 0.34075708348829004, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::5", "id": "5", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.58154296875, "float_closeness": 0.31555799518827143, "vespa.summaryFeatures.cached": 0.0 }, "text": "The Voyage 4 embedding model family includes voyage-4-large for maximum quality, voyage-4-lite for a balance of cost and quality, and voyage-4-nano as a small open-source model suitable for local deployment." }, "id": "id:doc:doc::5", "relevance": 0.31555799518827143, "source": "voyageai_content" } { "fields": { "documentid": "id:doc:doc::3", "id": "3", "sddocname": "doc", "summaryfeatures": { "binary_closeness": 0.5703125, "float_closeness": 0.29142659282264916, "vespa.summaryFeatures.cached": 0.0 }, "text": "Vespa is a fully featured search engine and vector database. It supports real-time indexing, structured and unstructured data, and advanced ranking with multiple retrieval and ranking phases." }, "id": "id:doc:doc::3", "relevance": 0.29142659282264916, "source": "voyageai_content" } ``` The `summaryfeatures` in each hit show both scoring phases: - `binary_closeness`: The first-phase hamming-based score (fast, approximate). - `float_closeness`: The second-phase dot-product score between the query and document float embeddings. Since both embeddings are unit-normalized (`prenormalized-angular`), the dot product equals cosine similarity. The final `relevance` score is the second-phase float closeness for reranked candidates. In \[61\]: Copied! ``` response.json["timing"] ``` response.json["timing"] Out\[61\]: ``` {'querytime': 0.032, 'searchtime': 0.051000000000000004, 'summaryfetchtime': 0.015} ``` ## Cleanup[¶](#cleanup) In \[ \]: Copied! ``` vespa_cloud.delete() ``` vespa_cloud.delete() # API Reference - [vespa](https://vespa-engine.github.io/pyvespa/api/vespa/index.md) - [application](https://vespa-engine.github.io/pyvespa/api/vespa/application.md) - [deployment](https://vespa-engine.github.io/pyvespa/api/vespa/deployment.md) - [evaluation](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation/index.md) - [\_base](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation/_base.md) - [\_mteb](https://vespa-engine.github.io/pyvespa/api/vespa/evaluation/_mteb.md) - [exceptions](https://vespa-engine.github.io/pyvespa/api/vespa/exceptions.md) - [io](https://vespa-engine.github.io/pyvespa/api/vespa/io.md) - [models](https://vespa-engine.github.io/pyvespa/api/vespa/models.md) - [package](https://vespa-engine.github.io/pyvespa/api/vespa/package.md) - [querybuilder](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/index.md) - [builder](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/index.md) - [builder](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/builder.md) - [grouping](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/grouping/index.md) - [grouping](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/grouping/grouping.md) - [throttling](https://vespa-engine.github.io/pyvespa/api/vespa/throttling.md) ## `vespa.application` ### `Vespa(url, port=None, deployment_message=None, cert=None, key=None, vespa_cloud_secret_token=None, output_file=sys.stdout, application_package=None, additional_headers=None)` Bases: `object` Establish a connection with an existing Vespa application. Parameters: | Name | Type | Description | Default | | -------------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `url` | `str` | Vespa endpoint URL. | *required* | | `port` | `int` | Vespa endpoint port. | `None` | | `deployment_message` | `str` | Message returned by Vespa engine after deployment. Used internally by deploy methods. | `None` | | `cert` | `str` | Path to data plane certificate and key file in case the 'key' parameter is None. If 'key' is not None, this should be the path of the certificate file. Typically generated by Vespa-cli with 'vespa auth cert'. | `None` | | `key` | `str` | Path to the data plane key file. Typically generated by Vespa-cli with 'vespa auth cert'. | `None` | | `vespa_cloud_secret_token` | `str` | Vespa Cloud data plane secret token. | `None` | | `output_file` | `str` | Output file to write output messages. | `stdout` | | `application_package` | `str` | Application package definition used to deploy the application. | `None` | | `additional_headers` | `dict` | Additional headers to be sent to the Vespa application. | `None` | Example usage ```python Vespa(url="https://cord19.vespa.ai") # doctest: +SKIP Vespa(url="http://localhost", port=8080) Vespa(http://localhost, 8080) Vespa(url="https://token-endpoint..z.vespa-app.cloud", vespa_cloud_secret_token="your_token") # doctest: +SKIP Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem") # doctest: +SKIP Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem", additional_headers={"X-Custom-Header": "test"}) # doctest: +SKIP ``` #### `application_package` Get application package definition, if available. #### `asyncio(connections=1, total_timeout=None, timeout=30.0, client=None, **kwargs)` Access Vespa asynchronous connection layer. Should be used as a context manager. Example usage ```python async with app.asyncio() as async_app: response = await async_app.query(body=body) # passing kwargs with custom timeout async with app.asyncio(connections=5, timeout=60.0) as async_app: response = await async_app.query(body=body) ``` See `VespaAsync` for more details on the parameters. Parameters: | Name | Type | Description | Default | | --------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------- | --------- | | `connections` | `int` | Number of maximum_keepalive_connections. | `1` | | `total_timeout` | `int` | Deprecated. Will be ignored. Use timeout instead. | `None` | | `timeout` | \`float | int | Timeout\` | | `client` | `AsyncClient` | Reusable httpx.AsyncClient to use instead of creating a new one. When provided, the caller is responsible for closing the client. | `None` | | `**kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------------------------- | | `VespaAsync` | `VespaAsync` | Instance of Vespa asynchronous layer. | #### `get_async_session(connections=1, total_timeout=None, timeout=30.0, **kwargs)` Return a configured `httpx.AsyncClient` for reuse. The client is created with the same configuration as `VespaAsync` and is HTTP/2 enabled by default. Callers are responsible for closing the client via `await client.aclose()` when finished. Parameters: | Name | Type | Description | Default | | ------------- | ------- | ------------------------------------------------------------ | --------- | | `connections` | `int` | Number of logical connections to keep alive. | `1` | | `timeout` | \`float | int | Timeout\` | | `**kwargs` | | Additional keyword arguments forwarded to httpx.AsyncClient. | `{}` | Returns: | Type | Description | | ------------- | ------------------------------------------------------- | | `AsyncClient` | httpx.AsyncClient: Configured asynchronous HTTP client. | #### `syncio(connections=8, compress='auto', session=None)` Access Vespa synchronous connection layer. Should be used as a context manager. Example usage: ````text ```python with app.syncio() as sync_app: response = sync_app.query(body=body) ```` ```` See for more details. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `connections` | `int` | Number of allowed concurrent connections. | `8` | | `total_timeout` | `float` | Total timeout in seconds. | *required* | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `session` | `Session` | Reusable requests session to utilise for all requests made within the context manager. When provided, the caller is responsible for closing the session. | `None` | Returns: | Name | Type | Description | | --- | --- | --- | | `VespaAsyncLayer` | `VespaSync` | Instance of Vespa asynchronous layer. | #### `get_sync_session(connections=8, compress='auto')` Return a configured httpr.Client for reuse. The returned client is configured with the same headers, authentication, and mTLS certificates as the VespaSync context manager. Callers are responsible for closing the client when it is no longer needed. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `connections` | `int` | Kept for API compatibility (httpr manages pooling). | `8` | | `compress` | `Union[str, bool]` | Whether to compress request bodies. | `'auto'` | Returns: | Type | Description | | --- | --- | | `Client` | httpr.Client: Configured HTTP client. | #### `wait_for_application_up(max_wait=300)` Wait for application endpoint ready (/ApplicationStatus). Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `max_wait` | `int` | Seconds to wait for the application endpoint. | `300` | Raises: | Type | Description | | --- | --- | | `RuntimeError` | If not able to reach endpoint within max_wait or the client fails to authenticate. | Returns: | Type | Description | | --- | --- | | `None` | None | #### `get_application_status()` Get application status (/ApplicationStatus). Returns: | Type | Description | | --- | --- | | `Optional[Response]` | None | #### `get_model_endpoint(model_id=None)` Get stateless model evaluation endpoints. #### `query(body=None, groupname=None, streaming=False, profile=False, **kwargs)` Send a query request to the Vespa application. Send 'body' containing all the request parameters. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `body` | `dict` | Dictionary containing request parameters. | `None` | | `groupname` | `str` | The groupname used with streaming search. | `None` | | `streaming` | `bool` | Whether to use streaming mode (SSE). Defaults to False. | `False` | | `profile` | `bool` | Add profiling parameters to the query (response may be large). Defaults to False. | `False` | | `**kwargs` | `dict` | Extra Vespa Query API parameters. | `{}` | Returns: | Type | Description | | --- | --- | | `Union[VespaQueryResponse, Generator[str, None, None]]` | VespaQueryResponse when streaming=False, or a generator of decoded lines when streaming=True. | #### `feed_data_point(schema, data_id, fields, namespace=None, groupname=None, compress='auto', **kwargs)` Feed a data point to a Vespa app. Will create a new VespaSync with connection overhead. Example usage ```python app = Vespa(url="localhost", port=8080) data_id = "1", fields = { "field1": "value1", } with VespaSync(app) as sync_app: response = sync_app.feed_data_point( schema="schema_name", data_id=data_id, fields=fields ) print(response) ```` Parameters: | Name | Type | Description | Default | | ----------- | ------------------ | -------------------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are sending data to. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dictionary containing all the fields required by the schema. | *required* | | `namespace` | `str` | The namespace that we are sending data to. | `None` | | `groupname` | `str` | The groupname that we are sending data to. | `None` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | Returns: | Name | Type | Description | | --------------- | --------------- | -------------------------------------- | | `VespaResponse` | `VespaResponse` | The response of the HTTP POST request. | #### `feed_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=8, max_connections=16, compress='auto', **kwargs)` Feed data from an Iterable of Dict with the keys 'id' and 'fields' to be used in the `feed_data_point` function. Uses a queue to feed data in parallel with a thread pool. The result of each operation is forwarded to the user-provided callback function that can process the returned `VespaResponse`. Example usage ```python app = Vespa(url="localhost", port=8080) data = [ {"id": "1", "fields": {"field1": "value1"}}, {"id": "2", "fields": {"field1": "value2"}}, ] def callback(response, id): print(f"Response for id {id}: {response.status_code}") app.feed_iterable(data, schema="schema_name", callback=callback) ``` Parameters: | Name | Type | Description | Default | | ----------------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `iter` | `Iterable[dict]` | An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa. | *required* | | `schema` | `str` | The Vespa schema name that we are sending data to. | `None` | | `namespace` | `str` | The Vespa document id namespace. If no namespace is provided, the schema is used. | `None` | | `callback` | `function` | A callback function to be called on each result. Signature callback(response: VespaResponse, id: str). | `None` | | `operation_type` | `str` | The operation to perform. Defaults to feed. Valid values are feed, update, or delete. | `'feed'` | | `max_queue_size` | `int` | The maximum size of the blocking queue and max in-flight operations. | `1000` | | `max_workers` | `int` | The maximum number of workers in the threadpool executor. | `8` | | `max_connections` | `int` | The maximum number of persisted connections to the Vespa endpoint. | `16` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `**kwargs` | `dict` | Additional parameters passed to the respective operation type specific function (\_data_point). | `{}` | Returns: | Type | Description | | ---- | ----------- | | | None | #### `feed_async_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=64, max_connections=1, **kwargs)` Feed data asynchronously using httpx.AsyncClient with HTTP/2. Feed from an Iterable of Dict with the keys 'id' and 'fields' to be used in the `feed_data_point` function. The result of each operation is forwarded to the user-provided callback function that can process the returned `VespaResponse`. Prefer using this method over `feed_iterable` when the operation is I/O bound from the client side. Example usage ```python app = Vespa(url="localhost", port=8080) data = [ {"id": "1", "fields": {"field1": "value1"}}, {"id": "2", "fields": {"field1": "value2"}}, ] def callback(response, id): print(f"Response for id {id}: {response.status_code}") app.feed_async_iterable(data, schema="schema_name", callback=callback) ``` Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `iter` | `Iterable[dict]` | An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa. | *required* | | `schema` | `str` | The Vespa schema name that we are sending data to. | `None` | | `namespace` | `str` | The Vespa document id namespace. If no namespace is provided, the schema is used. | `None` | | `callback` | `function` | A callback function to be called on each result. Signature callback(response: VespaResponse, id: str). | `None` | | `operation_type` | `str` | The operation to perform. Defaults to feed. Valid values are feed, update, or delete. | `'feed'` | | `max_queue_size` | `int` | The maximum number of tasks waiting to be processed. Useful to limit memory usage. Default is 1000. | `1000` | | `max_workers` | `int` | Maximum number of concurrent requests to have in-flight, bound by an asyncio.Semaphore, that needs to be acquired by a submit task. Increase if the server is scaled to handle more requests. | `64` | | `max_connections` | `int` | The maximum number of connections passed to httpx.AsyncClient to the Vespa endpoint. As HTTP/2 is used, only one connection is needed. | `1` | | `**kwargs` | `dict` | Additional parameters passed to the respective operation type-specific function (\_data_point). | `{}` | Returns: | Type | Description | | ---- | ----------- | | | None | #### `query_many_async(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs)` Execute many queries asynchronously using httpx.AsyncClient. Number of concurrent requests is controlled by the `max_concurrent` parameter. Each query will be retried up to 3 times using an exponential backoff strategy. When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations. Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Iterable[dict]` | Iterable of query bodies (dictionaries) to be sent. | *required* | | `num_connections` | `int` | Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1. | `1` | | `max_concurrent` | `int` | Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much. | `100` | | `adaptive` | `bool` | Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates. | `True` | | `client_kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | | `**query_kwargs` | `dict` | Additional arguments to be passed to the query method. | `{}` | Returns: | Type | Description | | -------------------------- | --------------------------------------------------------------- | | `List[VespaQueryResponse]` | List\[VespaQueryResponse\]: List of VespaQueryResponse objects. | #### `query_many(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs)` Execute many queries asynchronously using httpx.AsyncClient. This method is a wrapper around the `query_many_async` method that uses the asyncio event loop to run the coroutine. Number of concurrent requests is controlled by the `max_concurrent` parameter. Each query will be retried up to 3 times using an exponential backoff strategy. When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations. Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Iterable[dict]` | Iterable of query bodies (dictionaries) to be sent. | *required* | | `num_connections` | `int` | Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1. | `1` | | `max_concurrent` | `int` | Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much. | `100` | | `adaptive` | `bool` | Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates. | `True` | | `client_kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | | `**query_kwargs` | `dict` | Additional arguments to be passed to the query method. | `{}` | Returns: | Type | Description | | -------------------------- | --------------------------------------------------------------- | | `List[VespaQueryResponse]` | List\[VespaQueryResponse\]: List of VespaQueryResponse objects. | #### `delete_data(schema, data_id, namespace=None, groupname=None, **kwargs)` Delete a data point from a Vespa app. Example usage ```python app = Vespa(url="localhost", port=8080) response = app.delete_data(schema="schema_name", data_id="1") print(response) ``` Parameters: | Name | Type | Description | Default | | ----------- | ------ | ----------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are deleting data from. | `None` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ---------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP DELETE request. | #### `delete_all_docs(content_cluster_name, schema, namespace=None, slices=1, **kwargs)` Delete all documents associated with the schema. This might block for a long time as it requires sending multiple delete requests to complete. Parameters: | Name | Type | Description | Default | | ---------------------- | ------ | ----------------------------------------------------------------------------------------------------------- | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from, or visit. | *required* | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. If no namespace is provided, the schema is used. | `None` | | `slices` | `int` | Number of slices to use for parallel delete requests. Defaults to 1. | `1` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details. | `{}` | Returns: | Name | Type | Description | | ---------- | ---------- | ---------------------------------------- | | `Response` | `Response` | The response of the HTTP DELETE request. | #### `visit(content_cluster_name, schema=None, namespace=None, slices=1, selection='true', wanted_document_count=500, slice_id=None, **kwargs)` Visit all documents associated with the schema and matching the selection. Will run each slice on a separate thread, for each slice yields the response for each page. Example usage ```python for slice in app.visit(schema="schema_name", slices=2): for response in slice: print(response.json) ``` Parameters: | Name | Type | Description | Default | | ----------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------- | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from. | *required* | | `schema` | `str` | The schema that we are visiting data from. | `None` | | `namespace` | `str` | The namespace that we are visiting data from. | `None` | | `slices` | `int` | Number of slices to use for parallel GET. | `1` | | `selection` | `str` | Selection expression to filter documents. | `'true'` | | `wanted_document_count` | `int` | Best effort number of documents to retrieve for each request. May contain less if there are not enough documents left. | `500` | | `slice_id` | `int` | Slice id to use for the visit. If None, all slices will be used. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See Vespa API documentation. | `{}` | Yields: | Type | Description | | ------------------------------------------- | -------------------------------------------------------------------------------------------------- | | `Generator[VespaVisitResponse, None, None]` | Generator\[Generator[Response]\]: A generator of slices, each containing a generator of responses. | Raises: | Type | Description | | ----------- | -------------------------- | | `HTTPError` | If an HTTP error occurred. | #### `get_data(schema, data_id, namespace=None, groupname=None, raise_on_not_found=False, **kwargs)` Get a data point from a Vespa app. Parameters: | Name | Type | Description | Default | | -------------------- | ------ | --------------------------------------------------------------------------------------------- | ---------- | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `schema` | `str` | The schema that we are getting data from. Will attempt to infer schema name if not provided. | *required* | | `namespace` | `str` | The namespace that we are getting data from. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are getting data from. | `None` | | `raise_on_not_found` | `bool` | Raise an exception if the data_id is not found. Default is False. | `False` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP GET request. See Vespa API documentation. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP GET request. | #### `update_data(schema, data_id, fields, create=False, namespace=None, groupname=None, compress='auto', **kwargs)` Update a data point in a Vespa app. Example usage ```python vespa = Vespa(url="localhost", port=8080) fields = {"mystringfield": "value1", "myintfield": 42} response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields) # or, with partial update, setting auto_assign=False fields = {"myintfield": {"increment": 1}} response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields, auto_assign=False) print(response.json) ``` Parameters: | Name | Type | Description | Default | | ------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are updating data. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dict containing all the fields you want to update. | *required* | | `create` | `bool` | If true, updates to non-existent documents will create an empty document to update. | `False` | | `auto_assign` | `bool` | Assumes fields-parameter is an assignment operation. If set to false, the fields parameter should be a dictionary including the update operation. | *required* | | `namespace` | `str` | The namespace that we are updating data. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are updating data. | `None` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP PUT request. See Vespa API documentation. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP PUT request. | #### `get_model_from_application_package(model_name)` Get model definition from application package, if available. #### `predict(x, model_id, function_name='output_0')` Obtain a stateless model evaluation. Parameters: | Name | Type | Description | Default | | --------------- | --------- | --------------------------------------------------------------------- | ------------ | | `x` | `various` | Input where the format depends on the task that the model is serving. | *required* | | `model_id` | `str` | The id of the model used to serve the prediction. | *required* | | `function_name` | `str` | The name of the output function to be evaluated. | `'output_0'` | Returns: | Name | Type | Description | | ----- | ---- | ----------------- | | `var` | | Model prediction. | #### `get_document_v1_path(id, schema=None, namespace=None, group=None, number=None)` Convert to document v1 path. Parameters: | Name | Type | Description | Default | | ----------- | ----- | ------------------------------ | ---------- | | `id` | `str` | The id of the document. | *required* | | `namespace` | `str` | The namespace of the document. | `None` | | `schema` | `str` | The schema of the document. | `None` | | `group` | `str` | The group of the document. | `None` | | `number` | `int` | The number of the document. | `None` | Returns: | Name | Type | Description | | ----- | ----- | ------------------------------------- | | `str` | `str` | The path to the document v1 endpoint. | ### `VespaSync(app, pool_maxsize=10, pool_connections=10, compress='auto', session=None)` Bases: `object` Class to handle synchronous requests to Vespa. This class is intended to be used as a context manager. Example usage ```python with VespaSync(app) as sync_app: response = sync_app.query(body=body) print(response) ``` Can also be accessed directly through `Vespa.syncio`: ```python app = Vespa(url="localhost", port=8080) with app.syncio() as sync_app: response = sync_app.query(body=body) ``` **Reusing a client across multiple contexts** (avoids TLS handshake overhead): ```python # Get a reusable client client = app.get_sync_session() try: # Use it multiple times with app.syncio(session=client) as sync_app: response1 = sync_app.query(body=body1) with app.syncio(session=client) as sync_app: response2 = sync_app.query(body=body2) finally: # User is responsible for closing client.close() ``` See also `Vespa.feed_iterable` for a convenient way to feed data synchronously. Parameters: | Name | Type | Description | Default | | ------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | ---------- | | `app` | `Vespa` | Vespa app object. | *required* | | `pool_maxsize` | `int` | The maximum number of connections to save in the pool. Defaults to 10. (Note: httpr manages connection pooling automatically) | `10` | | `pool_connections` | `int` | The number of connection pools to cache. Defaults to 10. (Note: httpr manages connection pooling automatically) | `10` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `session` | `Client` | An externally managed httpr client to reuse. When provided, the caller is responsible for closing it. Defaults to None. | `None` | #### `get_model_endpoint(model_id=None)` Get model evaluation endpoints. #### `predict(model_id, function_name, encoded_tokens)` Obtain a stateless model evaluation. Parameters: | Name | Type | Description | Default | | ---------------- | ----- | ------------------------------------------------- | ---------- | | `model_id` | `str` | The id of the model used to serve the prediction. | *required* | | `function_name` | `str` | The name of the output function to be evaluated. | *required* | | `encoded_tokens` | `str` | URL-encoded input to the model. | *required* | Returns: | Type | Description | | ---- | --------------------- | | | The model prediction. | #### `feed_data_point(schema, data_id, fields, namespace=None, groupname=None, **kwargs)` Feed a data point to a Vespa app. Parameters: | Name | Type | Description | Default | | ----------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `schema` | `str` | The schema that we are sending data to. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dict containing all the fields required by the schema. | *required* | | `namespace` | `str` | The namespace that we are sending data to. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The group that we are sending data to. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Type | Description | | --------------- | -------------------------------------- | | `VespaResponse` | The response of the HTTP POST request. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `query(body=None, groupname=None, streaming=False, profile=False, **kwargs)` Send a query request to the Vespa application. Parameters: | Name | Type | Description | Default | | ----------- | ------ | ------------------------------------------------------------------------------------------------------------------- | ------- | | `body` | `dict` | Dict containing all the request parameters. | `None` | | `groupname` | `str` | The groupname used in streaming search. | `None` | | `streaming` | `bool` | Whether to use streaming mode (SSE). Defaults to False. | `False` | | `profile` | `bool` | Add profiling parameters to the query (response may be large). Defaults to False. | `False` | | `**kwargs` | `dict` | Additional valid Vespa HTTP Query API parameters. See: https://docs.vespa.ai/en/reference/query-api-reference.html. | `{}` | Returns: | Type | Description | | ------------------------------------------------------- | --------------------------------------------------------------------------------------------- | | `Union[VespaQueryResponse, Generator[str, None, None]]` | VespaQueryResponse when streaming=False, or a generator of decoded lines when streaming=True. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `delete_data(schema, data_id, namespace=None, groupname=None, **kwargs)` Delete a data point from a Vespa app. Parameters: | Name | Type | Description | Default | | ----------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ---------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP DELETE request. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `delete_all_docs(content_cluster_name, schema, namespace=None, slices=1, **kwargs)` Delete all documents associated with the schema. Parameters: | Name | Type | Description | Default | | ---------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from or visit. | *required* | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. | `None` | | `slices` | `int` | Number of slices to use for parallel delete. | `1` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ---------- | ------ | ---------------------------------------- | | `Response` | `None` | The response of the HTTP DELETE request. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `visit(content_cluster_name, schema=None, namespace=None, slices=1, selection='true', wanted_document_count=500, slice_id=None, **kwargs)` Visit all documents associated with the schema and matching the selection. This method will run each slice on a separate thread, yielding the response for each page for each slice. Parameters: | Name | Type | Description | Default | | ----------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from. | *required* | | `schema` | `str` | The schema that we are visiting data from. | `None` | | `namespace` | `str` | The namespace that we are visiting data from. | `None` | | `slices` | `int` | Number of slices to use for parallel GET. | `1` | | `wanted_document_count` | `int` | Best effort number of documents to retrieve for each request. May contain fewer if there are not enough documents left. | `500` | | `selection` | `str` | Selection expression to use. Defaults to "true". See: https://docs.vespa.ai/en/reference/document-select-language.html. | `'true'` | | `slice_id` | `int` | Slice id to use. Defaults to -1, which means all slices. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ----------- | ------ | ---------------------------------------------------------------- | | `generator` | `None` | A generator of slices, each containing a generator of responses. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `get_data(schema, data_id, namespace=None, groupname=None, raise_on_not_found=False, **kwargs)` Get a data point from a Vespa app. Parameters: | Name | Type | Description | Default | | -------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `schema` | `str` | The schema that we are getting data from. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `namespace` | `str` | The namespace that we are getting data from. | `None` | | `groupname` | `str` | The groupname used to get data. | `None` | | `raise_on_not_found` | `bool` | Raise an exception if the document is not found. Default is False. | `False` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP GET request. | Raises: | Type | Description | | ----------- | ---------------- | | `HTTPError` | If one occurred. | #### `update_data(schema, data_id, fields, create=False, auto_assign=True, namespace=None, groupname=None, **kwargs)` Update a data point in a Vespa app. Parameters: | Name | Type | Description | Default | | ------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `schema` | `str` | The schema that we are updating data in. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dict containing all the fields you want to update. | *required* | | `create` | `bool` | If true, updates to non-existent documents will create an empty document to update. Default is False. | `False` | | `auto_assign` | `bool` | Assumes fields-parameter is an assignment operation. If set to False, the fields parameter should include the update operation. Default is True. | `True` | | `namespace` | `str` | The namespace that we are updating data in. | `None` | | `groupname` | `str` | The groupname used to update data. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: \/target/application directory. | `None` | | `cluster` | `str` | Name of the cluster to target when retrieving endpoints. This affects which endpoints are used for initializing the :class:Vespa instance in VespaCloud.get_application and VespaCloud.deploy. | `None` | | `instance` | `str` | Name of the application instance. Default is "default". | `'default'` | Raises: | Type | Description | | -------------- | -------------------- | | `RuntimeError` | If deployment fails. | Returns: | Name | Type | Description | | ------- | ------ | -------------------------------------------------------------------------- | | `Vespa` | `None` | A Vespa connection instance for interacting with the deployed application. | #### `deploy(instance='default', disk_folder=None, version=None, max_wait=1800, environment='dev', region=None)` Deploy the given application package as the given instance in the Vespa Cloud dev or perf environment. Parameters: | Name | Type | Description | Default | | ------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Name of this instance of the application in the Vespa Cloud. | `'default'` | | `disk_folder` | `str` | Disk folder to save the required Vespa config files. Defaults to the application name folder within the user's current working directory. | `None` | | `version` | `str` | Vespa version to use for deployment. Defaults to None, meaning the latest version. Should only be set based on instructions from the Vespa team. Must be a valid Vespa version, e.g., "8.435.13". | `None` | | `max_wait` | `int` | Seconds to wait for the deployment to complete. | `1800` | | `environment` | `Literal['dev', 'perf']` | Environment to deploy to. Default is "dev". | `'dev'` | | `region` | `str` | Dev region to deploy to. Valid regions: "aws-us-east-1c" (default), "aws-euw1-az1", "azure-eastus-az1", "gcp-us-central1-f". Only used when environment is "dev". | `None` | Returns: | Name | Type | Description | | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `Vespa` | `Vespa` | A Vespa connection instance. This instance connects to the mTLS endpoint. To connect to the token endpoint, use VespaCloud.get_application(endpoint_type="token"). | Raises: | Type | Description | | -------------- | ----------------------------------------------------------------------- | | `RuntimeError` | If deployment fails or if there are issues with the deployment process. | | `ValueError` | If an invalid dev region is provided. | #### `deploy_to_prod(instance='default', application_root=None, source_url='')` Deploy the given application package as the given instance in the Vespa Cloud prod environment. NB! This feature is experimental and may fail in unexpected ways. Expect better support in future releases. If submitting an application that is not yet packaged, tests should be located in /tests. If submitting an application packaged with maven, application_root should refer to the generated /target/application directory. Parameters: | Name | Type | Description | Default | | ------------------ | ----- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Name of this instance of the application in the Vespa Cloud. | `'default'` | | `application_root` | `str` | Path to either save the required Vespa config files (if initialized with application_package) or read them from (if initialized with application_root). | `None` | | `source_url` | `str` | Optional source URL (including commit hash) for the deployment. This is a URL to the source code repository, e.g., GitHub, that is used to build the application package. Example: https://github.com/vespa-cloud/vector-search/commit/474d7771bd938d35dc5dcfd407c21c019d15df3c. The source URL will show up in the Vespa Cloud Console next to the build number. | `''` | Raises: | Type | Description | | -------------- | ----------------------------------------------------------------------- | | `RuntimeError` | If deployment fails or if there are issues with the deployment process. | #### `get_application(instance='default', environment='dev', endpoint_type='mtls', vespa_cloud_secret_token=None, region=None, max_wait=60)` Get a connection to the Vespa application instance. Will only work if the application is already deployed. Example usage ```python vespa_cloud = VespaCloud(...) app: Vespa = vespa_cloud.get_application() # Feed, query, visit, etc. ``` Parameters: | Name | Type | Description | Default | | -------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Name of this instance of the application in the Vespa Cloud. Default is "default". | `'default'` | | `environment` | `str` | Environment of the application. Default is "dev". Options are "dev", "perf", or "prod". | `'dev'` | | `endpoint_type` | `str` | Type of endpoint to connect to. Default is "mtls". Options are "mtls" or "token". | `'mtls'` | | `vespa_cloud_secret_token` | `str` | Vespa Cloud Secret Token. Only required if endpoint_type is "token". | `None` | | `region` | `str` | Region of the application in Vespa Cloud, e.g., "aws-us-east-1c". If not provided, the first region from the environment will be used. | `None` | | `max_wait` | `int` | Seconds to wait for the application to be up. Default is 60 seconds. | `60` | Returns: | Name | Type | Description | | ------- | ------- | --------------------------- | | `Vespa` | `Vespa` | Vespa application instance. | Raises: | Type | Description | | -------------- | ------------------------------------------------------------------------------------- | | `RuntimeError` | If the application is not yet deployed or there are issues retrieving the connection. | #### `check_production_build_status(build_no, quiet=False)` Check the status of a production build. Useful for example in CI/CD pipelines to check when a build has converged. Example usage ```python vespa_cloud = VespaCloud(...) build_no = vespa_cloud.deploy_to_prod() status = vespa_cloud.check_production_build_status(build_no) # The response contains: # - "deployed" (bool): True if the build has converged everywhere. # - "status" (str): "deploying" or "done". # - "hasFailed" (bool): True if any job for this build has ever failed. # Once true, it stays true even if the system retries with a new run. # - "skipReason" (str, optional): Why the build was skipped, e.g. "no-changes" or "cancelled". # - "jobs" (list): Per-job deployment details, each with "jobName", "runStatus", # "runId", and "instance". The list grows as jobs are triggered. # # Each job shows the most recent run's status for this build. # # Example: early in deployment (only tests triggered so far): # {"deployed": False, "status": "deploying", "hasFailed": False, # "jobs": [{"jobName": "system-test", "runStatus": "running"}, # {"jobName": "staging-test", "runStatus": "running"}]} # # Example: fully deployed: # {"deployed": True, "status": "done", "hasFailed": False, # "jobs": [{"jobName": "system-test", "runStatus": "success"}, # {"jobName": "staging-test", "runStatus": "success"}, # {"jobName": "production-us-east-3", "runStatus": "success"}]} # # Example: a job failed (system retries, but hasFailed stays true): # {"deployed": False, "status": "deploying", "hasFailed": True, # "jobs": [{"jobName": "system-test", "runStatus": "success"}, # {"jobName": "staging-test", "runStatus": "installationFailed"}]} # # Example: skipped before any jobs triggered (no changes): # {"deployed": False, "status": "done", "hasFailed": False, # "skipReason": "no-changes", "jobs": []} # # Example: cancelled after some jobs ran: # {"deployed": False, "status": "done", "hasFailed": True, # "skipReason": "cancelled", # "jobs": [{"jobName": "system-test", "runStatus": "success"}, # {"jobName": "staging-test", "runStatus": "running"}]} ``` Parameters: | Name | Type | Description | Default | | ---------- | ------ | ------------------------------------------------- | ---------- | | `build_no` | `int` | The build number to check. | *required* | | `quiet` | `bool` | If True, suppress status print. Default is False. | `False` | Returns: | Name | Type | Description | | ------ | ------ | --------------------------------------------------------------------------------------- | | `dict` | `dict` | The build status response from the API. See example responses above for the full shape. | Raises: | Type | Description | | -------------- | ------------------------------------------------------------ | | `RuntimeError` | If there are issues with retrieving the status of the build. | #### `wait_for_prod_deployment(build_no=None, max_wait=3600, poll_interval=5)` Wait for a production deployment to finish by polling build status. Prints per-job status changes as they happen (only prints when a job's status changes). Example usage ```python vespa_cloud = VespaCloud(...) build_no = vespa_cloud.deploy_to_prod() success = vespa_cloud.wait_for_prod_deployment(build_no, max_wait=3600, poll_interval=5) print(success) # Output: True ``` Parameters: | Name | Type | Description | Default | | --------------- | ----- | ----------------------------------------------------------------------------- | ------- | | `build_no` | `int` | The build number to check. | `None` | | `max_wait` | `int` | Maximum time to wait for the deployment in seconds. Default is 3600 (1 hour). | `3600` | | `poll_interval` | `int` | Polling interval in seconds. Default is 5 seconds. | `5` | Returns: | Name | Type | Description | | ------ | ------ | --------------------------------------------------------------------------------------------------------------------------------- | | `bool` | `bool` | True if the build was deployed to all production zones, False if it completed without deploying (e.g. skipped due to no changes). | Raises: | Type | Description | | -------------- | -------------------------------------------------------------------------------------------------------------------------------------- | | `RuntimeError` | If any job for this build has failed. The deployment system may continue retrying, but this method exits immediately on first failure. | | `TimeoutError` | If the deployment did not finish within max_wait seconds. | #### `deploy_from_disk(instance, application_root, max_wait=300, version=None, environment='dev', region=None)` Deploy to the development or performance environment from a directory tree. This method is used when making changes to application package files that are not supported by pyvespa. Note: Requires a certificate and key to be generated using 'vespa auth cert'. Example usage ```python vespa_cloud = VespaCloud(...) vespa_cloud.deploy_from_disk( instance="my-instance", application_root="/path/to/application", max_wait=3600, version="8.435.13" ) ``` Parameters: | Name | Type | Description | Default | | ------------------ | ----- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `instance` | `str` | The name of the instance where the application will be run. | *required* | | `application_root` | `str` | The root directory of the application package. | *required* | | `max_wait` | `int` | The maximum number of seconds to wait for the deployment. Default is 3600 (1 hour). | `300` | | `version` | `str` | The Vespa version to use for the deployment. Default is None, which means the latest version. It must be a valid Vespa version (e.g., "8.435.13"). | `None` | | `environment` | `str` | Environment to deploy to. Default is "dev". Options are "dev" or "perf". | `'dev'` | | `region` | `str` | Dev region to deploy to. Valid regions: "aws-us-east-1c" (default), "aws-euw1-az1", "azure-eastus-az1", "gcp-us-central1-f". Only used when environment is "dev". | `None` | Returns: | Name | Type | Description | | ------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | `Vespa` | `Vespa` | A Vespa connection instance. This connects to the mtls endpoint. To connect to the token endpoint, use VespaCloud.get_application(endpoint_type="token"). | #### `delete(instance='default', environment='dev', region=None)` Delete the specified instance from the development environment in the Vespa Cloud. To delete a production instance, you must submit a new deployment with `deployment-removal` added to the 'validation-overrides.xml'. See for more details. Example usage ```python vespa_cloud = VespaCloud(...) vespa_cloud.delete_instance(instance="my-instance") ``` Parameters: | Name | Type | Description | Default | | ------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | The name of the instance to delete. | `'default'` | | `environment` | `str` | The environment from which to delete the instance. Must be "dev" or "perf". | `'dev'` | | `region` | `str` | Dev region to delete from. Valid regions: "aws-us-east-1c" (default), "aws-euw1-az1", "azure-eastus-az1", "gcp-us-central1-f". Only used when environment is "dev". | `None` | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `get_all_endpoints(instance='default', region=None, environment='dev')` Get all endpoints for the application instance. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ----------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | Returns: | Name | Type | Description | | ------ | ---------------------- | ------------------ | | `list` | `List[Dict[str, str]]` | List of endpoints. | #### `get_private_services(instance='default', region=None, environment='dev')` Get private services for the application instance. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ----------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | Returns: | Name | Type | Description | | ------ | ------ | -------------------------- | | `dict` | `dict` | Private services response. | warning:: This method is experimental and may change. #### `get_app_package_contents(instance='default', region=None, environment='dev')` Get all endpoints for the application package content in the specified region and environment. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ----------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. If None, uses the default region for the environment. | `None` | | `environment` | `str` | Environment (dev/perf/prod). Default is 'dev'. | `'dev'` | Returns: | Name | Type | Description | | ------ | ----------- | ----------------------------------------------- | | `list` | `List[str]` | List of endpoints for the application instance. | #### `get_schemas(instance='default', region=None, environment='dev')` Get all schemas for the application instance in the specified environment and region. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ----------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. If None, uses the default region for the environment. | `None` | | `environment` | `str` | Environment (dev/perf/prod). Default is 'dev'. | `'dev'` | Returns: | Name | Type | Description | | ------ | ---------------- | -------------------------------------------------------- | | `dict` | `Dict[str, str]` | Dictionary with schema name as key and content as value. | #### `download_app_package_content(destination_path, instance='default', region=None, environment='dev')` Download the application package content to a specified destination path. Parameters: | Name | Type | Description | Default | | ------------------ | ----- | ----------------------------------------------------------------------------------------- | ----------- | | `destination_path` | `str` | The path where the application package content will be downloaded. | *required* | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. If None, uses the default region for the environment. | `None` | | `environment` | `str` | Environment (dev/perf/prod). Default is 'dev'. | `'dev'` | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `get_endpoint_auth_method(url, instance='default', region=None, environment='dev')` Get the authentication method for the given endpoint URL. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ----------------------------------- | ----------- | | `url` | `str` | The endpoint URL. | *required* | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | Returns: | Name | Type | Description | | ----- | ----- | ---------------------------------------------- | | `str` | `str` | The authentication method ('mtls' or 'token'). | #### `get_endpoint(auth_method, instance='default', region=None, environment='dev', cluster=None)` Get the endpoint URL for the application. Tip: See the 'endpoint'-tab in Vespa Cloud Console for available endpoints. Parameters: | Name | Type | Description | Default | | ------------- | ----- | --------------------------------------------------------------------------------------- | ----------- | | `auth_method` | `str` | Authentication method. Options are 'mtls' or 'token'. | *required* | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name, e.g. 'aws-us-east-1c'. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | | `cluster` | `str` | Specific cluster to get the endpoint for. If None, uses the instance's default cluster. | `None` | Returns: | Name | Type | Description | | ----- | ----- | ----------------- | | `str` | `str` | The endpoint URL. | #### `get_mtls_endpoint(instance='default', region=None, environment='dev', cluster=None)` Get the endpoint URL of a mTLS endpoint for the application. Will return the first mTLS endpoint found if multiple exist. Use `VespaCloud.get_all_endpoints` to get all endpoints. Tip: See the 'endpoint'-tab in Vespa Cloud Console for available endpoints. Parameters: | Name | Type | Description | Default | | ------------- | ----- | --------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | | `cluster` | `str` | Specific cluster to get the endpoint for. If None, uses the instance's default cluster. | `None` | Returns: | Name | Type | Description | | ----- | ----- | ----------------- | | `str` | `str` | The endpoint URL. | #### `get_token_endpoint(instance='default', region=None, environment='dev', cluster=None)` Get the endpoint URL of a token endpoint for the application. Will return the first token endpoint found if multiple exist. Use `VespaCloud.get_all_endpoints` to get all endpoints. Tip: See the 'endpoint'-tab in Vespa Cloud Console for available endpoints. Parameters: | Name | Type | Description | Default | | ------------- | ----- | --------------------------------------------------------------------------------------- | ----------- | | `instance` | `str` | Application instance name. | `'default'` | | `region` | `str` | Region name. | `None` | | `environment` | `str` | Environment (dev/perf/prod). | `'dev'` | | `cluster` | `str` | Specific cluster to get the endpoint for. If None, uses the instance's default cluster. | `None` | Returns: | Name | Type | Description | | ----- | ----- | ----------------- | | `str` | `str` | The endpoint URL. | ## `vespa.evaluation` Vespa evaluation module. This module provides tools for evaluating and benchmarking Vespa applications. ### `Vespa(url, port=None, deployment_message=None, cert=None, key=None, vespa_cloud_secret_token=None, output_file=sys.stdout, application_package=None, additional_headers=None)` Bases: `object` Establish a connection with an existing Vespa application. Parameters: | Name | Type | Description | Default | | -------------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `url` | `str` | Vespa endpoint URL. | *required* | | `port` | `int` | Vespa endpoint port. | `None` | | `deployment_message` | `str` | Message returned by Vespa engine after deployment. Used internally by deploy methods. | `None` | | `cert` | `str` | Path to data plane certificate and key file in case the 'key' parameter is None. If 'key' is not None, this should be the path of the certificate file. Typically generated by Vespa-cli with 'vespa auth cert'. | `None` | | `key` | `str` | Path to the data plane key file. Typically generated by Vespa-cli with 'vespa auth cert'. | `None` | | `vespa_cloud_secret_token` | `str` | Vespa Cloud data plane secret token. | `None` | | `output_file` | `str` | Output file to write output messages. | `stdout` | | `application_package` | `str` | Application package definition used to deploy the application. | `None` | | `additional_headers` | `dict` | Additional headers to be sent to the Vespa application. | `None` | Example usage ```python Vespa(url="https://cord19.vespa.ai") # doctest: +SKIP Vespa(url="http://localhost", port=8080) Vespa(http://localhost, 8080) Vespa(url="https://token-endpoint..z.vespa-app.cloud", vespa_cloud_secret_token="your_token") # doctest: +SKIP Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem") # doctest: +SKIP Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem", additional_headers={"X-Custom-Header": "test"}) # doctest: +SKIP ``` #### `application_package` Get application package definition, if available. #### `asyncio(connections=1, total_timeout=None, timeout=30.0, client=None, **kwargs)` Access Vespa asynchronous connection layer. Should be used as a context manager. Example usage ```python async with app.asyncio() as async_app: response = await async_app.query(body=body) # passing kwargs with custom timeout async with app.asyncio(connections=5, timeout=60.0) as async_app: response = await async_app.query(body=body) ``` See `VespaAsync` for more details on the parameters. Parameters: | Name | Type | Description | Default | | --------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------- | --------- | | `connections` | `int` | Number of maximum_keepalive_connections. | `1` | | `total_timeout` | `int` | Deprecated. Will be ignored. Use timeout instead. | `None` | | `timeout` | \`float | int | Timeout\` | | `client` | `AsyncClient` | Reusable httpx.AsyncClient to use instead of creating a new one. When provided, the caller is responsible for closing the client. | `None` | | `**kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------------------------- | | `VespaAsync` | `VespaAsync` | Instance of Vespa asynchronous layer. | #### `get_async_session(connections=1, total_timeout=None, timeout=30.0, **kwargs)` Return a configured `httpx.AsyncClient` for reuse. The client is created with the same configuration as `VespaAsync` and is HTTP/2 enabled by default. Callers are responsible for closing the client via `await client.aclose()` when finished. Parameters: | Name | Type | Description | Default | | ------------- | ------- | ------------------------------------------------------------ | --------- | | `connections` | `int` | Number of logical connections to keep alive. | `1` | | `timeout` | \`float | int | Timeout\` | | `**kwargs` | | Additional keyword arguments forwarded to httpx.AsyncClient. | `{}` | Returns: | Type | Description | | ------------- | ------------------------------------------------------- | | `AsyncClient` | httpx.AsyncClient: Configured asynchronous HTTP client. | #### `syncio(connections=8, compress='auto', session=None)` Access Vespa synchronous connection layer. Should be used as a context manager. Example usage: ````text ```python with app.syncio() as sync_app: response = sync_app.query(body=body) ```` ```` See for more details. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `connections` | `int` | Number of allowed concurrent connections. | `8` | | `total_timeout` | `float` | Total timeout in seconds. | *required* | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `session` | `Session` | Reusable requests session to utilise for all requests made within the context manager. When provided, the caller is responsible for closing the session. | `None` | Returns: | Name | Type | Description | | --- | --- | --- | | `VespaAsyncLayer` | `VespaSync` | Instance of Vespa asynchronous layer. | #### `get_sync_session(connections=8, compress='auto')` Return a configured httpr.Client for reuse. The returned client is configured with the same headers, authentication, and mTLS certificates as the VespaSync context manager. Callers are responsible for closing the client when it is no longer needed. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `connections` | `int` | Kept for API compatibility (httpr manages pooling). | `8` | | `compress` | `Union[str, bool]` | Whether to compress request bodies. | `'auto'` | Returns: | Type | Description | | --- | --- | | `Client` | httpr.Client: Configured HTTP client. | #### `wait_for_application_up(max_wait=300)` Wait for application endpoint ready (/ApplicationStatus). Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `max_wait` | `int` | Seconds to wait for the application endpoint. | `300` | Raises: | Type | Description | | --- | --- | | `RuntimeError` | If not able to reach endpoint within max_wait or the client fails to authenticate. | Returns: | Type | Description | | --- | --- | | `None` | None | #### `get_application_status()` Get application status (/ApplicationStatus). Returns: | Type | Description | | --- | --- | | `Optional[Response]` | None | #### `get_model_endpoint(model_id=None)` Get stateless model evaluation endpoints. #### `query(body=None, groupname=None, streaming=False, profile=False, **kwargs)` Send a query request to the Vespa application. Send 'body' containing all the request parameters. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `body` | `dict` | Dictionary containing request parameters. | `None` | | `groupname` | `str` | The groupname used with streaming search. | `None` | | `streaming` | `bool` | Whether to use streaming mode (SSE). Defaults to False. | `False` | | `profile` | `bool` | Add profiling parameters to the query (response may be large). Defaults to False. | `False` | | `**kwargs` | `dict` | Extra Vespa Query API parameters. | `{}` | Returns: | Type | Description | | --- | --- | | `Union[VespaQueryResponse, Generator[str, None, None]]` | VespaQueryResponse when streaming=False, or a generator of decoded lines when streaming=True. | #### `feed_data_point(schema, data_id, fields, namespace=None, groupname=None, compress='auto', **kwargs)` Feed a data point to a Vespa app. Will create a new VespaSync with connection overhead. Example usage ```python app = Vespa(url="localhost", port=8080) data_id = "1", fields = { "field1": "value1", } with VespaSync(app) as sync_app: response = sync_app.feed_data_point( schema="schema_name", data_id=data_id, fields=fields ) print(response) ```` Parameters: | Name | Type | Description | Default | | ----------- | ------------------ | -------------------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are sending data to. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dictionary containing all the fields required by the schema. | *required* | | `namespace` | `str` | The namespace that we are sending data to. | `None` | | `groupname` | `str` | The groupname that we are sending data to. | `None` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | Returns: | Name | Type | Description | | --------------- | --------------- | -------------------------------------- | | `VespaResponse` | `VespaResponse` | The response of the HTTP POST request. | #### `feed_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=8, max_connections=16, compress='auto', **kwargs)` Feed data from an Iterable of Dict with the keys 'id' and 'fields' to be used in the `feed_data_point` function. Uses a queue to feed data in parallel with a thread pool. The result of each operation is forwarded to the user-provided callback function that can process the returned `VespaResponse`. Example usage ```python app = Vespa(url="localhost", port=8080) data = [ {"id": "1", "fields": {"field1": "value1"}}, {"id": "2", "fields": {"field1": "value2"}}, ] def callback(response, id): print(f"Response for id {id}: {response.status_code}") app.feed_iterable(data, schema="schema_name", callback=callback) ``` Parameters: | Name | Type | Description | Default | | ----------------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `iter` | `Iterable[dict]` | An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa. | *required* | | `schema` | `str` | The Vespa schema name that we are sending data to. | `None` | | `namespace` | `str` | The Vespa document id namespace. If no namespace is provided, the schema is used. | `None` | | `callback` | `function` | A callback function to be called on each result. Signature callback(response: VespaResponse, id: str). | `None` | | `operation_type` | `str` | The operation to perform. Defaults to feed. Valid values are feed, update, or delete. | `'feed'` | | `max_queue_size` | `int` | The maximum size of the blocking queue and max in-flight operations. | `1000` | | `max_workers` | `int` | The maximum number of workers in the threadpool executor. | `8` | | `max_connections` | `int` | The maximum number of persisted connections to the Vespa endpoint. | `16` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `**kwargs` | `dict` | Additional parameters passed to the respective operation type specific function (\_data_point). | `{}` | Returns: | Type | Description | | ---- | ----------- | | | None | #### `feed_async_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=64, max_connections=1, **kwargs)` Feed data asynchronously using httpx.AsyncClient with HTTP/2. Feed from an Iterable of Dict with the keys 'id' and 'fields' to be used in the `feed_data_point` function. The result of each operation is forwarded to the user-provided callback function that can process the returned `VespaResponse`. Prefer using this method over `feed_iterable` when the operation is I/O bound from the client side. Example usage ```python app = Vespa(url="localhost", port=8080) data = [ {"id": "1", "fields": {"field1": "value1"}}, {"id": "2", "fields": {"field1": "value2"}}, ] def callback(response, id): print(f"Response for id {id}: {response.status_code}") app.feed_async_iterable(data, schema="schema_name", callback=callback) ``` Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `iter` | `Iterable[dict]` | An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa. | *required* | | `schema` | `str` | The Vespa schema name that we are sending data to. | `None` | | `namespace` | `str` | The Vespa document id namespace. If no namespace is provided, the schema is used. | `None` | | `callback` | `function` | A callback function to be called on each result. Signature callback(response: VespaResponse, id: str). | `None` | | `operation_type` | `str` | The operation to perform. Defaults to feed. Valid values are feed, update, or delete. | `'feed'` | | `max_queue_size` | `int` | The maximum number of tasks waiting to be processed. Useful to limit memory usage. Default is 1000. | `1000` | | `max_workers` | `int` | Maximum number of concurrent requests to have in-flight, bound by an asyncio.Semaphore, that needs to be acquired by a submit task. Increase if the server is scaled to handle more requests. | `64` | | `max_connections` | `int` | The maximum number of connections passed to httpx.AsyncClient to the Vespa endpoint. As HTTP/2 is used, only one connection is needed. | `1` | | `**kwargs` | `dict` | Additional parameters passed to the respective operation type-specific function (\_data_point). | `{}` | Returns: | Type | Description | | ---- | ----------- | | | None | #### `query_many_async(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs)` Execute many queries asynchronously using httpx.AsyncClient. Number of concurrent requests is controlled by the `max_concurrent` parameter. Each query will be retried up to 3 times using an exponential backoff strategy. When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations. Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Iterable[dict]` | Iterable of query bodies (dictionaries) to be sent. | *required* | | `num_connections` | `int` | Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1. | `1` | | `max_concurrent` | `int` | Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much. | `100` | | `adaptive` | `bool` | Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates. | `True` | | `client_kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | | `**query_kwargs` | `dict` | Additional arguments to be passed to the query method. | `{}` | Returns: | Type | Description | | -------------------------- | --------------------------------------------------------------- | | `List[VespaQueryResponse]` | List\[VespaQueryResponse\]: List of VespaQueryResponse objects. | #### `query_many(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs)` Execute many queries asynchronously using httpx.AsyncClient. This method is a wrapper around the `query_many_async` method that uses the asyncio event loop to run the coroutine. Number of concurrent requests is controlled by the `max_concurrent` parameter. Each query will be retried up to 3 times using an exponential backoff strategy. When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations. Parameters: | Name | Type | Description | Default | | ----------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Iterable[dict]` | Iterable of query bodies (dictionaries) to be sent. | *required* | | `num_connections` | `int` | Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1. | `1` | | `max_concurrent` | `int` | Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much. | `100` | | `adaptive` | `bool` | Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates. | `True` | | `client_kwargs` | `dict` | Additional arguments to be passed to the httpx.AsyncClient. | `{}` | | `**query_kwargs` | `dict` | Additional arguments to be passed to the query method. | `{}` | Returns: | Type | Description | | -------------------------- | --------------------------------------------------------------- | | `List[VespaQueryResponse]` | List\[VespaQueryResponse\]: List of VespaQueryResponse objects. | #### `delete_data(schema, data_id, namespace=None, groupname=None, **kwargs)` Delete a data point from a Vespa app. Example usage ```python app = Vespa(url="localhost", port=8080) response = app.delete_data(schema="schema_name", data_id="1") print(response) ``` Parameters: | Name | Type | Description | Default | | ----------- | ------ | ----------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are deleting data from. | `None` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ---------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP DELETE request. | #### `delete_all_docs(content_cluster_name, schema, namespace=None, slices=1, **kwargs)` Delete all documents associated with the schema. This might block for a long time as it requires sending multiple delete requests to complete. Parameters: | Name | Type | Description | Default | | ---------------------- | ------ | ----------------------------------------------------------------------------------------------------------- | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from, or visit. | *required* | | `schema` | `str` | The schema that we are deleting data from. | *required* | | `namespace` | `str` | The namespace that we are deleting data from. If no namespace is provided, the schema is used. | `None` | | `slices` | `int` | Number of slices to use for parallel delete requests. Defaults to 1. | `1` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details. | `{}` | Returns: | Name | Type | Description | | ---------- | ---------- | ---------------------------------------- | | `Response` | `Response` | The response of the HTTP DELETE request. | #### `visit(content_cluster_name, schema=None, namespace=None, slices=1, selection='true', wanted_document_count=500, slice_id=None, **kwargs)` Visit all documents associated with the schema and matching the selection. Will run each slice on a separate thread, for each slice yields the response for each page. Example usage ```python for slice in app.visit(schema="schema_name", slices=2): for response in slice: print(response.json) ``` Parameters: | Name | Type | Description | Default | | ----------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------- | ---------- | | `content_cluster_name` | `str` | Name of content cluster to GET from. | *required* | | `schema` | `str` | The schema that we are visiting data from. | `None` | | `namespace` | `str` | The namespace that we are visiting data from. | `None` | | `slices` | `int` | Number of slices to use for parallel GET. | `1` | | `selection` | `str` | Selection expression to filter documents. | `'true'` | | `wanted_document_count` | `int` | Best effort number of documents to retrieve for each request. May contain less if there are not enough documents left. | `500` | | `slice_id` | `int` | Slice id to use for the visit. If None, all slices will be used. | `None` | | `**kwargs` | `dict` | Additional HTTP request parameters. See Vespa API documentation. | `{}` | Yields: | Type | Description | | ------------------------------------------- | -------------------------------------------------------------------------------------------------- | | `Generator[VespaVisitResponse, None, None]` | Generator\[Generator[Response]\]: A generator of slices, each containing a generator of responses. | Raises: | Type | Description | | ----------- | -------------------------- | | `HTTPError` | If an HTTP error occurred. | #### `get_data(schema, data_id, namespace=None, groupname=None, raise_on_not_found=False, **kwargs)` Get a data point from a Vespa app. Parameters: | Name | Type | Description | Default | | -------------------- | ------ | --------------------------------------------------------------------------------------------- | ---------- | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `schema` | `str` | The schema that we are getting data from. Will attempt to infer schema name if not provided. | *required* | | `namespace` | `str` | The namespace that we are getting data from. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are getting data from. | `None` | | `raise_on_not_found` | `bool` | Raise an exception if the data_id is not found. Default is False. | `False` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP GET request. See Vespa API documentation. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP GET request. | #### `update_data(schema, data_id, fields, create=False, namespace=None, groupname=None, compress='auto', **kwargs)` Update a data point in a Vespa app. Example usage ```python vespa = Vespa(url="localhost", port=8080) fields = {"mystringfield": "value1", "myintfield": 42} response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields) # or, with partial update, setting auto_assign=False fields = {"myintfield": {"increment": 1}} response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields, auto_assign=False) print(response.json) ``` Parameters: | Name | Type | Description | Default | | ------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `schema` | `str` | The schema that we are updating data. | *required* | | `data_id` | `str` | Unique id associated with this data point. | *required* | | `fields` | `dict` | Dict containing all the fields you want to update. | *required* | | `create` | `bool` | If true, updates to non-existent documents will create an empty document to update. | `False` | | `auto_assign` | `bool` | Assumes fields-parameter is an assignment operation. If set to false, the fields parameter should be a dictionary including the update operation. | *required* | | `namespace` | `str` | The namespace that we are updating data. If no namespace is provided, the schema is used. | `None` | | `groupname` | `str` | The groupname that we are updating data. | `None` | | `compress` | `Union[str, bool]` | Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes. | `'auto'` | | `**kwargs` | `dict` | Additional arguments to be passed to the HTTP PUT request. See Vespa API documentation. | `{}` | Returns: | Name | Type | Description | | ---------- | --------------- | ------------------------------------- | | `Response` | `VespaResponse` | The response of the HTTP PUT request. | #### `get_model_from_application_package(model_name)` Get model definition from application package, if available. #### `predict(x, model_id, function_name='output_0')` Obtain a stateless model evaluation. Parameters: | Name | Type | Description | Default | | --------------- | --------- | --------------------------------------------------------------------- | ------------ | | `x` | `various` | Input where the format depends on the task that the model is serving. | *required* | | `model_id` | `str` | The id of the model used to serve the prediction. | *required* | | `function_name` | `str` | The name of the output function to be evaluated. | `'output_0'` | Returns: | Name | Type | Description | | ----- | ---- | ----------------- | | `var` | | Model prediction. | #### `get_document_v1_path(id, schema=None, namespace=None, group=None, number=None)` Convert to document v1 path. Parameters: | Name | Type | Description | Default | | ----------- | ----- | ------------------------------ | ---------- | | `id` | `str` | The id of the document. | *required* | | `namespace` | `str` | The namespace of the document. | `None` | | `schema` | `str` | The schema of the document. | `None` | | `group` | `str` | The group of the document. | `None` | | `number` | `int` | The number of the document. | `None` | Returns: | Name | Type | Description | | ----- | ----- | ------------------------------------- | | `str` | `str` | The path to the document v1 endpoint. | ### `VespaQueryResponse(json, status_code, url, request_body=None)` Bases: `VespaResponse` #### `get_json()` For debugging when the response does not have hits. Returns: | Type | Description | | ------ | ------------------------------ | | `Dict` | JSON object with full response | ### `RandomHitsSamplingStrategy` Bases: `Enum` Enum for different random hits sampling strategies. - RATIO: Sample random hits as a ratio of relevant docs (e.g., 1.0 = equal number, 2.0 = twice as many) - FIXED: Sample a fixed number of random hits per query ### `VespaEvaluatorBase(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', write_csv=False, csv_dir=None)` Bases: `ABC` Abstract base class for Vespa evaluators providing initialization and interface. #### `run()` Abstract method to be implemented by subclasses. #### `__call__()` Make the evaluator callable. ### `VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)` Bases: `VespaEvaluatorBase` Evaluate retrieval performance on a Vespa application. This class: - Iterates over queries and issues them against your Vespa application. - Retrieves top-k documents per query (with k = max of your IR metrics). - Compares the retrieved documents with a set of relevant document ids. - Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k. - Logs vespa search times for each query. - Logs/returns these metrics. - Optionally writes out to CSV. Note: The 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage ```python from vespa.application import Vespa from vespa.evaluation import VespaEvaluator queries = { "q1": "What is the best GPU for gaming?", "q2": "How to bake sourdough bread?", # ... } relevant_docs = { "q1": {"d12", "d99"}, "q2": {"d101"}, # ... } # relevant_docs can also be a dict of query_id => single relevant doc_id # relevant_docs = { # "q1": "d12", # "q2": "d101", # # ... # } # Or, relevant_docs can be a dict of query_id => map of doc_id => relevance # relevant_docs = { # "q1": {"d12": 1, "d99": 0.1}, # "q2": {"d101": 0.01}, # # ... # Note that for non-binary relevance, the relevance values should be in [0, 1], and that # only the nDCG metric will be computed. def my_vespa_query_fn(query_text: str, top_k: int) -> dict: return { "yql": 'select * from sources * where userInput("' + query_text + '");', "hits": top_k, "ranking": "your_ranking_profile", } app = Vespa(url="http://localhost", port=8080) evaluator = VespaEvaluator( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, name="test-run", accuracy_at_k=[1, 3, 5], precision_recall_at_k=[1, 3, 5], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=True ) results = evaluator() print("Primary metric:", evaluator.primary_metric) print("All results:", results) ``` Parameters: | Name | Type | Description | Default | | ----------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | | `queries` | `Dict[str, str]` | A dictionary where keys are query IDs and values are query strings. | *required* | | `relevant_docs` | `Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]` | A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string. | *required* | | `vespa_query_fn` | `Callable[[str, int, Optional[str]], dict]` | A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `name` | `str` | A name for this evaluation run. Defaults to "". | `''` | | `id_field` | `str` | The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "". | `''` | | `accuracy_at_k` | `List[int]` | List of k values for which to compute Accuracy@k. Defaults to [1, 3, 5, 10]. | `[1, 3, 5, 10]` | | `precision_recall_at_k` | `List[int]` | List of k values for which to compute Precision@k and Recall@k. Defaults to [1, 3, 5, 10]. | `[1, 3, 5, 10]` | | `mrr_at_k` | `List[int]` | List of k values for which to compute MRR@k. Defaults to [10]. | `[10]` | | `ndcg_at_k` | `List[int]` | List of k values for which to compute NDCG@k. Defaults to [10]. | `[10]` | | `map_at_k` | `List[int]` | List of k values for which to compute MAP@k. Defaults to [100]. | `[100]` | | `write_csv` | `bool` | Whether to write the evaluation results to a CSV file. Defaults to False. | `False` | | `csv_dir` | `Optional[str]` | Directory to save the CSV file. Defaults to None (current directory). | `None` | #### `run()` Executes the evaluation by running queries and computing IR metrics. This method: 1. Executes all configured queries against the Vespa application. 1. Collects search results and timing information. 1. Computes the configured IR metrics (Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k). 1. Records search timing statistics. 1. Logs results and optionally writes them to CSV. Returns: | Name | Type | Description | | ------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `dict` | `Dict[str, float]` | A dictionary containing: - IR metrics with names like "accuracy@k", "precision@k", etc. - Search time statistics ("searchtime_avg", "searchtime_q50", etc.). The values are floats between 0 and 1 for metrics and in seconds for timing. | Example ```python { "accuracy@1": 0.75, "ndcg@10": 0.68, "searchtime_avg": 0.0123, ... } ``` ### `VespaMatchEvaluator(queries, relevant_docs, vespa_query_fn, app, id_field, name='', rank_profile='unranked', write_csv=False, write_verbose=False, csv_dir=None)` Bases: `VespaEvaluatorBase` Evaluate recall in the match-phase over a set of queries for a Vespa application. This class: - Iterates over queries and issues them against your Vespa application. - Sends one query with limit 0 to get the number of matched documents. - Sends one query with recall-parameter set according to the provided relevant documents. - Compares the retrieved documents with a set of relevant document ids. - Logs vespa search times for each query. - Logs/returns these metrics. - Optionally writes out to CSV. Note: It is recommended to use a rank profile without any first-phase (and second-phase) ranking if you care about speed of evaluation run. If you do so, you need to make sure that the rank profile you use has the same inputs. For example, if you want to evaluate a YQL query including nearestNeighbor-operator, your rank-profile needs to define the corresponding input tensor. You must also either provide the query tensor or define it as input (e.g 'input.query(embedding)=embed(@query)') in your Vespa query function. Also note that the 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage: ```python from vespa.application import Vespa from vespa.evaluation import VespaEvaluator queries = { "q1": "What is the best GPU for gaming?", "q2": "How to bake sourdough bread?", # ... } relevant_docs = { "q1": {"d12", "d99"}, "q2": {"d101"}, # ... } # relevant_docs can also be a dict of query_id => single relevant doc_id # relevant_docs = { # "q1": "d12", # "q2": "d101", # # ... # } # Or, relevant_docs can be a dict of query_id => map of doc_id => relevance # relevant_docs = { # "q1": {"d12": 1, "d99": 0.1}, # "q2": {"d101": 0.01}, # # ... def my_vespa_query_fn(query_text: str, top_k: int) -> dict: return { "yql": 'select * from sources * where userInput("' + query_text + '");', "hits": top_k, "ranking": "your_ranking_profile", } app = Vespa(url="http://localhost", port=8080) evaluator = VespaMatchEvaluator( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, name="test-run", id_field="id", write_csv=True, write_verbose=True, ) results = evaluator() print("Primary metric:", evaluator.primary_metric) print("All results:", results) ``` Parameters: | Name | Type | Description | Default | | ---------------- | --------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Dict[str, str]` | A dictionary where keys are query IDs and values are query strings. | *required* | | `relevant_docs` | `Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]` | A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, or a single doc_id string. Graded relevance (dict of doc_id to relevance score) is not supported for match evaluation. | *required* | | `vespa_query_fn` | `Callable[[str, int, Optional[str]], dict]` | A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `name` | `str` | A name for this evaluation run. Defaults to "". | `''` | | `id_field` | `str` | The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "". | *required* | | `write_csv` | `bool` | Whether to write the summary evaluation results to a CSV file. Defaults to False. | `False` | | `write_verbose` | `bool` | Whether to write detailed query-level results to a separate CSV file. Defaults to False. | `False` | | `csv_dir` | `Optional[str]` | Directory to save the CSV files. Defaults to None (current directory). | `None` | #### `create_grouping_filter(yql, id_field, relevant_ids)` Create a grouping filter to append Vespa YQL queries to limit results to relevant documents. | all( group(id_field) filter(regex("", id_field)) each(output(count()))) Parameters: yql (str): The base YQL query string. id_field (str): The field name in the Vespa hit that contains the document ID. relevant_ids (list[str]): List of relevant document IDs to include in the filter. Returns: str: The modified YQL query string with the grouping filter applied. #### `extract_matched_ids(resp, id_field)` Extract matched document IDs from Vespa query response hits. Parameters: resp (VespaQueryResponse): The Vespa query response object. id_field (str): The field name in the Vespa hit that contains the document ID Returns: Set\[str\]: A set of matched document IDs. #### `run()` Executes the match-phase recall evaluation. This method: 1. Sends a grouping query to see which of the relevant documents were matched, and get totalCount. 1. Computes recall metrics and match statistics. 1. Logs results and optionally writes them to CSV. Returns: | Name | Type | Description | | ------ | ------------------ | ------------------------------------------------------------------------------------- | | `dict` | `Dict[str, float]` | A dictionary containing recall metrics, match statistics, and search time statistics. | Example ```python { "match_recall": 0.85, "total_relevant_docs": 150, "total_matched_relevant": 128, "avg_matched_per_query": 45.2, "searchtime_avg": 0.015, ... } ``` ### `VespaCollectorBase(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)` Bases: `ABC` Abstract base class for Vespa training data collectors providing initialization and interface. Initialize the VespaFeatureCollector. Parameters: | Name | Type | Description | Default | | --------------------------- | --------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Dict[str, str]` | Dictionary mapping query IDs to query strings | *required* | | `relevant_docs` | `Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]` | Dictionary mapping query IDs to relevant document IDs | *required* | | `vespa_query_fn` | `Callable[[str, int, Optional[str]], dict]` | Function to generate Vespa query bodies | *required* | | `app` | `Vespa` | Vespa application instance | *required* | | `id_field` | `str` | Field name containing document IDs in Vespa hits (must be defined as an attribute in the schema) | *required* | | `name` | `str` | Name for this collection run | `''` | | `csv_dir` | `Optional[str]` | Directory to save CSV files | `None` | | `random_hits_strategy` | `Union[RandomHitsSamplingStrategy, str]` | Strategy for sampling random hits - either "ratio" or "fixed" - RATIO: Sample random hits as a ratio of relevant docs - FIXED: Sample a fixed number of random hits per query | `RATIO` | | `random_hits_value` | `Union[float, int]` | Value for the sampling strategy - For RATIO: Ratio value (e.g., 1.0 = equal, 2.0 = twice as many random hits) - For FIXED: Fixed number of random hits per query | `1.0` | | `max_random_hits_per_query` | `Optional[int]` | Optional maximum limit on random hits per query (only applies when using RATIO strategy to prevent excessive sampling) | `None` | | `collect_matchfeatures` | `bool` | Whether to collect match features | `True` | | `collect_rankfeatures` | `bool` | Whether to collect rank features | `False` | | `collect_summaryfeatures` | `bool` | Whether to collect summary features | `False` | | `write_csv` | `bool` | Whether to write results to CSV file | `True` | #### `collect()` Abstract method to be implemented by subclasses. #### `__call__()` Make the collector callable. ### `VespaFeatureCollector(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)` Bases: `VespaCollectorBase` Collects training data for retrieval tasks from a Vespa application. This class: - Iterates over queries and issues them against your Vespa application. - Retrieves top-k documents per query. - Samples random hits based on the specified strategy. - Compiles a CSV file with query-document pairs and their relevance labels. Important: If you want to sample random hits, you need to make sure that the rank profile you define in your `vespa_query_fn` has a ranking expression that reflects this. See [docs](https://docs.vespa.ai/en/tutorials/text-search-ml.html#get-random-hits) for example. In this case, be aware that the `relevance_score` value in the returned results (or CSV) will be of no value. This will only have meaning if you use this to collect features for relevant documents only. Example usage ```python from vespa.application import Vespa from vespa.evaluation import VespaFeatureCollector queries = { "q1": "What is the best GPU for gaming?", "q2": "How to bake sourdough bread?", # ... } relevant_docs = { "q1": {"d12", "d99"}, "q2": {"d101"}, # ... } def my_vespa_query_fn(query_text: str, top_k: int) -> dict: return { "yql": 'select * from sources * where userInput("' + query_text + '");', "hits": 10, # Do not make use of top_k here. "ranking": "your_ranking_profile", # This should have `random` as ranking expression } app = Vespa(url="http://localhost", port=8080) collector = VespaFeatureCollector( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, id_field="id", # Field in Vespa hit that contains the document ID (must be an attribute) name="retrieval-data-collection", csv_dir="/path/to/save/csv", random_hits_strategy="ratio", # or RandomHitsSamplingStrategy.RATIO random_hits_value=1.0, # Sample equal number of random hits to relevant docs max_random_hits_per_query=100, # Optional: cap random hits per query collect_matchfeatures=True, # Collect match features from rank profile collect_rankfeatures=False, # Skip traditional rank features collect_summaryfeatures=False, # Skip summary features ) collector() ``` **Alternative Usage Examples:** ```python # Example 1: Fixed number of random hits per query collector = VespaFeatureCollector( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, id_field="id", # Required field name random_hits_strategy="fixed", random_hits_value=50, # Always sample 50 random hits per query ) # Example 2: Ratio-based with a cap collector = VespaFeatureCollector( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, id_field="id", # Required field name random_hits_strategy="ratio", random_hits_value=2.0, # Sample twice as many random hits as relevant docs max_random_hits_per_query=200, # But never more than 200 per query ) ``` Parameters: | Name | Type | Description | Default | | --------------------------- | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `queries` | `Dict[str, str]` | A dictionary where keys are query IDs and values are query strings. | *required* | | `relevant_docs` | `Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]` | A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string. | *required* | | `vespa_query_fn` | `Callable[[str, int, Optional[str]], dict]` | A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `id_field` | `str` | The field name in the Vespa hit that contains the document ID. This field must be defined as an attribute in your Vespa schema. | *required* | | `name` | `str` | A name for this data collection run. Defaults to "". | `''` | | `csv_dir` | `Optional[str]` | Directory to save the CSV file. Defaults to None (current directory). | `None` | | `random_hits_strategy` | `Union[RandomHitsSamplingStrategy, str]` | Strategy for sampling random hits. Can be "ratio" (or RandomHitsSamplingStrategy.RATIO) to sample as a ratio of relevant docs, or "fixed" (or RandomHitsSamplingStrategy.FIXED) to sample a fixed number per query. Defaults to "ratio". | `RATIO` | | `random_hits_value` | `Union[float, int]` | Value for the sampling strategy. For RATIO strategy: ratio value (e.g., 1.0 = equal number, 2.0 = twice as many random hits). For FIXED strategy: fixed number of random hits per query. Defaults to 1.0. | `1.0` | | `max_random_hits_per_query` | `Optional[int]` | Maximum limit on random hits per query. Only applies to RATIO strategy to prevent excessive sampling. Defaults to None (no limit). | `None` | | `collect_matchfeatures` | `bool` | Whether to collect match features defined in rank profile's match-features section. Defaults to True. | `True` | | `collect_rankfeatures` | `bool` | Whether to collect rank features using ranking.listFeatures=true. Defaults to False. | `False` | | `collect_summaryfeatures` | `bool` | Whether to collect summary features from document summaries. Defaults to False. | `False` | | `write_csv` | `bool` | Whether to write results to CSV file. Defaults to True. | `True` | #### `get_recall_param(relevant_doc_ids, get_relevant)` Adds the recall parameter to the Vespa query body based on relevant document IDs. Parameters: | Name | Type | Description | Default | | ------------------ | ------ | --------------------------------------- | ---------- | | `relevant_doc_ids` | `set` | A set of relevant document IDs. | *required* | | `get_relevant` | `bool` | Whether to retrieve relevant documents. | *required* | Returns: | Name | Type | Description | | ------ | ------ | ------------------------------------------------------- | | `dict` | `dict` | The updated Vespa query body with the recall parameter. | #### `calculate_random_hits_count(num_relevant_docs)` Calculate the number of random hits to sample based on the configured strategy. Parameters: | Name | Type | Description | Default | | ------------------- | ----- | ------------------------------------------ | ---------- | | `num_relevant_docs` | `int` | Number of relevant documents for the query | *required* | Returns: | Type | Description | | ----- | ------------------------------- | | `int` | Number of random hits to sample | #### `collect()` Collects training data by executing queries and saving results to CSV. This method: 1. Executes all configured queries against the Vespa application. 1. Collects the top-k document IDs and their relevance labels. 1. Optionally writes the data to a CSV file for training purposes. 1. Returns the collected data as a single dictionary with results. Returns: | Type | Description | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `Dict[str, List[Dict]]` | Dict containing: | | `Dict[str, List[Dict]]` | 'results': List of dictionaries, each containing all data for a query-document pair (query_id, doc_id, relevance_label, relevance_score, and all extracted features) | ### `VespaNNParameters` Collection of nearest-neighbor query parameters used in nearest-neighbor classes. ### `VespaNNUnsuccessfulQueryError` Bases: `Exception` Exception raised when trying to determine the hit ratio or compute the recall of an unsuccessful query. ### `VespaNNGlobalFilterHitratioEvaluator(queries, app, verify_target_hits=None)` Determine the hit ratio of the global filter in ANN queries. This hit ratio determines the search strategy used to perform the nearest-neighbor search and is essential to understanding and optimizing the behavior of Vespa on these queries. This class: - Takes a list of queries. - Runs the queries with tracing. - Determines the hit ratio by examining the trace. Parameters: | Name | Type | Description | Default | | --------- | ----------------------------- | ------------------------------------- | ---------- | | `queries` | `Sequence[Mapping[str, str]]` | List of ANN queries. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | #### `run()` Determines the hit ratios of the global filters in the supplied ANN queries. Returns: | Type | Description | | ---- | --------------------------------------------------------------------------------------------------------------------------------------- | | | List\[List[float]\]: List of lists of hit ratios, which are values from the interval [0.0, 1.0], corresponding to the supplied queries. | #### `get_searchable_copies()` Returns number of searchable copies determined during hit-ratio computation. Returns: | Name | Type | Description | | ----- | ----- | ----------- | | `int` | \`int | None\` | ### `VespaNNRecallEvaluator(queries, hits, app, query_limit=20, id_field='id', **kwargs)` Determine recall of ANN queries. The recall of an ANN query with k hits is the number of hits that actually are among the k nearest neighbors of the query vector. This class: - Takes a list of queries. - First runs the queries as is (with the supplied HTTP parameters). - Then runs the queries with the supplied HTTP parameters and an additional parameter enforcing an exact nearest neighbor search. - Determines the recall by comparing the results. Parameters: | Name | Type | Description | Default | | ------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `queries` | `Sequence[Mapping[str, Any]]` | List of ANN queries. | *required* | | `hits` | `int` | Number of hits to use. Should match the parameter targetHits in the used ANN queries. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `query_limit` | `int` | Maximum number of queries to determine the recall for. Defaults to 20. | `20` | | `id_field` | `str` | Name of the field containing a unique id. Defaults to "id". | `'id'` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | #### `run()` Computes the recall of the supplied queries. Returns: | Type | Description | | ------------- | -------------------------------------------------------------------------------------------------------- | | `List[float]` | List\[float\]: List of recall values from the interval [0.0, 1.0] corresponding to the supplied queries. | ### `VespaQueryBenchmarker(queries, app, time_limit=2000, max_concurrent=10, **kwargs)` Determine the searchtime of queries by running them multiple times and taking the average. Using the searchtime has the advantage of not including network latency. This class: - Takes a list of queries. - Runs the queries for the given amount of time. - Determines the average searchtime of these runs. Parameters: | Name | Type | Description | Default | | ------------ | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `queries` | `Sequence[Mapping[str, Any]]` | List of queries. | *required* | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `time_limit` | `int` | Time to run the benchmark for (in milliseconds). | `2000` | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | #### `run()` Runs the benchmark (including a warm-up run not included in the result). Returns: | Type | Description | | ------------- | -------------------------------------------------------------------------- | | `List[float]` | List\[float\]: List of searchtimes, corresponding to the supplied queries. | ### `BucketedMetricResults(metric_name, buckets, values, filtered_out_ratios)` Stores aggregated statistics for a metric across query buckets. Computes mean and various percentiles for values grouped by bucket, where each bucket contains multiple measurements (e.g., response times or recall values). Parameters: | Name | Type | Description | Default | | --------------------- | ------------------- | ---------------------------------------------------------------- | ---------- | | `metric_name` | `str` | Name of the metric being measured (e.g., "searchtime", "recall") | *required* | | `buckets` | `List[int]` | List of bucket indices that contain data | *required* | | `values` | `List[List[float]]` | List of lists containing measurements, one list per bucket | *required* | | `filtered_out_ratios` | `List[float]` | Pre-computed filtered-out ratios for each bucket | *required* | #### `to_dict()` Convert results to dictionary format. Returns: | Type | Description | | ---------------- | ----------------------------------------------------------- | | `Dict[str, Any]` | Dictionary containing bucket information and all statistics | ### `VespaNNParameterOptimizer(app, queries, hits, buckets_per_percent=2, print_progress=False, benchmark_time_limit=5000, recall_query_limit=20, max_concurrent=10, id_field='id')` Get suggestions for configuring the nearest-neighbor parameters of a Vespa application. This class: - Sorts ANN queries into buckets based on the hit-ratio of their global filter. - For every bucket, can determine the average response time of the queries in this bucket. - For every bucket, can determine the average recall of the queries in this bucket. - Can suggest a value for postFilterThreshold. - Can suggest a value for filterFirstThreshold. - Can suggest a value for filterFirstExploration. - Can suggest a value for approximateThreshold. Parameters: | Name | Type | Description | Default | | ---------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------ | ---------- | | `app` | `Vespa` | An instance of the Vespa application. | *required* | | `queries` | `Sequence[Mapping[str, Any]]` | Queries to optimize for. | *required* | | `hits` | `int` | Number of hits to use in recall computations. Has to match the parameter targetHits in the used ANN queries. | *required* | | `buckets_per_percent` | `int` | How many buckets are created for every percent point, "resolution" of the suggestions. Defaults to 2. | `2` | | `print_progress` | `bool` | Whether to print progress information while determining suggestions. Defaults to False. | `False` | | `benchmark_time_limit` | `int` | Time in milliseconds to spend per bucket benchmark. Defaults to 5000. | `5000` | | `recall_query_limit` | `int` | Number of queries per bucket to compute the recall for. Defaults to 20. | `20` | | `max_concurrent` | `int` | Number of queries to execute concurrently during benchmark/recall calculation. Defaults to 10. | `10` | | `id_field` | `str` | Name of the field containing a unique id for recall computation. Defaults to "id". | `'id'` | #### `get_bucket_interval_width()` Gets the width of the interval represented by a single bucket. Returns: | Name | Type | Description | | ------- | ------- | ----------------------------------------------------- | | `float` | `float` | Width of the interval represented by a single bucket. | #### `get_number_of_buckets()` Gets the number of buckets. Returns: | Name | Type | Description | | ----- | ----- | ------------------ | | `int` | `int` | Number of buckets. | #### `get_number_of_nonempty_buckets()` Counts the number of buckets that contain at least one query. Returns: | Name | Type | Description | | ----- | ----- | ------------------------------------------------------ | | `int` | `int` | The number of buckets that contain at least one query. | #### `get_non_empty_buckets()` Gets the indices of the non-empty buckets. Returns: | Type | Description | | ----------- | ------------------------------------------------------ | | `List[int]` | List\[int\]: List of indices of the non-empty buckets. | #### `get_filtered_out_ratios()` Gets the (lower interval ends of the) filtered-out ratios of the non-empty buckets. Returns: | Type | Description | | ------------- | ----------------------------------------------------------------------------------------------------- | | `List[float]` | List\[float\]: List of the (lower interval ends of the) filtered-out ratios of the non-empty buckets. | #### `get_number_of_queries()` Gets the number of queries contained in the buckets. Returns: | Name | Type | Description | | ----- | ---- | ------------------------------------------- | | `int` | | Number of queries contained in the buckets. | #### `bucket_to_hitratio(bucket)` Gets the hit ratio (upper endpoint of interval) corresponding to the given bucket index. Parameters: | Name | Type | Description | Default | | -------- | ----- | ------------------ | ---------- | | `bucket` | `int` | Index of a bucket. | *required* | Returns: | Name | Type | Description | | ------- | ------- | -------------------------------------------------- | | `float` | `float` | Hit ratio corresponding to the given bucket index. | #### `bucket_to_filtered_out(bucket)` Gets the filtered-out ratio (1 - hit ratio, lower endpoint of interval) corresponding to the given bucket index. Parameters: | Name | Type | Description | Default | | -------- | ----- | ------------------ | ---------- | | `bucket` | `int` | Index of a bucket. | *required* | Returns: | Name | Type | Description | | ------- | ------- | ----------------------------------------------------------- | | `float` | `float` | Filtered-out ratio corresponding to the given bucket index. | #### `buckets_to_filtered_out(buckets)` Applies bucket_to_filtered_out to list of bucket indices. Parameters: | Name | Type | Description | Default | | --------- | ----------- | ----------------------- | ---------- | | `buckets` | `List[int]` | List of bucket indices. | *required* | Returns: | Type | Description | | ------------- | ----------------------------------------------------------------------------- | | `List[float]` | List\[float\]: Filtered-out ratios corresponding to the given bucket indices. | #### `filtered_out_to_bucket(percent)` Gets the index of the bucket containing the given filtered-out ratio. Parameters: | Name | Type | Description | Default | | --------- | ------- | ------------------- | ---------- | | `percent` | `float` | Filtered-out ratio. | *required* | Returns: | Name | Type | Description | | ----- | ----- | -------------------------------------------------------- | | `int` | `int` | Index of bucket containing the given filtered-out ratio. | #### `distribute_to_buckets(queries_with_hitratios)` Distributes the given queries to buckets according to their given hit ratios. Parameters: | Name | Type | Description | Default | | ------------------------ | ----------------------------- | ------------------------ | ---------- | | `queries_with_hitratios` | `List[Dict[str, str], float]` | Queries with hit ratios. | *required* | Returns: | Type | Description | | ----------------- | ----------------------------------- | | `List[List[str]]` | List\[List[str]\]: List of buckets. | #### `determine_hit_ratios_and_distribute_to_buckets(queries)` Distributes the given queries to buckets by determining their hit ratios. Parameters: | Name | Type | Description | Default | | --------- | ----------------------------- | ----------- | ---------- | | `queries` | `Sequence[Mapping[str, Any]]` | Queries. | *required* | Returns: | Type | Description | | ----------------- | ----------------------------------- | | `List[List[str]]` | List\[List[str]\]: List of buckets. | #### `query_from_get_string(get_query)` Parses a query in GET format. Parameters: | Name | Type | Description | Default | | ----------- | ----- | ---------------------------------- | ---------- | | `get_query` | `str` | Query as a single-line GET string. | *required* | Returns: | Type | Description | | ---------------- | --------------------------------- | | `Dict[str, str]` | Dict\[str,str\]: Query as a dict. | #### `distribute_file_to_buckets(filename)` Distributes the queries from the given file to buckets according to their given hit ratios. Parameters: | Name | Type | Description | Default | | -------------- | ---- | --------------------------------------------- | ---------- | | `filename str` | | Name of file with GET queries (one per line). | *required* | Returns: | Type | Description | | ----------------- | ----------------------------------- | | `List[List[str]]` | List\[List[str]\]: List of buckets. | #### `has_sufficient_queries()` Checks whether the given queries are deemed sufficient to give meaningful suggestions. Returns: | Name | Type | Description | | ------ | ------ | ------------------------------------------------------------------------------- | | `bool` | `bool` | Whether the given queries are deemed sufficient to give meaningful suggestions. | #### `buckets_sufficiently_filled()` Checks whether all non-empty buckets have at least 10 queries. Returns: | Name | Type | Description | | ------ | ------ | ------------------------------------------------------- | | `bool` | `bool` | Whether all non-empty buckets have at least 10 queries. | #### `get_query_distribution()` Gets the distribution of queries across all buckets. Returns: | Type | Description | | ---- | ------------------------------------------------------------------------------ | | | List\[float\]: List of filtered-out ratios corresponding to non-empty buckets. | | | List\[int\]: List of numbers of queries. | #### `benchmark(**kwargs)` For each non-empty bucket, determine the average searchtime. Parameters: | Name | Type | Description | Default | | ---------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ------- | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ----------------------- | ----------------------- | ---------------------- | | `BucketedMetricResults` | `BucketedMetricResults` | The benchmark results. | #### `compute_average_recalls(**kwargs)` For each non-empty bucket, determine the average recall. Parameters: | Name | Type | Description | Default | | ---------- | ------ | ------------------------------------------------------------------------------------------------------------------------------ | ------- | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. | `{}` | Returns: | Name | Type | Description | | ----------------------- | ----------------------- | ------------------- | | `BucketedMetricResults` | `BucketedMetricResults` | The recall results. | #### `suggest_filter_first_threshold(**kwargs)` Suggests a value for [filterFirstThreshold](https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching) based on performed benchmarks. Parameters: | Name | Type | Description | Default | | ---------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration! | `{}` | Returns: | Name | Type | Description | | ------- | ------------------ | ---------------------------- | | `float` | \`dict\[str, float | dict\[str, List[float]\]\]\` | #### `suggest_approximate_threshold(**kwargs)` Suggests a value for [approximateThreshold](https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching) based on performed benchmarks. Parameters: | Name | Type | Description | Default | | ---------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration and ranking.matching.filterFirstThreshold! | `{}` | Returns: | Name | Type | Description | | ------- | ------------------ | ---------------------------- | | `float` | \`dict\[str, float | dict\[str, List[float]\]\]\` | #### `suggest_post_filter_threshold(**kwargs)` Suggests a value for [postFilterThreshold](https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching) based on performed benchmarks and recall measurements. Parameters: | Name | Type | Description | Default | | ---------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | | `**kwargs` | `dict` | Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration, ranking.matching.filterFirstThreshold, and ranking.matching.approximateThreshold! | `{}` | Returns: | Name | Type | Description | | ------- | ------------------ | ---------------------------- | | `float` | \`dict\[str, float | dict\[str, List[float]\]\]\` | #### `suggest_filter_first_exploration()` Suggests a value for [filterFirstExploration](https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching) based on benchmarks and recall measurements performed on the supplied Vespa app. Returns: | Name | Type | Description | | ------ | ------------------ | ---------------------------- | | `dict` | \`dict\[str, float | dict\[str, List[float]\]\]\` | #### `run()` Determines suggestions for all parameters supported by this class. This method: 1. Determines the hit-ratios of supplied ANN queries. 1. Sorts these queries into buckets based on the determined hit-ratio. 1. Determines a suggestion for filterFirstExploration. 1. Determines a suggestion for filterFirstThreshold. 1. Determines a suggestion for approximateThreshold. 1. Determines a suggestion for postFilterThreshold. 1. Reports the determined suggestions and all benchmarks and recall measurements performed. Returns: | Name | Type | Description | | ------ | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------- | | `dict` | `Dict[str, Any]` | A dictionary containing the suggested values, information about the query distribution, performed benchmarks, and recall measurements. | Example ```python { "buckets": { "buckets_per_percent": 2, "bucket_interval_width": 0.005, "non_empty_buckets": [ 2, 20, 100, 180, 190, 198 ], "filtered_out_ratios": [ 0.01, 0.1, 0.5, 0.9, 0.95, 0.99 ], "hit_ratios": [ 0.99, 0.9, 0.5, 0.09999999999999998, 0.050000000000000044, 0.010000000000000009 ], "query_distribution": [ 100, 100, 100, 100, 100, 100 ] }, "filterFirstExploration": { "suggestion": 0.39453125, "benchmarks": { "0.0": [ 4.265999999999999, 4.256000000000001, 3.9430000000000005, 3.246999999999998, 2.4610000000000003, 1.768 ], "1.0": [ 3.9259999999999984, 3.6010000000000004, 3.290999999999999, 3.78, 4.927000000000002, 8.415000000000001 ], "0.5": [ 3.6299999999999977, 3.417, 3.4490000000000007, 3.752, 4.257, 5.99 ], "0.25": [ 3.5830000000000006, 3.616, 3.3239999999999985, 3.3200000000000016, 2.654999999999999, 2.3789999999999996 ], "0.375": [ 3.465, 3.4289999999999994, 3.196999999999997, 3.228999999999999, 3.167, 3.700999999999999 ], "0.4375": [ 3.9880000000000013, 3.463000000000002, 3.4650000000000007, 3.5000000000000013, 3.7499999999999982, 4.724000000000001 ], "0.40625": [ 3.4990000000000006, 3.3680000000000003, 3.147000000000001, 3.33, 3.381, 4.083999999999998 ], "0.390625": [ 3.6060000000000008, 3.5269999999999992, 3.2820000000000005, 3.433999999999998, 3.2880000000000007, 3.8609999999999984 ], "0.3984375": [ 3.6870000000000016, 3.386000000000001, 3.336000000000001, 3.316999999999999, 3.5329999999999973, 4.719000000000002 ] }, "recall_measurements": { "0.0": [ 0.8758, 0.8768999999999997, 0.8915, 0.9489999999999994, 0.9045999999999998, 0.64 ], "1.0": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9675999999999998, 0.9852999999999996, 0.9957999999999998 ], "0.5": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9660999999999998, 0.9759999999999996, 0.9903 ], "0.25": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9553999999999995, 0.9323999999999996, 0.8123000000000004 ], "0.375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9615999999999997, 0.9599999999999999, 0.9626000000000002 ], "0.4375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9642999999999999, 0.9697999999999999, 0.9832 ], "0.40625": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9632, 0.9642999999999999, 0.9763999999999997 ], "0.390625": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9625999999999999, 0.9617999999999999, 0.9688999999999998 ], "0.3984375": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.963, 0.9635000000000001, 0.9738999999999999 ] } }, "filterFirstThreshold": { "suggestion": 0.47, "benchmarks": { "hnsw": [ 2.779, 2.725000000000001, 3.151999999999999, 7.138999999999998, 11.362, 32.599999999999994 ], "filter_first": [ 3.543999999999999, 3.454, 3.443999999999999, 3.4129999999999994, 3.4090000000000003, 4.602999999999998 ] }, "recall_measurements": { "hnsw": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9740999999999996, 0.9852999999999993, 0.9937999999999992 ], "filter_first": [ 0.8757, 0.8768999999999997, 0.8909999999999999, 0.9627999999999999, 0.9630000000000001, 0.9718999999999994 ] } }, "approximateThreshold": { "suggestion": 0.03, "benchmarks": { "exact": [ 33.072, 31.99600000000001, 23.256, 9.155, 6.069000000000001, 2.0949999999999984 ], "filter_first": [ 2.9570000000000003, 2.91, 3.165000000000001, 3.396999999999998, 3.3310000000000004, 4.046 ] }, "recall_measurements": { "exact": [ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ], "filter_first": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9627999999999999, 0.9630000000000001, 0.9718999999999994 ] } }, "postFilterThreshold": { "suggestion": 0.49, "benchmarks": { "post_filtering": [ 2.0609999999999995, 2.448, 3.097999999999999, 7.200999999999999, 11.463000000000006, 11.622999999999996 ], "filter_first": [ 3.177999999999999, 2.717000000000001, 3.177, 3.5000000000000004, 3.455, 2.1159999999999997 ] }, "recall_measurements": { "post_filtering": [ 0.8288999999999995, 0.8355, 0.8967999999999998, 0.9519999999999997, 0.9512999999999994, 0.19180000000000003 ], "filter_first": [ 0.8284999999999996, 0.8368999999999996, 0.9007999999999996, 0.9627999999999999, 0.9630000000000001, 1.0 ] } } } ``` ### `mean(values)` Compute the mean of a list of numbers without using numpy. ### `percentile(values, p)` Compute the p-th percentile of a list of values (0 \<= p \<= 100). This approximates numpy.percentile's behavior. ### `validate_queries(queries)` Validate and normalize queries. Converts query IDs to strings if they are ints. ### `validate_qrels(qrels)` Validate and normalize qrels. Converts query IDs to strings if they are ints. ### `validate_vespa_query_fn(fn)` Validates the vespa_query_fn function. The function must be callable and accept either 2 or 3 parameters - (query_text: str, top_k: int) - or (query_text: str, top_k: int, query_id: Optional[str]) It must return a dictionary when called with test inputs. Returns True if the function takes a query_id parameter, False otherwise. ### `filter_queries(queries, relevant_docs)` Filter out queries that have no relevant docs ### `extract_doc_id_from_hit(hit, id_field)` Extract document ID from a Vespa hit. ### `get_id_field_from_hit(hit, id_field)` Get the ID field from a Vespa hit. ### `calculate_searchtime_stats(searchtimes)` Calculate search time statistics. ### `execute_queries(app, query_bodies, max_concurrent=10)` Execute queries and collect timing information. Returns the responses and a list of search times. ### `write_csv(metrics, searchtime_stats, csv_file, csv_dir, name)` Write metrics to CSV file. ### `log_metrics(name, metrics)` Log metrics with appropriate formatting. ### `extract_features_from_hit(hit, collect_matchfeatures, collect_rankfeatures, collect_summaryfeatures)` Extract features from a Vespa hit based on the collection configuration. Parameters: | Name | Type | Description | Default | | ------------------------- | ------ | ----------------------------------- | ---------- | | `hit` | `dict` | The Vespa hit dictionary | *required* | | `collect_matchfeatures` | `bool` | Whether to collect match features | *required* | | `collect_rankfeatures` | `bool` | Whether to collect rank features | *required* | | `collect_summaryfeatures` | `bool` | Whether to collect summary features | *required* | Returns: | Type | Description | | ------------------ | ------------------------------------ | | `Dict[str, float]` | Dict mapping feature names to values | ### `__getattr__(name)` Lazy import for optional MTEB dependencies. ## `vespa.exceptions` ### `VespaError` Bases: `Exception` Vespa returned an error response ## `vespa.io` ### `VespaResponse(json, status_code, url, operation_type)` Bases: `object` Class to represent a Vespa HTTP API response. #### `get_status_code()` Return status code of the response. #### `is_successfull()` [Deprecated] Use is_successful() instead #### `is_successful()` True if status code is 200. #### `get_json()` Return json of the response. ### `VespaQueryResponse(json, status_code, url, request_body=None)` Bases: `VespaResponse` #### `get_json()` For debugging when the response does not have hits. Returns: | Type | Description | | ------ | ------------------------------ | | `Dict` | JSON object with full response | ## `vespa.models` ### `ModelConfig(model_id, embedding_dim, tokenizer_id=None, binarized=False, embedding_field_type='float', distance_metric=None, component_id=None, model_path=None, tokenizer_path=None, model_url=None, tokenizer_url=None, max_tokens=None, transformer_input_ids=None, transformer_attention_mask=None, transformer_token_type_ids=None, transformer_output=None, pooling_strategy=None, normalize=None, query_prepend=None, document_prepend=None, validate_urls=False)` Configuration for an embedding model. This class encapsulates all model-specific parameters that affect the Vespa schema, component configuration, and ranking expressions. Attributes: | Name | Type | Description | | ---------------------------- | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `model_id` | `str` | The model identifier (e.g., 'e5-small-v2', 'snowflake-arctic-embed-xs') | | `embedding_dim` | `int` | The dimension of the embedding vectors (e.g., 384, 768). When binarized=True, specify the original model dimension - it will be automatically divided by 8 for storage (e.g., 1024 -> 128 bytes). | | `tokenizer_id` | `Optional[str]` | The tokenizer model identifier (if different from model_id) | | `binarized` | `bool` | Whether the embeddings should be binarized (packed to bits). When True, overrides embedding_field_type to int8 and embedding_dim must be divisible by 8. | | `embedding_field_type` | `EmbeddingFieldType` | Tensor cell type for embeddings (default: "float"). Note: When binarized=True, this is automatically overridden to "int8". Options: - "double": 64-bit float (highest precision, highest memory) - "float": 32-bit float (good balance) - "bfloat16": 16-bit brain float (reduced memory, good for large scale) - "int8": 8-bit integer (quantized, or used automatically when binarized=True) | | `distance_metric` | `Optional[DistanceMetric]` | Distance metric for HNSW index (default: None, auto-set based on binarized). When binarized=True, automatically set to "hamming". When binarized=False and not specified, defaults to "angular". Options: - "angular": Cosine similarity - "hamming": Hamming distance (required for binarized embeddings) - "euclidean", "dotproduct", "prenormalized-angular", "geodegrees" | | `component_id` | `Optional[str]` | The ID to use for the Vespa component (defaults to sanitized model_id) | | `model_path` | `Optional[str]` | Optional local path to the model file | | `tokenizer_path` | `Optional[str]` | Optional local path to the tokenizer file | | `model_url` | `Optional[str]` | Optional URL to the ONNX model file (alternative to model_id) | | `tokenizer_url` | `Optional[str]` | Optional URL to the tokenizer file (alternative to tokenizer_id) | | `max_tokens` | `Optional[int]` | Maximum number of tokens accepted by the transformer model. Optional, if not set the Vespa embedder uses its internal default (512). | | `transformer_input_ids` | `Optional[str]` | Name/identifier for transformer input IDs. Optional, if not set the Vespa embedder uses its internal default ("input_ids"). | | `transformer_attention_mask` | `Optional[str]` | Name/identifier for transformer attention mask. Optional, if not set the Vespa embedder uses its internal default ("attention_mask"). | | `transformer_token_type_ids` | `Optional[str]` | Name/identifier for transformer token type IDs. Optional, if not set the Vespa embedder uses its internal default ("token_type_ids"). Set to empty string "" to explicitly disable token_type_ids. | | `transformer_output` | `Optional[str]` | Name/identifier for transformer output. Optional, if not set the Vespa embedder uses its internal default ("last_hidden_state"). | | `pooling_strategy` | `Optional[PoolingStrategy]` | How to pool output vectors ("mean", "cls", or "none"). Optional, if not set the Vespa embedder uses its internal default ("mean"). | | `normalize` | `Optional[bool]` | Whether to normalize output to unit length. Optional, if not set the Vespa embedder uses its internal default (False). | | `query_prepend` | `Optional[str]` | Optional instruction to prepend to query text | | `document_prepend` | `Optional[str]` | Optional instruction to prepend to document text | | `validate_urls` | `bool` | Whether to validate URLs by checking they return HTTP 200 (default: False) | #### `__post_init__()` Set defaults and validate configuration. #### `to_dict(include_none=False)` Convert the ModelConfig to a dictionary for serialization. Parameters: | Name | Type | Description | Default | | -------------- | ------ | -------------------------------------------------------- | ------- | | `include_none` | `bool` | If True, include fields with None values. Default False. | `False` | Returns: | Type | Description | | ---------------- | --------------------------------------------- | | `Dict[str, Any]` | Dict with all model configuration attributes. | Example > > > config = ModelConfig(model_id="e5-small-v2", embedding_dim=384) d = config.to_dict() d["model_id"] 'e5-small-v2' d["embedding_dim"] 384 d["binarized"] False ### `ApplicationPackageWithQueryFunctions(query_functions=None, **kwargs)` Bases: `ApplicationPackage` #### `get_query_functions()` Get the query functions for this application package. Returns: | Type | Description | | --------------------------------------- | ----------------------- | | `Dict[str, Callable[[str, int], dict]]` | Dict of query functions | ### `sanitize_component_id(model_id)` Sanitize a model ID to create a valid Vespa component identifier. Vespa component IDs must match the pattern a-zA-Z\* (start with a letter, followed by letters, digits, or underscores). Parameters: | Name | Type | Description | Default | | ---------- | ----- | -------------------------------- | ---------- | | `model_id` | `str` | The model identifier to sanitize | *required* | Returns: | Type | Description | | ----- | -------------------------- | | `str` | A valid Vespa component ID | Example > > > sanitize_component_id("e5-small-v2") 'e5_small_v2' sanitize_component_id("sentence-transformers/all-MiniLM-L6-v2") 'sentence_transformers_all_MiniLM_L6_v2' sanitize_component_id("model.v1.0") 'model_v1_0' sanitize_component_id("123-model") 'model_123_model' ### `create_embedder_component(config)` Create a Vespa hugging-face-embedder component from a model configuration. Parameters: | Name | Type | Description | Default | | -------- | ------------- | ------------------------------------------ | ---------- | | `config` | `ModelConfig` | ModelConfig instance with model parameters | *required* | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------------------------------------------- | | `Component` | `Component` | A Vespa Component configured as a hugging-face-embedder | Example > > > config = ModelConfig(model_id="e5-small-v2", embedding_dim=384) component = create_embedder_component(config) component.id 'e5_small_v2' > > > > > > #### Example with URL-based model and custom parameters > > > > > > config = ModelConfig( ... model_id="gte-multilingual", ... embedding_dim=768, ... model_url="https://huggingface.co/onnx-community/gte-multilingual-base/resolve/main/onnx/model_quantized.onnx", ... tokenizer_url="https://huggingface.co/onnx-community/gte-multilingual-base/resolve/main/tokenizer.json", ... transformer_output="token_embeddings", ... max_tokens=8192, ... query_prepend="Represent this sentence for searching relevant passages: ", ... document_prepend="passage: ", ... ) component = create_embedder_component(config) component.id 'gte_multilingual' ### `create_embedding_field(config, field_name='embedding', indexing=None, distance_metric=None, embedder_id=None)` Create a Vespa embedding field from a model configuration. The field type and indexing statement are automatically configured based on whether the embeddings are binarized. Parameters: | Name | Type | Description | Default | | ----------------- | -------------------------- | -------------------------------------------------------------------------------- | ------------- | | `config` | `ModelConfig` | ModelConfig instance with model parameters | *required* | | `field_name` | `str` | Name of the embedding field (default: "embedding") | `'embedding'` | | `indexing` | `Optional[List[str]]` | Custom indexing statement (default: auto-generated based on config) | `None` | | `distance_metric` | `Optional[DistanceMetric]` | Distance metric for HNSW (default: "hamming" for binarized, "angular" for float) | `None` | | `embedder_id` | `Optional[str]` | Embedder ID to use in the indexing statement (default: uses config.component_id) | `None` | Returns: | Name | Type | Description | | ------- | ------- | --------------------------------------- | | `Field` | `Field` | A Vespa Field configured for embeddings | Example > > > config = ModelConfig(model_id="e5-small-v2", embedding_dim=384) field = create_embedding_field(config) field.type 'tensor(x[384])' > > > > > > config_float = ModelConfig(model_id="e5-small-v2", embedding_dim=384, embedding_field_type="float") field_float = create_embedding_field(config_float) field_float.type 'tensor(x[384])' > > > > > > config_binary = ModelConfig(model_id="bge-m3", embedding_dim=1024, binarized=True) field_binary = create_embedding_field(config_binary) field_binary.type 'tensor(x[128])' ### `create_semantic_rank_profile(config, profile_name='semantic', embedding_field='embedding', query_tensor='q')` Create a semantic ranking profile based on model configuration. The ranking expression is automatically configured to use hamming distance for binarized embeddings or cosine similarity for float embeddings. Parameters: | Name | Type | Description | Default | | ----------------- | ------------- | -------------------------------------------------- | ------------- | | `config` | `ModelConfig` | ModelConfig instance with model parameters | *required* | | `profile_name` | `str` | Name of the rank profile (default: "semantic") | `'semantic'` | | `embedding_field` | `str` | Name of the embedding field (default: "embedding") | `'embedding'` | | `query_tensor` | `str` | Name of the query tensor (default: "q") | `'q'` | Returns: | Name | Type | Description | | ------------- | ------------- | -------------------------------------------------- | | `RankProfile` | `RankProfile` | A Vespa RankProfile configured for semantic search | Example > > > config = ModelConfig(model_id="e5-small-v2", embedding_dim=384, binarized=False) profile = create_semantic_rank_profile(config) profile.name 'semantic' ### `create_hybrid_rank_profile(config, profile_name='fusion', base_profile='bm25', embedding_field='embedding', query_tensor='q', fusion_method='rrf', global_rerank_count=1000, first_phase_keep_rank_count=None)` Create a hybrid ranking profile combining BM25 and semantic search. Parameters: | Name | Type | Description | Default | | ----------------------------- | --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | | `config` | `ModelConfig` | ModelConfig instance with model parameters | *required* | | `profile_name` | `str` | Name of the rank profile (default: "fusion") | `'fusion'` | | `base_profile` | `str` | Name of the BM25 profile to inherit from (default: "bm25") | `'bm25'` | | `embedding_field` | `str` | Name of the embedding field (default: "embedding") | `'embedding'` | | `query_tensor` | `str` | Name of the query tensor (default: "q") | `'q'` | | `fusion_method` | `FusionMethod` | Fusion method - "rrf" for reciprocal rank fusion, "atan_norm" for atan-normalized sum in first phase, or "norm_linear" for linear normalization in global phase. | `'rrf'` | | `global_rerank_count` | `int` | Number of hits to rerank in global phase (default: 1000) | `1000` | | `first_phase_keep_rank_count` | `Optional[int]` | How many documents to keep the first phase top rank values for (default: None, uses Vespa default of 10000) | `None` | Returns: | Name | Type | Description | | ------------- | ------------- | ------------------------------------------------ | | `RankProfile` | `RankProfile` | A Vespa RankProfile configured for hybrid search | Example > > > config = ModelConfig(model_id="e5-small-v2", embedding_dim=384) profile = create_hybrid_rank_profile(config) profile.name 'fusion' ### `get_model_config(model_name)` Get a predefined model configuration by name. Parameters: | Name | Type | Description | Default | | ------------ | ----- | -------------------------- | ---------- | | `model_name` | `str` | Name of a predefined model | *required* | Returns: | Name | Type | Description | | ------------- | ------------- | ----------------------- | | `ModelConfig` | `ModelConfig` | The model configuration | Raises: | Type | Description | | ---------- | ------------------------------ | | `KeyError` | If the model name is not found | Example > > > config = get_model_config("e5-small-v2") config.embedding_dim 384 ### `list_models()` List all available predefined model configurations. Returns: | Type | Description | | ----------- | ------------------------------------------------------------ | | `List[str]` | List of model names that can be used with get_model_config() | Example > > > models = list_models() 'e5-small-v2' in models True 'nomic-ai-modernbert' in models True ### `create_hybrid_package(models, app_name='hybridapp', schema_name='doc', global_rerank_count=1000)` Create a Vespa application package configured for hybrid search evaluation. This function creates a complete Vespa application package with all necessary components, fields, and rank profiles for evaluation. It supports single or multiple embedding models, automatically handling naming conflicts by using model-specific field and component names. Parameters: | Name | Type | Description | Default | | --------------------- | --------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | | `models` | `Union[str, ModelConfig, List[Union[str, ModelConfig]], List[ModelConfig]]` | Single model or list of models to configure. Each can be: - A string model name (e.g., "e5-small-v2") to use a predefined config - A ModelConfig instance for custom configuration | *required* | | `app_name` | `str` | Name of the application (default: "hybridapp") | `'hybridapp'` | | `schema_name` | `str` | Name of the schema (default: "doc") | `'doc'` | | `global_rerank_count` | `int` | Number of hits to rerank in global phase (default: 1000) | `1000` | Returns: | Name | Type | Description | | -------------------- | -------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `ApplicationPackage` | `ApplicationPackageWithQueryFunctions` | Configured Vespa application package with: - Components for each embedding model - Embedding fields for each model (named "embedding" for single model, "embedding\_{component_id}" for multiple models) - BM25 and semantic rank profiles for each model - Hybrid rank profiles (RRF, atan_norm, norm_linear) for each model - A match-only rank profile for baseline evaluation | Raises: | Type | Description | | ------------ | --------------------------------------------- | | `ValueError` | If models list is empty | | `KeyError` | If a model name is not found in COMMON_MODELS | Example > > > #### Single model by name > > > > > > package = create_hybrid_package("e5-small-v2") len(package.components) 1 package.schema.document.fields[2].name 'embedding' > > > > > > #### Single model with custom config > > > > > > config = ModelConfig(model_id="my-model", embedding_dim=512) package = create_hybrid_package(config) package.schema.document.fields[2].name 'embedding' > > > > > > #### Multiple models - creates separate fields and profiles for each > > > > > > package = create_hybrid_package(["e5-small-v2", "e5-base-v2"]) len(package.components) 2 > > > > > > #### Fields will be named: embedding_e5_small_v2, embedding_e5_base_v2 > > > > > > field_names = [f.name for f in package.schema.document.fields if f.name.startswith('embedding')] len(field_names) 2 > > > > > > #### Multiple models with mixed configs > > > > > > custom = ModelConfig(model_id="custom-model", embedding_dim=384) package = create_hybrid_package(["e5-small-v2", custom]) len(package.components) 2 ## `vespa.package` ### `VT(tag, cs, attrs=None, void_=False, replace_underscores=True, **kwargs)` A 'Vespa Tag' structure, containing `tag`, `children`, and `attrs` #### `sanitize_tag_name(tag)` Convert invalid tag names (with '-') to valid Python identifiers (with '\_') #### `restore_tag_name()` Restore sanitized tag names back to the original names for XML generation ### `Summary(name=None, type=None, fields=None, select_elements_by=None)` Bases: `object` Configures a summary field. Parameters: | Name | Type | Description | Default | | -------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | `name` | `str` | The name of the summary field. Can be None if used inside a Field, which then uses the name of the Field. | `None` | | `type` | `str` | The type of the summary field. Can be None if used inside a Field, which then uses the type of the Field. | `None` | | `fields` | `list` | A list of properties used to configure the summary. These can be single properties (like "summary: dynamic", common in Field), or composite values (like "source: another_field"). | `None` | | `select_elements_by` | `str` | The name of a function that determines which elements to include in the summary. | `None` | Example ```py Summary(None, None, ["dynamic"]) Summary(None, None, ['dynamic']) Summary( "title", "string", [("source", "title")] ) Summary('title', 'string', [('source', 'title')]) Summary( "title", "string", [("source", ["title", "abstract"])] ) Summary('title', 'string', [('source', ['title', 'abstract'])]) Summary( name="artist", type="string", ) Summary('artist', 'string', None) Summary(None, None, None, best_chunks) ``` #### `as_lines` Returns the object as a list of strings, where each string represents a line of configuration that can be used during schema generation as shown below: Example usage ```text {% for line in field.summary.as_lines %} {{ line }} {% endfor %} ``` Example ```python Summary(None, None, ["dynamic"]).as_lines ['summary: dynamic'] ``` ```python Summary( "artist", "string", ).as_lines ['summary artist type string {}'] ``` ```python Summary( "artist", "string", [("bolding", "on"), ("sources", "artist")], ).as_lines ['summary artist type string {', ' bolding: on', ' sources: artist', '}'] ``` ```python Summary(None, None, None, "best_chunks").as_lines ['summary {', ' select-elements-by: best_chunks', '}'] ``` ### `HNSW(distance_metric='euclidean', max_links_per_node=16, neighbors_to_explore_at_insert=200)` Bases: `object` Configures Vespa HNSW indexes. For more information, check the [Vespa documentation](https://docs.vespa.ai/en/approximate-nn-hnsw.html). Parameters: | Name | Type | Description | Default | | -------------------------------- | ----- | ---------------------------------------------------------------------------------------------------- | ------------- | | `distance_metric` | `str` | The distance metric to use when computing distance between vectors. Default is 'euclidean'. | `'euclidean'` | | `max_links_per_node` | `int` | Specifies how many links per HNSW node to select when building the graph. Default is 16. | `16` | | `neighbors_to_explore_at_insert` | `int` | Specifies how many neighbors to explore when inserting a document in the HNSW graph. Default is 200. | `200` | ### `StructField(name, **kwargs)` Create a Vespa struct-field. For more detailed information about struct-fields, check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#struct-field). Parameters: | Name | Type | Description | Default | | --------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `name` | `str` | The name of the struct-field. | *required* | | `indexing` | `list, tuple, or str` | Configures how to process data of a struct-field during indexing. - Tuple: renders as indexing { value1; value2; ... } block with each item on a new line, and semicolon at the end. - List: renders as indexing: value1 | value2 | | `attribute` | `list` | Specifies a property of an index structure attribute. | *required* | | `match` | `list` | Set properties that decide how the matching method for this field operates. | *required* | | `query_command` | `list` | Add configuration for the query-command of the field. | *required* | | `summary` | `Summary` | Add configuration for the summary of the field. | *required* | | `rank` | `str` | Specifies the property that defines ranking calculations done for a field. | *required* | Example ```python StructField( name = "first_name", ) StructField('first_name', None, None, None, None, None, None) ``` ```python StructField( name = "first_name", indexing = ["attribute"], attribute = ["fast-search"], ) StructField('first_name', ['attribute'], ['fast-search'], None, None, None, None) ``` ```python StructField( name = "last_name", match = ["exact", ("exact-terminator", '"@%"')], query_command = ['"exact %%"'], summary = Summary(None, None, fields=["dynamic", ("bolding", "on")]) ) StructField('last_name', None, None, ['exact', ('exact-terminator', '"@%"')], ['"exact %%"'], Summary(None, None, ['dynamic', ('bolding', 'on')]), None) ``` ```python StructField( name = "first_name", indexing = ["attribute"], attribute = ["fast-search"], rank = "filter", ) StructField('first_name', ['attribute'], ['fast-search'], None, None, None, 'filter') ``` ```python StructField( name = "complex_field", indexing = ('"preprocessing"', ["attribute", "summary"]), attribute = ["fast-search"], ) StructField('complex_field', ('"preprocessing"', ['attribute', 'summary']), ['fast-search'], None, None, None, None) ``` #### `indexing_as_multiline` Generate multiline indexing statements for tuple-based indexing. ### `FieldConfiguration` Bases: `TypedDict` alias (list[str]): Add alias to the field. Use the format "component: component_alias" to add an alias to a field's component. See [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#uri) for an example. ### `Field(name, type, indexing=None, index=None, attribute=None, ann=None, match=None, weight=None, bolding=None, summary=None, is_document_field=True, **kwargs)` Bases: `object` Create a Vespa field. For more detailed information about fields, check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#field). Once we have an `ApplicationPackage` instance containing a `Schema` and a `Document`, we usually want to add fields so that we can store our data in a structured manner. We can accomplish that by creating `Field` instances and adding those to the `ApplicationPackage` instance via `Schema` and `Document` methods. Index Configuration Behavior - Single string configuration: uses `index: value` syntax - Single dict or multiple configurations: uses `index { ... }` block syntax - All configurations in a list are consolidated into a single index block Parameters: | Name | Type | Description | Default | | ------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `name` | `str` | The name of the field. | *required* | | `type` | `str` | The data type of the field. | *required* | | `indexing` | `list, tuple, or str` | Configures how to process data of a field during indexing. - Tuple: renders as indexing { value1; value2; ... } block with each item on a new line, and semicolon at the end. - List: renders as indexing: value1 | value2 | | `index` | `str, dict, or list` | Sets index parameters. - Single string (e.g., "enable-bm25"): renders as index: enable-bm25 - Single dict (e.g., {"arity": 2}): renders as index { arity: 2 } - List with multiple items: renders as single index { ... } block containing all configurations Fields with index are normalized and tokenized by default. | `None` | | `attribute` | `list` | Specifies a property of an index structure attribute. | `None` | | `ann` | `HNSW` | Add configuration for approximate nearest neighbor. | `None` | | `match` | `list` | Set properties that decide how the matching method for this field operates. | `None` | | `weight` | `int` | Sets the weight of the field, used when calculating rank scores. | `None` | | `bolding` | `bool` | Whether to highlight matching query terms in the summary. | `None` | | `summary` | `Summary` | Add configuration for the summary of the field. | `None` | | `is_document_field` | `bool` | Whether the field is a document field or part of the schema. Default is True. | `True` | | `stemming` | `str` | Add configuration for stemming of the field. | *required* | | `rank` | `str` | Add configuration for ranking calculations of the field. | *required* | | `query_command` | `list` | Add configuration for query-command of the field. | *required* | | `struct_fields` | `list` | Add struct-fields to the field. | *required* | | `alias` | `list` | Add alias to the field. Use the format "component: component_alias" to add an alias to a field's component. See Vespa documentation for an example. | *required* | Example ```python Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25") Field('title', 'string', ['index', 'summary'], 'enable-bm25', None, None, None, None, None, None, True, None, None, None, [], None) ``` ```python Field( name = "title", type = "array", indexing = ('"en"', ["index", "summary"]), ) Field('title', 'array', ('"en"', ['index', 'summary']), None, None, None, None, None, None, None, True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", indexing = ["attribute"], attribute=["fast-search", "fast-access"] ) Field('abstract', 'string', ['attribute'], None, ['fast-search', 'fast-access'], None, None, None, None, None, True, None, None, None, [], None) ``` ```python Field(name="tensor_field", type="tensor(x[128])", indexing=["attribute"], ann=HNSW( distance_metric="euclidean", max_links_per_node=16, neighbors_to_explore_at_insert=200, ), ) Field('tensor_field', 'tensor(x[128])', ['attribute'], None, None, HNSW('euclidean', 16, 200), None, None, None, None, True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", match = ["exact", ("exact-terminator", '"@%"',)], ) Field('abstract', 'string', None, None, None, None, ['exact', ('exact-terminator', '"@%"')], None, None, None, True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", weight = 200, ) Field('abstract', 'string', None, None, None, None, None, 200, None, None, True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", bolding = True, ) Field('abstract', 'string', None, None, None, None, None, None, True, None, True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", summary = Summary(None, None, ["dynamic", ["bolding", "on"]]), ) Field('abstract', 'string', None, None, None, None, None, None, None, Summary(None, None, ['dynamic', ['bolding', 'on']]), True, None, None, None, [], None) ``` ```python Field( name = "abstract", type = "string", stemming = "shortest", ) Field('abstract', 'string', None, None, None, None, None, None, None, None, True, 'shortest', None, None, [], None) ``` ```python Field( name = "abstract", type = "string", rank = "filter", ) Field('abstract', 'string', None, None, None, None, None, None, None, None, True, None, 'filter', None, [], None) ``` ```python Field( name = "abstract", type = "string", query_command = ['"exact %%"'], ) Field('abstract', 'string', None, None, None, None, None, None, None, None, True, None, None, ['"exact %%"'], [], None) ``` ```python Field( name = "abstract", type = "string", struct_fields = [ StructField( name = "first_name", indexing = ["attribute"], attribute = ["fast-search"], ), ], ) Field('abstract', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [StructField('first_name', ['attribute'], ['fast-search'], None, None, None, None)], None) ``` ```python Field( name = "artist", type = "string", alias = ["artist_name", "component: component_alias"], ) Field('artist', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [], ['artist_name', 'component: component_alias']) ``` ```python # Single string index - uses simple syntax Field(name = "title", type = "string", index = "enable-bm25") # Renders as: index: enable-bm25 ``` ```python # Single dict index - uses block syntax Field(name = "predicate_field", type = "predicate", index = {"arity": 2}) # Renders as: index { arity: 2 } ``` ```python # Multiple string indices - uses block syntax Field(name = "multi", type = "string", index = ["enable-bm25", "another-setting"]) # Renders as: index { enable-bm25; another-setting } ``` ```python # Complex index configurations with multiple parameters Field( name = "predicate_field", type = "predicate", indexing = ["attribute"], index = { "arity": 2, "lower-bound": 3, "upper-bound": 200, "dense-posting-list-threshold": 0.25 } ) # Renders as: index { arity: 2; lower-bound: 3; upper-bound: 200; dense-posting-list-threshold: 0.25 } ``` ```python # Multiple index configurations with mixed types Field( name = "complex_field", type = "string", indexing = ["index", "summary"], index = [ "enable-bm25", # Simple index setting {"arity": 2, "lower-bound": 3}, # Complex index block "another-setting" # Another simple setting ] ) # Renders as single block: # index { # enable-bm25 # arity: 2 # lower-bound: 3 # another-setting # } ``` ```python # Parameterless index settings using None values Field( name = "taxonomy", type = "array", indexing = ["index", "summary"], match = ["text"], index = {"enable-bm25": None} ) # Renders as: index { enable-bm25 } (without ": None") ``` #### `indexing_as_multiline` Generate multiline indexing statements for tuple-based indexing. #### `index_configurations` Returns index configurations as a list, normalizing single values to lists. This allows the template to consistently iterate over index configurations. #### `use_simple_index_syntax` Returns True if we should use simple 'index: value' syntax. Simple syntax is used only when there's exactly one string configuration. Otherwise, we use the block syntax 'index { ... }'. #### `add_struct_fields(*struct_fields)` Add `StructField`'s to the `Field`. Parameters: | Name | Type | Description | Default | | --------------- | ------ | ------------------------------------------ | ------- | | `struct_fields` | `list` | A list of StructField objects to be added. | `()` | ### `ImportedField(name, reference_field, field_to_import)` Bases: `object` Imported field from a reference document. Useful to implement [parent/child relationships](https://docs.vespa.ai/en/parent-child.html). Parameters: | Name | Type | Description | Default | | ----------------- | ----- | --------------------------------------------------------------------------------------------- | ---------- | | `name` | `str` | Field name. | *required* | | `reference_field` | `str` | A field of type reference that points to the document that contains the field to be imported. | *required* | | `field_to_import` | `str` | Field name to be imported, as defined in the reference document. | *required* | Example ```python ImportedField( name="global_category_ctrs", reference_field="category_ctr_ref", field_to_import="ctrs", ) ImportedField('global_category_ctrs', 'category_ctr_ref', 'ctrs') ``` ### `Struct(name, fields=None)` Bases: `object` Create a Vespa struct. A struct defines a composite type. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#struct) for more detailed information about structs. Parameters: | Name | Type | Description | Default | | -------- | ------ | ----------------------------------------------------- | ---------- | | `name` | `str` | Name of the struct. | *required* | | `fields` | `list` | List of Field objects to be included in the fieldset. | `None` | Example ```python Struct("person") Struct('person', None) Struct( "person", [ Field("first_name", "string"), Field("last_name", "string"), ], ) Struct('person', [Field('first_name', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [], None), Field('last_name', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [], None)]) ``` ### `DocumentSummary(name, inherits=None, summary_fields=None, from_disk=None, omit_summary_features=None)` Bases: `object` Create a Document Summary. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#document-summary) for more detailed information about document-summary. Parameters: | Name | Type | Description | Default | | ----------------------- | ------ | ----------------------------------------------------------------------------- | ---------- | | `name` | `str` | Name of the document-summary. | *required* | | `inherits` | `str` | Name of another document-summary from which this inherits. | `None` | | `summary_fields` | `list` | List of Summary objects used in this document-summary. | `None` | | `from_disk` | `bool` | Marks this document-summary as accessing fields on disk. | `None` | | `omit_summary_features` | `bool` | Specifies that summary-features should be omitted from this document summary. | `None` | Example ```python DocumentSummary( name="document-summary", ) DocumentSummary('document-summary', None, None, None, None) DocumentSummary( name="which-inherits", inherits="base-document-summary", ) DocumentSummary('which-inherits', 'base-document-summary', None, None, None) DocumentSummary( name="with-field", summary_fields=[Summary("title", "string", [("source", "title")])] ) DocumentSummary('with-field', None, [Summary('title', 'string', [('source', 'title')])], None, None) DocumentSummary( name="with-bools", from_disk=True, omit_summary_features=True, ) DocumentSummary('with-bools', None, None, True, True) ``` ### `Document(fields=None, inherits=None, structs=None)` Bases: `object` Create a Vespa Document. Check the [Vespa documentation](https://docs.vespa.ai/en/documents.html) for more detailed information about documents. Parameters: | Name | Type | Description | Default | | -------- | ------ | ------------------------------------------------------------ | ------- | | `fields` | `list` | A list of Field objects to include in the document's schema. | `None` | Example ```python Document() Document(None, None, None) Document(fields=[Field(name="title", type="string")]) Document([Field('title', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [], None)], None, None) Document(fields=[Field(name="title", type="string")], inherits="context") Document([Field('title', 'string', None, None, None, None, None, None, None, None, True, None, None, None, [], None)], context, None) ``` #### `add_fields(*fields)` Add `Field` objects to the document. Parameters: | Name | Type | Description | Default | | -------- | ------ | ------------------- | ------- | | `fields` | `list` | Fields to be added. | `()` | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `add_structs(*structs)` Add `Struct` objects to the document. Parameters: | Name | Type | Description | Default | | --------- | ------ | -------------------- | ------- | | `structs` | `list` | Structs to be added. | `()` | Returns: | Type | Description | | ------ | ----------- | | `None` | None | ### `FieldSet(name, fields)` Bases: `object` Create a Vespa field set. A fieldset groups fields together for searching. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#fieldset) for more detailed information about field sets. Parameters: | Name | Type | Description | Default | | -------- | ------ | ------------------------------------------- | ---------- | | `name` | `str` | Name of the fieldset. | *required* | | `fields` | `list` | Field names to be included in the fieldset. | *required* | Returns: | Name | Type | Description | | ---------- | ------ | --------------------- | | `FieldSet` | `None` | A field set instance. | Example ```text FieldSet(name="default", fields=["title", "body"]) FieldSet('default', ['title', 'body']) ``` ### `Function(name, expression, args=None)` Bases: `object` Create a Vespa rank function. Define a named function that can be referenced as a part of the ranking expression, or (if having no arguments) as a feature. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#function-rank) for more detailed information about rank functions. Parameters: | Name | Type | Description | Default | | ------------ | ------ | -------------------------------------------------------------------------- | ---------- | | `name` | `str` | Name of the function. | *required* | | `expression` | `str` | String representing a Vespa expression. | *required* | | `args` | `list` | List of arguments to be used in the function expression. Defaults to None. | `None` | Returns: | Name | Type | Description | | ---------- | ------ | ------------------------- | | `Function` | `None` | A rank function instance. | Example ```text Function( name="myfeature", expression="fieldMatch(bar) + freshness(foo)", args=["foo", "bar"] ) Function('myfeature', 'fieldMatch(bar) + freshness(foo)', ['foo', 'bar']) ``` It is possible to define functions with multi-line expressions: ```text Function( name="token_type_ids", expression="tensor(d0[1],d1[128])(\n" " if (d1 < question_length,\n" " 0,\n" " if (d1 < question_length + doc_length,\n" " 1,\n" " TOKEN_NONE\n" " )))", ) Function('token_type_ids', 'tensor(d0[1],d1[128])(\n if (d1 < question_length,\n 0,\n if (d1 < question_length + doc_length,\n 1,\n TOKEN_NONE\n )))', None) ``` ### `FirstPhaseRanking(expression, keep_rank_count=None, rank_score_drop_limit=None)` Create a Vespa first phase ranking configuration. This is the initial ranking performed on all matching documents. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#firstphase-rank) for more detailed information about first phase ranking configuration. Parameters: | Name | Type | Description | Default | | ----------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `expression` | `str` | Specify the ranking expression to be used for the first phase of ranking. Check also the Vespa documentation for ranking expressions. | *required* | | `keep_rank_count` | `int` | How many documents to keep the first phase top rank values for. Default value is 10000. | `None` | | `rank_score_drop_limit` | `float` | Drop all hits with a first phase rank score less than or equal to this floating point number. | `None` | Returns: | Name | Type | Description | | ------------------- | ------ | --------------------------------------------- | | `FirstPhaseRanking` | `None` | A first phase ranking configuration instance. | Example ```text FirstPhaseRanking("myFeature * 10") FirstPhaseRanking('myFeature * 10', None, None) FirstPhaseRanking(expression="myFeature * 10", keep_rank_count=50, rank_score_drop_limit=10) FirstPhaseRanking('myFeature * 10', 50, 10) ``` ### `SecondPhaseRanking(expression, rerank_count=100, rank_score_drop_limit=None)` Bases: `object` Create a Vespa second phase ranking configuration. This is the optional reranking performed on the best hits from the first phase. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#secondphase-rank) for more detailed information about second phase ranking configuration. Parameters: | Name | Type | Description | Default | | ----------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `expression` | `str` | Specify the ranking expression to be used for the second phase of ranking. Check also the Vespa documentation for ranking expressions. | *required* | | `rerank_count` | `int` | Specifies the number of hits to be reranked in the second phase. Default value is 100. | `100` | | `rank_score_drop_limit` | `float` | Drop all hits with a first phase rank score less than or equal to this floating point number. | `None` | Returns: | Name | Type | Description | | -------------------- | ------ | ---------------------------------------------- | | `SecondPhaseRanking` | `None` | A second phase ranking configuration instance. | Example ```text SecondPhaseRanking(expression="1.25 * bm25(title) + 3.75 * bm25(body)", rerank_count=10) SecondPhaseRanking('1.25 * bm25(title) + 3.75 * bm25(body)', 10, None) SecondPhaseRanking(expression="1.25 * bm25(title) + 3.75 * bm25(body)", rerank_count=10, rank_score_drop_limit=5) SecondPhaseRanking('1.25 * bm25(title) + 3.75 * bm25(body)', 10, 5) ``` ### `GlobalPhaseRanking(expression, rerank_count=100, rank_score_drop_limit=None)` Bases: `object` Create a Vespa global phase ranking configuration. This is the optional reranking performed on the best hits from the content nodes phase(s). Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#globalphase-rank) for more detailed information about global phase ranking configuration. Parameters: | Name | Type | Description | Default | | ----------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `expression` | `str` | Specify the ranking expression to be used for the global phase of ranking. Check also the Vespa documentation for ranking expressions. | *required* | | `rerank_count` | `int` | Specifies the number of hits to be reranked in the global phase. Default value is 100. | `100` | | `rank_score_drop_limit` | `float` | Drop all hits with a first phase rank score less than or equal to this floating point number. | `None` | Returns: | Name | Type | Description | | -------------------- | ------ | ---------------------------------------------- | | `GlobalPhaseRanking` | `None` | A global phase ranking configuration instance. | Example ```text GlobalPhaseRanking(expression="1.25 * bm25(title) + 3.75 * bm25(body)", rerank_count=10) GlobalPhaseRanking('1.25 * bm25(title) + 3.75 * bm25(body)', 10, None) GlobalPhaseRanking(expression="1.25 * bm25(title) + 3.75 * bm25(body)", rerank_count=10, rank_score_drop_limit=5) GlobalPhaseRanking('1.25 * bm25(title) + 3.75 * bm25(body)', 10, 5) ``` ### `Mutate(on_match, on_first_phase, on_second_phase, on_summary)` Bases: `object` Enable mutating operations in rank profiles. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#mutate) for more detailed information about mutable attributes. Parameters: | Name | Type | Description | Default | | ----------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `on_match` | `dict` | Dictionary for the on-match phase containing 3 mandatory keys: - attribute: name of the mutable attribute to mutate. - operation_string: operation to perform on the mutable attribute. - operation_value: number to set, add, or subtract to/from the current value of the mutable attribute. | *required* | | `on_first_phase` | `dict` | Dictionary for the on-first-phase phase containing 3 mandatory keys: - attribute: name of the mutable attribute to mutate. - operation_string: operation to perform on the mutable attribute. - operation_value: number to set, add, or subtract to/from the current value of the mutable attribute. | *required* | | `on_second_phase` | `dict` | Dictionary for the on-second-phase phase containing 3 mandatory keys: - attribute: name of the mutable attribute to mutate. - operation_string: operation to perform on the mutable attribute. - operation_value: number to set, add, or subtract to/from the current value of the mutable attribute. | *required* | | `on_summary` | `dict` | Dictionary for the on-summary phase containing 3 mandatory keys: - attribute: name of the mutable attribute to mutate. - operation_string: operation to perform on the mutable attribute. - operation_value: number to set, add, or subtract to/from the current value of the mutable attribute. | *required* | Example ```python enable_mutating_operations( on_match={ 'attribute': 'popularity', 'operation_string': 'add', 'operation_value': 5 }, on_first_phase={ 'attribute': 'score', 'operation_string': 'subtract', 'operation_value': 3 } ) enable_mutating_operations({'attribute': 'popularity', 'operation_string': 'add', 'operation_value': 5}, {'attribute': 'score', 'operation_string': 'subtract', 'operation_value': 3}) ``` ### `Diversity(attribute, min_groups)` Bases: `object` Create a Vespa ranking diversity configuration. This is an optional config that is used to guarantee diversity in the different query phases. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#diversity) for more detailed information about diversity configuration. Parameters: | Name | Type | Description | Default | | ------------ | ----- | ------------------------------------------------------------------------------------------------------------------------ | ---------- | | `attribute` | `str` | Which attribute to use when deciding diversity. The attribute must be a single-valued numeric, string or reference type. | *required* | | `min_groups` | `int` | Specifies the minimum number of groups returned from the phase. | *required* | Returns: | Name | Type | Description | | ----------- | ------ | ------------------------------------------- | | `Diversity` | `None` | A ranking diversity configuration instance. | Example ```text Diversity(attribute="popularity", min_groups=5) Diversity('popularity', 5) ``` ### `MatchPhaseRanking(attribute, order, max_hits)` Bases: `object` Create a Vespa match phase ranking configuration. This is an optional phase that can be used to quickly select a subset of hits for further ranking. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#match-phase) for more detailed information about match phase ranking configuration. Parameters: | Name | Type | Description | Default | | ----------- | ----- | --------------------------------------------------- | ---------- | | `attribute` | `str` | The numeric attribute to use for filtering. | *required* | | `order` | `str` | The sort order, either "ascending" or "descending". | *required* | | `max_hits` | `int` | Maximum number of hits to pass to the next phase. | *required* | Example ```python MatchPhaseRanking(attribute="popularity", order="descending", max_hits=1000) MatchPhaseRanking('popularity', 'descending', 1000) ``` ### `RankProfile(name, first_phase=None, inherits=None, constants=None, functions=None, summary_features=None, match_features=None, second_phase=None, global_phase=None, match_phase=None, num_threads_per_search=None, diversity=None, **kwargs)` Bases: `object` Create a Vespa rank profile. Rank profiles are used to specify an alternative ranking of the same data for different purposes, and to experiment with new rank settings. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#rank-profile) for more detailed information about rank profiles. Parameters: | Name | Type | Description | Default | | ------------------------ | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `name` | `str` | Rank profile name. | *required* | | `first_phase` | `str` | The config specifying the first phase of ranking. More info about first phase ranking. | `None` | | `inherits` | `str` | The inherits attribute is optional. If defined, it contains the name of another rank profile in the same schema. Values not defined in this rank profile will then be inherited. | `None` | | `constants` | `dict` | Dict of constants available in ranking expressions, resolved and optimized at configuration time. More info about constants. | `None` | | `functions` | `list` | List of Function objects representing rank functions to be included in the rank profile. | `None` | | `summary_features` | `list` | List of rank features to be included with each hit. More info about summary features. | `None` | | `match_features` | `list` | List of rank features to be included with each hit. More info about match features. | `None` | | `second_phase` | `SecondPhaseRanking` | Config specifying the second phase of ranking. See SecondPhaseRanking. | `None` | | `global_phase` | `GlobalPhaseRanking` | Config specifying the global phase of ranking. See GlobalPhaseRanking. | `None` | | `match_phase` | `MatchPhaseRanking` | Config specifying the match phase of ranking. See MatchPhaseRanking. | `None` | | `num_threads_per_search` | `int` | Overrides the global persearch value for this rank profile to a lower value. | `None` | | `diversity` | `Optional[Diversity]` | Optional config specifying the diversity of ranking. | `None` | | `weight` | `list` | A list of tuples containing the field and their weight. | *required* | | `rank_type` | `list` | A list of tuples containing a field and the rank-type-name. More info about rank-type. | *required* | | `rank_properties` | `list` | A list of tuples containing a field and its configuration. More info about rank-properties. | *required* | | `mutate` | `Mutate` | A Mutate object containing attributes to mutate on, mutation operation, and value. More info about mutate operation. | *required* | Example ```python RankProfile(name = "default", first_phase = "nativeRank(title, body)") RankProfile('default', 'nativeRank(title, body)', None, None, None, None, None, None, None, None, None, None, None, None, None) RankProfile(name = "new", first_phase = "BM25(title)", inherits = "default") RankProfile('new', 'BM25(title)', 'default', None, None, None, None, None, None, None, None, None, None, None, None) RankProfile( name = "new", first_phase = "BM25(title)", inherits = "default", constants={"TOKEN_NONE": 0, "TOKEN_CLS": 101, "TOKEN_SEP": 102}, summary_features=["BM25(title)"] ) RankProfile('new', 'BM25(title)', 'default', {'TOKEN_NONE': 0, 'TOKEN_CLS': 101, 'TOKEN_SEP': 102}, None, ['BM25(title)'], None, None, None, None, None, None, None, None, None) RankProfile( name="bert", first_phase="bm25(title) + bm25(body)", second_phase=SecondPhaseRanking(expression="1.25 * bm25(title) + 3.75 * bm25(body)", rerank_count=10), inherits="default", constants={"TOKEN_NONE": 0, "TOKEN_CLS": 101, "TOKEN_SEP": 102}, functions=[ Function( name="question_length", expression="sum(map(query(query_token_ids), f(a)(a > 0)))" ), Function( name="doc_length", expression="sum(map(attribute(doc_token_ids), f(a)(a > 0)))" ) ], summary_features=["question_length", "doc_length"] ) RankProfile('bert', 'bm25(title) + bm25(body)', 'default', {'TOKEN_NONE': 0, 'TOKEN_CLS': 101, 'TOKEN_SEP': 102}, [Function('question_length', 'sum(map(query(query_token_ids), f(a)(a > 0)))', None), Function('doc_length', 'sum(map(attribute(doc_token_ids), f(a)(a > 0)))', None)], ['question_length', 'doc_length'], None, SecondPhaseRanking('1.25 * bm25(title) + 3.75 * bm25(body)', 10, None), None, None, None, None, None, None, None) RankProfile( name = "default", first_phase = "nativeRank(title, body)", weight = [("title", 200), ("body", 100)] ) RankProfile('default', 'nativeRank(title, body)', None, None, None, None, None, None, None, None, None, [('title', 200), ('body', 100)], None, None, None) RankProfile( name = "default", first_phase = "nativeRank(title, body)", rank_type = [("body", "about")] ) RankProfile('default', 'nativeRank(title, body)', None, None, None, None, None, None, None, None, None, None, [('body', 'about')], None, None) RankProfile( name = "default", first_phase = "nativeRank(title, body)", rank_properties = [("fieldMatch(title).maxAlternativeSegmentations", "10")] ) RankProfile('default', 'nativeRank(title, body)', None, None, None, None, None, None, None, None, None, None, None, [('fieldMatch(title).maxAlternativeSegmentations', '10')], None) RankProfile( name = "default", first_phase = FirstPhaseRanking(expression="nativeRank(title, body)", keep_rank_count=50) ) RankProfile('default', FirstPhaseRanking('nativeRank(title, body)', 50, None), None, None, None, None, None, None, None, None, None, None, None, None, None) RankProfile( name = "default", first_phase = "nativeRank(title, body)", num_threads_per_search = 2 ) RankProfile('default', 'nativeRank(title, body)', None, None, None, None, None, None, None, None, 2, None, None, None, None) ``` ### `OnnxModel(model_name, model_file_path, inputs, outputs)` Bases: `object` Create a Vespa ONNX model config. Vespa has support for advanced ranking models through its tensor API. If you have your model in the ONNX format, Vespa can import the models and use them directly. Check the [Vespa documentation](https://docs.vespa.ai/en/onnx.html) for more detailed information about field sets. Parameters: | Name | Type | Description | Default | | ----------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `model_name` | `str` | Unique model name to use as an ID when referencing the model. | *required* | | `model_file_path` | `str` | ONNX model file path. | *required* | | `inputs` | `dict` | Dict mapping the ONNX input names as specified in the ONNX file to valid Vespa inputs. These can be a document field (attribute(field_name)), a query parameter (query(query_param)), a constant (constant(name)), or a user-defined function (function_name). | *required* | | `outputs` | `dict` | Dict mapping the ONNX output names as specified in the ONNX file to the name used in Vespa to specify the output. If omitted, the first output in the ONNX file will be used. | *required* | Example ```python OnnxModel( model_name="bert", model_file_path="bert.onnx", inputs={ "input_ids": "input_ids", "token_type_ids": "token_type_ids", "attention_mask": "attention_mask", }, outputs={"logits": "logits"}, ) OnnxModel('bert', 'bert.onnx', {'input_ids': 'input_ids', 'token_type_ids': 'token_type_ids', 'attention_mask': 'attention_mask'}, {'logits': 'logits'}) ``` ### `Schema(name, document, fieldsets=None, rank_profiles=None, models=None, global_document=False, imported_fields=None, document_summaries=None, mode='index', inherits=None, **kwargs)` Bases: `object` Create a Vespa Schema. Check the [Vespa documentation](https://docs.vespa.ai/en/schemas.html) for more detailed information about schemas. Parameters: | Name | Type | Description | Default | | -------------------- | ---------- | --------------------------------------------------------------------------------- | ---------- | | `name` | `str` | Schema name. | *required* | | `document` | `Document` | Vespa Document associated with the Schema. | *required* | | `fieldsets` | `list` | A list of FieldSet associated with the Schema. | `None` | | `rank_profiles` | `list` | A list of RankProfile associated with the Schema. | `None` | | `models` | `list` | A list of OnnxModel associated with the Schema. | `None` | | `global_document` | `bool` | Set to True to copy the documents to all content nodes. Defaults to False. | `False` | | `imported_fields` | `list` | A list of ImportedField defining fields from global documents to be imported. | `None` | | `document_summaries` | `list` | A list of DocumentSummary associated with the schema. | `None` | | `mode` | `str` | Schema mode. Defaults to 'index'. Other options are 'store-only' and 'streaming'. | `'index'` | | `inherits` | `str` | Schema to inherit from. | `None` | | `stemming` | `str` | The default stemming setting. Defaults to 'best'. | *required* | Example ```python Schema(name="schema_name", document=Document()) Schema('schema_name', Document(None, None, None), None, None, [], False, None, [], None) ``` #### `add_fields(*fields)` Add `Field` to the Schema's `Document`. Parameters: | Name | Type | Description | Default | | -------- | ------ | ---------------------------------------------------- | ------- | | `fields` | `list` | A list of Field objects to be added to the Document. | `()` | Example ```python schema.add_fields([Field(name="title", type="string"), Field(name="body", type="text")]) schema.add_fields([Field('title', 'string'), Field('body', 'text')]) ``` #### `add_field_set(field_set)` Add a `FieldSet` to the Schema. Parameters: | Name | Type | Description | Default | | ----------- | ------ | ----------------------------------------------------- | ---------- | | `field_set` | `list` | A list of FieldSet objects to be added to the Schema. | *required* | #### `add_rank_profile(rank_profile)` Add a `RankProfile` to the Schema. Parameters: | Name | Type | Description | Default | | -------------- | ------------- | ------------------------------------------- | ---------- | | `rank_profile` | `RankProfile` | The rank profile to be added to the Schema. | *required* | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `add_model(model)` Add an `OnnxModel` to the Schema. Parameters: | Name | Type | Description | Default | | ------- | ----------- | ----------------------------------------- | ---------- | | `model` | `OnnxModel` | The ONNX model to be added to the Schema. | *required* | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `add_imported_field(imported_field)` Add an `ImportedField` to the Schema. Parameters: | Name | Type | Description | Default | | ---------------- | --------------- | --------------------------------------------- | ---------- | | `imported_field` | `ImportedField` | The imported field to be added to the Schema. | *required* | #### `add_document_summary(document_summary)` Add a `DocumentSummary` to the Schema. Parameters: | Name | Type | Description | Default | | ------------------ | ----------------- | ----------------------------------------------- | ---------- | | `document_summary` | `DocumentSummary` | The document summary to be added to the Schema. | *required* | ### `QueryTypeField(name, type)` Bases: `object` Create a field to be included in a `QueryProfileType`. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------- | ---------- | | `name` | `str` | Field name. | *required* | | `type` | `str` | Field type. | *required* | Example ```python QueryTypeField( name="ranking.features.query(title_bert)", type="tensor(x[768])" ) QueryTypeField('ranking.features.query(title_bert)', 'tensor(x[768])') ``` ### `QueryProfileType(fields=None)` Bases: `object` Create a Vespa Query Profile Type. Check the [Vespa documentation](https://docs.vespa.ai/en/query-profiles.html#query-profile-types) for more detailed information about query profile types. An `ApplicationPackage` instance comes with a default `QueryProfile` named `default` that is associated with a `QueryProfileType` named `root`, meaning that you usually do not need to create those yourself, only add fields to them when required. Parameters: | Name | Type | Description | Default | | -------- | ---------------------- | ------------------------- | ------- | | `fields` | `list[QueryTypeField]` | A list of QueryTypeField. | `None` | Example ```python QueryProfileType( fields=[ QueryTypeField( name="ranking.features.query(tensor_bert)", type="tensor(x[768])" ) ] ) # Output: QueryProfileType([QueryTypeField('ranking.features.query(tensor_bert)', 'tensor(x[768])')]) ``` #### `add_fields(*fields)` Add `QueryTypeField` objects to the Query Profile Type. Parameters: | Name | Type | Description | Default | | -------- | ---------------- | ------------------- | ------- | | `fields` | `QueryTypeField` | Fields to be added. | `()` | Example ```python query_profile_type = QueryProfileType() query_profile_type.add_fields( QueryTypeField( name="age", type="integer" ), QueryTypeField( name="profession", type="string" ) ) ``` ### `QueryField(name, value)` Bases: `object` Create a field to be included in a `QueryProfile`. Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------ | ---------- | | `name` | `str` | Field name. | *required* | | `value` | `Any` | Field value. | *required* | Example ```python QueryField(name="maxHits", value=1000) # Output: QueryField('maxHits', 1000) ``` ### `QueryProfile(fields=None)` Bases: `object` Create a Vespa Query Profile. Check the [Vespa documentation](https://docs.vespa.ai/en/query-profiles.html) for more detailed information about query profiles. A `QueryProfile` is a named collection of query request parameters given in the configuration. The query request can specify a query profile whose parameters will be used as parameters of that request. The query profiles may optionally be type-checked. Type checking is turned on by referencing a `QueryProfileType` from the query profile. Parameters: | Name | Type | Description | Default | | -------- | ------------------ | --------------------- | ------- | | `fields` | `list[QueryField]` | A list of QueryField. | `None` | Example ```python QueryProfile(fields=[QueryField(name="maxHits", value=1000)]) # Output: QueryProfile([QueryField('maxHits', 1000)]) ``` #### `add_fields(*fields)` Add `QueryField` objects to the Query Profile. Parameters: | Name | Type | Description | Default | | -------- | ------------ | ------------------- | ------- | | `fields` | `QueryField` | Fields to be added. | `()` | Example ```python query_profile = QueryProfile() query_profile.add_fields(QueryField(name="maxHits", value=1000)) ``` ### `ApplicationConfiguration(name, value)` Bases: `object` Create a Vespa Schema. Check the [Config documentation](https://docs.vespa.ai/en/reference/services.html#generic-config) for more detailed information about generic configuration. Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------------- | ---------------------------------------------------------------- | | `name` | `str` | Configuration name. | *required* | | `value` | \`str | dict\` | Either a string or a dictionary (which may be nested) of values. | Example ```python ApplicationConfiguration( name="container.handler.observability.application-userdata", value={"version": "my-version"} ) # Output: ApplicationConfiguration(name="container.handler.observability.application-userdata") ``` ### `Parameter(name, args=None, children=None)` Bases: `object` Create a Vespa Component configuration parameter. Parameters: | Name | Type | Description | Default | | ---------- | ----- | -------------------- | --------------------------------------------------------------------------------------------- | | `name` | `str` | Parameter name. | *required* | | `args` | `Any` | Parameter arguments. | `None` | | `children` | \`str | list[Parameter]\` | Parameter children. Can be either a string or a list of Parameter objects for nested configs. | ### `AuthClient(id, permissions, parameters=None)` Bases: `object` Create a Vespa AuthClient. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/services-container.html). Parameters: | Name | Type | Description | Default | | ------------- | ----------------- | ------------------------------------------------------------------------ | ---------- | | `id` | `str` | The auth client ID. | *required* | | `permissions` | `list[str]` | List of permissions. | *required* | | `parameters` | `list[Parameter]` | List of Parameter objects defining the configuration of the auth client. | `None` | Example ```python AuthClient( id="token", permissions=["read", "write"], parameters=[Parameter("token", {"id": "my-token-id"})], ) # Output: AuthClient(id="token", permissions="['read', 'write']") ``` ### `Component(id, cls=None, bundle=None, type=None, parameters=None)` Bases: `object` ### `Nodes(count='1', parameters=None)` Bases: `object` Specify node resources for a content or container cluster as part of a `ContainerCluster` or `ContentCluster`. Parameters: | Name | Type | Description | Default | | ------------ | ----------------- | ------------------------------------------------------------------------------ | ------- | | `count` | `int` | Number of nodes in a cluster. | `'1'` | | `parameters` | `list[Parameter]` | List of Parameter objects defining the configuration of the cluster resources. | `None` | Example ```python ContainerCluster( id="example_container", nodes=Nodes( count="2", parameters=[ Parameter( "resources", {"vcpu": "4.0", "memory": "16Gb", "disk": "125Gb"}, children=[Parameter("gpu", {"count": "1", "memory": "16Gb"})] ), Parameter("node", {"hostalias": "node1", "distribution-key": "0"}), ] ) ) # Output: ContainerCluster(id="example_container", version="1.0", nodes="Nodes(count='2')") ``` ### `Cluster(id, version='1.0', nodes=None)` Bases: `object` Base class for a cluster configuration. Should not be instantiated directly. Use subclasses `ContainerCluster` or `ContentCluster` instead. Parameters: | Name | Type | Description | Default | | --------- | ------- | ------------------------------------ | ---------- | | `id` | `str` | Cluster ID. | *required* | | `version` | `str` | Cluster version. | `'1.0'` | | `nodes` | `Nodes` | Nodes that specifies node resources. | `None` | #### `to_xml(root)` Set up XML elements that are used in both container and content clusters. ### `ContainerCluster(id, version='1.0', nodes=None, components=None, auth_clients=None)` Bases: `Cluster` Defines the configuration of a container cluster. Parameters: | Name | Type | Description | Default | | -------------- | ------------------ | ---------------------------------------------------------------------------------------------- | ------- | | `components` | `list[Component]` | List of Component that contains configurations for application components, e.g. embedders. | `None` | | `auth_clients` | `list[AuthClient]` | List of AuthClient that contains configurations for authentication clients (e.g., mTLS/token). | `None` | | `nodes` | `Nodes` | Nodes that specifies the resources of the cluster. | `None` | If `ContainerCluster` is used, any `Component`s must be added to the `ContainerCluster`, rather than to the `ApplicationPackage`, in order to be included in the generated schema. Example ```python ContainerCluster( id="example_container", components=[ Component( id="e5", type="hugging-face-embedder", parameters=[ Parameter( "transformer-model", {"url": "https://github.com/vespa-engine/sample-apps/raw/master/examples/model-exporting/model/e5-small-v2-int8.onnx"} ), Parameter( "tokenizer-model", {"url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/examples/model-exporting/model/tokenizer.json"} ) ] ) ], auth_clients=[AuthClient(id="mtls", permissions=["read", "write"])], nodes=Nodes(count="2", parameters=[Parameter("resources", {"vcpu": "4.0", "memory": "16Gb", "disk": "125Gb"})]) ) # Output: ContainerCluster(id="example_container", version="1.0", nodes="Nodes(count='2')", components="[Component(id='e5', type='hugging-face-embedder')]", auth_clients="[AuthClient(id='mtls', permissions=['read', 'write'])]") ``` ### `ContentCluster(id, document_name, version='1.0', nodes=None, min_redundancy='1')` Bases: `Cluster` Defines the configuration of a content cluster. Parameters: | Name | Type | Description | Default | | ---------------- | ----- | ----------------------------------------------------------------------------------------- | ---------- | | `document_name` | `str` | Name of document. | *required* | | `min_redundancy` | `int` | Minimum redundancy of the content cluster. Must be at least 2 for production deployments. | `'1'` | Example ```python ContentCluster(id="example_content", document_name="doc") # Output: ContentCluster(id="example_content", version="1.0", document_name="doc") ``` ### `ValidationID` Bases: `Enum` Collection of IDs that can be used in validation-overrides.xml. Taken from [ValidationId.java](https://github.com/vespa-engine/vespa/blob/master/config-model-api/src/main/java/com/yahoo/config/application/api/ValidationId.java). `clusterSizeReduction` was not added as it will be removed in Vespa 9. #### `indexingChange = 'indexing-change'` Changing what tokens are expected and stored in field indexes #### `indexModeChange = 'indexing-mode-change'` Changing the index mode (streaming, indexed, store-only) of documents #### `fieldTypeChange = 'field-type-change'` Field type changes #### `tensorTypeChange = 'tensor-type-change'` Tensor type change #### `resourcesReduction = 'resources-reduction'` Large reductions in node resources (> 50% of the current max total resources) #### `contentTypeRemoval = 'schema-removal'` Removal of a schema (causes deletion of all documents) #### `contentClusterRemoval = 'content-cluster-removal'` Removal (or id change) of content clusters #### `deploymentRemoval = 'deployment-removal'` Removal of production zones from deployment.xml #### `globalDocumentChange = 'global-document-change'` Changing global attribute for document types in content clusters #### `configModelVersionMismatch = 'config-model-version-mismatch'` Internal use #### `skipOldConfigModels = 'skip-old-config-models'` Internal use #### `accessControl = 'access-control'` Internal use, used in zones where there should be no access-control #### `globalEndpointChange = 'global-endpoint-change'` Changing global endpoints #### `zoneEndpointChange = 'zone-endpoint-change'` Changing zone (possibly private) endpoint settings #### `redundancyIncrease = 'redundancy-increase'` Increasing redundancy - may easily cause feed blocked #### `redundancyOne = 'redundancy-one'` redundancy=1 requires a validation override on first deployment #### `pagedSettingRemoval = 'paged-setting-removal'` May cause content nodes to run out of memory #### `certificateRemoval = 'certificate-removal'` Remove data plane certificates ### `Validation(validation_id, until, comment=None)` Bases: `object` Represents a validation to be overridden on application. Check the [Vespa documentation](https://docs.vespa.ai/en/reference/validation-overrides.html) for more detailed information about validations. Parameters: | Name | Type | Description | Default | | --------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `validation_id` | `str` | ID of the validation. | *required* | | `until` | `str` | The last day this change is allowed, as an ISO-8601-format date in UTC, e.g. 2016-01-30. Dates may at most be 30 days in the future, but should be as close to now as possible for safety, while allowing time for review and propagation to all deployed zones. allow-tags with dates in the past are ignored. | *required* | | `comment` | `str` | Optional text explaining the reason for the change to humans. | `None` | ### `DeploymentConfiguration(environment, regions)` Bases: `object` Create a DeploymentConfiguration, which defines how to generate a deployment.xml file (for use in production deployments). Parameters: | Name | Type | Description | Default | | ------------- | ----------- | ------------------------------------------------------------------------------------------------------------ | ---------- | | `environment` | `str` | The environment to deploy to. Currently, only 'prod' is supported. | *required* | | `regions` | `list[str]` | List of regions to deploy to, e.g. ["us-east-1", "us-west-1"]. See Vespa documentation for more information. | *required* | Example ```python DeploymentConfiguration(environment="prod", regions=["us-east-1", "us-west-1"]) # Output: DeploymentConfiguration(environment='prod', regions=['us-east-1', 'us-west-1']) ``` ### `EmptyDeploymentConfiguration()` Bases: `DeploymentConfiguration` Create an EmptyDeploymentConfiguration, which creates an empty deployment.xml, used to delete production deployments. ### `ServicesConfiguration(application_name, schemas=None, configurations=[], stateless_model_evaluation=False, components=[], auth_clients=[], clusters=[], services_config=None)` Bases: `object` Create a ServicesConfiguration, adopting the VespaTag (VT) approach, rather than Jinja templates. Intended to be used in ApplicationPackage, to generate services.xml, based on either: - A passed `services_config` (VT) object, or - A set of configurations, schemas, components, auth_clients, and clusters (equivalent to the old approach). The latter will be done in code by calling `build_services_vt()` to generate the VT object. Parameters: | Name | Type | Description | Default | | ---------------------------- | ------------------------------------------ | ---------------------------------------------------------------------------------- | ---------- | | `application_name` | `str` | Application name. | *required* | | `schemas` | `Optional[List[Schema]]` | List of Schemas of the application. | `None` | | `configurations` | `Optional[List[ApplicationConfiguration]]` | List of ApplicationConfiguration that contains configurations for the application. | `[]` | | `stateless_model_evaluation` | `Optional[bool]` | Enable stateless model evaluation. Default is False. | `False` | | `components` | `Optional[List[Component]]` | List of Component that contains configurations for application components. | `[]` | | `auth_clients` | `Optional[List[AuthClient]]` | List of AuthClient that contains configurations for authentication clients. | `[]` | | `clusters` | `Optional[List[Cluster]]` | List of Cluster that contains configurations for content or container clusters. | `[]` | | `services_config` | `Optional[VT]` | VT object that contains the services configuration. | `None` | Example ```python config = ServicesConfiguration( application_name="myapp", schemas=[Schema(name="myschema", document=Document())], configurations=[ApplicationConfiguration(name="container.handler.observability.application-userdata", value={"version": "my-version"})], components=[Component(id="hf-embedder", type="huggingface-embedder")], stateless_model_evaluation=True, ) print(str(config)) # Output: # ... services_config = ServicesConfiguration( application_name="myapp", services_config=services( container(id="myapp_default", version="1.0")( component( model(url="https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json"), id="tokenizer", type="hugging-face-tokenizer" ), document_api(), search(), ), content(id="myapp", version="1.0")( min_redundancy("1"), documents(document(type="doc", mode="index")), engine(proton(tuning(searchnode(requestthreads(persearch("4")))))) ), version="1.0", minimum_required_vespa_version="8.311.28", ), ) print(str(services_config)) # Output: # ... ``` ### `ApplicationPackage(name, schema=None, query_profile=None, query_profile_type=None, stateless_model_evaluation=False, create_schema_by_default=True, create_query_profile_by_default=True, configurations=None, validations=None, components=None, auth_clients=None, clusters=None, deployment_config=None, services_config=None, query_profile_config=None)` Bases: `object` Create an application package. Parameters: | Name | Type | Description | Default | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `name` | `str` | Application name. Cannot contain '-' or '\_'. | *required* | | `schema` | `list` | List of Schema objects for the application. If None, a default Schema with the same name as the application will be created. Defaults to None. | `None` | | `query_profile` | `QueryProfile` | QueryProfile of the application. If None, a default QueryProfile with QueryProfileType 'root' will be created. Defaults to None. | `None` | | `query_profile_type` | `QueryProfileType` | QueryProfileType of the application. If None, a default QueryProfileType 'root' will be created. Defaults to None. | `None` | | `stateless_model_evaluation` | `bool` | Enable stateless model evaluation. Defaults to False. | `False` | | `create_schema_by_default` | `bool` | Include a default Schema if none is provided in the schema argument. Defaults to True. | `True` | | `create_query_profile_by_default` | `bool` | Include a default QueryProfile and QueryProfileType if not explicitly defined by the user. Defaults to True. | `True` | | `configurations` | `list` | List of ApplicationConfiguration for the application. Defaults to None. | `None` | | `validations` | `list` | Optional list of Validation objects to be overridden. Defaults to None. | `None` | | `components` | `list` | List of Component objects for application components. Defaults to None. | `None` | | `clusters` | `list` | List of Cluster objects for content or container clusters. If clusters is provided, any Component must be part of a cluster. Defaults to None. | `None` | | `auth_clients` | `list` | List of AuthClient objects for client authorization. If clusters is passed, pass the auth clients to the ContainerCluster instead. Defaults to None. | `None` | | `deployment_config` | `Union[DeploymentConfiguration, VT]` | Deployment configuration for the application. Must be either a DeploymentConfiguration object (legacy) or a VT (Vespa Tag) based deployment configuration whose top-level tag must be deployment. Defaults to None. | `None` | | `services_config` | `ServicesConfiguration` | (Optional) Services configuration for the application. For advanced configuration. See https://vespa-engine.github.io/pyvespa/advanced-configuration.md | `None` | | `query_profile_config` | `Union[VT, List[VT]]` | Configuration for query profiles. If provided, will override the query_profile and query_profile_type arguments. Defaults to None. See See https://vespa-engine.github.io/pyvespa/advanced-configuration.md | `None` | Example: To create a default application package: ````text ```python ApplicationPackage(name="testapp") ApplicationPackage('testapp', [Schema('testapp', Document(None, None, None), None, None, [], False, None, [], None)], QueryProfile(None), QueryProfileType(None)) ```` ```` This creates a default Schema, QueryProfile, and QueryProfileType, which can be populated with your application's specifics. #### `services_to_text` Intention is to only use services_config, but keeping this until 100% compatibility is achieved through tests. #### `add_schema(*schemas)` Add Schema's to the application package. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `schemas` | `list` | Schemas to be added. | `()` | Returns: | Type | Description | | --- | --- | | `None` | None | #### `add_query_profile(query_profile_item)` Add a query profile item (query-profile or query-profile-type) to the application package. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `query_profile_item` | `VT or List[VT]` | Query profile item(s) to be added. | *required* | Returns: | Type | Description | | --- | --- | | `None` | None | Example ```python app_package = ApplicationPackage(name="testapp") qp = query_profile( field(30, name="hits"), field(3, name="trace.level"), ) app_package.add_query_profile( qp ) # Query profile item is added to the application package. # inspect with `app_package.query_profile_config` ```` #### `to_zip()` Return the application package as zipped bytes, to be used in a subsequent deploy. Returns: | Name | Type | Description | | --------- | --------- | --------------------------------------------------- | | `BytesIO` | `BytesIO` | A buffer containing the zipped application package. | #### `to_zipfile(zfile)` Export the application package as a deployable zipfile. See [application packages](https://docs.vespa.ai/en/application-packages.html) for deployment options. Parameters: | Name | Type | Description | Default | | ------- | ----- | ---------------------- | ---------- | | `zfile` | `str` | Filename to export to. | *required* | Returns: | Type | Description | | ------ | ----------- | | `None` | None | #### `to_files(root)` Export the application package as a directory tree. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------- | ---------- | | `root` | `str` | Directory to export files to. | *required* | Returns: | Type | Description | | ------ | ----------- | | `None` | None | ### `validate_services(xml_input)` Validate an XML input against the RelaxNG schema file for services.xml Parameters: | Name | Type | Description | Default | | ----------- | ------------------------ | -------------------------- | ---------- | | `xml_input` | `Path or str or Element` | The XML input to validate. | *required* | Returns: True if the XML input is valid according to the RelaxNG schema, False otherwise. ## `vespa.querybuilder.builder.builder` ### `QueryField(name)` ### `Condition(expression)` #### `all(*conditions)` Combine multiple conditions using logical AND. #### `any(*conditions)` Combine multiple conditions using logical OR. ### `Query(select_fields, prepend_yql=False)` #### `from_(*sources)` Specify the source schema(s) to query. Example ```python import vespa.querybuilder as qb from vespa.package import Schema, Document query = qb.select("*").from_("schema1", "schema2") str(query) 'select * from schema1, schema2' query = qb.select("*").from_(Schema(name="schema1", document=Document()), Schema(name="schema2", document=Document())) str(query) 'select * from schema1, schema2' ``` Parameters: | Name | Type | Description | Default | | --------- | -------------------- | ------------------------------ | ------- | | `sources` | `Union[str, Schema]` | The source schema(s) to query. | `()` | Returns: | Name | Type | Description | | ------- | ------- | ----------------- | | `Query` | `Query` | The Query object. | #### `where(condition)` Adds a where clause to filter query results. For more information, see Parameters: | Name | Type | Description | Default | | ----------- | ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | `condition` | `Union[Condition, bool]` | Filter condition that can be: - Condition object for complex queries - Boolean for simple true/false - QueryField for field-based filters | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb # Using field conditions f1 = qb.QueryField("f1") query = qb.select("*").from_("sd1").where(f1.contains("v1")) str(query) 'select * from sd1 where f1 contains "v1"' ``` ```python # Using boolean query = qb.select("*").from_("sd1").where(True) str(query) 'select * from sd1 where true' ``` ```python # Using complex conditions condition = f1.contains("v1") & qb.QueryField("f2").contains("v2") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where f1 contains "v1" and f2 contains "v2"' ``` #### `order_by(field, ascending=True, annotations=None)` Orders results by specified fields. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `fields` | | Field names or QueryField objects to order by | *required* | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations like "locale", "strength", etc. See https://docs.vespa.ai/en/reference/sorting.html#special-sorting-attributes for details. | `None` | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb # Simple ordering query = qb.select("*").from_("sd1").order_by("price") str(query) 'select * from sd1 order by price asc' # Multiple fields with annotation query = qb.select("*").from_("sd1").order_by( "price", annotations={"locale": "en_US"}, ascending=False ).order_by("name", annotations={"locale": "no_NO"}, ascending=True) str(query) 'select * from sd1 order by {"locale":"en_US"}price desc, {"locale":"no_NO"}name asc' ``` #### `orderByAsc(field, annotations=None)` Convenience method for ordering results by a field in ascending order. See `order_by` for more information. #### `orderByDesc(field, annotations=None)` Convenience method for ordering results by a field in descending order. See `order_by` for more information. #### `set_limit(limit)` Sets maximum number of results to return. For more information, see Parameters: | Name | Type | Description | Default | | ------- | ----- | -------------------------------- | ---------- | | `limit` | `int` | Maximum number of hits to return | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb f1 = qb.QueryField("f1") query = qb.select("*").from_("sd1").where(f1.contains("v1")).set_limit(5) str(query) 'select * from sd1 where f1 contains "v1" limit 5' ``` #### `set_offset(offset)` Sets number of initial results to skip for pagination. For more information, see Parameters: | Name | Type | Description | Default | | -------- | ----- | ------------------------- | ---------- | | `offset` | `int` | Number of results to skip | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb f1 = qb.QueryField("f1") query = qb.select("*").from_("sd1").where(f1.contains("v1")).set_offset(10) str(query) 'select * from sd1 where f1 contains "v1" offset 10' ``` #### `set_timeout(timeout)` Sets query timeout in milliseconds. For more information, see Parameters: | Name | Type | Description | Default | | --------- | ----- | ----------------------- | ---------- | | `timeout` | `int` | Timeout in milliseconds | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb f1 = qb.QueryField("f1") query = qb.select("*").from_("sd1").where(f1.contains("v1")).set_timeout(500) str(query) 'select * from sd1 where f1 contains "v1" timeout 500' ``` #### `add_parameter(key, value)` Adds a query parameter. For more information, see Parameters: | Name | Type | Description | Default | | ------- | ----- | --------------- | ---------- | | `key` | `str` | Parameter name | *required* | | `value` | `Any` | Parameter value | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb condition = qb.userInput("@myvar") query = qb.select("*").from_("sd1").where(condition).add_parameter("myvar", "test") str(query) 'select * from sd1 where userInput(@myvar)&myvar=test' ``` #### `param(key, value)` Alias for add_parameter(). For more information, see Parameters: | Name | Type | Description | Default | | ------- | ----- | --------------- | ---------- | | `key` | `str` | Parameter name | *required* | | `value` | `Any` | Parameter value | *required* | Returns: | Name | Type | Description | | ------- | ------- | ------------------------ | | `Query` | `Query` | Self for method chaining | Example ```python import vespa.querybuilder as qb condition = qb.userInput("@animal") query = qb.select("*").from_("sd1").where(condition).param("animal", "panda") str(query) 'select * from sd1 where userInput(@animal)&animal=panda' ``` #### `groupby(group_expression, continuations=[])` Groups results by specified expression. For more information, see Also see for available methods to build group expressions. Parameters: | Name | Type | Description | Default | | ------------------ | ------ | ----------------------------------------------------------------------------------- | ---------- | | `group_expression` | `str` | Grouping expression | *required* | | `continuations` | `List` | List of continuation tokens (see https://docs.vespa.ai/en/grouping.html#pagination) | `[]` | Returns: | Type | Description | | ------- | -------------------------- | | `Query` | : Self for method chaining | Example ```python import vespa.querybuilder as qb from vespa.querybuilder import Grouping as G # Group by customer with sum of price grouping = G.all( G.group("customer"), G.each(G.output(G.sum("price"))), ) str(grouping) 'all(group(customer) each(output(sum(price))))' query = qb.select("*").from_("sd1").groupby(grouping) str(query) 'select * from sd1 | all(group(customer) each(output(sum(price))))' # Group by year with count grouping = G.all( G.group("time.year(a)"), G.each(G.output(G.count())), ) str(grouping) 'all(group(time.year(a)) each(output(count())))' query = qb.select("*").from_("purchase").where(True).groupby(grouping) str(query) 'select * from purchase where true | all(group(time.year(a)) each(output(count())))' # With continuations query = qb.select("*").from_("purchase").where(True).groupby(grouping, continuations=["foo", "bar"]) str(query) "select * from purchase where true | { 'continuations':['foo', 'bar'] }all(group(time.year(a)) each(output(count())))" ``` ### `Q` Wrapper class for QueryBuilder static methods. Methods are exposed as module-level functions. To use: ```python import vespa.querybuilder as qb query = qb.select("*").from_("sd1") # or any of the other Q class methods ``` #### `select(fields)` Creates a new query selecting specified fields. For more information, see Parameters: | Name | Type | Description | Default | | -------- | ----------------------------------------- | ------------------------------------------- | ---------- | | `fields` | `Union[str, List[str], List[QueryField]]` | Field names or QueryField objects to select | *required* | Returns: | Name | Type | Description | | ------- | ------- | ---------------- | | `Query` | `Query` | New query object | Example ```python import vespa.querybuilder as qb query = qb.select("*").from_("sd1") str(query) 'select * from sd1' query = qb.select(["title", "url"]) str(query) 'select title, url from *' ``` #### `any(*conditions)` "Combines multiple conditions with OR operator. For more information, see Parameters: | Name | Type | Description | Default | | ------------ | ----------- | ------------------------------------------------------- | ------- | | `conditions` | `Condition` | Variable number of Condition objects to combine with OR | `()` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------------------------- | | `Condition` | `Condition` | Combined condition using OR operators | Example ```python import vespa.querybuilder as qb f1, f2 = qb.QueryField("f1"), qb.QueryField("f2") condition = qb.any(f1 > 10, f2 == "v2") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where f1 > 10 or f2 = "v2"' ``` #### `all(*conditions)` Combines multiple conditions with AND operator. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | ----------- | -------------------------------------------------------- | ------- | | `*conditions` | `Condition` | Variable number of Condition objects to combine with AND | `()` | Returns: | Name | Type | Description | | ----------- | ----------- | -------------------------------------- | | `Condition` | `Condition` | Combined condition using AND operators | Example ```python import vespa.querybuilder as qb f1, f2 = qb.QueryField("f1"), qb.QueryField("f2") condition = qb.all(f1 > 10, f2 == "v2") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where f1 > 10 and f2 = "v2"' ``` #### `userQuery(value='')` Creates a userQuery operator for text search. For more information, see Parameters: | Name | Type | Description | Default | | ------- | ----- | ----------------------------------------------- | ------- | | `value` | `str` | Optional query string. Default is empty string. | `''` | Returns: | Name | Type | Description | | ----------- | ----------- | --------------------- | | `Condition` | `Condition` | A userQuery condition | Example ```python import vespa.querybuilder as qb # Basic userQuery condition = qb.userQuery() query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where userQuery()' # UserQuery with search terms condition = qb.userQuery("search terms") query = qb.select("*").from_("documents").where(condition) str(query) 'select * from documents where userQuery("search terms")' ``` #### `dotProduct(field, weights, annotations=None)` Creates a dot product calculation condition. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ---------- | | `field` | `str` | Field containing vectors | *required* | | `weights` | `Union[List[float], Dict[str, float], str]` | Either list of numeric weights or dict mapping elements to weights or a parameter substitution string starting with '@' | *required* | | `annotations` | `Optional[Dict]` | Optional modifiers like label | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ----------------------------------- | | `Condition` | `Condition` | A dot product calculation condition | Example ```python import vespa.querybuilder as qb # Using dict weights with annotation condition = qb.dotProduct( "weightedset_field", {"feature1": 1, "feature2": 2}, annotations={"label": "myDotProduct"} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({label:"myDotProduct"}dotProduct(weightedset_field, {"feature1": 1, "feature2": 2}))' # Using list weights condition = qb.dotProduct("weightedset_field", [0.4, 0.6]) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where dotProduct(weightedset_field, [0.4, 0.6])' # Using parameter substitution condition = qb.dotProduct("weightedset_field", "@myweights") query = qb.select("*").from_("sd1").where(condition).add_parameter("myweights", [0.4, 0.6]) str(query) 'select * from sd1 where dotProduct(weightedset_field, "@myweights")&myweights=[0.4, 0.6]' ``` #### `weightedSet(field, weights, annotations=None)` Creates a weighted set condition. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ---------- | | `field` | `str` | Field containing weighted set data | *required* | | `weights` | `Union[List[float], Dict[str, float], str]` | Either list of numeric weights or dict mapping elements to weights or a parameter substitution string starting with '@' | *required* | | `annotations` | `Optional[Dict]` | Optional annotations like targetNumHits | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------------ | | `Condition` | `Condition` | A weighted set condition | Example ```python import vespa.querybuilder as qb # using map weights condition = qb.weightedSet( "weightedset_field", {"element1": 1, "element2": 2}, annotations={"targetNumHits": 10} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({targetNumHits:10}weightedSet(weightedset_field, {"element1": 1, "element2": 2}))' # using list weights condition = qb.weightedSet("weightedset_field", [0.4, 0.6]) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where weightedSet(weightedset_field, [0.4, 0.6])' # using parameter substitution condition = qb.weightedSet("weightedset_field", "@myweights") query = qb.select("*").from_("sd1").where(condition).add_parameter("myweights", [0.4, 0.6]) str(query) 'select * from sd1 where weightedSet(weightedset_field, "@myweights")&myweights=[0.4, 0.6]' ``` #### `nonEmpty(condition)` Creates a nonEmpty operator to check if a field has content. For more information, see . Parameters: | Name | Type | Description | Default | | ----------- | ------------------------------ | --------------------------- | ---------- | | `condition` | `Union[Condition, QueryField]` | Field or condition to check | *required* | Returns: | Name | Type | Description | | ----------- | ----------- | -------------------- | | `Condition` | `Condition` | A nonEmpty condition | Example ```python import vespa.querybuilder as qb field = qb.QueryField("title") condition = qb.nonEmpty(field) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where nonEmpty(title)' ``` #### `wand(field, weights, annotations=None)` Creates a Weighted AND (WAND) operator for efficient top-k retrieval. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | ------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ---------- | | `field` | `str` | Field name to search | *required* | | `weights` | `Union[List[float], Dict[str, float], str]` | Either list of numeric weights or dict mapping terms to weights or a parameter substitution string starting with '@' | *required* | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations like targetHits | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ---------------- | | `Condition` | `Condition` | A WAND condition | Example ```python import vespa.querybuilder as qb # Using list weights condition = qb.wand("description", weights=[0.4, 0.6]) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where wand(description, [0.4, 0.6])' # Using dict weights with annotation weights = {"hello": 0.3, "world": 0.7} condition = qb.wand( "title", weights, annotations={"targetHits": 100} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({targetHits: 100}wand(title, {"hello": 0.3, "world": 0.7}))' ``` #### `weakAnd(*conditions, annotations=None)` Creates a weakAnd operator for less strict AND matching. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ---------------------------------------- | ------- | | `*conditions` | `Condition` | Variable number of conditions to combine | `()` | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations like targetHits | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------- | | `Condition` | `Condition` | A weakAnd condition | Example ```python import vespa.querybuilder as qb f1, f2 = qb.QueryField("f1"), qb.QueryField("f2") condition = qb.weakAnd(f1 == "v1", f2 == "v2") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where weakAnd(f1 = "v1", f2 = "v2")' # With annotation condition = qb.weakAnd( f1 == "v1", f2 == "v2", annotations={"targetHits": 100} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({"targetHits": 100}weakAnd(f1 = "v1", f2 = "v2"))' ``` #### `geoLocation(field, lat, lng, radius, annotations=None)` Creates a geolocation search condition. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | ---------------- | --------------------------------- | ---------- | | `field` | `str` | Field containing location data | *required* | | `lat` | `float` | Latitude coordinate | *required* | | `lon` | `float` | Longitude coordinate | *required* | | `radius` | `str` | Search radius (e.g. "10km") | *required* | | `annotations` | `Optional[Dict]` | Optional settings like targetHits | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------------------ | | `Condition` | `Condition` | A geolocation search condition | Example ```python import vespa.querybuilder as qb condition = qb.geoLocation( "location_field", 37.7749, -122.4194, "10km", annotations={"targetHits": 100} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({targetHits:100}geoLocation(location_field, 37.7749, -122.4194, "10km"))' ``` #### `nearestNeighbor(field, query_vector, annotations={'targetHits': 100})` Creates a nearest neighbor search condition. See for more information. Parameters: | Name | Type | Description | Default | | -------------- | ---------------- | ------------------------------------------------------------------------------------------ | --------------------- | | `field` | `str` | Vector field to search in | *required* | | `query_vector` | `str` | Query vector to compare against | *required* | | `annotations` | `Dict[str, Any]` | Optional annotations to modify the behavior. Required annotation: targetHits (default: 10) | `{'targetHits': 100}` | Returns: | Name | Type | Description | | ----------- | ----------- | ----------------------------------- | | `Condition` | `Condition` | A nearest neighbor search condition | Example ```python import vespa.querybuilder as qb condition = qb.nearestNeighbor( field="dense_rep", query_vector="q_dense", ) query = qb.select(["id, text"]).from_("m").where(condition) str(query) 'select id, text from m where ({targetHits:100}nearestNeighbor(dense_rep, q_dense))' ``` #### `rank(*queries)` Creates a rank condition for combining multiple queries. For more information, see Parameters: | Name | Type | Description | Default | | ---------- | ---- | ------------------------------------------- | ------- | | `*queries` | | Variable number of Query objects to combine | `()` | Returns: | Name | Type | Description | | ----------- | ----------- | ---------------- | | `Condition` | `Condition` | A rank condition | Example ```python import vespa.querybuilder as qb condition = qb.rank( qb.nearestNeighbor("field", "queryVector"), qb.QueryField("a").contains("A"), qb.QueryField("b").contains("B"), qb.QueryField("c").contains("C"), ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where rank(({targetHits:100}nearestNeighbor(field, queryVector)), a contains "A", b contains "B", c contains "C")' ``` #### `phrase(*terms, annotations=None)` Creates a phrase search operator for exact phrase matching. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ----------------------------- | ------- | | `*terms` | `str` | Terms that make up the phrase | `()` | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------ | | `Condition` | `Condition` | A phrase condition | Example ```python import vespa.querybuilder as qb condition = qb.phrase("new", "york", "city") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where phrase("new", "york", "city")' ``` #### `near(*terms, distance=None, annotations=None, **kwargs)` Creates a near search operator for finding terms within a specified distance. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ------------------------------------------------------------------------ | ------- | | `*terms` | `str` | Terms to search for | `()` | | `distance` | `Optional[int]` | Maximum word distance between terms. Will default to 2 if not specified. | `None` | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations | `None` | | `**kwargs` | | Additional annotations | `{}` | Returns: | Name | Type | Description | | ----------- | ----------- | ---------------- | | `Condition` | `Condition` | A near condition | Example ```python import vespa.querybuilder as qb condition = qb.near("machine", "learning", distance=5) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({distance:5}near("machine", "learning"))' ``` #### `onear(*terms, distance=None, annotations=None, **kwargs)` Creates an ordered near operator for ordered proximity search. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ------------------------------------------------------------------------ | ------- | | `*terms` | `str` | Terms to search for in order | `()` | | `distance` | `Optional[int]` | Maximum word distance between terms. Will default to 2 if not specified. | `None` | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------ | | `Condition` | `Condition` | An onear condition | Example ```python import vespa.querybuilder as qb condition = qb.onear("deep", "learning", distance=3) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({distance:3}onear("deep", "learning"))' ``` #### `sameElement(*conditions)` Creates a sameElement operator to match conditions in same array element. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | ----------- | ------------------------------------------ | ------- | | `*conditions` | `Condition` | Conditions that must match in same element | `()` | Returns: | Name | Type | Description | | ----------- | ----------- | ----------------------- | | `Condition` | `Condition` | A sameElement condition | Example ```python import vespa.querybuilder as qb persons = qb.QueryField("persons") first_name = qb.QueryField("first_name") last_name = qb.QueryField("last_name") year_of_birth = qb.QueryField("year_of_birth") condition = persons.contains( qb.sameElement( first_name.contains("Joe"), last_name.contains("Smith"), year_of_birth < 1940, ) ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where persons contains sameElement(first_name contains "Joe", last_name contains "Smith", year_of_birth < 1940)' ``` #### `equiv(*terms)` Creates an equiv operator for matching equivalent terms. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ------------------------ | ---------- | | `terms` | `List[str]` | List of equivalent terms | `()` | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations | *required* | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------ | | `Condition` | `Condition` | An equiv condition | Example ```python import vespa.querybuilder as qb fieldName = qb.QueryField("fieldName") condition = fieldName.contains(qb.equiv("Snoop Dogg", "Calvin Broadus")) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where fieldName contains equiv("Snoop Dogg", "Calvin Broadus")' ``` #### `uri(value, annotations=None)` Creates a uri operator for matching URIs. For more information, see Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------------------- | ---------- | | `field` | `str` | Field name containing URI | *required* | | `value` | `str` | URI value to match | *required* | Returns: | Name | Type | Description | | ----------- | ----------- | --------------- | | `Condition` | `Condition` | A uri condition | Example ```python import vespa.querybuilder as qb url = "vespa.ai/foo" condition = qb.uri(url) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where uri("vespa.ai/foo")' ``` #### `fuzzy(value, annotations=None, **kwargs)` Creates a fuzzy operator for approximate string matching. For more information, see Parameters: | Name | Type | Description | Default | | ------------- | -------------------------- | ------------------------------------------------------------ | ---------- | | `term` | `str` | Term to fuzzy match | *required* | | `annotations` | `Optional[Dict[str, Any]]` | Optional annotations | `None` | | `**kwargs` | | Optional parameters like maxEditDistance, prefixLength, etc. | `{}` | Returns: | Name | Type | Description | | ----------- | ----------- | ----------------- | | `Condition` | `Condition` | A fuzzy condition | Example ```python import vespa.querybuilder as qb condition = qb.fuzzy("parantesis") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where fuzzy("parantesis")' # With annotation condition = qb.fuzzy("parantesis", annotations={"prefixLength": 1, "maxEditDistance": 2}) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where ({prefixLength:1,maxEditDistance:2}fuzzy("parantesis"))' ``` #### `userInput(value=None, annotations=None)` Creates a userInput operator for query evaluation. For more information, see . Parameters: | Name | Type | Description | Default | | ------------- | ---------------- | ------------------------------------------- | ------- | | `value` | `Optional[str]` | The input variable name, e.g. "@myvar" | `None` | | `annotations` | `Optional[Dict]` | Optional annotations to modify the behavior | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ----------------------------------------------- | | `Condition` | `Condition` | A condition representing the userInput operator | Example ```python import vespa.querybuilder as qb condition = qb.userInput("@myvar") query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where userInput(@myvar)' # With defaultIndex annotation condition = qb.userInput("@myvar").annotate({"defaultIndex": "text"}) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where {defaultIndex:"text"}userInput(@myvar)' # With parameter condition = qb.userInput("@animal") query = qb.select("*").from_("sd1").where(condition).param("animal", "panda") str(query) 'select * from sd1 where userInput(@animal)&animal=panda' ``` #### `predicate(field, attributes=None, range_attributes=None)` Creates a predicate condition for filtering documents based on specific attributes. For more information, see . Parameters: | Name | Type | Description | Default | | ------------------ | -------------------------- | --------------------------------------------- | ---------- | | `field` | `str` | The predicate field name | *required* | | `attributes` | `Optional[Dict[str, Any]]` | Dictionary of attribute key-value pairs | `None` | | `range_attributes` | `Optional[Dict[str, Any]]` | Dictionary of range attribute key-value pairs | `None` | Returns: | Name | Type | Description | | ----------- | ----------- | ------------------------------------------------ | | `Condition` | `Condition` | A condition representing the predicate operation | Example ```python import vespa.querybuilder as qb condition = qb.predicate( "predicate_field", attributes={"gender": "Female"}, range_attributes={"age": "20L"} ) query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where predicate(predicate_field,{"gender":"Female"},{"age":20L})' ``` #### `true()` Creates a condition that is always true. Returns: | Name | Type | Description | | ----------- | ----------- | ---------------- | | `Condition` | `Condition` | A true condition | Example ```python import vespa.querybuilder as qb condition = qb.true() query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where true' ``` #### `false()` Creates a condition that is always false. Returns: | Name | Type | Description | | ----------- | ----------- | ----------------- | | `Condition` | `Condition` | A false condition | Example ```python import vespa.querybuilder as qb condition = qb.false() query = qb.select("*").from_("sd1").where(condition) str(query) 'select * from sd1 where false' ``` ## `vespa.querybuilder.grouping.grouping` ### `Expression` Bases: `str` #### `__invert__()` ~expr → 'not (expr)' for filter predicates. Always wraps in parens so Vespa's precedence (not > and > or) does not reinterpret compound expressions. #### `__and__(other)` expr & expr → 'expr and expr' for filter predicates. #### `__or__(other)` expr | expr → '(expr or expr)' for filter predicates. Always wraps in parens so precedence is correct when combined with &. ### `Grouping` A Pythonic DSL for building Vespa grouping expressions programmatically. This class provides a set of static methods that build grouping syntax strings which can be combined to form a valid Vespa “select=…” grouping expression. For a guide to grouping in vespa, see . For the reference docs, see . Minimal Example ```python from vespa.querybuilder import Grouping as G # Build a simple grouping expression which groups on "my_attribute" # and outputs the count of matching documents under each group: expr = G.all( G.group("my_attribute"), G.each( G.output(G.count()) ) ) print(expr) all(group(my_attribute) each(output(count()))) ``` In the above example, the “all(...)” wraps the grouping operations at the top level. We first group on “my_attribute”, then under “each(...)” we add an output aggregator “count()”. The “print” output is the exact grouping expression string you would pass to Vespa in the “select” query parameter. For multi-level (nested) grouping, you can nest additional calls to “group(...)” or “each(...)” inside. For example: ```python # Nested grouping: # 1) Group by 'category' # 2) Within each category, group by 'sub_category' # 3) Output the count() under each sub-category nested_expr = G.all( G.group("category"), G.each( G.group("sub_category"), G.each( G.output(G.count()) ) ) ) print(nested_expr) all(group(category) each(group(sub_category) each(output(count())))) ``` You may use any of the static methods below to build more advanced groupings, aggregations, or arithmetic/string expressions for sorting, filtering, or bucket definitions. Refer to Vespa documentation for the complete details. #### `all(*args)` Corresponds to the “all(...)” grouping block in Vespa, which means “group all documents (no top-level grouping) and then do the enclosed operations”. Parameters: | Name | Type | Description | Default | | ------- | ----- | ----------------------------------------------------- | ------- | | `*args` | `str` | Sub-expressions to include within the all(...) block. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | ----------------------------------- | | `str` | `Expression` | A Vespa grouping expression string. | Example ```python from vespa.querybuilder import Grouping as G expr = G.all(G.group("my_attribute"), G.each(G.output(G.count()))) print(expr) all(group(my_attribute) each(output(count()))) ``` #### `each(*args)` Corresponds to the “each(...)” grouping block in Vespa, which means “create a group for each unique value and then do the enclosed operations”. Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------------------------------------------------ | ------- | | `*args` | `str` | Sub-expressions to include within the each(...) block. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | ----------------------------------- | | `str` | `Expression` | A Vespa grouping expression string. | Example ```python from vespa.querybuilder import Grouping as G expr = G.each("output(count())", "output(avg(price))") print(expr) each(output(count()) output(avg(price))) ``` #### `group(field)` Defines a grouping step on a field or expression. Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------------------------------------ | ---------- | | `field` | `str` | The field or expression on which to group. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------------------------- | | `str` | `Expression` | A Vespa grouping expression string. | Example ```python from vespa.querybuilder import Grouping as G expr = G.group("my_map.key") print(expr) group(my_map.key) ``` #### `filter_(predicate)` Wraps a predicate in `filter(...)` for use inside a grouping expression. Parameters: | Name | Type | Description | Default | | ----------- | ---- | ----------------------------------------------------------------- | ---------- | | `predicate` | | A filter predicate expression (e.g. G.regex(...), G.istrue(...)). | *required* | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------- | | `Expression` | `Expression` | filter() | Example ```python from vespa.querybuilder import Grouping as G expr = G.all( G.group("customer"), G.filter_(G.regex("Bonn.*", 'attributes{"sales_rep"}') & ~G.range_(0, 1000, "price")), G.each(G.output(G.sum("price"))), ) print(expr) all(group(customer) filter(regex("Bonn.*", attributes{"sales_rep"}) and not (range(0, 1000, price))) each(output(sum(price)))) ``` #### `regex(pattern, expr)` Creates a `regex(...)` filter predicate. Parameters: | Name | Type | Description | Default | | --------- | ----- | ------------------------------------------------ | ---------- | | `pattern` | `str` | The regular expression pattern (will be quoted). | *required* | | `expr` | `str` | The field or expression to match against. | *required* | Returns: | Name | Type | Description | | ------------ | ------------ | -------------------------- | | `Expression` | `Expression` | regex("", ) | Example ```python from vespa.querybuilder import Grouping as G expr = G.regex("foo.*", "my_field") print(expr) regex("foo.*", my_field) ``` #### `range_(min_val, max_val, expr, lower_inclusive=None, upper_inclusive=None)` Creates a `range(...)` filter predicate. Vespa defaults: lower bound is inclusive, upper bound is exclusive. If either `lower_inclusive` or `upper_inclusive` is provided, both are emitted using Vespa defaults (`true`/`false`) for any omitted value. Parameters: | Name | Type | Description | Default | | ----------------- | ----- | ---------------------------------------------------------------------- | ---------- | | `min_val` | | Lower bound of the range. | *required* | | `max_val` | | Upper bound of the range. | *required* | | `expr` | `str` | The field or expression to check. | *required* | | `lower_inclusive` | | Whether the lower bound is inclusive (default: True, matching Vespa). | `None` | | `upper_inclusive` | | Whether the upper bound is inclusive (default: False, matching Vespa). | `None` | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------------------------------------- | | `Expression` | `Expression` | range(, , \[, , \]) | Example ```python from vespa.querybuilder import Grouping as G expr = G.range_(1990, 2012, "year") print(expr) range(1990, 2012, year) expr = G.range_(1990, 2012, "year", True, True) print(expr) range(1990, 2012, year, true, true) ``` #### `istrue(expr)` Creates an `istrue(...)` filter predicate. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------------------------------ | ---------- | | `expr` | `str` | The field or expression to check for truthiness. | *required* | Returns: | Name | Type | Description | | ------------ | ------------ | -------------- | | `Expression` | `Expression` | istrue() | Example ```python from vespa.querybuilder import Grouping as G expr = G.istrue("my_bool") print(expr) istrue(my_bool) ``` #### `count()` “count()” aggregator. By default, returns a string 'count()'. Negative ordering or usage can be done by prefixing a minus, e.g.: order(-count()) in Vespa syntax. Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------ | | `str` | `Expression` | 'count()' or prefixed version if used with a minus operator. | Example ```python from vespa.querybuilder import Grouping as G expr = G.count() print(expr) count() sort_expr = f"-{expr}" print(sort_expr) -count() ``` #### `sum(value)` “sum(...)” aggregator. Sums the given expression or field over all documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | --------------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression to sum. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'sum(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.sum("my_numeric_field") print(expr) sum(my_numeric_field) ``` #### `avg(value)` “avg(...)” aggregator. Computes the average of the given expression or field for all documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | ------------------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression to average. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'avg(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.avg("my_numeric_field") print(expr) avg(my_numeric_field) ``` #### `min(value)` “min(...)” aggregator. Keeps the minimum value of the expression or field among all documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | ------------------------------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression to find the minimum of. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'min(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.min("some_field") print(expr) min(some_field) ``` #### `max(value)` “max(...)” aggregator. Keeps the maximum value of the expression or field among all documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | ------------------------------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression to find the maximum of. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'max(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.max("relevance()") print(expr) max(relevance()) ``` #### `stddev(value)` “stddev(...)” aggregator. Computes the population standard deviation for the expression or field among all documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | -------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'stddev(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.stddev("my_numeric_field") print(expr) stddev(my_numeric_field) ``` #### `xor(value)` “xor(...)” aggregator. XORs all values of the expression or field together over the documents in the group. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | -------------------------------- | ---------- | | `value` | `Union[str, int, float]` | The field or numeric expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'xor(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.xor("my_field") print(expr) xor(my_field) ``` #### `output(*args)` Defines output aggregators to be collected for the grouping level. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | --------------------------------------------------------------- | ------- | | `*args` | `Union[str, Expression]` | Multiple aggregator expressions, e.g., 'count()', 'sum(price)'. | `()` | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------------------------------------------------- | | `Expression` | `Expression` | A Vespa grouping expression string of the form 'output(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.output(G.count(), G.sum("price")) print(expr) output(count(),sum(price)) ``` #### `order(*args)` Defines an order(...) clause to sort groups by the given expressions or aggregators. Parameters: | Name | Type | Description | Default | | ------- | ------------------------ | ------------------------------------------------ | ------- | | `*args` | `Union[str, Expression]` | Multiple expressions or aggregators to order by. | `()` | Returns: | Name | Type | Description | | ------------ | ------------ | ------------------------------------------------------------ | | `Expression` | `Expression` | A Vespa grouping expression string of the form 'order(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.order(G.sum(G.relevance()), -G.count()) print(expr) order(sum(relevance()),-count()) ``` #### `precision(value)` Sets the “precision(...)” for the grouping step. Parameters: | Name | Type | Description | Default | | ------- | ----- | ---------------- | ---------- | | `value` | `int` | Precision value. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'precision(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.precision(1000) print(expr) precision(1000) ``` #### `add(*expressions)` “add(...)” expression. Adds all arguments together in order. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ---------------------------- | ------- | | `*expressions` | `str` | The expressions to be added. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'add(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.add("my_field", "5", "10") print(expr) add(my_field, 5, 10) ``` #### `sub(*expressions)` “sub(...)” expression. Subtracts each subsequent argument from the first. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ---------------------------------------- | ------- | | `*expressions` | `str` | The expressions involved in subtraction. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'sub(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.sub("my_field", "2") print(expr) sub(my_field, 2) ``` #### `mul(*expressions)` “mul(...)” expression. Multiplies all arguments in order. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ---------------------------- | ------- | | `*expressions` | `str` | The expressions to multiply. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'mul(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.mul("my_field", "2", "3") print(expr) mul(my_field, 2, 3) ``` #### `div(*expressions)` “div(...)” expression. Divides the first argument by the second, etc. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ----------------------------------- | ------- | | `*expressions` | `str` | The expressions to divide in order. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'div(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.div("my_field", "2") print(expr) div(my_field, 2) ``` #### `mod(*expressions)` “mod(...)” expression. Modulo the first argument by the second, result by the third, etc. Parameters: | Name | Type | Description | Default | | -------------- | ----- | -------------------------------------------- | ------- | | `*expressions` | `str` | The expressions to apply modulo on in order. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'mod(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.mod("my_field", "100") print(expr) mod(my_field,100) ``` #### `and_(*expressions)` “and(...)” expression. Bitwise AND of the arguments in order. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ------------------------------------- | ------- | | `*expressions` | `str` | The expressions to apply bitwise AND. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'and(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.and_("fieldA", "fieldB") print(expr) and(fieldA, fieldB) ``` #### `or_(*expressions)` “or(...)” expression. Bitwise OR of the arguments in order. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ------------------------------------ | ------- | | `*expressions` | `str` | The expressions to apply bitwise OR. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'or(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.or_("fieldA", "fieldB") print(expr) or(fieldA, fieldB) ``` #### `xor_expr(*expressions)` “xor(...)” bitwise expression. (Note: For aggregator use, see xor(...) aggregator method above.) Parameters: | Name | Type | Description | Default | | -------------- | ----- | ------------------------------------- | ------- | | `*expressions` | `str` | The expressions to apply bitwise XOR. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'xor(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.xor_expr("fieldA", "fieldB") print(expr) xor(fieldA, fieldB) ``` #### `strlen(expr)` “strlen(...)” expression. Returns the number of bytes in the string. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------------- | ---------- | | `expr` | `str` | The string field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'strlen(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.strlen("my_string_field") print(expr) strlen(my_string_field) ``` #### `strcat(*expressions)` “strcat(...)” expression. Concatenate all string arguments in order. Parameters: | Name | Type | Description | Default | | -------------- | ----- | -------------------------------------- | ------- | | `*expressions` | `str` | The string expressions to concatenate. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'strcat(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.strcat("fieldA", "_", "fieldB") print(expr) strcat(fieldA,_,fieldB) ``` #### `todouble(expr)` “todouble(...)” expression. Convert argument to double. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------------- | ---------- | | `expr` | `str` | The expression or field to convert. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'todouble(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.todouble("my_field") print(expr) todouble(my_field) ``` #### `tolong(expr)` “tolong(...)” expression. Convert argument to long. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------------- | ---------- | | `expr` | `str` | The expression or field to convert. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'tolong(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.tolong("my_field") print(expr) tolong(my_field) ``` #### `tostring(expr)` “tostring(...)” expression. Convert argument to string. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------------- | ---------- | | `expr` | `str` | The expression or field to convert. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'tostring(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.tostring("my_field") print(expr) tostring(my_field) ``` #### `toraw(expr)` “toraw(...)” expression. Convert argument to raw data. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------------- | ---------- | | `expr` | `str` | The expression or field to convert. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'toraw(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.toraw("my_field") print(expr) toraw(my_field) ``` #### `cat(*expressions)` “cat(...)” expression. Concatenate the binary representation of arguments. Parameters: | Name | Type | Description | Default | | -------------- | ----- | ------------------------------------------------ | ------- | | `*expressions` | `str` | The binary expressions or fields to concatenate. | `()` | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'cat(expr1, expr2, ...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.cat("fieldA", "fieldB") print(expr) cat(fieldA,fieldB) ``` #### `md5(expr, width)` “md5(...)” expression. Does an MD5 over the binary representation of the argument, and keeps the lowest 'width' bits. Parameters: | Name | Type | Description | Default | | ------- | ----- | ---------------------------------------- | ---------- | | `expr` | `str` | The expression or field to apply MD5 on. | *required* | | `width` | `int` | The number of bits to keep. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'md5(expr, width)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.md5("my_field", 16) print(expr) md5(my_field, 16) ``` #### `xorbit(expr, width)` “xorbit(...)” expression. Performs an XOR of 'width' bits over the binary representation of the argument. Width is rounded up to a multiple of 8. Parameters: | Name | Type | Description | Default | | ------- | ----- | ------------------------------------------- | ---------- | | `expr` | `str` | The expression or field to apply xorbit on. | *required* | | `width` | `int` | The number of bits for the XOR operation. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'xorbit(expr, width)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.xorbit("my_field", 16) print(expr) xorbit(my_field, 16) ``` #### `relevance()` “relevance()” expression. Returns the computed rank (relevance) of a document. Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------- | | `str` | `Expression` | 'relevance()' as a Vespa expression string. | Example ```python from vespa.querybuilder import Grouping as G expr = G.relevance() print(expr) relevance() ``` #### `array_at(array_name, index_expr)` “array.at(...)” accessor expression. Returns a single element from the array at the given index. Parameters: | Name | Type | Description | Default | | ------------ | ----------------- | --------------------------------------------------- | ---------- | | `array_name` | `str` | The name of the array. | *required* | | `index_expr` | `Union[str, int]` | The index or expression that evaluates to an index. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'array.at(array_name, index)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.array_at("my_array", 0) print(expr) array.at(my_array, 0) ``` #### `zcurve_x(expr)` “zcurve.x(...)” expression. Returns the X component of the given zcurve-encoded 2D point. Parameters: | Name | Type | Description | Default | | ------ | ----- | --------------------------------------- | ---------- | | `expr` | `str` | The zcurve-encoded field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'zcurve.x(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.zcurve_x("location_zcurve") print(expr) zcurve.x(location_zcurve) ``` #### `zcurve_y(expr)` “zcurve.y(...)” expression. Returns the Y component of the given zcurve-encoded 2D point. Parameters: | Name | Type | Description | Default | | ------ | ----- | --------------------------------------- | ---------- | | `expr` | `str` | The zcurve-encoded field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'zcurve.y(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.zcurve_y("location_zcurve") print(expr) zcurve.y(location_zcurve) ``` #### `time_dayofmonth(expr)` “time.dayofmonth(...)” expression. Returns the day of month (1-31). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.dayofmonth(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_dayofmonth("timestamp_field") print(expr) time.dayofmonth(timestamp_field) ``` #### `time_dayofweek(expr)` “time.dayofweek(...)” expression. Returns the day of week (0-6), Monday = 0. Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.dayofweek(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_dayofweek("timestamp_field") print(expr) time.dayofweek(timestamp_field) ``` #### `time_dayofyear(expr)` “time.dayofyear(...)” expression. Returns the day of year (0-365). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.dayofyear(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_dayofyear("timestamp_field") print(expr) time.dayofyear(timestamp_field) ``` #### `time_hourofday(expr)` “time.hourofday(...)” expression. Returns the hour of day (0-23). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.hourofday(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_hourofday("timestamp_field") print(expr) time.hourofday(timestamp_field) ``` #### `time_minuteofhour(expr)` “time.minuteofhour(...)” expression. Returns the minute of hour (0-59). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ---------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.minuteofhour(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_minuteofhour("timestamp_field") print(expr) time.minuteofhour(timestamp_field) ``` #### `time_monthofyear(expr)` “time.monthofyear(...)” expression. Returns the month of year (1-12). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.monthofyear(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_monthofyear("timestamp_field") print(expr) time.monthofyear(timestamp_field) ``` #### `time_secondofminute(expr)` “time.secondofminute(...)” expression. Returns the second of minute (0-59). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'time.secondofminute(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_secondofminute("timestamp_field") print(expr) time.secondofminute(timestamp_field) ``` #### `time_year(expr)` “time.year(...)” expression. Returns the full year (e.g. 2009). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.year(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_year("timestamp_field") print(expr) time.year(timestamp_field) ``` #### `time_date(expr)` “time.date(...)” expression. Returns the date (e.g. 2009-01-10). Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------- | ---------- | | `expr` | `str` | The timestamp field or expression. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'time.date(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.time_date("timestamp_field") print(expr) time.date(timestamp_field) ``` #### `math_exp(expr)` “math.exp(...)” expression. Returns e^expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'math.exp(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_exp("my_field") print(expr) math.exp(my_field) ``` #### `math_log(expr)` “math.log(...)” expression. Returns the natural logarithm of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------- | | `str` | `Expression` | 'math.log(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_log("my_field") print(expr) math.log(my_field) ``` #### `math_log1p(expr)` “math.log1p(...)” expression. Returns the natural logarithm of (1 + expr). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------- | | `str` | `Expression` | 'math.log1p(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_log1p("my_field") print(expr) math.log1p(my_field) ``` #### `math_log10(expr)` “math.log10(...)” expression. Returns the base-10 logarithm of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------- | | `str` | `Expression` | 'math.log10(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_log10("my_field") print(expr) math.log10(my_field) ``` #### `math_sqrt(expr)` “math.sqrt(...)” expression. Returns the square root of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.sqrt(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_sqrt("my_field") print(expr) math.sqrt(my_field) ``` #### `math_cbrt(expr)` “math.cbrt(...)” expression. Returns the cube root of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.cbrt(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_cbrt("my_field") print(expr) math.cbrt(my_field) ``` #### `math_sin(expr)` “math.sin(...)” expression. Returns the sine of expr (argument in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------- | | `str` | `Expression` | 'math.sin(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_sin("my_field") print(expr) math.sin(my_field) ``` #### `math_cos(expr)` “math.cos(...)” expression. Returns the cosine of expr (argument in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------- | | `str` | `Expression` | 'math.cos(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_cos("my_field") print(expr) math.cos(my_field) ``` #### `math_tan(expr)` “math.tan(...)” expression. Returns the tangent of expr (argument in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------- | | `str` | `Expression` | 'math.tan(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_tan("my_field") print(expr) math.tan(my_field) ``` #### `math_asin(expr)` “math.asin(...)” expression. Returns the arcsine of expr (in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.asin(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_asin("my_field") print(expr) math.asin(my_field) ``` #### `math_acos(expr)` “math.acos(...)” expression. Returns the arccosine of expr (in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.acos(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_acos("my_field") print(expr) math.acos(my_field) ``` #### `math_atan(expr)` “math.atan(...)” expression. Returns the arctangent of expr (in radians). Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.atan(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_atan("my_field") print(expr) math.atan(my_field) ``` #### `math_sinh(expr)` “math.sinh(...)” expression. Returns the hyperbolic sine of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.sinh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_sinh("my_field") print(expr) math.sinh(my_field) ``` #### `math_cosh(expr)` “math.cosh(...)” expression. Returns the hyperbolic cosine of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.cosh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_cosh("my_field") print(expr) math.cosh(my_field) ``` #### `math_tanh(expr)` “math.tanh(...)” expression. Returns the hyperbolic tangent of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------ | | `str` | `Expression` | 'math.tanh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_tanh("my_field") print(expr) math.tanh(my_field) ``` #### `math_asinh(expr)` “math.asinh(...)” expression. Returns the inverse hyperbolic sine of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------- | | `str` | `Expression` | 'math.asinh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_asinh("my_field") print(expr) math.asinh(my_field) ``` #### `math_acosh(expr)` “math.acosh(...)” expression. Returns the inverse hyperbolic cosine of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------- | | `str` | `Expression` | 'math.acosh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_acosh("my_field") print(expr) math.acosh(my_field) ``` #### `math_atanh(expr)` “math.atanh(...)” expression. Returns the inverse hyperbolic tangent of expr. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------ | ---------- | | `expr` | `str` | The expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------- | | `str` | `Expression` | 'math.atanh(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_atanh("my_field") print(expr) math.atanh(my_field) ``` #### `math_pow(expr_x, expr_y)` “math.pow(...)” expression. Returns expr_x^expr_y. Parameters: | Name | Type | Description | Default | | -------- | ----- | ----------------------------------------- | ---------- | | `expr_x` | `str` | The expression or field for the base. | *required* | | `expr_y` | `str` | The expression or field for the exponent. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------- | | `str` | `Expression` | 'math.pow(expr_x, expr_y)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_pow("my_field", "2") print(expr) math.pow(my_field,2) ``` #### `math_hypot(expr_x, expr_y)` “math.hypot(...)” expression. Returns the length of the hypotenuse given expr_x and expr_y. Parameters: | Name | Type | Description | Default | | -------- | ----- | ------------------------------------------------------------ | ---------- | | `expr_x` | `str` | The expression or field for the first side of the triangle. | *required* | | `expr_y` | `str` | The expression or field for the second side of the triangle. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ----------------------------- | | `str` | `Expression` | 'math.hypot(expr_x, expr_y)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.math_hypot("my_field_x", "my_field_y") print(expr) math.hypot(my_field_x, my_field_y) ``` #### `size(expr)` “size(...)” expression. Returns the number of elements if expr is a list; otherwise returns 1. Parameters: | Name | Type | Description | Default | | ------ | ----- | ----------------------------- | ---------- | | `expr` | `str` | The list expression or field. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'size(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.size("my_array") print(expr) size(my_array) ``` #### `sort(expr)` “sort(...)” expression. Sorts the elements of the list argument in ascending order. Parameters: | Name | Type | Description | Default | | ------ | ----- | ------------------------------------- | ---------- | | `expr` | `str` | The list expression or field to sort. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'sort(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.sort("my_array") print(expr) sort(my_array) ``` #### `reverse(expr)` “reverse(...)” expression. Reverses the elements of the list argument. Parameters: | Name | Type | Description | Default | | ------ | ----- | ---------------------------------------- | ---------- | | `expr` | `str` | The list expression or field to reverse. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'reverse(expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.reverse("my_array") print(expr) reverse(my_array) ``` #### `fixedwidth(value, bucket_width)` “fixedwidth(...)” bucket expression. Maps the value of the first argument into consecutive buckets whose width is the second argument. Parameters: | Name | Type | Description | Default | | -------------- | ------------------- | ---------------------------------- | ---------- | | `value` | `str` | The field or expression to bucket. | *required* | | `bucket_width` | `Union[int, float]` | The width of each bucket. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'fixedwidth(value, bucket_width)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.fixedwidth("my_field",10) print(expr) fixedwidth(my_field,10) ``` #### `predefined(value, buckets)` “predefined(...)” bucket expression. Maps the value into the provided list of buckets. Each 'bucket' must be a string representing the range, e.g.: 'bucket(-inf,0)', 'bucket\[0,10)', 'bucket\[10,inf)', etc. Parameters: | Name | Type | Description | Default | | --------- | ----------- | ---------------------------------- | ---------- | | `value` | `str` | The field or expression to bucket. | *required* | | `buckets` | `List[str]` | A list of bucket definitions. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | --------------------------------------------------------------- | | `str` | `Expression` | A Vespa expression string of the form 'predefined(value, ( ))'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.predefined("my_field", ["bucket(-inf,0)", "bucket[0,10)", "bucket[10,inf)"]) print(expr) predefined(my_field,bucket(-inf,0),bucket[0,10),bucket[10,inf)) ``` #### `interpolatedlookup(array_attr, lookup_expr)` “interpolatedlookup(...)” expression. Counts elements in a sorted array that are less than an expression, with linear interpolation if the expression is between element values. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ---------------------------------- | ---------- | | `array_attr` | `str` | The sorted array field name. | *required* | | `lookup_expr` | `str` | The expression or value to lookup. | *required* | Returns: | Name | Type | Description | | ----- | ------------ | ------------------------------------------------------------------------ | | `str` | `Expression` | A Vespa expression string of the form 'interpolatedlookup(array, expr)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.interpolatedlookup("my_sorted_array", "4.2") print(expr) interpolatedlookup(my_sorted_array, 4.2) ``` #### `summary(summary_class='')` “summary(...)” hit aggregator. Produces a summary of the requested summary class. If no summary class is specified, “summary()” is used. Parameters: | Name | Type | Description | Default | | --------------- | ----- | ------------------------------------------ | ------- | | `summary_class` | `str` | Name of the summary class. Defaults to "". | `''` | Returns: | Name | Type | Description | | ----- | ------------ | -------------------------------------------------------------- | | `str` | `Expression` | A Vespa grouping expression string of the form 'summary(...)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.summary() print(expr) summary() ``` ```python expr = G.summary("my_summary_class") print(expr) summary(my_summary_class) ``` #### `as_(expression, label)` Appends an ' as(label)' part to a grouping block expression. Parameters: | Name | Type | Description | Default | | ------------ | ----- | ----------------------------- | ---------- | | `expression` | `str` | The expression to be labeled. | *required* | | `label` | `str` | The label to be used. | *required* | Returns: | Name | Type | Description | | ----- | ----- | --------------------------------------------------------------------- | | `str` | `str` | A Vespa grouping expression string of the form 'expression as(label)' | Example ```python from vespa.querybuilder import Grouping as G expr = G.as_(G.each(G.output(G.count())), "mylabel") print(expr) each(output(count())) as(mylabel) ``` #### `alias(alias_name, expression)` Defines an alias(...) grouping syntax. This lets you name an expression, so you can reference it later by $alias_name. Parameters: | Name | Type | Description | Default | | ------------ | ----- | ------------------------ | ---------- | | `alias_name` | `str` | The alias name. | *required* | | `expression` | `str` | The expression to alias. | *required* | Returns: | Name | Type | Description | | ----- | ----- | ------------------------------------------------------------------------------- | | `str` | `str` | A Vespa grouping expression string of the form 'alias(alias_name, expression)'. | Example ```python from vespa.querybuilder import Grouping as G expr = G.alias("my_alias", G.add("fieldA", "fieldB")) print(expr) alias(my_alias,add(fieldA, fieldB)) ``` ## `vespa.throttling` Adaptive throttling for async requests to Vespa. This module provides an AdaptiveThrottler class that dynamically adjusts concurrency based on server response patterns to prevent overloading Vespa applications with expensive operations (e.g., large embedding models). ### `AdaptiveThrottler(initial_concurrent=10, min_concurrent=1, max_concurrent=100, error_threshold=0.1, success_window=50, reduction_factor=0.5, increase_step=2, cooldown_seconds=5.0)` Adaptive throttler that adjusts concurrency based on response status codes. The throttler starts with a conservative concurrency limit and automatically adjusts based on server responses: - Reduces concurrency when error rate exceeds threshold (504, 503, 429 errors) - Gradually increases concurrency during healthy periods after a cooldown Attributes: | Name | Type | Description | | -------------------- | ------- | ---------------------------------------------------------- | | `initial_concurrent` | `int` | Starting concurrency limit (default: 10) | | `min_concurrent` | `int` | Minimum concurrency floor (default: 1) | | `max_concurrent` | `int` | Maximum concurrency ceiling (default: 100) | | `error_threshold` | `float` | Error rate that triggers reduction (default: 0.1 = 10%) | | `success_window` | `int` | Consecutive successes needed to increase (default: 50) | | `reduction_factor` | `float` | Factor to reduce concurrency by (default: 0.5 = 50%) | | `increase_step` | `int` | Amount to increase on success (default: 2) | | `cooldown_seconds` | `float` | Wait time before increasing after reduction (default: 5.0) | Example ```python throttler = AdaptiveThrottler(initial_concurrent=10, max_concurrent=50) async def make_request(): async with throttler.semaphore: response = await do_request() await throttler.record_result(response.status_code) return response ``` #### `current_concurrent` Current concurrency limit. #### `semaphore` Async semaphore for rate limiting requests. Note: This property lazily creates the semaphore on first access to maintain compatibility with Python 3.9. #### `__post_init__()` Initialize internal state after dataclass construction. #### `record_result(status_code)` Record a request result and adjust throttling if needed. Parameters: | Name | Type | Description | Default | | ------------- | ----- | ---------------------------------- | ---------- | | `status_code` | `int` | HTTP status code from the response | *required* | #### `reset()` Reset throttler to initial state.