BGE-M3 - The Mother of all embedding models¶
BAAI released BGE-M3 on January 30th, a new member of the BGE model series.
M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec (colbert) retrieval).
This notebook demonstrates how to use the BGE-M3 embeddings and represent all three embedding representations in Vespa! Vespa is the only scalable serving engine that can handle all M3 representations.
This code is inspired by the README from the model hub BAAI/bge-m3.
Let's get started! First, install dependencies:
!pip3 install -U pyvespa FlagEmbedding vespacli
Explore the multiple representations of M3¶
When encoding text, we can ask for the representations we want
- Sparse vectors with weights for the token IDs (from the multilingual tokenization process)
- Dense (DPR) regular text embeddings
- Multi-Dense (ColBERT) - contextualized multi-token vectors
Let us dive into it - To use this model on the CPU we set use_fp16
to False, for GPU inference, it is recommended to use use_fp16=True
for accelerated inference.
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)
A demo passage¶
Let us encode a simple passage
passage = [
"BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."
]
passage_embeddings = model.encode(
passage, return_dense=True, return_sparse=True, return_colbert_vecs=True
)
passage_embeddings.keys()
dict_keys(['dense_vecs', 'lexical_weights', 'colbert_vecs'])
Defining the Vespa application¶
PyVespa helps us build the Vespa application package. A Vespa application package consists of configuration files, schemas, models, and code (plugins).
First, we define a Vespa schema with the fields we want to store and their type. We use Vespa tensors to represent the three different M3 representations.
- We use a mapped tensor denoted by
t{}
to represent the sparse lexical representation - We use an indexed tensor denoted by
x[1024]
to represent the dense single vector representation of 1024 dimensions - For the colbert_rep (multi-vector), we use a mixed tensor that combines a mapped and an indexed dimension. This mixed tensor allows us to represent variable lengths.
We use bfloat16
tensor cell type, saving 50% storage compared to float
.
from vespa.package import Schema, Document, Field, FieldSet
m_schema = Schema(
name="m",
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary"]),
Field(
name="text",
type="string",
indexing=["summary", "index"],
index="enable-bm25",
),
Field(
name="lexical_rep",
type="tensor<bfloat16>(t{})",
indexing=["summary", "attribute"],
),
Field(
name="dense_rep",
type="tensor<bfloat16>(x[1024])",
indexing=["summary", "attribute"],
attribute=["distance-metric: angular"],
),
Field(
name="colbert_rep",
type="tensor<bfloat16>(t{}, x[1024])",
indexing=["summary", "attribute"],
),
],
),
fieldsets=[FieldSet(name="default", fields=["text"])],
)
The above defines our m
schema with the original text and the three different representations
from vespa.package import ApplicationPackage
vespa_app_name = "m"
vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[m_schema])
In the last step, we configure ranking by adding rank-profile
's to the schema.
We define three functions that implement the three different scoring functions for the different representations
- dense (dense cosine similarity)
- sparse (sparse dot product)
- max_sim (The colbert max sim operation)
Then, we combine these three scoring functions using a linear combination with weights, as suggested by the authors here.
from vespa.package import RankProfile, Function, FirstPhaseRanking
semantic = RankProfile(
name="m3hybrid",
inputs=[
("query(q_dense)", "tensor<bfloat16>(x[1024])"),
("query(q_lexical)", "tensor<bfloat16>(t{})"),
("query(q_colbert)", "tensor<bfloat16>(qt{}, x[1024])"),
("query(q_len_colbert)", "float"),
],
functions=[
Function(
name="dense",
expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)",
),
Function(
name="lexical", expression="sum(query(q_lexical) * attribute(lexical_rep))"
),
Function(
name="max_sim",
expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)",
),
],
first_phase=FirstPhaseRanking(
expression="0.4*dense + 0.2*lexical + 0.4*max_sim", rank_score_drop_limit=0.0
),
match_features=["dense", "lexical", "max_sim", "bm25(text)"],
)
m_schema.add_rank_profile(semantic)
The m3hybrid
rank-profile above defines the query input embedding type and a similarities function that
uses a Vespa tensor compute function that calculates
the M3 similarities for dense, lexical, and the max_sim for the colbert representations.
The profile only defines a single ranking phase, using a linear combination of multiple features using the suggested weighting.
Using match-features, Vespa returns selected features along with the hit in the SERP (result page). We also include BM25. We can view BM25 as the fourth dimension. Especially for long-context retrieval, it can be helpful compared to the neural representations.
Deploy the application to Vespa Cloud¶
With the configured application, we can deploy it to Vespa Cloud.
To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:
Create a tenant at console.vespa-cloud.com (unless you already have one). This step requires a Google or GitHub account, and will start your free trial.
Make note of the tenant name, it is used in the next steps.
Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.
from vespa.deployment import VespaCloud
import os
# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Key is only used for CI/CD. Can be removed if logging in interactively
key = os.getenv("VESPA_TEAM_API_KEY", None)
if key is not None:
key = key.replace(r"\n", "\n") # To parse key correctly
vespa_cloud = VespaCloud(
tenant=tenant_name,
application=vespa_app_name,
key_content=key, # Key is only used for CI/CD. Can be removed if logging in interactively
application_package=vespa_application_package,
)
Now deploy the app to Vespa Cloud dev zone.
The first deployment typically takes 2 minutes until the endpoint is up.
from vespa.application import Vespa
app: Vespa = vespa_cloud.deploy()
Deployment started in run 1 of dev-aws-us-east-1c for samples.m. This may take a few minutes the first time. INFO [22:13:09] Deploying platform version 8.299.14 and application dev build 1 for dev-aws-us-east-1c of default ... INFO [22:13:10] Using CA signed certificate version 0 INFO [22:13:10] Using 1 nodes in container cluster 'm_container' INFO [22:13:14] Session 939 for tenant 'samples' prepared and activated. INFO [22:13:17] ######## Details for all nodes ######## INFO [22:13:31] h88976d.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- container-clustercontroller on port 19050 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h89388b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- storagenode on port 19102 has not started INFO [22:13:31] --- searchnode on port 19107 has not started INFO [22:13:31] --- distributor on port 19111 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h90001a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- logserver-container on port 4080 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:13:31] h90550a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- : INFO [22:13:31] --- container on port 4080 has not started INFO [22:13:31] --- metricsproxy-container on port 19092 has not started INFO [22:14:31] Found endpoints: INFO [22:14:31] - dev.aws-us-east-1c INFO [22:14:31] |-- https://d29bf3e7.f064e220.z.vespa-app.cloud/ (cluster 'm_container') INFO [22:14:32] Installation succeeded! Using mTLS (key,cert) Authentication against endpoint https://d29bf3e7.f064e220.z.vespa-app.cloud//ApplicationStatus Application is up! Finished deployment.
Feed the M3 representations¶
We convert the three different representations to Vespa feed format
vespa_fields = {
"text": passage[0],
"lexical_rep": {
key: float(value)
for key, value in passage_embeddings["lexical_weights"][0].items()
},
"dense_rep": passage_embeddings["dense_vecs"][0].tolist(),
"colbert_rep": {
index: passage_embeddings["colbert_vecs"][0][index].tolist()
for index in range(passage_embeddings["colbert_vecs"][0].shape[0])
},
}
from vespa.io import VespaResponse
response: VespaResponse = app.feed_data_point(
schema="m", data_id=0, fields=vespa_fields
)
assert response.is_successful()
query = ["What is BGE M3?"]
query_embeddings = model.encode(
query, return_dense=True, return_sparse=True, return_colbert_vecs=True
)
The M3 colbert scoring function needs the query length to normalize the score to the range 0 to 1. This helps when combining the score with the other scoring functions.
query_length = query_embeddings["colbert_vecs"][0].shape[0]
query_fields = {
"input.query(q_lexical)": {
key: float(value)
for key, value in query_embeddings["lexical_weights"][0].items()
},
"input.query(q_dense)": query_embeddings["dense_vecs"][0].tolist(),
"input.query(q_colbert)": str(
{
index: query_embeddings["colbert_vecs"][0][index].tolist()
for index in range(query_embeddings["colbert_vecs"][0].shape[0])
}
),
"input.query(q_len_colbert)": query_length,
}
from vespa.io import VespaQueryResponse
import json
response: VespaQueryResponse = app.query(
yql="select id, text from m where userQuery() or ({targetHits:10}nearestNeighbor(dense_rep,q_dense))",
ranking="m3hybrid",
query=query[0],
body={**query_fields},
)
assert response.is_successful()
print(json.dumps(response.hits[0], indent=2))
{ "id": "index:m_content/0/cfcd2084234135f700f08abf", "relevance": 0.5993361056332731, "source": "m_content", "fields": { "matchfeatures": { "bm25(text)": 0.8630462173553426, "dense": 0.6258970723760484, "lexical": 0.1941967010498047, "max_sim": 0.7753448411822319 }, "text": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction." } }
Notice the matchfeatures
that returns the configured match-features from the rank-profile. We can
use these to compare the torch model scoring with the computations specified in Vespa.
Now, we can compare the Vespa computed scores with the model torch code and they line up perfectly
model.compute_lexical_matching_score(
passage_embeddings["lexical_weights"][0], query_embeddings["lexical_weights"][0]
)
0.19554455392062664
query_embeddings["dense_vecs"][0] @ passage_embeddings["dense_vecs"][0].T
0.6259037
model.colbert_score(
query_embeddings["colbert_vecs"][0], passage_embeddings["colbert_vecs"][0]
)
tensor(0.7797)
That is it!¶
That is how easy it is to represent the brand new M3 FlagEmbedding representations in Vespa! Read more in the M3 technical report.
We can go ahead and delete the Vespa cloud instance we deployed by:
vespa_cloud.delete()