LightGBM: Mapping model features to Vespa features¶
The main goal of this tutorial is to show how to deploy a LightGBM model with feature names that do not match Vespa feature names.
The following tasks will be accomplished throughout the tutorial:
- Train a LightGBM classification model with generic feature names that will not be available in the Vespa application.
- Create an application package and include a mapping from Vespa feature names to LightGBM model feature names.
- Create Vespa application package files and export then to an application folder.
- Export the trained LightGBM model to the Vespa application folder.
- Deploy the Vespa application using the application folder.
- Feed data to the Vespa application.
- Assert that the LightGBM predictions from the deployed model are correct.
Setup¶
Install and load required packages.
!pip3 install numpy pandas pyvespa lightgbm
import json
import lightgbm as lgb
import numpy as np
import pandas as pd
Create data¶
Simulate data that will be used to train the LightGBM model. Note that Vespa does not automatically recognize the feature names feature_1
, feature_2
and feature_3
. When creating the application package we need to map those variables to something that the Vespa application recognizes, such as a document attribute or query value.
# Create random training set
features = pd.DataFrame(
{
"feature_1": np.random.random(100),
"feature_2": np.random.random(100),
"feature_3": pd.Series(
np.random.choice(["a", "b", "c"], size=100), dtype="category"
),
}
)
features.head()
feature_1 | feature_2 | feature_3 | |
---|---|---|---|
0 | 0.856415 | 0.550705 | a |
1 | 0.615107 | 0.509030 | a |
2 | 0.089759 | 0.667729 | c |
3 | 0.161664 | 0.361693 | b |
4 | 0.841505 | 0.967227 | b |
Create a target variable that depends on feature_1
, feature_2
and feature_3
:
numeric_features = pd.get_dummies(features)
targets = (
(
numeric_features["feature_1"]
+ numeric_features["feature_2"]
- 0.5 * numeric_features["feature_3_a"]
+ 0.5 * numeric_features["feature_3_c"]
)
> 1.0
) * 1.0
targets
0 0.0 1 0.0 2 1.0 3 0.0 4 1.0 ... 95 1.0 96 1.0 97 0.0 98 1.0 99 1.0 Length: 100, dtype: float64
Fit lightgbm model¶
Train the LightGBM model on the simulated data,
training_set = lgb.Dataset(features, targets)
# Train the model
params = {
"objective": "binary",
"metric": "binary_logloss",
"num_leaves": 3,
}
model = lgb.train(params, training_set, num_boost_round=5)
[LightGBM] [Info] Number of positive: 48, number of negative: 52 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000404 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 74 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043 [LightGBM] [Info] Start training from score -0.080043
Vespa application package¶
Create the application package and map the LightGBM feature names to the related Vespa names.
In this example we are going to assume that feature_1
represents the document field numeric
and map feature_1
to attribute(numeric)
through the use of a Vespa Function
in the corresponding RankProfile
. feature_2
maps to a value
that will be sent along with the query, and this is represented in Vespa by mapping query(value)
to feature_2
. Lastly, the categorical feature is mapped from attribute(categorical)
to feature_3
.
from vespa.package import ApplicationPackage, Field, RankProfile, Function
app_package = ApplicationPackage(name="lightgbm")
app_package.schema.add_fields(
Field(name="id", type="string", indexing=["summary", "attribute"]),
Field(name="numeric", type="double", indexing=["summary", "attribute"]),
Field(name="categorical", type="string", indexing=["summary", "attribute"]),
)
app_package.schema.add_rank_profile(
RankProfile(
name="classify",
functions=[
Function(name="feature_1", expression="attribute(numeric)"),
Function(name="feature_2", expression="query(value)"),
Function(name="feature_3", expression="attribute(categorical)"),
],
first_phase="lightgbm('lightgbm_model.json')",
)
)
We can check how the Vespa search defition file will look like. Note that feature_1
, feature_2
and feature_3
required by the LightGBM model are now defined on the schema definition:
print(app_package.schema.schema_to_text)
schema lightgbm { document lightgbm { field id type string { indexing: summary | attribute } field numeric type double { indexing: summary | attribute } field categorical type string { indexing: summary | attribute } } rank-profile classify { function feature_1() { expression { attribute(numeric) } } function feature_2() { expression { query(value) } } function feature_3() { expression { attribute(categorical) } } first-phase { expression { lightgbm('lightgbm_model.json') } } } }
We can export the application package files to disk:
from pathlib import Path
Path("lightgbm").mkdir(parents=True, exist_ok=True)
app_package.to_files("lightgbm")
Note that we don't have any models under the models
folder. We need to export the lightGBM model that we trained earlier to models/lightgbm.json
.
!tree lightgbm
lightgbm ├── files ├── models │ └── lightgbm_model.json ├── schemas │ └── lightgbm.sd ├── search │ └── query-profiles │ ├── default.xml │ └── types │ └── root.xml └── services.xml 7 directories, 5 files
Export the model¶
with open("lightgbm/models/lightgbm_model.json", "w") as f:
json.dump(model.dump_model(), f, indent=2)
Now we can see that the model is where Vespa expects it to be:
!tree lightgbm
lightgbm ├── files ├── models │ └── lightgbm_model.json ├── schemas │ └── lightgbm.sd ├── search │ └── query-profiles │ ├── default.xml │ └── types │ └── root.xml └── services.xml 7 directories, 5 files
Deploy the application¶
Deploy the application package from disk with Docker:
from vespa.deployment import VespaDocker
vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(
application_name="lightgbm", application_root="lightgbm"
)
Waiting for configuration server, 0/300 seconds... Waiting for configuration server, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 15/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 20/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 25/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment.
Feed the data¶
Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing id and fields keys:
feed_batch = [
{
"id": idx,
"fields": {
"id": idx,
"numeric": row["feature_1"],
"categorical": row["feature_3"],
},
}
for idx, row in features.iterrows()
]
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}")
app.feed_iterable(feed_batch, callback=callback)
Model predictions¶
Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa.
features["model_prediction"] = model.predict(features)
features
feature_1 | feature_2 | feature_3 | model_prediction | |
---|---|---|---|---|
0 | 0.856415 | 0.550705 | a | 0.402572 |
1 | 0.615107 | 0.509030 | a | 0.356262 |
2 | 0.089759 | 0.667729 | c | 0.641578 |
3 | 0.161664 | 0.361693 | b | 0.388184 |
4 | 0.841505 | 0.967227 | b | 0.632525 |
... | ... | ... | ... | ... |
95 | 0.087768 | 0.451850 | c | 0.641578 |
96 | 0.839063 | 0.644387 | b | 0.632525 |
97 | 0.725573 | 0.327668 | a | 0.376350 |
98 | 0.937481 | 0.199995 | b | 0.376350 |
99 | 0.918530 | 0.734004 | a | 0.402572 |
100 rows × 4 columns
Query¶
Create a compute_vespa_relevance
function that takes a document id
and a query value
and return the LightGBM model deployed.
def compute_vespa_relevance(id_value: int):
hits = app.query(
body={
"yql": "select * from sources * where id = {}".format(str(id_value)),
"ranking": "classify",
"ranking.features.query(value)": features.loc[id_value, "feature_2"],
"hits": 1,
}
).hits
return hits[0]["relevance"]
compute_vespa_relevance(id_value=0)
0.4025720849980601
Loop through the features
to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa.
vespa_relevance = []
for idx, row in features.iterrows():
vespa_relevance.append(compute_vespa_relevance(id_value=idx))
features["vespa_relevance"] = vespa_relevance
features
feature_1 | feature_2 | feature_3 | model_prediction | vespa_relevance | |
---|---|---|---|---|---|
0 | 0.856415 | 0.550705 | a | 0.402572 | 0.402572 |
1 | 0.615107 | 0.509030 | a | 0.356262 | 0.356262 |
2 | 0.089759 | 0.667729 | c | 0.641578 | 0.641578 |
3 | 0.161664 | 0.361693 | b | 0.388184 | 0.388184 |
4 | 0.841505 | 0.967227 | b | 0.632525 | 0.632525 |
... | ... | ... | ... | ... | ... |
95 | 0.087768 | 0.451850 | c | 0.641578 | 0.641578 |
96 | 0.839063 | 0.644387 | b | 0.632525 | 0.632525 |
97 | 0.725573 | 0.327668 | a | 0.376350 | 0.376350 |
98 | 0.937481 | 0.199995 | b | 0.376350 | 0.376350 |
99 | 0.918530 | 0.734004 | a | 0.402572 | 0.402572 |
100 rows × 5 columns
Compare model and Vespa predictions¶
Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa.
assert features["model_prediction"].tolist() == features["vespa_relevance"].tolist()
Clean environment¶
!rm -fr lightgbm
vespa_docker.container.stop()
vespa_docker.container.remove()