LightGBM: Training the model with Vespa features¶
The main goal of this tutorial is to deploy and use a LightGBM model in a Vespa application. The following tasks will be accomplished throughout the tutorial:
- Train a LightGBM classification model with variable names supported by Vespa.
- Create Vespa application package files and export then to an application folder.
- Export the trained LightGBM model to the Vespa application folder.
- Deploy the Vespa application using the application folder.
- Feed data to the Vespa application.
- Assert that the LightGBM predictions from the deployed model are correct.
Setup¶
Install and load required packages.
!pip3 install numpy pandas pyvespa lightgbm
import json
import lightgbm as lgb
import numpy as np
import pandas as pd
Create data¶
Generate a toy dataset to follow along. Note that we set the column names in a format that Vespa understands. query(value)
means that the user will send a parameter named value
along with the query. attribute(field)
means that field
is a document attribute defined in a schema. In the example below we have a query parameter named value
and two document's attributes, numeric
and categorical
. If we want lightgbm
to handle categorical variables we should use dtype="category"
when creating the dataframe, as shown below.
# Create random training set
features = pd.DataFrame(
{
"query(value)": np.random.random(100),
"attribute(numeric)": np.random.random(100),
"attribute(categorical)": pd.Series(
np.random.choice(["a", "b", "c"], size=100), dtype="category"
),
}
)
features.head()
query(value) | attribute(numeric) | attribute(categorical) | |
---|---|---|---|
0 | 0.437748 | 0.442222 | c |
1 | 0.957135 | 0.323047 | b |
2 | 0.514168 | 0.426117 | a |
3 | 0.713511 | 0.886630 | b |
4 | 0.626918 | 0.663179 | c |
We generate the target variable as a function of the three features defined above:
numeric_features = pd.get_dummies(features)
targets = (
(
numeric_features["query(value)"]
+ numeric_features["attribute(numeric)"]
- 0.5 * numeric_features["attribute(categorical)_a"]
+ 0.5 * numeric_features["attribute(categorical)_c"]
)
> 1.0
) * 1.0
targets
0 1.0 1 1.0 2 0.0 3 1.0 4 1.0 ... 95 0.0 96 1.0 97 0.0 98 0.0 99 1.0 Length: 100, dtype: float64
Fit lightgbm model¶
Train an LightGBM model with a binary loss function:
training_set = lgb.Dataset(features, targets)
# Train the model
params = {
"objective": "binary",
"metric": "binary_logloss",
"num_leaves": 3,
}
model = lgb.train(params, training_set, num_boost_round=5)
[LightGBM] [Info] Number of positive: 48, number of negative: 52 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000484 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 74 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043 [LightGBM] [Info] Start training from score -0.080043
Vespa application package¶
Create a Vespa application package. The model expects two document attributes, numeric
and categorical
. We can use the model in the first-phase ranking by using the lightgbm
rank feature.
from vespa.package import ApplicationPackage, Field, RankProfile
app_package = ApplicationPackage(name="lightgbm")
app_package.schema.add_fields(
Field(name="id", type="string", indexing=["summary", "attribute"]),
Field(name="numeric", type="double", indexing=["summary", "attribute"]),
Field(name="categorical", type="string", indexing=["summary", "attribute"]),
)
app_package.schema.add_rank_profile(
RankProfile(name="classify", first_phase="lightgbm('lightgbm_model.json')")
)
We can check how the Vespa search defition file will look like:
print(app_package.schema.schema_to_text)
schema lightgbm { document lightgbm { field id type string { indexing: summary | attribute } field numeric type double { indexing: summary | attribute } field categorical type string { indexing: summary | attribute } } rank-profile classify { first-phase { expression { lightgbm('lightgbm_model.json') } } } }
We can export the application package files to disk:
from pathlib import Path
Path("lightgbm").mkdir(parents=True, exist_ok=True)
app_package.to_files("lightgbm")
Note that we don't have any models under the models
folder. We need to export the lightGBM model that we trained earlier to models/lightgbm.json
.
!tree lightgbm
lightgbm ├── files ├── models ├── schemas │ └── lightgbm.sd ├── search │ └── query-profiles │ ├── default.xml │ └── types │ └── root.xml └── services.xml 7 directories, 4 files
Export the model¶
with open("lightgbm/models/lightgbm_model.json", "w") as f:
json.dump(model.dump_model(), f, indent=2)
Now we can see that the model is where Vespa expects it to be:
!tree lightgbm
lightgbm ├── files ├── models │ └── lightgbm_model.json ├── schemas │ └── lightgbm.sd ├── search │ └── query-profiles │ ├── default.xml │ └── types │ └── root.xml └── services.xml 7 directories, 5 files
Deploy the application¶
Deploy the application package from disk with Docker:
from vespa.deployment import VespaDocker
vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(
application_name="lightgbm", application_root="lightgbm"
)
Waiting for configuration server, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 0/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 5/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Waiting for application status, 10/300 seconds... Using plain http against endpoint http://localhost:8080/ApplicationStatus Application is up! Finished deployment.
Feed the data¶
Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing id
and fields
keys:
feed_batch = [
{
"id": idx,
"fields": {
"id": idx,
"numeric": row["attribute(numeric)"],
"categorical": row["attribute(categorical)"],
},
}
for idx, row in features.iterrows()
]
Feed the batch of data:
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}")
app.feed_iterable(feed_batch, callback=callback)
Model predictions¶
Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa.
features["model_prediction"] = model.predict(features)
features
query(value) | attribute(numeric) | attribute(categorical) | model_prediction | |
---|---|---|---|---|
0 | 0.437748 | 0.442222 | c | 0.645663 |
1 | 0.957135 | 0.323047 | b | 0.645663 |
2 | 0.514168 | 0.426117 | a | 0.354024 |
3 | 0.713511 | 0.886630 | b | 0.645663 |
4 | 0.626918 | 0.663179 | c | 0.645663 |
... | ... | ... | ... | ... |
95 | 0.208583 | 0.103319 | c | 0.352136 |
96 | 0.882902 | 0.224213 | c | 0.645663 |
97 | 0.604831 | 0.675583 | a | 0.354024 |
98 | 0.278674 | 0.008019 | b | 0.352136 |
99 | 0.417318 | 0.616241 | b | 0.645663 |
100 rows × 4 columns
Query¶
Create a compute_vespa_relevance
function that takes a document id
and a query value
and return the LightGBM model deployed.
def compute_vespa_relevance(id_value: int):
hits = app.query(
body={
"yql": "select * from sources * where id = {}".format(str(id_value)),
"ranking": "classify",
"ranking.features.query(value)": features.loc[id_value, "query(value)"],
"hits": 1,
}
).hits
return hits[0]["relevance"]
compute_vespa_relevance(id_value=0)
0.645662636917761
Loop through the features
to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa.
vespa_relevance = []
for idx, row in features.iterrows():
vespa_relevance.append(compute_vespa_relevance(id_value=idx))
features["vespa_relevance"] = vespa_relevance
features
query(value) | attribute(numeric) | attribute(categorical) | model_prediction | vespa_relevance | |
---|---|---|---|---|---|
0 | 0.437748 | 0.442222 | c | 0.645663 | 0.645663 |
1 | 0.957135 | 0.323047 | b | 0.645663 | 0.645663 |
2 | 0.514168 | 0.426117 | a | 0.354024 | 0.354024 |
3 | 0.713511 | 0.886630 | b | 0.645663 | 0.645663 |
4 | 0.626918 | 0.663179 | c | 0.645663 | 0.645663 |
... | ... | ... | ... | ... | ... |
95 | 0.208583 | 0.103319 | c | 0.352136 | 0.352136 |
96 | 0.882902 | 0.224213 | c | 0.645663 | 0.645663 |
97 | 0.604831 | 0.675583 | a | 0.354024 | 0.354024 |
98 | 0.278674 | 0.008019 | b | 0.352136 | 0.352136 |
99 | 0.417318 | 0.616241 | b | 0.645663 | 0.645663 |
100 rows × 5 columns
Compare model and Vespa predictions¶
Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa.
assert features["model_prediction"].tolist() == features["vespa_relevance"].tolist()
Clean environment¶
!rm -fr lightgbm
vespa_docker.container.stop()
vespa_docker.container.remove()