from learntorank.ml import SequenceClassification
= SequenceClassification(
task ="bert_tiny",
model_id="google/bert_uncased_L-2_H-128_A-2"
model )
Sequence Classification task
Accelerated model evaluation using ONNX Runtime in the stateless cluster
Vespa has implemented accelerated model evaluation using ONNX Runtime in the stateless cluster. This opens up new usage areas for Vespa, such as serving model predictions.
Define the model server
The SequenceClassification
task takes a text input and returns an array of floats that depends on the model used to solve the task. The model
argument can be the id of the model as defined by the huggingface model hub.
A ModelServer
is a simplified application package focused on stateless model evaluation. It can take as many tasks as we want.
from learntorank.ml import ModelServer
= ModelServer(
model_server ="bertModelServer",
name=[task],
tasks )
Deploy the model server
We can either host our model server on Vespa Cloud or deploy it locally using a Docker container.
from vespa.deployment import VespaDocker
= VespaDocker()
vespa_docker = vespa_docker.deploy(application_package=model_server) app
Using framework PyTorch: 1.12.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.
Get model information
Get models available:
app.get_model_endpoint()
{'bert_tiny': 'http://localhost:8080/model-evaluation/v1/bert_tiny'}
Get information about a specific model:
="bert_tiny") app.get_model_endpoint(model_id
{'model': 'bert_tiny',
'functions': [{'function': 'output_0',
'info': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0',
'eval': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0/eval',
'arguments': [{'name': 'input_ids', 'type': 'tensor(d0[],d1[])'},
{'name': 'attention_mask', 'type': 'tensor(d0[],d1[])'},
{'name': 'token_type_ids', 'type': 'tensor(d0[],d1[])'}]}]}
Get predictions
Get a prediction:
="this is a test", model_id="bert_tiny") app.predict(x
[-0.00954509899020195, 0.2504960000514984]
Cleanup
from shutil import rmtree
=600)
vespa_docker.container.stop(timeout vespa_docker.container.remove()