Sequence Classification task

Accelerated model evaluation using ONNX Runtime in the stateless cluster

Vespa has implemented accelerated model evaluation using ONNX Runtime in the stateless cluster. This opens up new usage areas for Vespa, such as serving model predictions.

Define the model server

The SequenceClassification task takes a text input and returns an array of floats that depends on the model used to solve the task. The model argument can be the id of the model as defined by the huggingface model hub.

from learntorank.ml import SequenceClassification

task = SequenceClassification(
    model_id="bert_tiny", 
    model="google/bert_uncased_L-2_H-128_A-2"
)

A ModelServer is a simplified application package focused on stateless model evaluation. It can take as many tasks as we want.

from learntorank.ml import ModelServer

model_server = ModelServer(
    name="bertModelServer",
    tasks=[task],
)

Deploy the model server

We can either host our model server on Vespa Cloud or deploy it locally using a Docker container.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=model_server)
Using framework PyTorch: 1.12.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.

Get model information

Get models available:

app.get_model_endpoint()
{'bert_tiny': 'http://localhost:8080/model-evaluation/v1/bert_tiny'}

Get information about a specific model:

app.get_model_endpoint(model_id="bert_tiny")
{'model': 'bert_tiny',
 'functions': [{'function': 'output_0',
   'info': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0',
   'eval': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0/eval',
   'arguments': [{'name': 'input_ids', 'type': 'tensor(d0[],d1[])'},
    {'name': 'attention_mask', 'type': 'tensor(d0[],d1[])'},
    {'name': 'token_type_ids', 'type': 'tensor(d0[],d1[])'}]}]}

Get predictions

Get a prediction:

app.predict(x="this is a test", model_id="bert_tiny")
[-0.00954509899020195, 0.2504960000514984]

Cleanup

from shutil import rmtree

vespa_docker.container.stop(timeout=600)
vespa_docker.container.remove()