vllm¶
The vllm
module is generator that using vllm.
Why use vllm module?¶
vllm
can generate new texts really fast. Its speed is more than 10x faster than a huggingface transformers library.
You can use vllm
model with llama_index_llm module, but it is really slow because LlamaIndex
does not optimize for processing many prompts at once.
So, we decide to make a standalone module for vllm, for faster generation speed.
Module Parameters¶
llm: You can type your ‘model name’ at here. For example,
facebook/opt-125m
ormistralai/Mistral-7B-Instruct-v0.2
.max_tokens: The maximum number of tokens to generate.
temperature: The temperature of the sampling. Higher temperature means more randomness.
top_p: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
And all parameters from LLM initialization and Sampling Params.
Example config.yaml¶
modules:
- module_type: vllm
llm: mistralai/Mistral-7B-Instruct-v0.2
temperature: [ 0.1, 1.0 ]
max_tokens: 512
Support chat prompt¶
From v0.3.18, you can use chat prompt with vllm
module.
For using chat prompt, you have to use chat_fstring
module for prompt maker.
Using reasoning¶
From v0.3.18, you can use reasoning with vllm
module.
All you need to do is set thinking
parameter to True
in the YAML file.
modules:
- module_type: vllm
llm: mistralai/Mistral-7B-Instruct-v0.2
temperature: [ 0.1, 1.0 ]
max_tokens: 512
thinking: True
You have to use reasoning model to use reasoning, unless you can get an error.
Use in Multi-GPU¶
First, for more details, you must check out vllm docs about parallel processing.
When you use multi gpu, you can set tensor_parallel_size
parameter at YAML file.
modules:
- module_type: vllm
llm: mistralai/Mistral-7B-Instruct-v0.2
tensor_parallel_size: 2 # If the gpu is two.
temperature: [ 0.1, 1.0 ]
max_tokens: 512
Also, you can use any parameter from vllm.LLM
, SamplingParams
, and EngineArgs
.
Plus, you can use it over v0.2.16, so you must be upgrade to the latest version.
Warning
We are developing multi-gpu compatibility for AutoRAG now. So, please wait for the full compatibilty to multi-gpu environment.
Warning
When using the vllm module, errors may occur depending on the configuration of PyTorch. In such cases, please follow the instructions below:
Define the vllm module to operate in a single-case mode.
Set the skip_validation parameter to True when using the start_trial function in the evaluator.