Text Generation - Qwen3-VL

model.generate()

Generate text completions from image/video inputs and text prompts.

generated_ids = model.generate(**inputs, max_new_tokens=128)

Parameters

inputs

dict

required

Model inputs dictionary containing tokenized text and preprocessed images/videos.Typically obtained from processor.apply_chat_template() and moved to model device:

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)

max_new_tokens

int

default:"128"

Maximum number of new tokens to generate.

# Short response
generated_ids = model.generate(**inputs, max_new_tokens=128)

# Long response
generated_ids = model.generate(**inputs, max_new_tokens=2048)

temperature

float

default:"0.7"

Sampling temperature. Higher values increase randomness.

0.1-0.5: More focused, deterministic outputs
0.7: Balanced creativity (default for Instruct models)
0.6: Recommended for Thinking models
1.0+: More creative, diverse outputs

# More deterministic
generated_ids = model.generate(**inputs, temperature=0.3, max_new_tokens=128)

top_p

float

default:"0.8"

Nucleus sampling threshold. Only tokens with cumulative probability up to top_p are considered.

Instruct models: 0.8 (default)
Thinking models: 0.95 (recommended)

generated_ids = model.generate(
    **inputs,
    temperature=0.7,
    top_p=0.8,
    max_new_tokens=128
)

top_k

int

default:"20"

Top-k sampling. Only the top k most likely tokens are considered.

20: Recommended value
-1: Disabled (consider all tokens)

generated_ids = model.generate(**inputs, top_k=20, max_new_tokens=128)

repetition_penalty

float

default:"1.0"

Penalty for repeating tokens. Values > 1.0 discourage repetition.

1.0: No penalty (default)
1.1-1.5: Light to moderate penalty

generated_ids = model.generate(
    **inputs,
    repetition_penalty=1.0,
    max_new_tokens=128
)

presence_penalty

float

default:"1.5"

Penalty for token presence regardless of frequency.

Instruct models: 1.5 (recommended)
Thinking models: 0.0 (recommended)

generated_ids = model.generate(
    **inputs,
    presence_penalty=1.5,
    max_new_tokens=128
)

do_sample

bool

default:"True"

Whether to use sampling. If False, uses greedy decoding.

# Greedy decoding (deterministic)
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=128)

# Sampling (with randomness)
generated_ids = model.generate(**inputs, do_sample=True, max_new_tokens=128)

seed

int

Random seed for reproducible generation.

Instruct models: 3407 (official eval)
Thinking models: 1234 (official eval)

generated_ids = model.generate(**inputs, seed=42, max_new_tokens=128)

Returns

generated_ids

torch.Tensor

Tensor of token IDs including both input and generated tokens.Shape: (batch_size, total_sequence_length)To extract only the newly generated tokens:

generated_ids_trimmed = [
    out_ids[len(in_ids):] 
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

Decoding Output

Convert generated token IDs to text using the processor:

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

Complete Example

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://example.com/image.jpg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

# Prepare inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)

# Decode
generated_ids_trimmed = [
    out_ids[len(in_ids):]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

Recommended Hyperparameters

Instruct Models

generated_ids = model.generate(
    **inputs,
    max_new_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    repetition_penalty=1.0,
    presence_penalty=1.5,
    do_sample=True,
    seed=3407
)

Thinking Models

generated_ids = model.generate(
    **inputs,
    max_new_tokens=40960,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.0,
    presence_penalty=0.0,
    do_sample=True,
    seed=1234
)

Notes

For batch generation, always set processor.tokenizer.padding_side = 'left' and include padding=True in apply_chat_template().

The recommended hyperparameters above are used for official benchmark evaluations and provide optimal results for most use cases.

​model.generate()

​Parameters

​Returns

​Decoding Output

​Complete Example

​Recommended Hyperparameters

​Instruct Models

​Thinking Models

​Notes

model.generate()

Parameters

Returns

Decoding Output

Complete Example

Recommended Hyperparameters

Instruct Models

Thinking Models

Notes