Skip to main content

model.generate()

Generate text completions from image/video inputs and text prompts.
generated_ids = model.generate(**inputs, max_new_tokens=128)

Parameters

inputs
dict
required
Model inputs dictionary containing tokenized text and preprocessed images/videos.Typically obtained from processor.apply_chat_template() and moved to model device:
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
max_new_tokens
int
default:"128"
Maximum number of new tokens to generate.
# Short response
generated_ids = model.generate(**inputs, max_new_tokens=128)

# Long response
generated_ids = model.generate(**inputs, max_new_tokens=2048)
temperature
float
default:"0.7"
Sampling temperature. Higher values increase randomness.
  • 0.1-0.5: More focused, deterministic outputs
  • 0.7: Balanced creativity (default for Instruct models)
  • 0.6: Recommended for Thinking models
  • 1.0+: More creative, diverse outputs
# More deterministic
generated_ids = model.generate(**inputs, temperature=0.3, max_new_tokens=128)
top_p
float
default:"0.8"
Nucleus sampling threshold. Only tokens with cumulative probability up to top_p are considered.
  • Instruct models: 0.8 (default)
  • Thinking models: 0.95 (recommended)
generated_ids = model.generate(
    **inputs,
    temperature=0.7,
    top_p=0.8,
    max_new_tokens=128
)
top_k
int
default:"20"
Top-k sampling. Only the top k most likely tokens are considered.
  • 20: Recommended value
  • -1: Disabled (consider all tokens)
generated_ids = model.generate(**inputs, top_k=20, max_new_tokens=128)
repetition_penalty
float
default:"1.0"
Penalty for repeating tokens. Values > 1.0 discourage repetition.
  • 1.0: No penalty (default)
  • 1.1-1.5: Light to moderate penalty
generated_ids = model.generate(
    **inputs,
    repetition_penalty=1.0,
    max_new_tokens=128
)
presence_penalty
float
default:"1.5"
Penalty for token presence regardless of frequency.
  • Instruct models: 1.5 (recommended)
  • Thinking models: 0.0 (recommended)
generated_ids = model.generate(
    **inputs,
    presence_penalty=1.5,
    max_new_tokens=128
)
do_sample
bool
default:"True"
Whether to use sampling. If False, uses greedy decoding.
# Greedy decoding (deterministic)
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=128)

# Sampling (with randomness)
generated_ids = model.generate(**inputs, do_sample=True, max_new_tokens=128)
seed
int
Random seed for reproducible generation.
  • Instruct models: 3407 (official eval)
  • Thinking models: 1234 (official eval)
generated_ids = model.generate(**inputs, seed=42, max_new_tokens=128)

Returns

generated_ids
torch.Tensor
Tensor of token IDs including both input and generated tokens.Shape: (batch_size, total_sequence_length)To extract only the newly generated tokens:
generated_ids_trimmed = [
    out_ids[len(in_ids):] 
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

Decoding Output

Convert generated token IDs to text using the processor:
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

Complete Example

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://example.com/image.jpg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

# Prepare inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)

# Decode
generated_ids_trimmed = [
    out_ids[len(in_ids):]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

Instruct Models

generated_ids = model.generate(
    **inputs,
    max_new_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    repetition_penalty=1.0,
    presence_penalty=1.5,
    do_sample=True,
    seed=3407
)

Thinking Models

generated_ids = model.generate(
    **inputs,
    max_new_tokens=40960,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.0,
    presence_penalty=0.0,
    do_sample=True,
    seed=1234
)

Notes

For batch generation, always set processor.tokenizer.padding_side = 'left' and include padding=True in apply_chat_template().
The recommended hyperparameters above are used for official benchmark evaluations and provide optimal results for most use cases.