Inference and Merging

This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.

1 Quick Start

Tip

Use the same config used for training on inference/merging.

1.1 Basic Inference

axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"

axolotl inference your_config.yml --base-model="./completed-model"

1.2 Interactive Chat

For multi-turn testing of conversational models, use chat mode. The chat template is resolved exactly as it was during training and re-applied to the full conversation each turn:

axolotl inference your_config.yml --chat

Type a message to chat. End a line with \ to continue typing on the next line. Slash commands control the session:

Command	Aliases	Description
`/help`	`/?`	Show all commands
`/new`	`/clear`, `/reset`	Clear the conversation (keeps system prompt and parameters)
`/system [text\\|clear]`		Show, set, or clear the system prompt
`/set <param> <value>`		Set a generation parameter
`/status`	`/params`	Show model info and current settings
`/history`		Show the conversation so far
`/retry`	`/regen`	Regenerate the last assistant reply
`/undo`		Remove the last exchange
`/save [path]`		Append the conversation as a `chat_template`-format JSONL sample
`/quit`	`/exit`, `/q`	Exit

Generation parameters can also be set directly, e.g. /temperature 0.7 (or /temp 0.7), /top_p 0.9, /top_k 50, /max_tokens 512, /rep 1.05, /seed 42. Setting temperature to 0 switches to greedy decoding.

Press Ctrl+C during generation to stop the current reply; the partial response is kept in the conversation (diffusion replies denoise in one piece, so an interrupted diffusion turn is discarded instead).

1.2.1 Thinking Models

Thinking blocks (e.g. <think>...</think>) stream live in a small dim window, then collapse to a one-line summary — /expand shows the full reasoning of the last reply, and /collapse off switches to raw verbatim output. The per-turn stats split thinking from reply tokens. If the chat template supports a render-time thinking toggle (e.g. Qwen’s enable_thinking), /think off disables thinking entirely from the next turn; /think default restores the template default.

Note

Assistant turns are stored the way transformers recommends: special tokens are stripped and thinking is kept on a separate reasoning_content key (via the tokenizer’s parse_response schema when it ships one, marker-splitting otherwise), so the chat template decides how prior-turn reasoning is re-rendered — matching what the model saw during training. The KV cache is re-used across turns whenever the rendered conversation extends the previous one, so long chats stay responsive.

/save writes conversations in the messages format accepted by type: chat_template datasets, so a good interactive session can be turned directly into training data.

1.2.2 Diffusion Models

With the diffusion plugin enabled, chat mode generates each reply by appending a masked block to the conversation and denoising it. Replies arrive in one piece (no token streaming), and the parameter set changes accordingly: /tokens N sets the completion block size, /steps N the number of denoising steps, and /temperature the denoising temperature. Defaults come from the diffusion: section of your config.

Chat mode is not supported with --prompter; use the default inference mode for legacy prompters.

2 Advanced Usage

2.1 Gradio Interface

Launch an interactive web interface:

axolotl inference your_config.yml --gradio

2.2 File-based Prompts

Process prompts from a text file:

cat /tmp/prompt.txt | axolotl inference your_config.yml \
  --base-model="./completed-model" --prompter=None

2.3 Memory Optimization

For large models or limited memory:

axolotl inference your_config.yml --load-in-8bit=True

3 Merging LoRA Weights

Merge LoRA adapters with the base model:

axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"

3.1 Memory Management for Merging

gpu_memory_limit: 20GiB  # Adjust based on your GPU
lora_on_cpu: true        # Process on CPU if needed

CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...

4 Tokenization

4.1 Common Issues

Warning

Tokenization mismatches between training and inference are a common source of problems.

To debug:

Check training tokenization:

axolotl preprocess your_config.yml --debug

Verify inference tokenization by decoding tokens before model input
Compare token IDs between training and inference

4.2 Special Tokens

Configure special tokens in your YAML:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

5 Troubleshooting

5.1 Common Problems

Use 8-bit loading
Reduce batch sizes
Try CPU offloading

Verify special tokens
Check tokenizer settings
Compare training and inference preprocessing

Verify model loading
Check prompt formatting
Ensure temperature/sampling settings

For more details, see our debugging guide.