Inference and Merging

This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.

1 Quick Start

Tip

Use the same config used for training on inference/merging.

1.1 Basic Inference

axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
axolotl inference your_config.yml --base-model="./completed-model"

1.2 Interactive Chat

For multi-turn testing of conversational models, use chat mode. The chat template is resolved exactly as it was during training and re-applied to the full conversation each turn:

axolotl inference your_config.yml --chat

Type a message to chat. End a line with \ to continue typing on the next line. Slash commands control the session:

Command Aliases Description
/help /? Show all commands
/new /clear, /reset Clear the conversation (keeps system prompt and parameters)
/system [text\|clear] Show, set, or clear the system prompt
/set <param> <value> Set a generation parameter
/status /params Show model info and current settings
/history Show the conversation so far
/retry /regen Regenerate the last assistant reply
/undo Remove the last exchange
/save [path] Append the conversation as a chat_template-format JSONL sample
/quit /exit, /q Exit

Generation parameters can also be set directly, e.g. /temperature 0.7 (or /temp 0.7), /top_p 0.9, /top_k 50, /max_tokens 512, /rep 1.05, /seed 42. Setting temperature to 0 switches to greedy decoding.

Press Ctrl+C during generation to stop the current reply; the partial response is kept in the conversation (diffusion replies denoise in one piece, so an interrupted diffusion turn is discarded instead).

1.2.1 Thinking Models

Thinking blocks (e.g. <think>...</think>) stream live in a small dim window, then collapse to a one-line summary — /expand shows the full reasoning of the last reply, and /collapse off switches to raw verbatim output. The per-turn stats split thinking from reply tokens. If the chat template supports a render-time thinking toggle (e.g. Qwen’s enable_thinking), /think off disables thinking entirely from the next turn; /think default restores the template default.

Note

Assistant turns are stored the way transformers recommends: special tokens are stripped and thinking is kept on a separate reasoning_content key (via the tokenizer’s parse_response schema when it ships one, marker-splitting otherwise), so the chat template decides how prior-turn reasoning is re-rendered — matching what the model saw during training. The KV cache is re-used across turns whenever the rendered conversation extends the previous one, so long chats stay responsive.

/save writes conversations in the messages format accepted by type: chat_template datasets, so a good interactive session can be turned directly into training data.

1.2.2 Diffusion Models

With the diffusion plugin enabled, chat mode generates each reply by appending a masked block to the conversation and denoising it. Replies arrive in one piece (no token streaming), and the parameter set changes accordingly: /tokens N sets the completion block size, /steps N the number of denoising steps, and /temperature the denoising temperature. Defaults come from the diffusion: section of your config.

Chat mode is not supported with --prompter; use the default inference mode for legacy prompters.

2 Advanced Usage

2.1 Gradio Interface

Launch an interactive web interface:

axolotl inference your_config.yml --gradio

2.2 File-based Prompts

Process prompts from a text file:

cat /tmp/prompt.txt | axolotl inference your_config.yml \
  --base-model="./completed-model" --prompter=None

2.3 Memory Optimization

For large models or limited memory:

axolotl inference your_config.yml --load-in-8bit=True

3 Merging LoRA Weights

Merge LoRA adapters with the base model:

axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"

3.1 Memory Management for Merging

gpu_memory_limit: 20GiB  # Adjust based on your GPU
lora_on_cpu: true        # Process on CPU if needed
CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...

4 Tokenization

4.1 Common Issues

Warning

Tokenization mismatches between training and inference are a common source of problems.

To debug:

  1. Check training tokenization:
axolotl preprocess your_config.yml --debug
  1. Verify inference tokenization by decoding tokens before model input

  2. Compare token IDs between training and inference

4.2 Special Tokens

Configure special tokens in your YAML:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

5 Troubleshooting

5.1 Common Problems

  • Use 8-bit loading
  • Reduce batch sizes
  • Try CPU offloading
  • Verify special tokens
  • Check tokenizer settings
  • Compare training and inference preprocessing
  • Verify model loading
  • Check prompt formatting
  • Ensure temperature/sampling settings

For more details, see our debugging guide.