💬 Build a Local Qwen3 Chatbot (LoRA Fine-Tuned Model Inference Guide)

Tech3Space15 Jun 2026

💬 Build a Local Qwen3 Chatbot (LoRA Fine-Tuned Model Inference Guide)

🧠 Description

In this tutorial, we build a fully working local chatbot using a fine-tuned Qwen3 model (LoRA merged version) with Hugging Face Transformers.

This script demonstrates how to:

Load a merged Qwen3 model
Use chat templates for conversation memory
Generate responses with sampling strategies
Maintain multi-turn chat history
Build a CLI-based AI assistant

This is a production-style inference pipeline suitable for:

Local AI assistants
API backend integration
Chatbot prototypes
Research experiments

🚀 Full Tutorial: Running Qwen3 Chatbot Locally

⚙️ 1. Import Required Libraries

python
1import torch
2from transformers import AutoTokenizer, AutoModelForCausalLM

🧠 Why these?

transformers → loads model + tokenizer
torch → handles GPU inference

📦 2. Load Fine-Tuned Model

python
1MODEL_PATH = "./qwen3_lora_sft_pro_merged"

💡 Important:

This is the merged LoRA model, meaning:

No adapter needed
Fully standalone inference model
Production-ready checkpoint

🔤 3. Load Tokenizer and Model

python
1tokenizer = AutoTokenizer.from_pretrained(
2    MODEL_PATH,
3    trust_remote_code=True,
4)
5
6model = AutoModelForCausalLM.from_pretrained(
7    MODEL_PATH,
8    trust_remote_code=True,
9    device_map="auto",
10)

🚀 Key points:

trust_remote_code=True → required for Qwen architecture
device_map="auto" → automatically uses GPU/CPU split
No manual .to(device) needed

🧠 4. Initialize Chat Memory

python
1messages = [
2    {
3        "role": "system",
4        "content": "You are a helpful AI assistant."
5    }
6]

💡 Why system prompt matters:

It defines:

Personality of assistant
Response style
Safety + instruction behavior

💬 5. Start CLI Chat Interface

python
1print("=" * 60)
2print("Qwen3 Chat Agent")
3print("Type 'exit' or 'quit' to stop.")
4print("=" * 60)

This creates a simple terminal chatbot UI.

🔁 6. Infinite Chat Loop

python
1while True:
2    user_input = input("\nYou: ").strip()
3
4    if user_input.lower() in {"exit", "quit"}:
5        break

🧠 Behavior:

Accepts user input continuously
Stops when user types exit or quit

🧩 7. Store Conversation History

python
1messages.append({
2    "role": "user",
3    "content": user_input,
4})

💡 Why history is important:

Enables multi-turn conversation
Maintains context awareness
Improves response quality

🧾 8. Convert Chat History to Prompt

python
1prompt = tokenizer.apply_chat_template(
2    messages,
3    tokenize=False,
4    add_generation_prompt=True,
5)

🔥 Key idea:

This converts structured chat → model-readable format.

✔ Ensures Qwen-style formatting ✔ Maintains role structure (system/user/assistant)

🔢 9. Tokenize Input

python
1inputs = tokenizer(
2    prompt,
3    return_tensors="pt",
4).to(model.device)

💡 What happens here:

Converts text → token IDs
Sends tensors to GPU/CPU automatically

🧠 10. Generate Response (Core AI Step)

python
1with torch.no_grad():
2    output_ids = model.generate(
3        **inputs,
4        max_new_tokens=1024,
5        do_sample=True,
6        temperature=0.7,
7        top_p=0.9,
8        repetition_penalty=1.1,
9        pad_token_id=tokenizer.eos_token_id,
10    )

⚡ Parameter Explanation:

Parameter	Meaning
`max_new_tokens=1024`	Maximum response length
`temperature=0.7`	Controls creativity
`top_p=0.9`	Nucleus sampling
`repetition_penalty=1.1`	Prevents loops
`do_sample=True`	Enables randomness

✂️ 11. Extract Only New Tokens

python
1new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
2assistant_text = tokenizer.decode(
3    new_tokens,
4    skip_special_tokens=True,
5).strip()

🧠 Why this is needed:

Removes input prompt
Keeps only generated response

🧼 12. Clean Output (Optional)

python
1assistant_text = assistant_text.replace("<think>", "").replace("</think>", "").strip()

💡 Purpose:

Some models output reasoning tags — this removes them for cleaner UI.

🖨️ 13. Display Response

python
1print(f"\nAssistant: {assistant_text}")

💾 14. Save Conversation History

python
1messages.append({
2    "role": "assistant",
3    "content": assistant_text,
4})

complete code

python
1import torch
2from transformers import AutoTokenizer, AutoModelForCausalLM
3
4# MODEL_PATH = "./Qwen/Qwen3-0.6B"
5MODEL_PATH="./qwen3_lora_sft_pro_merged"
6
7# Load tokenizer and model
8tokenizer = AutoTokenizer.from_pretrained(
9    MODEL_PATH,
10    trust_remote_code=True,
11)
12
13model = AutoModelForCausalLM.from_pretrained(
14    MODEL_PATH,
15    trust_remote_code=True,
16    device_map="auto",
17)
18
19# Conversation history 
20messages = [
21    {
22        "role": "system",
23        "content": "You are a helpful AI assistant."
24    }
25]
26
27print("=" * 60)
28print("Qwen3 Chat Agent")
29print("Type 'exit' or 'quit' to stop.")
30print("=" * 60)
31
32while True:
33    user_input = input("\nYou: ").strip()
34
35    if user_input.lower() in {"exit", "quit"}:
36        break
37
38    messages.append(
39        {
40            "role": "user",
41            "content": user_input,
42        }
43    )
44
45    # Build prompt from full conversation
46    prompt = tokenizer.apply_chat_template(
47        messages,
48        tokenize=False,
49        add_generation_prompt=True,
50    )
51
52    inputs = tokenizer(
53        prompt,
54        return_tensors="pt",
55    ).to(model.device)
56
57    with torch.no_grad():
58        output_ids = model.generate(
59            **inputs,
60            max_new_tokens=1024,
61            do_sample=True,
62            temperature=0.7,
63            top_p=0.9,
64            repetition_penalty=1.1,
65            pad_token_id=tokenizer.eos_token_id,
66        )
67
68    # Decode only newly generated tokens
69    new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
70    assistant_text = tokenizer.decode(
71        new_tokens,
72        skip_special_tokens=True,
73    ).strip()
74
75    # Optionally remove visible thinking blocks
76    assistant_text = assistant_text.replace("<think>", "").replace("</think>", "").strip()
77
78    print(f"\nAssistant: {assistant_text}")
79
80    messages.append(
81        {
82            "role": "assistant",
83            "content": assistant_text,
84        }
85    )

🧠 Benefit:

Maintains memory across turns
Improves contextual responses

🧠 How This System Works (Architecture View)

User Input
   ↓
Chat History (messages[])
   ↓
Chat Template (Qwen format)
   ↓
Tokenizer → Tokens
   ↓
Model.generate()
   ↓
Decoded Output
   ↓
Assistant Response
   ↓
Saved Back into Memory

🚀 Key Features of This Chatbot

✅ Fully local inference (no API needed)
✅ Supports multi-turn conversation
✅ Uses fine-tuned LoRA merged model
✅ GPU optimized generation
✅ ChatGPT-style memory system
✅ Production-ready CLI chatbot

🔥 Advanced Improvements You Can Add

Streamed token generation (like ChatGPT typing effect)
FastAPI backend wrapper
Web UI using Gradio / Streamlit
RAG (Retrieval-Augmented Generation)
Function calling / tools integration