๐ฌ Build a Local Qwen3 Chatbot (LoRA Fine-Tuned Model Inference Guide)
Tech3Space15 Jun 2026
๐ฌ Build a Local Qwen3 Chatbot (LoRA Fine-Tuned Model Inference Guide)
๐ง Description
In this tutorial, we build a fully working local chatbot using a fine-tuned Qwen3 model (LoRA merged version) with Hugging Face Transformers.
This script demonstrates how to:
- Load a merged Qwen3 model
- Use chat templates for conversation memory
- Generate responses with sampling strategies
- Maintain multi-turn chat history
- Build a CLI-based AI assistant
This is a production-style inference pipeline suitable for:
- Local AI assistants
- API backend integration
- Chatbot prototypes
- Research experiments
๐ Full Tutorial: Running Qwen3 Chatbot Locally
โ๏ธ 1. Import Required Libraries
1import torch 2from transformers import AutoTokenizer, AutoModelForCausalLM
๐ง Why these?
transformersโ loads model + tokenizertorchโ handles GPU inference
๐ฆ 2. Load Fine-Tuned Model
1MODEL_PATH = "./qwen3_lora_sft_pro_merged"
๐ก Important:
This is the merged LoRA model, meaning:
- No adapter needed
- Fully standalone inference model
- Production-ready checkpoint
๐ค 3. Load Tokenizer and Model
1tokenizer = AutoTokenizer.from_pretrained( 2 MODEL_PATH, 3 trust_remote_code=True, 4) 5 6model = AutoModelForCausalLM.from_pretrained( 7 MODEL_PATH, 8 trust_remote_code=True, 9 device_map="auto", 10)
๐ Key points:
trust_remote_code=Trueโ required for Qwen architecturedevice_map="auto"โ automatically uses GPU/CPU split- No manual
.to(device)needed
๐ง 4. Initialize Chat Memory
1messages = [ 2 { 3 "role": "system", 4 "content": "You are a helpful AI assistant." 5 } 6]
๐ก Why system prompt matters:
It defines:
- Personality of assistant
- Response style
- Safety + instruction behavior
๐ฌ 5. Start CLI Chat Interface
1print("=" * 60) 2print("Qwen3 Chat Agent") 3print("Type 'exit' or 'quit' to stop.") 4print("=" * 60)
This creates a simple terminal chatbot UI.
๐ 6. Infinite Chat Loop
1while True: 2 user_input = input("\nYou: ").strip() 3 4 if user_input.lower() in {"exit", "quit"}: 5 break
๐ง Behavior:
- Accepts user input continuously
- Stops when user types
exitorquit
๐งฉ 7. Store Conversation History
1messages.append({ 2 "role": "user", 3 "content": user_input, 4})
๐ก Why history is important:
- Enables multi-turn conversation
- Maintains context awareness
- Improves response quality
๐งพ 8. Convert Chat History to Prompt
1prompt = tokenizer.apply_chat_template( 2 messages, 3 tokenize=False, 4 add_generation_prompt=True, 5)
๐ฅ Key idea:
This converts structured chat โ model-readable format.
โ Ensures Qwen-style formatting โ Maintains role structure (system/user/assistant)
๐ข 9. Tokenize Input
1inputs = tokenizer( 2 prompt, 3 return_tensors="pt", 4).to(model.device)
๐ก What happens here:
- Converts text โ token IDs
- Sends tensors to GPU/CPU automatically
๐ง 10. Generate Response (Core AI Step)
1with torch.no_grad(): 2 output_ids = model.generate( 3 **inputs, 4 max_new_tokens=1024, 5 do_sample=True, 6 temperature=0.7, 7 top_p=0.9, 8 repetition_penalty=1.1, 9 pad_token_id=tokenizer.eos_token_id, 10 )
โก Parameter Explanation:
| Parameter | Meaning |
|---|---|
max_new_tokens=1024 | Maximum response length |
temperature=0.7 | Controls creativity |
top_p=0.9 | Nucleus sampling |
repetition_penalty=1.1 | Prevents loops |
do_sample=True | Enables randomness |
โ๏ธ 11. Extract Only New Tokens
1new_tokens = output_ids[0][inputs["input_ids"].shape[1]:] 2assistant_text = tokenizer.decode( 3 new_tokens, 4 skip_special_tokens=True, 5).strip()
๐ง Why this is needed:
- Removes input prompt
- Keeps only generated response
๐งผ 12. Clean Output (Optional)
1assistant_text = assistant_text.replace("<think>", "").replace("</think>", "").strip()
๐ก Purpose:
Some models output reasoning tags โ this removes them for cleaner UI.
๐จ๏ธ 13. Display Response
1print(f"\nAssistant: {assistant_text}")
๐พ 14. Save Conversation History
1messages.append({ 2 "role": "assistant", 3 "content": assistant_text, 4})
complete code
1import torch 2from transformers import AutoTokenizer, AutoModelForCausalLM 3 4# MODEL_PATH = "./Qwen/Qwen3-0.6B" 5MODEL_PATH="./qwen3_lora_sft_pro_merged" 6 7# Load tokenizer and model 8tokenizer = AutoTokenizer.from_pretrained( 9 MODEL_PATH, 10 trust_remote_code=True, 11) 12 13model = AutoModelForCausalLM.from_pretrained( 14 MODEL_PATH, 15 trust_remote_code=True, 16 device_map="auto", 17) 18 19# Conversation history 20messages = [ 21 { 22 "role": "system", 23 "content": "You are a helpful AI assistant." 24 } 25] 26 27print("=" * 60) 28print("Qwen3 Chat Agent") 29print("Type 'exit' or 'quit' to stop.") 30print("=" * 60) 31 32while True: 33 user_input = input("\nYou: ").strip() 34 35 if user_input.lower() in {"exit", "quit"}: 36 break 37 38 messages.append( 39 { 40 "role": "user", 41 "content": user_input, 42 } 43 ) 44 45 # Build prompt from full conversation 46 prompt = tokenizer.apply_chat_template( 47 messages, 48 tokenize=False, 49 add_generation_prompt=True, 50 ) 51 52 inputs = tokenizer( 53 prompt, 54 return_tensors="pt", 55 ).to(model.device) 56 57 with torch.no_grad(): 58 output_ids = model.generate( 59 **inputs, 60 max_new_tokens=1024, 61 do_sample=True, 62 temperature=0.7, 63 top_p=0.9, 64 repetition_penalty=1.1, 65 pad_token_id=tokenizer.eos_token_id, 66 ) 67 68 # Decode only newly generated tokens 69 new_tokens = output_ids[0][inputs["input_ids"].shape[1]:] 70 assistant_text = tokenizer.decode( 71 new_tokens, 72 skip_special_tokens=True, 73 ).strip() 74 75 # Optionally remove visible thinking blocks 76 assistant_text = assistant_text.replace("<think>", "").replace("</think>", "").strip() 77 78 print(f"\nAssistant: {assistant_text}") 79 80 messages.append( 81 { 82 "role": "assistant", 83 "content": assistant_text, 84 } 85 )
๐ง Benefit:
- Maintains memory across turns
- Improves contextual responses
๐ง How This System Works (Architecture View)
User Input
โ
Chat History (messages[])
โ
Chat Template (Qwen format)
โ
Tokenizer โ Tokens
โ
Model.generate()
โ
Decoded Output
โ
Assistant Response
โ
Saved Back into Memory
๐ Key Features of This Chatbot
- โ Fully local inference (no API needed)
- โ Supports multi-turn conversation
- โ Uses fine-tuned LoRA merged model
- โ GPU optimized generation
- โ ChatGPT-style memory system
- โ Production-ready CLI chatbot
๐ฅ Advanced Improvements You Can Add
- Streamed token generation (like ChatGPT typing effect)
- FastAPI backend wrapper
- Web UI using Gradio / Streamlit
- RAG (Retrieval-Augmented Generation)
- Function calling / tools integration