Fine-Tuning Qwen3-1.7B with LoRA + SFTTrainer (Production-Level Guide)

Tech3Space15 Jun 2026

🚀 Fine-Tuning Qwen3-1.7B with LoRA + SFTTrainer (Production-Level Guide)

🧠 Description

In this tutorial, we build a complete parameter-efficient fine-tuning pipeline for the Qwen3-1.7B model using LoRA (Low-Rank Adaptation) and Hugging Face’s SFTTrainer. This setup is optimized for real-world usage: low VRAM, stable training, fast convergence, and deployment-ready model merging.

You’ll learn how to:

Load Qwen3 model efficiently
Apply LoRA for memory-efficient training
Format chat datasets properly
Train using SFTTrainer
Save + merge LoRA adapters for deployment

📌 Full Tutorial: LoRA Fine-Tuning Qwen3-1.7B

⚙️ 1. Project Setup

We start by importing required libraries:

python
1import torch
2from datasets import load_dataset
3from transformers import AutoTokenizer, AutoModelForCausalLM
4from peft import LoraConfig, get_peft_model
5from trl import SFTTrainer, SFTConfig

🔥 Why these libraries?

transformers → Model + tokenizer loading
datasets → Efficient dataset pipeline
peft → LoRA implementation
trl → Supervised fine-tuning (SFTTrainer)

📦 2. Configuration Setup

python
1MODEL_NAME = "./Qwen/Qwen3-1.7B"
2DATASET_PATH = "./dataset/train.jsonl"
3OUTPUT_DIR = "./qwen3_lora_sft_pro"
4
5MAX_LENGTH = 512

🧠 Key Idea:

Local Qwen model path
JSONL dataset format
Controlled sequence length for GPU efficiency

🔤 3. Tokenizer Setup

python
1tokenizer = AutoTokenizer.from_pretrained(
2    MODEL_NAME,
3    trust_remote_code=True,
4)
5
6if tokenizer.pad_token is None:
7    tokenizer.pad_token = tokenizer.eos_token

💡 Why this matters:

Qwen uses custom tokenizer logic
Padding token ensures stable batching
Prevents training crashes during packing

🧠 4. Load Model (Optimized for GPU Training)

python
1dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
2
3model = AutoModelForCausalLM.from_pretrained(
4    MODEL_NAME,
5    trust_remote_code=True,
6    dtype=dtype,
7)

⚡ Optimization choices:

bfloat16 → better stability (A100/H100 GPUs)
float16 → fallback for consumer GPUs

⚙️ 5. Memory Optimization Tricks

python
1model.config.use_cache = False
2model.gradient_checkpointing_enable()

🚀 Why this is important:

Saves GPU memory during backpropagation
Enables training large models on limited VRAM

🔧 6. LoRA Configuration (Core Idea)

python
1lora_config = LoraConfig(
2    r=16,
3    lora_alpha=32,
4    lora_dropout=0.05,
5    bias="none",
6    task_type="CAUSAL_LM",
7    target_modules=[
8        "q_proj", "k_proj", "v_proj",
9        "o_proj",
10        "gate_proj", "up_proj", "down_proj",
11    ],
12)

🧠 Explanation:

LoRA injects trainable low-rank matrices into transformer layers:

r=16 → capacity of adaptation
alpha=32 → scaling factor
dropout=0.05 → regularization
target_modules → attention + MLP layers

👉 This makes training:

10–50x cheaper
Faster convergence
Minimal GPU memory usage

🧩 7. Apply LoRA to Model

python
1model = get_peft_model(model, lora_config)
2model.print_trainable_parameters()

🔍 Result:

Only ~1–5% of parameters are trainable instead of full model.

📚 8. Load Dataset

python
1dataset = load_dataset(
2    "json",
3    data_files=DATASET_PATH,
4    split="train",
5)

📌 Format expected:

json
1{
2  "messages": [
3    {"role": "user", "content": "Hello"},
4    {"role": "assistant", "content": "Hi! How can I help?"}
5  ]
6}

🧹 9. Dataset Preprocessing

python
1def preprocess(example):
2    text = tokenizer.apply_chat_template(
3        example["messages"],
4        tokenize=False,
5        add_generation_prompt=False,
6    )
7    return {"text": text}
8
9dataset = dataset.map(preprocess, remove_columns=dataset.column_names)

💡 Why chat template matters:

Converts structured conversation → training text
Ensures Qwen-style formatting consistency

🏋️ 10. Training Configuration (SFTConfig)

python
1training_args = SFTConfig(
2    output_dir=OUTPUT_DIR,
3    max_length=MAX_LENGTH,
4    packing=True,
5
6    per_device_train_batch_size=1,
7    gradient_accumulation_steps=8,
8
9    num_train_epochs=3,
10
11    learning_rate=2e-4,
12    warmup_ratio=0.03,
13    lr_scheduler_type="cosine",
14
15    weight_decay=0.05,
16
17    logging_steps=10,
18    save_steps=500,
19    save_total_limit=3,
20
21    bf16=torch.cuda.is_bf16_supported(),
22    fp16=not torch.cuda.is_bf16_supported(),
23
24    report_to="none",
25)

🔥 Key Training Insights:

packing=True → improves GPU utilization
grad accumulation → simulates bigger batch size
cosine scheduler → smoother convergence
warmup_ratio → stabilizes early training

🤖 11. Initialize Trainer

python
1trainer = SFTTrainer(
2    model=model,
3    args=training_args,
4    train_dataset=dataset,
5    processing_class=tokenizer,
6)

🚀 12. Start Training

python
1trainer.train()

At this point:

LoRA layers start learning task-specific behavior
Base model remains frozen

💾 13. Save LoRA Adapter

python
1trainer.save_model(OUTPUT_DIR)
2tokenizer.save_pretrained(OUTPUT_DIR)

🔗 14. Merge Model for Deployment

python
1#!/usr/bin/env python3
2
3import os
4import torch
5
6from datasets import load_dataset
7from transformers import (
8    AutoTokenizer,
9    AutoModelForCausalLM,
10)
11
12from peft import (
13    LoraConfig,
14    get_peft_model,
15)
16
17from trl import (
18    SFTTrainer,
19    SFTConfig,
20)
21
22
23# ============================================================
24# CONFIG
25# ============================================================
26
27MODEL_NAME = "./Qwen3-1.7B"
28DATASET_PATH = "./dataset/train.jsonl"
29
30OUTPUT_DIR = "./qwen3_lora_sft_pro"
31MERGED_DIR = OUTPUT_DIR + "_merged"
32
33MAX_LENGTH = 512
34
35os.makedirs(OUTPUT_DIR, exist_ok=True)
36
37
38# ============================================================
39# TOKENIZER
40# ============================================================
41
42print("Loading tokenizer...")
43
44tokenizer = AutoTokenizer.from_pretrained(
45    MODEL_NAME,
46    trust_remote_code=True,
47)
48
49if tokenizer.pad_token is None:
50    tokenizer.pad_token = tokenizer.eos_token
51
52tokenizer.padding_side = "right"
53
54
55# ============================================================
56# MODEL
57# ============================================================
58
59print("Loading model...")
60
61dtype = (
62    torch.bfloat16
63    if torch.cuda.is_available()
64    and torch.cuda.is_bf16_supported()
65    else torch.float16
66)
67
68model = AutoModelForCausalLM.from_pretrained(
69    MODEL_NAME,
70    trust_remote_code=True,
71    torch_dtype=dtype,
72    device_map="auto",
73)
74
75model.config.use_cache = False
76model.gradient_checkpointing_enable()
77
78
79# ============================================================
80# LORA CONFIG
81# ============================================================
82
83print("Applying LoRA...")
84
85lora_config = LoraConfig(
86    r=16,
87    lora_alpha=32,
88    lora_dropout=0.05,
89    bias="none",
90    task_type="CAUSAL_LM",
91    target_modules=[
92        "q_proj",
93        "k_proj",
94        "v_proj",
95        "o_proj",
96        "gate_proj",
97        "up_proj",
98        "down_proj",
99    ],
100)
101
102model = get_peft_model(
103    model,
104    lora_config,
105)
106
107model.print_trainable_parameters()
108
109
110# ============================================================
111# DATASET
112# ============================================================
113
114print("Loading dataset...")
115
116dataset = load_dataset(
117    "json",
118    data_files=DATASET_PATH,
119    split="train",
120)
121
122
123def preprocess(example):
124    text = tokenizer.apply_chat_template(
125        example["messages"],
126        tokenize=False,
127        add_generation_prompt=False,
128    )
129
130    return {
131        "text": text
132    }
133
134
135dataset = dataset.map(
136    preprocess,
137    remove_columns=dataset.column_names,
138)
139
140print(dataset)
141
142
143# ============================================================
144# TRAINING CONFIG
145# ============================================================
146
147training_args = SFTConfig(
148    output_dir=OUTPUT_DIR,
149
150    max_length=MAX_LENGTH,
151    packing=True,
152
153    per_device_train_batch_size=1,
154    gradient_accumulation_steps=8,
155
156    num_train_epochs=3,
157
158    learning_rate=2e-4,
159    warmup_ratio=0.03,
160
161    lr_scheduler_type="cosine",
162    weight_decay=0.05,
163
164    logging_steps=10,
165
166    save_steps=500,
167    save_total_limit=3,
168
169    bf16=torch.cuda.is_available()
170    and torch.cuda.is_bf16_supported(),
171
172    fp16=not (
173        torch.cuda.is_available()
174        and torch.cuda.is_bf16_supported()
175    ),
176
177    report_to="none",
178
179    dataset_text_field="text",
180)
181
182
183# ============================================================
184# TRAINER
185# ============================================================
186
187print("Initializing trainer...")
188
189trainer = SFTTrainer(
190    model=model,
191    args=training_args,
192    train_dataset=dataset,
193    processing_class=tokenizer,
194)
195
196
197# ============================================================
198# TRAIN
199# ============================================================
200
201print("Starting training...")
202
203trainer.train()
204
205print("Training completed.")
206
207
208# ============================================================
209# SAVE LORA ADAPTER
210# ============================================================
211
212print("Saving LoRA adapter...")
213
214trainer.save_model(OUTPUT_DIR)
215tokenizer.save_pretrained(OUTPUT_DIR)
216
217
218# ============================================================
219# MERGE LORA
220# ============================================================
221
222print("Merging LoRA into base model...")
223
224merged_model = model.merge_and_unload()
225
226merged_model.save_pretrained(
227    MERGED_DIR,
228    safe_serialization=True,
229)
230
231tokenizer.save_pretrained(
232    MERGED_DIR,
233)
234
235print(f"Merged model saved to: {MERGED_DIR}")
236
237print("Done.")
238

⚡ Why merging matters:

Removes LoRA dependency
Produces standalone model
Easier deployment with vLLM / Transformers / APIs

🎯 Final Output

After training, you get:

📁 LoRA adapter model
📁 Full merged model
📁 Tokenizer files
🚀 Deployment-ready checkpoint