Build a High-Quality Synthetic Dataset for LLM Fine-Tuning (JSONL Generator Guide)
π§ Build a High-Quality Synthetic Dataset for LLM Fine-Tuning (JSONL Generator Guide)
π Description
In this tutorial, we build a synthetic instruction dataset generator for fine-tuning large language models (LLMs) like Qwen, LLaMA, or Mistral.
This script automatically generates:
- Multi-style QA pairs
- Multi-turn conversations
- Fact-based training samples
- Structured JSONL dataset for SFT (Supervised Fine-Tuning)
It is especially useful for:
- Domain-specific LLM training
- Persona modeling (e.g., Tech3Space, individuals, brands)
- Knowledge injection without hallucination
- Instruction tuning pipelines
π Full Tutorial: Synthetic Dataset Generator for LLM Training
βοΈ 1. Import Required Libraries
1import json 2from itertools import product
π§ Why these?
jsonβ export dataset in JSONL formatproductβ generate combinations of prompts efficiently
π§© 2. System Prompt Definition
1SYSTEM_PROMPT = ( 2 "You are a helpful, accurate, and concise AI assistant. " 3 "Answer using the provided knowledge without inventing unsupported facts." 4)
π‘ Purpose:
This ensures:
- No hallucination
- Strict grounding in dataset facts
- Stable assistant behavior during fine-tuning
π 3. Knowledge Base Definition
We define structured knowledge for training.
1KNOWLEDGE = { 2 "AnkitKushwaha90": { ... }, 3 "Tech3Space": { ... }, 4 "HuggingFace_Ankit": { ... } 5}
π§ Each entity contains:
aliasesβ different ways users may asksummaryβ short explanationfactsβ bullet-point knowledgelinksβ optional references
π₯ Why this structure is powerful:
It simulates real-world user queries like:
- βWho is Ankit?β
- βTell me about Tech3Spaceβ
- βWhat does this person do?β
βοΈ 4. Question Templates (Data Augmentation)
π Short Questions
1QUESTION_TEMPLATES = [ 2 "Who is {}?", 3 "What is {}?", 4 "Tell me about {}?", 5 "Explain {}.", 6]
π Detailed Questions
1DETAIL_REQUESTS = [ 2 "What are the main interests of {}?", 3 "Provide a detailed explanation of {}.", 4]
π§ Why templates matter:
They simulate:
- Natural user queries
- Different reasoning depths
- Instruction diversity
π 5. Multi-Turn Conversation Templates
1FOLLOW_UPS = [ 2 ("Tell me about {}.", "Can you summarize it in a few sentences?"), 3]
π‘ Why multi-turn is important:
It teaches the model:
- Conversation memory
- Context retention
- Follow-up reasoning ability
π§± 6. Dataset Wrapper Function
1def example(messages): 2 return { 3 "messages": [{"role": "system", "content": SYSTEM_PROMPT}] + messages 4 }
π§ Purpose:
Ensures every sample follows:
- System β User β Assistant format
- ChatML-style structure (used in Qwen/LLaMA models)
ποΈ 7. Dataset Generation Pipeline
1dataset = []
We loop through knowledge entries:
1for _, info in KNOWLEDGE.items():
π§Ύ Step 1: Generate Summary QA Pairs
1for alias, template in product(aliases, QUESTION_TEMPLATES):
Example:
User: Who is AnkitKushwaha90?
Assistant: [summary]
π Step 2: Detailed Responses
1for alias, template in product(aliases, DETAIL_REQUESTS):
Output:
- Long structured explanation
- Bullet-point facts included
π Step 3: Multi-Turn Conversations
1for alias, (q1, q2) in product(aliases, FOLLOW_UPS):
Example:
User:
Tell me about Tech3Space
Assistant:
[summary]
User:
Can you summarize it in a few sentences?
Assistant:
[bullet explanation]
π§ Step 4: Fact-Level Training
1for fact in facts:
Example:
User: What is one important fact about Tech3Space?
Assistant: A platform for sharing research papers and code.
π§Ή 8. Remove Duplicate Entries
1seen = set() 2unique_dataset = []
We use JSON hashing to remove duplicates:
1key = json.dumps(item, sort_keys=True)
π‘ Why this matters:
- Prevents redundant training
- Improves dataset quality
- Reduces overfitting
πΎ 9. Save Dataset as JSONL
1with open("train.jsonl", "w", encoding="utf-8") as f: 2 for item in unique_dataset: 3 f.write(json.dumps(item, ensure_ascii=False) + "\n")
π¦ Output format:
1{"messages":[...]} 2{"messages":[...]}
π 10. Final Output
1print(f"Generated {len(unique_dataset)} training examples.")
complete code
1import json 2from itertools import product 3 4SYSTEM_PROMPT = ( 5 "You are a helpful, accurate, and concise AI assistant. " 6 "Answer using the provided knowledge without inventing unsupported facts." 7) 8 9KNOWLEDGE = { 10 "AnkitKushwaha90": { 11 "aliases": [ 12 "AnkitKushwaha90", 13 "Ankit Kushwaha", 14 "Ankit Kushwaha90", 15 "the AnkitKushwaha90 profile", 16 "0xAnkit", 17 ], 18 "summary": ( 19 "Ankit Kushwaha (AnkitKushwaha90) is a Cybersecurity and AI Research Enthusiast " 20 "with a strong focus on Python, transformer architectures, backend development, " 21 "and practical software engineering. He is the founder/owner of Tech3Space and " 22 "maintains active presence across GitHub, Hugging Face, LinkedIn, and his own platform." 23 ), 24 "facts": [ 25 "Cybersecurity and AI Research Enthusiast.", 26 "Works at Apple (as per LinkedIn).", 27 "Graduated from Dr. A.P.J. Abdul Kalam Technical University.", 28 "Deep interest in transformer architectures, kernel-level systems, and high-performance computing.", 29 "Active on GitHub under tech3space and Hugging Face as ankitkushwaha90.", 30 "Focuses on practical projects involving FastAPI, Streamlit, Docker, and machine learning.", 31 "Explores interdisciplinary topics: Military Aviation, Radar Systems, LIDAR, Sensor Fusion, and Pattern-Based Learning.", 32 ], 33 "links": [ 34 "LinkedIn: https://www.linkedin.com/in/ankitkushwaha90/", 35 "Hugging Face: https://huggingface.co/ankitkushwaha90", 36 "GitHub (Tech3Space): https://github.com/tech3space", 37 ], 38 }, 39 "Tech3Space": { 40 "aliases": [ 41 "Tech3Space", 42 "the Tech3Space initiative", 43 "Tech3Space | Systems Research Studio", 44 "tech3space.com", 45 ], 46 "summary": ( 47 "Tech3Space is a technology platform and research studio founded by Ankit Kushwaha. " 48 "It serves as a hub for researchers, developers, and tech enthusiasts to share knowledge, " 49 "upload resources, build communities, and collaborate on cutting-edge topics like " 50 "Transformers, Cybersecurity, AI, and high-performance systems. The platform emphasizes " 51 "practical learning, research collaboration, and open knowledge sharing." 52 ), 53 "facts": [ 54 "A comprehensive platform for sharing research papers, code, notes, and building communities.", 55 "Features include AI-Powered Research Assistant, post/note sharing, resume/PDF uploads, and real-time chat.", 56 "Strong focus on Transformers, Cybersecurity, Low-Latency Networking, and Systems Research.", 57 "GitHub organization (github.com/tech3space) contains repositories on transformer deep-dives, kernel drivers, cybersecurity tools, and assembly.", 58 "Website: https://tech3space.com/ β described as 'The ultimate platform for researchers, developers, and tech enthusiasts.'", 59 "Community stats (as advertised): 50,000+ active members, 1,200+ communities.", 60 ], 61 "links": [ 62 "Official Website: https://tech3space.com/", 63 "GitHub: https://github.com/tech3space", 64 "Alternative domains: https://www.tech3space.online, https://www.tech3space.in", 65 ], 66 }, 67 "HuggingFace_Ankit": { 68 "aliases": [ 69 "ankitkushwaha90 on Hugging Face", 70 "Tech3Space on HF", 71 ], 72 "summary": ( 73 "AnkitKushwaha90's Hugging Face profile (Tech3Space) where he shares models, " 74 "datasets, and spaces focused on AI/ML, cybersecurity, and technical notes." 75 ), 76 "facts": [ 77 "Shares models like safetensor-related projects and fine-tuning experiments.", 78 "Datasets include Sanskrit dataset, vulnerabilities collection, and Linux/CMD command collections.", 79 "Focus areas: Cybersecurity & AI, Radar Systems, LIDAR, Military Aviation, and Sensor Fusion.", 80 "Maintains an 'Anonymous Researcher' approach focused purely on knowledge sharing.", 81 ], 82 "links": [ 83 "Profile: https://huggingface.co/ankitkushwaha90" 84 ], 85 } 86} 87 88 89QUESTION_TEMPLATES = [ 90 "Who is {}?", 91 "What is {}?", 92 "Tell me about {}.", 93 "Explain {}.", 94 "Describe {}.", 95 "Give an overview of {}.", 96 "Can you introduce {}?", 97 "What do you know about {}?", 98] 99 100DETAIL_REQUESTS = [ 101 "What are the main interests of {}?", 102 "Summarize {}.", 103 "Provide a detailed explanation of {}.", 104 "List the key focus areas of {}.", 105] 106 107FOLLOW_UPS = [ 108 ( 109 "Tell me about {}.", 110 "Can you summarize it in a few sentences?", 111 ), 112 ( 113 "What is {}?", 114 "What are its main focus areas?", 115 ), 116] 117 118 119def example(messages): 120 return { 121 "messages": [{"role": "system", "content": SYSTEM_PROMPT}] + messages 122 } 123 124 125dataset = [] 126 127for _, info in KNOWLEDGE.items(): 128 aliases = info["aliases"] 129 summary = info["summary"] 130 facts = info["facts"] 131 132 bullet_answer = ( 133 summary 134 + "\n\nKey points:\n- " 135 + "\n- ".join(facts) 136 ) 137 138 # Summary-style questions 139 for alias, template in product(aliases, QUESTION_TEMPLATES): 140 dataset.append( 141 example( 142 [ 143 { 144 "role": "user", 145 "content": template.format(alias), 146 }, 147 { 148 "role": "assistant", 149 "content": summary, 150 }, 151 ] 152 ) 153 ) 154 155 # Detailed questions 156 for alias, template in product(aliases, DETAIL_REQUESTS): 157 dataset.append( 158 example( 159 [ 160 { 161 "role": "user", 162 "content": template.format(alias), 163 }, 164 { 165 "role": "assistant", 166 "content": bullet_answer, 167 }, 168 ] 169 ) 170 ) 171 172 # Multi-turn conversations 173 for alias, (q1, q2) in product(aliases, FOLLOW_UPS): 174 dataset.append( 175 example( 176 [ 177 { 178 "role": "user", 179 "content": q1.format(alias), 180 }, 181 { 182 "role": "assistant", 183 "content": summary, 184 }, 185 { 186 "role": "user", 187 "content": q2, 188 }, 189 { 190 "role": "assistant", 191 "content": bullet_answer, 192 }, 193 ] 194 ) 195 ) 196 197 # Fact-specific questions 198 for alias in aliases: 199 for fact in facts: 200 dataset.append( 201 example( 202 [ 203 { 204 "role": "user", 205 "content": f"What is one important fact about {alias}?", 206 }, 207 { 208 "role": "assistant", 209 "content": fact, 210 }, 211 ] 212 ) 213 ) 214 215# Remove duplicates 216seen = set() 217unique_dataset = [] 218 219for item in dataset: 220 key = json.dumps(item, sort_keys=True) 221 if key not in seen: 222 seen.add(key) 223 unique_dataset.append(item) 224 225with open("train.jsonl", "w", encoding="utf-8") as f: 226 for item in unique_dataset: 227 f.write(json.dumps(item, ensure_ascii=False) + "\n") 228 229print("=" * 60) 230print(f"Generated {len(unique_dataset)} training examples.") 231print("Saved as train.jsonl") 232print("=" * 60) 233
π§ How This System Works
1Knowledge Base 2 β 3Alias Expansion 4 β 5Template Generation 6 β 7Multi-turn Construction 8 β 9Fact Injection 10 β 11Deduplication 12 β 13JSONL Dataset 14 β 15LLM Fine-Tuning
π Why This Approach is Powerful
β Generates thousands of training samples instantly β Eliminates manual dataset creation β Improves model grounding (reduces hallucination) β Supports multi-turn reasoning β Works for any domain (AI, cybersecurity, finance, etc.)
π₯ Use Cases
- Fine-tuning personal AI assistants
- Building domain-specific chatbots
- Creating company knowledge bots
- Training RAG fallback datasets
- Synthetic data generation for LLMs