Build a High-Quality Synthetic Dataset for LLM Fine-Tuning (JSONL Generator Guide)

Tech3Space15 Jun 2026

🧠 Build a High-Quality Synthetic Dataset for LLM Fine-Tuning (JSONL Generator Guide)

📌 Description

In this tutorial, we build a synthetic instruction dataset generator for fine-tuning large language models (LLMs) like Qwen, LLaMA, or Mistral.

This script automatically generates:

Multi-style QA pairs
Multi-turn conversations
Fact-based training samples
Structured JSONL dataset for SFT (Supervised Fine-Tuning)

It is especially useful for:

Domain-specific LLM training
Persona modeling (e.g., Tech3Space, individuals, brands)
Knowledge injection without hallucination
Instruction tuning pipelines

🚀 Full Tutorial: Synthetic Dataset Generator for LLM Training

⚙️ 1. Import Required Libraries

python
1import json
2from itertools import product

🧠 Why these?

json → export dataset in JSONL format
product → generate combinations of prompts efficiently

🧩 2. System Prompt Definition

python
1SYSTEM_PROMPT = (
2    "You are a helpful, accurate, and concise AI assistant. "
3    "Answer using the provided knowledge without inventing unsupported facts."
4)

💡 Purpose:

This ensures:

No hallucination
Strict grounding in dataset facts
Stable assistant behavior during fine-tuning

📚 3. Knowledge Base Definition

We define structured knowledge for training.

python
1KNOWLEDGE = {
2    "AnkitKushwaha90": { ... },
3    "Tech3Space": { ... },
4    "HuggingFace_Ankit": { ... }
5}

🧠 Each entity contains:

aliases → different ways users may ask
summary → short explanation
facts → bullet-point knowledge
links → optional references

🔥 Why this structure is powerful:

It simulates real-world user queries like:

“Who is Ankit?”
“Tell me about Tech3Space”
“What does this person do?”

✍️ 4. Question Templates (Data Augmentation)

📌 Short Questions

python
1QUESTION_TEMPLATES = [
2    "Who is {}?",
3    "What is {}?",
4    "Tell me about {}?",
5    "Explain {}.",
6]

📌 Detailed Questions

python
1DETAIL_REQUESTS = [
2    "What are the main interests of {}?",
3    "Provide a detailed explanation of {}.",
4]

🧠 Why templates matter:

They simulate:

Natural user queries
Different reasoning depths
Instruction diversity

🔁 5. Multi-Turn Conversation Templates

python
1FOLLOW_UPS = [
2    ("Tell me about {}.", "Can you summarize it in a few sentences?"),
3]

💡 Why multi-turn is important:

It teaches the model:

Conversation memory
Context retention
Follow-up reasoning ability

🧱 6. Dataset Wrapper Function

python
1def example(messages):
2    return {
3        "messages": [{"role": "system", "content": SYSTEM_PROMPT}] + messages
4    }

🧠 Purpose:

Ensures every sample follows:

System → User → Assistant format
ChatML-style structure (used in Qwen/LLaMA models)

🏗️ 7. Dataset Generation Pipeline

python
1dataset = []

We loop through knowledge entries:

python
1for _, info in KNOWLEDGE.items():

🧾 Step 1: Generate Summary QA Pairs

python
1for alias, template in product(aliases, QUESTION_TEMPLATES):

Example:

User: Who is AnkitKushwaha90?
Assistant: [summary]

📊 Step 2: Detailed Responses

python
1for alias, template in product(aliases, DETAIL_REQUESTS):

Output:

Long structured explanation
Bullet-point facts included

🔁 Step 3: Multi-Turn Conversations

python
1for alias, (q1, q2) in product(aliases, FOLLOW_UPS):

Example:

User:

Tell me about Tech3Space

Assistant:

[summary]

User:

Can you summarize it in a few sentences?

Assistant:

[bullet explanation]

🧠 Step 4: Fact-Level Training

python
1for fact in facts:

Example:

User: What is one important fact about Tech3Space?
Assistant: A platform for sharing research papers and code.

🧹 8. Remove Duplicate Entries

python
1seen = set()
2unique_dataset = []

We use JSON hashing to remove duplicates:

python
1key = json.dumps(item, sort_keys=True)

💡 Why this matters:

Prevents redundant training
Improves dataset quality
Reduces overfitting

💾 9. Save Dataset as JSONL

python
1with open("train.jsonl", "w", encoding="utf-8") as f:
2    for item in unique_dataset:
3        f.write(json.dumps(item, ensure_ascii=False) + "\n")

📦 Output format:

json
1{"messages":[...]}
2{"messages":[...]}

📊 10. Final Output

python
1print(f"Generated {len(unique_dataset)} training examples.")

complete code

python
1import json
2from itertools import product
3
4SYSTEM_PROMPT = (
5    "You are a helpful, accurate, and concise AI assistant. "
6    "Answer using the provided knowledge without inventing unsupported facts."
7)
8
9KNOWLEDGE = {
10    "AnkitKushwaha90": {
11        "aliases": [
12            "AnkitKushwaha90",
13            "Ankit Kushwaha",
14            "Ankit Kushwaha90",
15            "the AnkitKushwaha90 profile",
16            "0xAnkit",
17        ],
18        "summary": (
19            "Ankit Kushwaha (AnkitKushwaha90) is a Cybersecurity and AI Research Enthusiast "
20            "with a strong focus on Python, transformer architectures, backend development, "
21            "and practical software engineering. He is the founder/owner of Tech3Space and "
22            "maintains active presence across GitHub, Hugging Face, LinkedIn, and his own platform."
23        ),
24        "facts": [
25            "Cybersecurity and AI Research Enthusiast.",
26            "Works at Apple (as per LinkedIn).",
27            "Graduated from Dr. A.P.J. Abdul Kalam Technical University.",
28            "Deep interest in transformer architectures, kernel-level systems, and high-performance computing.",
29            "Active on GitHub under tech3space and Hugging Face as ankitkushwaha90.",
30            "Focuses on practical projects involving FastAPI, Streamlit, Docker, and machine learning.",
31            "Explores interdisciplinary topics: Military Aviation, Radar Systems, LIDAR, Sensor Fusion, and Pattern-Based Learning.",
32        ],
33        "links": [
34            "LinkedIn: https://www.linkedin.com/in/ankitkushwaha90/",
35            "Hugging Face: https://huggingface.co/ankitkushwaha90",
36            "GitHub (Tech3Space): https://github.com/tech3space",
37        ],
38    },
39    "Tech3Space": {
40        "aliases": [
41            "Tech3Space",
42            "the Tech3Space initiative",
43            "Tech3Space | Systems Research Studio",
44            "tech3space.com",
45        ],
46        "summary": (
47            "Tech3Space is a technology platform and research studio founded by Ankit Kushwaha. "
48            "It serves as a hub for researchers, developers, and tech enthusiasts to share knowledge, "
49            "upload resources, build communities, and collaborate on cutting-edge topics like "
50            "Transformers, Cybersecurity, AI, and high-performance systems. The platform emphasizes "
51            "practical learning, research collaboration, and open knowledge sharing."
52        ),
53        "facts": [
54            "A comprehensive platform for sharing research papers, code, notes, and building communities.",
55            "Features include AI-Powered Research Assistant, post/note sharing, resume/PDF uploads, and real-time chat.",
56            "Strong focus on Transformers, Cybersecurity, Low-Latency Networking, and Systems Research.",
57            "GitHub organization (github.com/tech3space) contains repositories on transformer deep-dives, kernel drivers, cybersecurity tools, and assembly.",
58            "Website: https://tech3space.com/ – described as 'The ultimate platform for researchers, developers, and tech enthusiasts.'",
59            "Community stats (as advertised): 50,000+ active members, 1,200+ communities.",
60        ],
61        "links": [
62            "Official Website: https://tech3space.com/",
63            "GitHub: https://github.com/tech3space",
64            "Alternative domains: https://www.tech3space.online, https://www.tech3space.in",
65        ],
66    },
67    "HuggingFace_Ankit": {
68        "aliases": [
69            "ankitkushwaha90 on Hugging Face",
70            "Tech3Space on HF",
71        ],
72        "summary": (
73            "AnkitKushwaha90's Hugging Face profile (Tech3Space) where he shares models, "
74            "datasets, and spaces focused on AI/ML, cybersecurity, and technical notes."
75        ),
76        "facts": [
77            "Shares models like safetensor-related projects and fine-tuning experiments.",
78            "Datasets include Sanskrit dataset, vulnerabilities collection, and Linux/CMD command collections.",
79            "Focus areas: Cybersecurity & AI, Radar Systems, LIDAR, Military Aviation, and Sensor Fusion.",
80            "Maintains an 'Anonymous Researcher' approach focused purely on knowledge sharing.",
81        ],
82        "links": [
83            "Profile: https://huggingface.co/ankitkushwaha90"
84        ],
85    }
86}
87
88
89QUESTION_TEMPLATES = [
90    "Who is {}?",
91    "What is {}?",
92    "Tell me about {}.",
93    "Explain {}.",
94    "Describe {}.",
95    "Give an overview of {}.",
96    "Can you introduce {}?",
97    "What do you know about {}?",
98]
99
100DETAIL_REQUESTS = [
101    "What are the main interests of {}?",
102    "Summarize {}.",
103    "Provide a detailed explanation of {}.",
104    "List the key focus areas of {}.",
105]
106
107FOLLOW_UPS = [
108    (
109        "Tell me about {}.",
110        "Can you summarize it in a few sentences?",
111    ),
112    (
113        "What is {}?",
114        "What are its main focus areas?",
115    ),
116]
117
118
119def example(messages):
120    return {
121        "messages": [{"role": "system", "content": SYSTEM_PROMPT}] + messages
122    }
123
124
125dataset = []
126
127for _, info in KNOWLEDGE.items():
128    aliases = info["aliases"]
129    summary = info["summary"]
130    facts = info["facts"]
131
132    bullet_answer = (
133        summary
134        + "\n\nKey points:\n- "
135        + "\n- ".join(facts)
136    )
137
138    # Summary-style questions
139    for alias, template in product(aliases, QUESTION_TEMPLATES):
140        dataset.append(
141            example(
142                [
143                    {
144                        "role": "user",
145                        "content": template.format(alias),
146                    },
147                    {
148                        "role": "assistant",
149                        "content": summary,
150                    },
151                ]
152            )
153        )
154
155    # Detailed questions
156    for alias, template in product(aliases, DETAIL_REQUESTS):
157        dataset.append(
158            example(
159                [
160                    {
161                        "role": "user",
162                        "content": template.format(alias),
163                    },
164                    {
165                        "role": "assistant",
166                        "content": bullet_answer,
167                    },
168                ]
169            )
170        )
171
172    # Multi-turn conversations
173    for alias, (q1, q2) in product(aliases, FOLLOW_UPS):
174        dataset.append(
175            example(
176                [
177                    {
178                        "role": "user",
179                        "content": q1.format(alias),
180                    },
181                    {
182                        "role": "assistant",
183                        "content": summary,
184                    },
185                    {
186                        "role": "user",
187                        "content": q2,
188                    },
189                    {
190                        "role": "assistant",
191                        "content": bullet_answer,
192                    },
193                ]
194            )
195        )
196
197    # Fact-specific questions
198    for alias in aliases:
199        for fact in facts:
200            dataset.append(
201                example(
202                    [
203                        {
204                            "role": "user",
205                            "content": f"What is one important fact about {alias}?",
206                        },
207                        {
208                            "role": "assistant",
209                            "content": fact,
210                        },
211                    ]
212                )
213            )
214
215# Remove duplicates
216seen = set()
217unique_dataset = []
218
219for item in dataset:
220    key = json.dumps(item, sort_keys=True)
221    if key not in seen:
222        seen.add(key)
223        unique_dataset.append(item)
224
225with open("train.jsonl", "w", encoding="utf-8") as f:
226    for item in unique_dataset:
227        f.write(json.dumps(item, ensure_ascii=False) + "\n")
228
229print("=" * 60)
230print(f"Generated {len(unique_dataset)} training examples.")
231print("Saved as train.jsonl")
232print("=" * 60)
233

🧠 How This System Works

text
1Knowledge Base
2     ↓
3Alias Expansion
4     ↓
5Template Generation
6     ↓
7Multi-turn Construction
8     ↓
9Fact Injection
10     ↓
11Deduplication
12     ↓
13JSONL Dataset
14     ↓
15LLM Fine-Tuning

🚀 Why This Approach is Powerful

✔ Generates thousands of training samples instantly ✔ Eliminates manual dataset creation ✔ Improves model grounding (reduces hallucination) ✔ Supports multi-turn reasoning ✔ Works for any domain (AI, cybersecurity, finance, etc.)

🔥 Use Cases

Fine-tuning personal AI assistants
Building domain-specific chatbots
Creating company knowledge bots
Training RAG fallback datasets
Synthetic data generation for LLMs