SynthData

A synthetic dataset generation tool powered by Large Language Models (LLM).

Overview

SynthData generates synthetic datasets from description files using LLM. It supports various data formats (JSON, CSV), large-scale generation with batching, and offers an interactive CLI wizard for guided data generation.

Features

LLM-Powered Generation: Generate synthetic data using OpenAI-compatible APIs
Multiple Output Formats: Support for JSON and CSV output
Large-Scale Generation: Batch processing with concurrency control for generating large datasets
Interactive Mode: CLI wizard that guides you through the generation process
Schema Validation: Validate description files before generation

Installation

git clone https://github.com/dynamder/synthdata.git
cd synthdata
go build -o synthdata ./cmd/synthdata

Configuration

Create a configuration file (e.g., configs/default.toml):

[llm]
api_key = "your-api-key"
base_url = "https://api.openai.com/v1"
model = "gpt-4o-mini"
max_retries = 3

Supported LLM providers: OpenAI, SiliconFlow, Azure OpenAI, and any OpenAI-compatible API.

Quick Start

synthdata generate -d description.md -o output.json -s 100

Usage

Command Options

Flag	Short	Description	Default
`--description`	`-d`	Path to description file	(required)
`--output`	`-o`	Output file path	(required)
`--format`	`-f`	Output format (json, csv)	json
`--scale`	`-s`	Number of records to generate	10
`--config`	`-c`	Config file path	configs/default.toml
`--batch-size`		Records per batch	10
`--concurrency`		Max parallel LLM calls	5
`--max-retries`		Max retry attempts	3
`--force`		Overwrite existing output	false
`--verbose`	`-v`	Enable verbose logging	false
`--interactive`	`-i`	Enable interactive wizard	false

Interactive Mode

Launch the interactive wizard to guide you through the process:

synthdata generate --interactive

Large-Scale Generation

For generating large datasets with batch processing:

synthdata generate -d description.md -o output.json -s 10000 --batch-size 100 --concurrency 10

Description File Format

The description file defines the data structure and generation rules:

{
  "name": "Dataset Name",
  "description_file": "description.md",
  "format": "json",
  "count": 100,
  "schema": {
    "name": "table_name",
    "type": "nested",
    "children": [
      { "name": "id", "type": "integer" },
      { "name": "username", "type": "string" },
      { "name": "email", "type": "string" }
    ]
  }
}

For more examples, see examples/bilibili_chat_description/bilibili_chat.json.

Examples

Generate JSON output:

synthdata generate -d examples/bilibili_chat_description/bilibili_chat.json -o output.json -s 50

Generate CSV output:

synthdata generate -d description.md -o output.csv -f csv -s 100

Use custom config:

synthdata generate -d description.md -o output.json -c my_config.toml

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
cmd/synthdata		cmd/synthdata
configs		configs
examples		examples
internal		internal
specs		specs
test		test
.gitignore		.gitignore
.golangci.yml		.golangci.yml
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthData

Overview

Features

Installation

Configuration

Quick Start

Usage

Command Options

Interactive Mode

Large-Scale Generation

Description File Format

Examples

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynthData

Overview

Features

Installation

Configuration

Quick Start

Usage

Command Options

Interactive Mode

Large-Scale Generation

Description File Format

Examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages