| NVIDIA just open sourced NeMo Data Designer, the synthetic data framework used internally to build both pre-training and post-training datasets for Nemotron. It lets you define an entire synthetic data pipeline directly in Python: structured outputs, statistical samplers, LLM-generated columns, dependency-aware field relationships, Python/SQL/remote validators, and optional LLM-as-judge scoring. Supports quick preview mode for fast iteration before scaling up. Install: ```
pip install data-designer
``` A minimal example: ```
from data_designer.essentials import * data_designer = DataDesigner()
config = DataDesignerConfigBuilder() config.add_column(
SamplerColumnConfig(
name="product_category",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Electronics", "Clothing", "Home & Kitchen", "Books"]
),
)
) config.add_column(
LLMTextColumnConfig(
name="review",
model_alias="nvidia-text",
prompt="Write a short product review for a {{ product_category }} item."
)
) preview = data_designer.preview(config_builder=config)
preview.display_sample_record()
``` This release also incorporates the synthetic data tech my team originally built at Gretel (now part of NVIDIA), now generally available for anyone to use or extend. Repo: https://github.com/NVIDIA-NeMo/DataDesigner |
NeMo Data Designer is our core product from Gretel and now the internal framework we use heavily for both pre- and post-training data in Nemotron for a variety of use cases.
The OSS version is fully general-purpose: Python-first, modular, and designed so you can mix statistical samplers, LLM columns, and seed datasets in a single pipeline.
Happy to answer questions or hear feedback on missing features