Copyright (c) 2026 DataLake Solutions. All rights reserved.
This source code and all related materials are proprietary to DataLake Solutions.
You may not, without prior written permission from DataLake Solutions:
- copy, modify, distribute, sublicense, sell, publish, or otherwise disclose this code;
- share this code with third parties or post it to public repositories, forums, or websites;
- use this code to create derivative works for external distribution or commercial exploitation.
Unauthorized use, disclosure, or distribution is strictly prohibited.
A scalable engine that generates realistic and structured data from database schemas, enabling automated seeding, testing, and environment setup.
- Local CSV output at
schemas/<schema_name>/Tables_generated/<table_name>.csv - Schema metadata JSON at
schemas/<schema_name>/instructions/
- Create venv:
python -m venv .venv
- Install:
python -m pip install -r requirements.txt
- Fill
.env:OPENAI_API_KEY=<your_key>OPENAI_MODEL=gpt-4o-mini(or any supported model)
- Run Streamlit UI:
python -m streamlit run app.py
The project includes a headless API that accepts one JSON payload with the schema prompt and table definitions, then runs the same backend generation flow used by the UI.
You do not need to open the Streamlit UI.
Instead, you:
- Start the API server.
- Send one JSON file to the API.
- The backend generates the data.
- The backend validates the data.
- The API returns the output folder and file paths.
Use the API server without --reload when generating data.
Why:
- this project creates and updates files while it runs
--reloadcan restart the API in the middle of a run- that can interrupt generation or validation
Run:
cd ..\datagenxRun:
.\.venv\Scripts\python.exe -m uvicorn headless_api:app --host 127.0.0.1 --port 8000After startup:
- Swagger UI:
http://127.0.0.1:8000/docs - Health check:
http://127.0.0.1:8000/health
Leave this terminal open.
Run:
cd ..\datagenxRun:
Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/health"Expected result:
{
"status": "ok"
}Main endpoint:
POST /generate
Payload shape:
{
"schema_name": "Banking",
"schema_prompt": "Generate realistic synthetic US banking data with referential integrity.",
"replace_existing": true,
"tables": [
{
"table_name": "CUSTOMERS",
"num_entries": 1000,
"ddl": "CREATE TABLE CUSTOMERS (\n CUSTOMER_ID BIGINT PRIMARY KEY,\n FULL_NAME VARCHAR(100) NOT NULL,\n EMAIL VARCHAR(120) UNIQUE\n);",
"instructions": "Generate realistic customer names and unique emails."
},
{
"table_name": "ACCOUNTS",
"num_entries": 1200,
"ddl": "CREATE TABLE ACCOUNTS (\n ACCOUNT_ID BIGINT PRIMARY KEY,\n CUSTOMER_ID BIGINT NOT NULL,\n ACCOUNT_NUMBER VARCHAR(20) UNIQUE,\n FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMERS(CUSTOMER_ID)\n);",
"instructions": "Every CUSTOMER_ID must exist in CUSTOMERS."
}
]
}Example payload files are available in the examples/ folder.
- Retail two-table example:
examples/retail_two_table_payload.json
PowerShell:
Invoke-RestMethod -Method Post `
-Uri "http://127.0.0.1:8000/generate" `
-ContentType "application/json" `
-InFile "examples\retail_two_table_payload.json"What this command does:
- reads the JSON file from the
examplesfolder - sends it to the API
- runs schema creation, data generation, and validation
- returns the output location and status
cURL:
curl -X POST "http://127.0.0.1:8000/generate" \
-H "Content-Type: application/json" \
--data-binary "@examples/retail_two_table_payload.json"GET /schemasto list saved schemasGET /schemas/{org_id}to inspect one saved schema and its statuses
The POST /generate response includes:
org_idschema_nameoutput_dirgenerated_filesvalidation_report_pathlog_pathstatus
If status is DONE, the run completed successfully.
For the retail example, generated files will be under:
schemas\Retail\Tables_generated\schemas\Retail\validations\validation_report.jsonschemas\Retail\logs\latest.log
For a non-technical user, the headless flow is:
- Start the API.
- Check
/health. - Send the JSON file.
- Wait for the response.
- Open the generated files.
- Generated files are saved in project
schemas/. - Per schema, instruction folder contains:
ddl.jsoninstructions.jsonschema_prompt.json
- UI keeps the same core generate/schema-management flow as your existing app.
- Query module and login are intentionally removed per requirement.