Understanding CSV → JSON dataset
Tabular rows into nested objects — for ML training and API responses.
What "nested" means in this context, the type-inference rules that mostly work, and the size threshold where JSONL beats one-big-array.
The simple translation.
A CSV with headers row plus N data rows becomes a JSON array of N objects. Each object's keys are the column headers; each value is the cell. The trivial conversion takes ~5 lines of code in any language. The interesting decisions are downstream: type inference, nested-object reconstruction, and how to chunk large files.
Type inference.
CSV is text-only. The converter has to guess types: "42" → integer, "3.14" → float, "true" → boolean, "2026-05-14" → ISO date string. Edge cases bite. "001" — a ZIP code or an integer? Usually you want the string. "True" — capitalised, still a boolean? Most parsers say yes; some say no. "1.0" vs "1" — same value, different type. The safe default is "infer from the column as a whole, not per-cell", and let the user override with a type hint.
Nested keys.
When column names contain dots — address.city, address.zip,address.country — they often encode nested structure. The conversion can re-nest: each row becomes { address: { city, zip, country } }instead of flat keys. Useful for feeding REST APIs that expect nested JSON. Optional; some downstream tools prefer the flat form for tabular processing.
JSONL vs JSON-array.
A small dataset (< 100k rows): output one JSON array. Fits in memory; loads withJSON.parse. A large dataset (millions of rows, training data): output JSONL — one JSON object per line, no surrounding array. Streams line-by-line; works with readline + JSON.parse; doesn't blow memory. ML pipelines, BigQuery, OpenAI fine-tuning APIs all consume JSONL. Knowing which format your downstream wants is the first question.
A worked conversion.
A 50k-row CSV of e-commerce orders with 12 columns including shipping.addressand shipping.country. Convert with type inference + nesting: output is 50k JSON objects, each with a shipping sub-object. File size grows ~20 % vs CSV (JSON syntax overhead) but the structure is now consumable by the frontend's table component without extra parsing. For ML training it would have been better as JSONL; for the dashboard it's fine as one array.
50k orders CSV
flat → nested
Header dots → sub-objects.
shipping.address + shipping.country → { shipping: {...} }
= 50k array, 20% larger
When to keep CSV.
Excel and Sheets users — JSON isn't openable for them. Reporting pipelines that consume CSV directly. Tools like AWS Athena and Snowflake that bulk-load CSV faster than JSON. Anywhere the consumer is tabular by nature. Convert to JSON only when the consumer needs nested structure, when the data is heading to a JSON-only API, or when the rows will be streamed one at a time. The conversion is cheap; the choice should be need-driven.