CSV parsing is harder than you think (and why `split(',')` isn't enough)
Quoted fields, escaped quotes, embedded newlines, UTF-8 BOMs, Excel's opinions. A tour of every CSV footgun and the libraries that handle them.
# The naive version
row.split(",")
This works until it doesn't. Here's the data that breaks it:
"Smith, John",42,"Software engineer at ""Acme, Inc."""
A naive split produces six fields. It should produce three. Let's unpack.
# The five things CSVs can do
# 1. Quoted fields with commas inside
"Smith, John",42
Two fields. Split on comma produces three. You need a parser that respects quotes.
# 2. Escaped quotes inside quoted fields
The standard is "double the quote":
"She said ""hi""",answered
Two fields: She said "hi" and answered. Some ancient dialects use backslash-escape ("She said \"hi\"") — always respect what your tool produces, not what you wish it produced.
# 3. Embedded newlines inside quoted fields
"line one
line two",next-field
This is a single row with two fields. The first field contains a newline. If your CSV parser reads line-by-line, you'll read half a row and break.
<div class="callout callout-warning" role="note"><div class="callout-title">Warning</div><div class="callout-body"><p>You cannot parse CSV line-by-line unless you have a guarantee that no field contains a newline. Spreadsheet exports routinely produce multi-line quoted fields.</p></div></div>
# 4. UTF-8 BOM at the start
Excel saves CSVs with a UTF-8 BOM (bytes EF BB BF). Your parser sees the first column header as \ufeffname instead of name, and no field lookups match. Strip the BOM before parsing, or use a parser that does.
# 5. Delimiter ambiguity
CSV stands for Comma-Separated Values, but:
- European locales often use
;because,is the decimal separator - Tab-separated (
\t) exists for the same reason - Pipe-separated (
|) shows up in data exports - Excel looks at locale settings and picks
Your parser needs to accept or detect the delimiter. Auto-detection works ~95% of the time — sniff the header line for the most common delimiter character.
# The quick rule
Do not write a CSV parser. Use one of these:
| Language | Package / stdlib |
|------------|------------------------|
| JavaScript | papaparse or csv-parse |
| Python | csv (stdlib) |
| Go | encoding/csv (stdlib)|
| Rust | csv crate |
| Java | opencsv, commons-csv |
| PostgreSQL | COPY command |
Every one of these handles the five cases above correctly by default.
# Common bugs even with a library
# 1. Not reading in binary / bytes mode
# Wrong on Windows — converts \r\n to \n silently, then quoted-newlines break
with open("data.csv", "r") as f:
reader = csv.reader(f)
# Right
with open("data.csv", "r", newline="") as f:
reader = csv.reader(f)
The newline="" prevents Python from normalising line endings.
# 2. Forgetting to handle the BOM
# Right: 'utf-8-sig' strips the BOM
with open("data.csv", "r", encoding="utf-8-sig", newline="") as f:
reader = csv.reader(f)
# 3. Excel's special leading characters
Excel treats these as formulas, not text, and will execute them:
=SUM(...)
+, -, @ at start of a field
If your CSV contains user input that starts with these characters, Excel will try to evaluate them. This is CVE-worthy when combined with =HYPERLINK(...). Escape leading =, +, -, @ with a leading apostrophe or tab.
# 4. Trailing commas
col1,col2,col3,
value,value,value,
Some exporters include a trailing comma. Good parsers interpret this as a 4th empty field. Bad parsers error. Test both.
# Excel's opinions
Excel is the biggest consumer of CSV files in the world and has strong opinions:
- It auto-detects data types.
123e5becomes scientific notation. Product SKUs get mangled. - Zip codes starting with 0 lose the leading zero.
- Dates get reformatted based on the user's locale.
- UTF-8 without BOM is interpreted as Windows-1252 — non-ASCII characters become gibberish.
<div class="callout callout-tip" role="note"><div class="callout-title">Tip</div><div class="callout-body"><p>If your CSV is meant to be opened in Excel: include the UTF-8 BOM, quote every field that contains special characters, and prefix number-looking strings with tab (<code>\t</code>) to prevent auto-conversion. Or generate .xlsx directly — our <a href="/json-to-excel">JSON to Excel</a> converter does this in-browser.</p></div></div>
# Convert with confidence
Our CSV to JSON tool handles all five edge cases — quoted fields, escaped quotes, embedded newlines, BOM, custom delimiters. Drop a messy CSV and get clean JSON. Everything client-side; your data doesn't upload anywhere.
# Related tools
Frequently asked questions
›Is there actually a CSV spec?
RFC 4180 exists, published in 2005, but every implementation treats it as advisory. Excel, Google Sheets, Postgres COPY, Python's `csv` module all interpret edge cases differently. The spec is the closest thing to a lingua franca, not a strict law.
›What's a BOM and why does it matter?
UTF-8 BOM is three bytes (EF BB BF) that Excel puts at the start of CSV files it saves. Most parsers treat the BOM as part of the first field value — so your first column header becomes `\ufeffname` instead of `name`. Strip it before parsing.
›Should I just use TSV instead?
It avoids comma-in-field collision but not the other issues (quoting, newlines, escaping). And Excel won't open a `.tsv` as nicely. CSV with a good parser is still usually the right answer.