Understanding CSV statistics
Summary stats — the first pass on every new dataset.
The five-number summary, why mean lies about skewed data, and the column-by-column ritual that catches data-quality problems early.
The five-number summary.
For any numeric column: minimum, first quartile (Q1, the 25th percentile), median (Q2, the 50th), third quartile (Q3, the 75th), maximum. These five describe the distribution without assuming shape — equally useful for normal data and for long-tailed counts. The boxplot is a graphical summary of these five. Most "what does this column look like?" questions are answered by the five numbers.
IQR = Q3 − Q1 ; outliers if x < Q1 − 1.5·IQR or x > Q3 + 1.5·IQR
Mean vs median.
For symmetric distributions, mean and median agree. For skewed distributions (income, sales, web-traffic, anything with a long right tail), the mean is pulled toward the tail and overestimates the typical value. The median is robust to outliers. Always look at both: when they disagree, the median is usually more honest. Reporting "average customer spend" using the mean for a distribution with a few whale customers is a classic misleading statistic.
Spread — std-dev vs IQR.
Standard deviation measures spread around the mean — useful for normal-ish data. For skewed or heavy-tailed data, std-dev exaggerates spread. The IQR (Q3 − Q1) is a robust alternative: it captures the middle 50 % of values regardless of tails. A dataset's std-dev being much larger than 1.5 × IQR is a sign the distribution isn't normal and analysis should use the median + IQR pair.
What to report for every column.
Numeric columns: count, min, Q1, median, mean, Q3, max, std-dev, count of NaN/empty. Categorical columns: unique count, top 5 values + frequencies, count of empty. Datetime columns: min, max, count of empty, frequency by month/year. The data summary should fit on one screen per column; reading it is the first sanity check before any analysis.
A worked profile.
A 100k-row sales CSV with 12 columns. Profile run: order_total mean $87, median $42 — strongly right-skewed (some whale orders). customer_idhas 8 NaN values (data-entry bug). order_date ranges 2024-01 to 2026-05, with a suspicious gap in Feb 2025 (system outage?). regionhas 5 unique values, one mis-spelled ("Eurpoe" appears 12 times). Three data quality issues found in 30 seconds; without the profile, they'd surface as wrong numbers in the dashboard a week later.
100k-row sales profile
12 columns × 8 stats
Run profile, scan for outliers/NaN/typos.
mean $87 vs median $42 ; 8 NaN ; 1 typo
= 3 issues caught early
vs full data exploration.
The summary is the first 5 minutes. Real data exploration adds: pairwise correlations, distributions plotted as histograms or density curves, scatter matrix for relationships, time-series plots for temporal columns, group-by aggregations for categorical interactions. Tools like pandas-profiling and ydata-profiling auto-generate all of that as an HTML report. The CSV summary stats are the cheap first pass; the report is the second.