Back to blog

How PZIP Compresses CSV Files Up to 69% Smaller Than LZMA

PZIP TeamFebruary 4, 2026

CSV files are everywhere. Data lakes, ETL pipelines, analytics exports, research datasets — they're the lingua franca of structured data. And they compress terribly with general-purpose algorithms.

Why General Compressors Struggle with CSV

LZMA and gzip look for repeated byte patterns. But a CSV file's structure isn't at the byte level — it's at the column level. Column 1 might be sequential IDs. Column 3 might be timestamps with millisecond precision. Column 7 might be categorical values from a small dictionary.

General compressors can't exploit any of this. PZIP can.

PZIP's CSV Strategy

  1. Schema detection: Parse headers, detect column types (integer, float, timestamp, categorical, text)
  2. Column separation: Transpose row-major to column-major for better compression
  3. Type-specific encoding:
    • Sequential integers → start + step + count (3 numbers instead of N)
    • Timestamps → base + deltas (small integers)
    • Categoricals → dictionary + indices
    • Floats → fixed-point or Gorilla encoding
  4. Residual compression: What's left after extraction compresses even better with LZMA

Results

On 153 real-world CSV files, PZIP achieves up to 68.8% smaller output than LZMA-9. The median improvement is 14.3%.

All results are byte-exact round-trip verified. decompress(compress(file)) == file. Always.

See CSV benchmarks or try it on your own CSV files.

How PZIP Compresses CSV Files Up to 69% Smaller Than LZMA | PZIP