Tag Archives: xml

Why isn’t there a decent file format for tabular data?

Tabular data is everywhere. I support reading and writing tabular data in various formats in all 3 of my software application. It is an important part of my data transformation software. But all the tabular data formats suck. There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.

Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON or Excel. And they are all highly sub-optimal for the job.

CSV is a mess. One quote in the wrong place and the file is invalid. It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file). Different quoting schemes are in use. You don’t know what encoding it is in. Use of separators and line endings are inconsistent (sometimes comma, sometimes semicolon). Writing a parser to handle all the different dialects is not at all trivial. Microsoft Excel and Apple Numbers don’t even agree on how to interpret some edge cases for CSV.

Tab separated is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings etc.

XML and JSON are tree structures and not suitable for efficiently storing tabular data (plus other issues).

There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

Don’t even get me started on Excel’s proprietary, ghastly binary format.

Why can’t we have a format where:

  • Encoding is always UTF-8
  • Values stored in row major order (row 1, row2 etc)
  • Columns are separated by \u001F (ASCII unit separator)
  • Rows are separated by \u001E (ASCII record separator)
  • Er, that’s the entire specification.

No escaping. If you want to put \u001F or \u001E in your data – tough you can’t. Use a different format.

It would be reasonably compact, efficient to parse and easy to manually edit (Notepad++ shows the unit separator as a ‘US’ symbol). You could write a fast parser for it in minutes. Typing \u001F or \u001E in some editors might be a faff, but it is hardly a showstopper.

It could be called something like “unicode separated value” (hat tip to @fakeunicode on Twitter for the name) or “unit separated value” with file extension .usv. Maybe a different extension could used when values are stored in column major order (column1, column 2 etc).

Is there nothing like this already? Maybe there is and I just haven’t heard of it. If not, shouldn’t there be?

And yes I am aware of the relevant XKCD cartoon ( https://xkcd.com/927/ ).

** Edit 4-May-2022 **

“Javascript” -> “JSON” in para 5.

It has been pointed at the above will give you a single line of text in an editor, which is not great for human readability. A quick fix for this would be to make the record delimiter a \u001E character followed by an LF character. Any LF that comes immediately after an \u001E would be ignored when parsing. Any LF not immediately after an \u001E is part of the data. I don’t know about other editors, but it is easy to view and edit in Notepad++.