Why isn’t there a decent file format for tabular data?

Tabular data is everywhere. I support reading and writing tabular data in various formats in all 3 of my software application. It is an important part of my data transformation software. But all the tabular data formats suck. There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.

Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON or Excel. And they are all highly sub-optimal for the job.

CSV is a mess. One quote in the wrong place and the file is invalid. It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file). Different quoting schemes are in use. You don’t know what encoding it is in. Use of separators and line endings are inconsistent (sometimes comma, sometimes semicolon). Writing a parser to handle all the different dialects is not at all trivial. Microsoft Excel and Apple Numbers don’t even agree on how to interpret some edge cases for CSV.

Tab separated is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings etc.

XML and JSON are tree structures and not suitable for efficiently storing tabular data (plus other issues).

There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

Don’t even get me started on Excel’s proprietary, ghastly binary format.

Why can’t we have a format where:

  • Encoding is always UTF-8
  • Values stored in row major order (row 1, row2 etc)
  • Columns are separated by \u001F (ASCII unit separator)
  • Rows are separated by \u001E (ASCII record separator)
  • Er, that’s the entire specification.

No escaping. If you want to put \u001F or \u001E in your data – tough you can’t. Use a different format.

It would be reasonably compact, efficient to parse and easy to manually edit (Notepad++ shows the unit separator as a ‘US’ symbol). You could write a fast parser for it in minutes. Typing \u001F or \u001E in some editors might be a faff, but it is hardly a showstopper.

It could be called something like “unicode separated value” (hat tip to @fakeunicode on Twitter for the name) or “unit separated value” with file extension .usv. Maybe a different extension could used when values are stored in column major order (column1, column 2 etc).

Is there nothing like this already? Maybe there is and I just haven’t heard of it. If not, shouldn’t there be?

And yes I am aware of the relevant XKCD cartoon ( https://xkcd.com/927/ ).

** Edit 4-May-2022 **

“Javascript” -> “JSON” in para 5.

It has been pointed at the above will give you a single line of text in an editor, which is not great for human readability. A quick fix for this would be to make the record delimiter a \u001E character followed by an LF character. Any LF that comes immediately after an \u001E would be ignored when parsing. Any LF not immediately after an \u001E is part of the data. I don’t know about other editors, but it is easy to view and edit in Notepad++.

11 thoughts on “Why isn’t there a decent file format for tabular data?

  1. Andy Brice Post author

    People have pointed out that all the data in a US/RS delimited file would be shown on 1 line in a text editor. This obviously isn’t ideal. But you could get around it by saying that any LF that appears immediately after an RS is ignored (i.e. not considered to be data).

    Reply
  2. rhettbull

    I like it! But what about using SQLite as the format? It’s not text but it’s basically universal (runs on every platform), fast, lots of tools available for using it, and you get indexes.

    Reply
  3. Chris

    “Javascript” is not the file format you had in mind… Sure meant JSON.
    This gives impression you have no idea what you’re talking about.

    CSV may not be optimal. But it is used successfully in so many production cases… Like exchanging huge amounts of data between systems. Just saying…

    Reply
  4. toolleeo

    IMHO the problem is not to have a brand new format with wonderful features, but having EXISTING programs supporting available formats in a consistent way.
    After all, even if you introduce a new format, existing ones will stay around. And existing programs will have to support the new format or you miss the portability of data.
    In the end, you need existing programs to be updated for supporting the new format. And therefore, why not just updating the program to support an existing format consistently?

    E.g., TSV is a very good format, rather robust, if you manage it consistently (in my experience, libreoffice does a very good job in that, as well as python). The problem comes when you want it managed by unsupportive tools, such as excel… Which confirms that the problem is mainly related to the tools, not solely to the format.

    Reply
  5. Igor Novgorodov

    There are various binary formats used in BigData world like Parquet – that’s probably what you want. It’s typed, terse etc.

    Reply
  6. mftb

    usv/adt is actually a really good idea! Related to line-endings, if such a format caught on it would not be difficult for any reasonable text editor to handle gracefully. We’re talking about ASCII characters after all.

    Reply

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s