Tag Archives: excel

Why isn’t there a decent file format for tabular data?

Tabular data is everywhere. I support reading and writing tabular data in various formats in all 3 of my software application. It is an important part of my data transformation software. But all the tabular data formats suck. There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.

Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON or Excel. And they are all highly sub-optimal for the job.

CSV is a mess. One quote in the wrong place and the file is invalid. It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file). Different quoting schemes are in use. You don’t know what encoding it is in. Use of separators and line endings are inconsistent (sometimes comma, sometimes semicolon). Writing a parser to handle all the different dialects is not at all trivial. Microsoft Excel and Apple Numbers don’t even agree on how to interpret some edge cases for CSV.

Tab separated is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings etc.

XML and JSON are tree structures and not suitable for efficiently storing tabular data (plus other issues).

There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

Don’t even get me started on Excel’s proprietary, ghastly binary format.

Why can’t we have a format where:

  • Encoding is always UTF-8
  • Values stored in row major order (row 1, row2 etc)
  • Columns are separated by \u001F (ASCII unit separator)
  • Rows are separated by \u001E (ASCII record separator)
  • Er, that’s the entire specification.

No escaping. If you want to put \u001F or \u001E in your data – tough you can’t. Use a different format.

It would be reasonably compact, efficient to parse and easy to manually edit (Notepad++ shows the unit separator as a ‘US’ symbol). You could write a fast parser for it in minutes. Typing \u001F or \u001E in some editors might be a faff, but it is hardly a showstopper.

It could be called something like “unicode separated value” (hat tip to @fakeunicode on Twitter for the name) or “unit separated value” with file extension .usv. Maybe a different extension could used when values are stored in column major order (column1, column 2 etc).

Is there nothing like this already? Maybe there is and I just haven’t heard of it. If not, shouldn’t there be?

And yes I am aware of the relevant XKCD cartoon ( https://xkcd.com/927/ ).

** Edit 4-May-2022 **

“Javascript” -> “JSON” in para 5.

It has been pointed at the above will give you a single line of text in an editor, which is not great for human readability. A quick fix for this would be to make the record delimiter a \u001E character followed by an LF character. Any LF that comes immediately after an \u001E would be ignored when parsing. Any LF not immediately after an \u001E is part of the data. I don’t know about other editors, but it is easy to view and edit in Notepad++.

Easy Data Transform

I have been furiously coding a new product. Easy Data Transform. It is a Windows and Mac tool for transforming table and list data from one form to another. Joining, splitting, reformatting, filtering, sorting etc.

easydatatransform

I have been thinking about this product idea for years. In fact I threw together a janky prototype back in 2008. It allows you to perform various operations on a pair of lists.

list-weaver

I used this prototype for jobs such as creating a list of emails of people who had bought Perfect Table Plan v5, but hadn’t upgraded to v6 yet. It worked. But it wasn’t very good. The biggest annoyance was that each operation obliterated everything that came before. Which made it very easy to lose track of where you had got to. And there was no repeatability. It was also limited to lists and it became clear that I really needed something that could also handle tabular data. I never released it.

But the idea has been running as a background process in my brain for 11 years since. And I think I have come up with a much better design in that time. Finally I had mature, stable versions of my Perfect Table Plan and Hyper Plan products out, so I decided to go for it. I am really pleased with how it has turned out so far.

If you aren’t embarrassed by v1.0 you didn’t release it early enough. And so I have cut lots of corners to get this first public version out. The documentation is only part written. I created the application icon myself  in 10 minutes. There is no licensing. The GUI is lacking polish. The website would make a designer cry. But the software seems fairly robust. My 13 year old son wasn’t able to crash it after 10 minutes of trying, despite financial incentives to do so.

I did some market research and spoke to some people who knew a bit about this market. But I deliberately didn’t look closely at any competing products, as I didn’t want to be mentally restricted by what others have done. For better or worse, I want to blaze my own trail. Copying other people’s stuff is a zero-sum game with no net benefit to society.

Most of the things that Easy Data Transform you can do, you can also do in Excel or SQL. My claim is that it is much quicker, easier and less error prone to do in Easy Data Transform. No programming or scripting required. I am hoping that people will be able to start using it within a couple of minutes of downloading it (I plan to do lots of usability testing). Will people pay for that? I hope so. I’m not aiming it at programmers. Perish the thought.

Naming is hard. I came up with some 70 names. Things like ‘Data Hero’, ‘Transform Flow’, ‘Transmogrify’ and ‘Data Rapture’. But the domains were taken, people I asked hated them or there was an existing service or product with that name. So I ended up with Easy Data Transform. It does what it says on the tin.

Why desktop? Surely no-one is writing new desktop apps in 2019? I believe a desktop solution has some real advantages in this market. The biggest ones are:

  • You don’t need to load your (potentially highly sensitive) data on to a third party server.
  • Not having to upload and download (potentially very large) data sets makes it much more responsive.

Easy Data Transform is currently free for anyone to use. You can get it from the super-minimalist easydatatransform.com website. The current 0.9.0 version expires on the 4th August 2019. You will then be able to get another free version. Once the product is mature enough, and if I am convinced there is enough demand, I will release a paid version. The free beta will probably last several months. Please try it and let me know how you get on. I am particularly interested to get feedback from anyone using it for real day-to-day tasks.

Of course the real challenge is always marketing. How to get noticed amongst many competing products. As well as helping to improve the product I am hoping that this extended beta will also help me to get some traction and better understand the market. For example, what price to charge and what trial model to use. Watch this space.

Pivot Tables demystified

pivot tablesOnline businesses generate a lot of data. Sales data, marketing data, traffic data etc. Pivot tables can be a useful tool for analysing and extracting useful information from this sea of data. A lot of people seem to have heard of Pivot Tables without understanding what they are or how to use them. Despite the slightly cryptic name they are really just a way of summarizing tables of data. Nothing to be intimidated by. I’m going to try to demystify them here with a simple example.

Let’s imagine you have got an Excel spreadsheet with a month of (fictitious) sales transaction. It looks like this:

sales spreadsheet

You want to find out:

  1. The total value of sales of each product
  2. The number of sales of each product
  3. The total value of sales of each product by country
  4. The total value of sales of each product by day of the week

How would you do it? You can crunch the numbers using a calculator, but that is very tedious and error prone. If you are a programmer you can export the data to a .csv file and write a small program in your favourite language to read it in, crunch the data and write the results out. You probably think you can do it in 10 minutes, but I bet it will take you at least an hour to get it working and debugged. I did all the above in 1 minute and 20 seconds using Excel pivot tables.

Here is how you can do it (screenshots from Excel 2007). You can download the spreadsheet if you want to try it yourself.

Select all the data, including the header (A1 to D222).

Click on Insert in the Excel Ribbon bar.

Click on PivotTable.

pivot table 1

The Create Pivot Table window will appear. Click OK.

pivot table 2

The PivotTable Field List will appear. We want to find the number of sales for each product. Drag the product field to Column labels and the value field to Values. This gives us the total value of sales by product.

pivot table 3

To change this to the number of sales pull down the Sum of value drop-down list and select Value Field Settings….

pivot table 4

Change Summarize value field by from Sum to Count and click OK.

pivot table 5

The number of sales of each product are now shown.

pivot table 6

Now drag the country field to Row labels. Then click on Count of value, select Value Field Settings…, change it back to Sum and click OK. The total value of sales of each product by country is now displayed.

pivot table 7

We now need to add a day of the week column. Click on Sheet 1 and add a new column as shown.

pivot table 8

Now select all the data from A1 to E222 and insert a new pivot table, as before.

pivot table 9

Drag the fields as shown to get the total value of sales of each product by day of the week.

pivot table 10

(We’ll quietly ignore the fact that the number of days aren’t divisible by 7 in the data.)

We can even display by product, day of the week and country with one more drag.

pivot table 11

Excel also supports pivot charts if you prefer your results in graphical form.

pivot chart

Obviously this is a very simple example, but pivot tables can be used to quickly analyse much larger and more complex data sets. Next time you have got some data to analyse consider whether you would be better using pivot tables or pivot charts.

*** UPDATE ***

You can also use my Hyper Plan software to create pivot tables. It is a lot simpler than the above! There is a free trial.