Category Archives: Easy Data Transform

What I learned spending $851 on Reddit Ads

I am always on the lookout for cost and time effective ways that I can market my software products. Previously, I have had quite a lot of success with Google Adwords Pay Per Click ads. However, the law of shitty clickthroughs means that advertising platforms generally get less and less profitable (for the advertisers) over time. And Google Adwords is a case study of that law in action. As Reddit is a less mature advertising platform, I thought it might still offer opportunities for a decent return. So I decided to experiment with advertising my data munging software, Easy Data Transform, on Reddit.

[By the way, I understand that nobody goes to Reddit because they want to see ads. But commercial products need to market themselves to survive, and Reddit probably wouldn’t exist without ads. Yay capitalism.]

Setup

The basic process to get started with Reddit Ads is:

  • Sign up for a Reddit Ads account.
  • Enter your details and credit card number.
  • Create a campaign.
  • Create one or more ad groups for your campaign. Choose a bid for each ad group, which countries you want it shown in and who you want it shown to.
  • Create one or more ads for each group.
  • Add the Reddit tracking pixel to every page of your website.
  • Set up conversion goals.

All pretty standard stuff for anyone who has used Google Adwords. The twist with Reddit is that you can advertise to communities (sub-Reddits), rather than based on search keywords. For example, Easy Data Transform is a much better tool for most data wrangling tasks than Excel, so I can bid to show ads targeted at Excel users in communities such as: reddit.com/r/excel/ and reddit.com/r/ExcelTips/.

Like Adwords, there are various ways to bid. I don’t want the advertising platform to set the bid prices for me (because I’m not insane), so I opted for fixed price bids of between $0.20 and $0.40 per click. Some of the ad groups suggested much higher bids than that. For example, the suggested bid for my Excel ad group is $0.79 to $4.79 per click!

However, Easy Data Transform is only a one time payment of $99. Paying more than $0.40 per click is unlikely to be profitable for me, especially when you factor in support costs. So that is the maximum I was prepared to bid. Also, the suggested bids are just the ad platform trying to push up the bid price. Something that anyone who has used Google Adwords will be all too familiar with. I was still able to get clicks, bidding significantly less than the recommended minimum.

I also set a daily maximum for each ad group, just in case I had messed up and added a zero in a bid somewhere.

I created multiple ads for each ad group, with a range of different text and images specific to the communities targeted. Here are some of the ones I ran in the Excel ad group:

Clicking ‘Learn more’ takes you to https://www.easydatatransform.com/.

I didn’t try to use edgy images or memes, because that isn’t really my style. There is an option to turn comments on below ads. As Reddit users are generally not well-disposed to ads, I didn’t try turning this on.

Based on hard-won experience with Google Adwords, I only set my ads to run in wealthy countries. I also restricted my ads to people on desktop devices as Easy Data Transform only runs on the desktop.

When Easy Data Transform is installed, it opens a page on my website with some instructions. So I used this to set up the Reddit conversion tracking to count the number of times a click ended up with a successful install of either the Windows or Mac version of Easy Data Transform.

I monitored the performance of the ads and disabled those that has poor click through or conversion rates and made variants of the more successful ones. Darwinian evolution for ads. I ended up creating 70 ads across 15 ad groups, targeting 50 communities.

I wasted an hour trying to get Reddit to recognize that I had installed their tracking pixel. But, overall, I found the Reddit Ads relatively simple to setup and monitor. Especially compared to the byzantine monstrosity that Google Adwords has become.

Reddit advertises a deal where you can get $500 of free ads.

But the link was broken when I clicked on it. Someone else I spoke to said they had tried to find out more, but gave up when they found out you had to have a phone call with a sales person at Reddit.

Results

I ran my experiment from 08-Jul-2025 to 31-Jul-2025. These are the stats, according to reddit.

Spend$851.04
Impressions490,478
Clicks3,585
Windows installs177
Mac installs63
Total installs240
Click Through Rate0.73%
Cost Per Click$0.24
Click to install conversion rate6.59%
Cost Per Install $3.55

I generally reckon that somewhere around 10% of people who install are going on to buy. So $3.55 per install would mean around $35.50 cost per sale, which is reasonable for a $99 sale. So that all looks quite encouraging.

But, comparing the Reddit number to the numbers I get from Google Analytics and my web logs, I think the Reddit numbers are dubious. At best. In a week when Reddit says it sent me 1174 clicks, Google Analytics says I received 590 referrals from Reddit and my web log says I received 639 referrals from Reddit. Some of the difference may be due to comparing sessions with clicks, time zones etc. But it looks fishy.

The discrepancy is even greater if you look at conversions. The total installs per week reported by Google Analytics and my web logs didn’t go up anything like you would expect from looking at the Reddit conversion numbers. If you dig a bit further, you find that Reddit uses ‘modeled conversions‘ to:

“Gain a more complete view of your ads performance with modeled conversions, which leverages machine learning to bridge attribution gaps caused by signal loss.”

Uh huh. Sounds suspiciously like ‘making shit up’.

And then there are the sales. Or lack of. I don’t have detailed tracking of exactly where every sale comes from. But I estimate that my $851 outlay on ads resulted in between $0 and $400 in additional sales. Which is not good, given that I don’t have VC money to burn. Especially when you factor in the time taken to run this experiment.

The top 5 countries for spend were:

  1. Italy
  2. Spain
  3. France
  4. Germany
  5. Singapore

The US only accounted for 0.28% of impressions, 13 clicks and $3.81 in spend. Presumably because the US market is more competitive, and I wasn’t bidding enough to get my ads shown.

You can look at various breakdowns by country, community, device etc. This is helpful. But some of the breakdowns make no sense. For example, it says that 41% of the click throughs from people reading Mac-related communities were from Windows PCs. That sounds very unlikely!

But the worst is still to come. Feast your eyes on this Google Analytics data from my website:

Average engaged time per active user (seconds)Engaged sessions per active user
Google / organic330.75
Successfulsoftware.net / referral310.74
Youtube.com / referral270.86
Chatgpt.com / referral240.69
Google / CPC160.65
Reddit / referral80.25

8 seconds! That is the mean, not the median. Yikes. And 75% of the sessions didn’t result in any meaningful engagement. This makes me wonder if the majority of the Reddit clicks are accidental.

I had intended to spend $1000 on this experiment, but the results were sufficiently horrible that I stopped before then.

If I had spent a lot of time tweaking the ad images and text, landing pages, communities and countries, then I could probably have improved things a bit. But I doubt I could ever get a worthwhile return on my time and money.

If the lifetime value of a sale is a lot more than $99 for you, or your product is a good fit for Reddit, then Reddit Ads might be worth trying. But be sure not to take any Reddit numbers at face value.

20 years working on the same software product

I released version 1 of my table seating planning software, PerfectTablePlan, in February 2005. 20 years ago this month. It was a different world. A world of Windows, shareware and CDs. A lot has changed since then, but PerfectTablePlan is now at version 7 and still going strong.

PerfectTablePlan v1

PerfectTablePlan v7

I have released several other products since then, and done some training and consulting, but PerfectTablePlan remains my most successful product. It’s success is due to a lot of hard work, and a certain amount of dumb luck.

I was getting married and I volunteered to do the seating plan for our wedding reception. It sounded like a relatively straightforward optimization problem, as we only had 60 guests and no family feuds to worry about. But it was surprisingly difficult to get right. I looked around for some software to help me. There were a couple of software packages, but I wasn’t impressed. I could do better myself! So I wrote a (very rough) first version, which I used for our wedding.

Things weren’t going great at my day job, at a small software startup. Maybe I could commercialize my table planner? I was a bit wary, as my potential competitors all seemed rather moribund and I didn’t think I would be able to make a living off it. But I thought I could do everything worth doing in 6-12 months and then start on the next product. Wrong on both counts!

Web-based software was still in its infancy in 2005. So I decided to write it as desktop software using C++ and cross-platform framework Qt, which I had plenty of experience in. Initially, I just released a Windows version. But I later added a Mac version as well. Qt has had its commercial ups and downs in the last 20 years, but it has grown with me and is now very robust, comprehensive and well documented. I think I made a good choice.

I financed PerfectTablePlan out of my own savings and it has been profitable every year since version 1 was launched. I could have taken on employees and grown the business, but I preferred to keep it as a lifestyle business. My wife does the accounts and proof reading and I do nearly everything else, with a bit of help from my accountant, web designers and a few other contractors. I don’t regret that decision. 20 years without meetings, ties or alarm clocks. My son was born 18 months after PerfectTablePlan was launched and it has been great to have the flexibility to be fully present as a Dad.

CDs, remember them? I sent out around 5,000 CDs (with some help from my father), before I stopped shipping CDs in 2016.

During the lifetime of PerfectTablePlan it became clear that things were increasingly moving to the web. But I couldn’t face rewriting PerfectTablePlan from scratch for the web. Javascript. Ugh. Also PerfectTablePlan is quite compute intensive, using a genetic algorithm to generate an automated seating plan and I felt it was better running this on the customer’s local computers than my server. And some of my customers consider their seating plans to be confidential and don’t want to store them on third party servers. So I decided to stick with desktop. But, if I was starting PerfectTablePlan from scratch now, I might make a different decision.

Plenty of strange and wonderful things have happened over the last 20 years, including:

  • PerfectTablePlan has been used by some very famous organizations for some very famous events (which we mostly don’t have permission to mention). It has seated royalty, celebrities and heads of state.
  • PerfectTablePlan was used as part of a demonstration of the (controversial) first commercial quantum computer by D-Wave.
  • A mock-up of PerfectTablePlan, including icons I did myself, was used without our permission by Sony in their ‘Big day’ TV comedy series. I threated them with legal action. Years later, I am still awaiting a reply.
  • I got to grapple with some interesting problems, including the mathematics of large combinatorial problems and elliptical tables. Some customers have seated 4,000 guests and 4000! (4000x3999x3998 .. x 1) is a mind-bogglingly huge number.
  • A well known wedding magazine ran a promotion with a valid licence key clearly visible in a photograph of a PerfectTablePlan CD. I worked through the night to release a new version of PerfectTablePlan that didn’t work with this key.
  • I found out that CDs are edible.
  • I sponsored the building of a kindergarten in Nepal.
  • I once had to stay up late, in a state of some inebriation, to fix an issue so that a world famous event wasn’t a disaster (no I can’t tell you the event).

The lowest point was the pandemic, when sales pretty much dropped to zero.

Competitors and operating systems have come and gone and the ecosystem for software has changed a lot, but PerfectTablePlan is still here and still paying the bills. It is about 145,000 lines of C++. Some of the code is a bit ugly and not how I would write it now. But the product is very solid, with very few bugs. The website and user documentation are also substantial pieces of work. The PDF version of the documentation is nearly 500 pages.

I now divide my time between PerfectTablePlan and my 2 other products: data wrangling software Easy Data Transform and visual planner Hyper Plan. Having multiple products keeps things varied and avoids having all my eggs in one basket. In May 2024 I released PerfectTablePlan v7 with a load of improvements and new features. And I have plenty of ideas for future improvements. I fully expect to keep working on PerfectTablePlan until I retire (I’m 59 now).

Easy Data Transform v2

I released Easy Data Transform v2 today. After no fewer than 80 (!) v1 production releases since 2019, this is the first paid upgrade.

Major improvements include:

  • Schema versioning, so you can automatically handle changes to the column structure of an input (e.g. additional or missing columns).
  • A new Verify transform so you can check a dataset has the expected values.

Currently there are 48 different verification checks you can make:

  • At least 1 non-empty value
  • Contains
  • Don’t allow listed values
  • Ends with
  • Integer except listed special value(s)
  • Is local file
  • Is local folder
  • Is lower case
  • Is sentence case
  • Is title case
  • Is upper case
  • Is valid EAN13
  • Is valid email
  • Is valid telephone number
  • Is valid UPC-A
  • Match column name
  • Matches regular expression
  • Maximum characters
  • Maximum number of columns
  • Maximum number of rows
  • Maximum value
  • Minimum characters
  • Minimum number of columns
  • Minimum number of rows
  • Minimum value
  • No blank values
  • No carriage returns
  • No currency
  • No digits
  • No double spaces
  • No duplicate column names
  • No duplicate values
  • No empty rows
  • No empty values
  • No gaps in values
  • No leading or trailing whitespace
  • No line feeds
  • No non-ASCII
  • No non-printable
  • No punctuation
  • No symbols
  • No Tab characters
  • No whitespace
  • Numeric except listed special value(s)
  • Only allow listed values
  • Require listed values
  • Starts with
  • Valid date in format

You can see any fails visually, with colour coding by severity:

  • Side-by-side comparison of dataset headers:
  • Side-by-side comparison of dataset data values:
  • Lots of extra matching options for the Lookup transform:

Allowing you to do exotic lookups such as:

Plus lots of other changes.

In v1 there were issues related to how column-related changes cascaded through the system. This was the hardest thing to get right, and it took a fairly big redesign to fix all the issues. As a bonus, you can now disconnect and reconnect nodes, and it remembers all the column-based options (within certain limits). These changes make Easy Data Transform feel much more robust to use, as you can now make lots of changes without worrying too much about breaking things further downstream.

Easy Data Transform now supports:

  • 9 input formats (including various CSV variants, Excel, XML and JSON)
  • 66 different data transforms (such as Join, Filter, Pivot, Sample and Lookup)
  • 11 output formats (including various CSV variants, Excel, XML and JSON)
  • 56 text encodings

This allows you to snap together a sequence of nodes like Lego, to very quickly transform or analyse your data. Unlike a code-based approach (such as R or Python) or a command line tool, it is extremely visual, with pretty-much instant feedback every time you make a change. Plus, no pesky syntax to remember.

data wrangling

Eating my own dogfood, using Easy Data Transform to create an email marketing campaign from various disparate data sources (mailing lists, licence key databases etc).

Easy Data Transform is all written in C++ with memory compression and reference counting, so it is fast and memory efficient and can handle multi-million row datasets with no problem.

While many of my competitors are transitioning to the web, Easy Data Transform remains a local tool for Windows and Mac. This has several major advantages:

  • Your sensitive data stays on your computer.
  • Less latency.
  • I don’t have to pay your compute and bandwidth costs, which means I can charge an affordable one-time fee for a perpetual licence.

I think privacy is only going to become ever more of a concern as rampaging AIs try to scrape every single piece of data they can find.

Usage-based fees for online data tools are no small matter. For a range of usage fee horror stories, such as enabling debug logging in a large production ETL pipeline resulting in $100k of extra costs in a week, see this Reddit post. Some of my customers have processed more than a billion rows in Easy Data Transform. Not bad for $99!

It has been a lot of hard work, but I am please with how far Easy Data Transform has come. I think Easy Data Transform is now a comprehensive, fast and robust tool for file-based data wrangling. If you have some data to wrangle, give it a try! It is only $99+tax ($40+tax if you are upgrading from v1) and there is a fully functional, 7 day free trial here:

Download Easy Data Transform v2

I am very grateful to my customers, who have been a big help in providing feedback. This has improved the product no end. Many heads are better than one!

The next big step is going to be adding the ability to talk directly to databases, REST APIs and other data sources. I also hope at some point to add the ability to visualize data using graphs and charts. Watch this space!

Visual vs text based programming, which is better?

Visual programming tools (also called ‘no-code’ or ‘low-code’) have been getting a lot of press recently. This, in turn, has generated a lot of discussion about whether visual or text based programming (coding) is ‘best’. As someone who uses text programming (C++) to create a visual programming data wrangling tool (Easy Data Transform) I have some skin in this game and have thought about it quite a bit.

At some level, everything is visual. Text is still visual (glyphs). By visual programming here I specifically mean software that allows you to program using nodes (boxes) and vertexes (arrows), laid out on a virtual canvas using drag and drop. 

A famous example of this sort of drag and drop visual programming is Yahoo Pipes:

Yahoo Pipes
Credit: Tony Hirst

But there are many others, including my own Easy Data Transform:

Note that I’m not talking about Excel, Scratch or drag and drop GUI designers. Although some of the discussion might apply to them.

By text programming, I mean mainstream programming languages such as Python, Javascript or C++, and associated tools. Here is the QtCreator Interactive Development Environment (IDE) that I use to write C++ in, to create Easy Data Transform:

The advantages of visual programming are:

  • Intuitive. Humans are very visual creatures. A lot of our brain is given over to visual processing and our visual processing bandwidth is high. Look at pretty much any whiteboard, at any company, and there is a good chance you will see boxes and arrows. Even in non-techie companies.
  • Quicker to get started. Drag and drop tools can allow you to start solving problems in minutes.
  • Higher level abstractions. Which means you can work faster (assuming they are the right abstractions).
  • Less hidden state. The connections between nodes are shown on screen, rather than you having to build an internal model in your own memory.
  • Less configuration. The system components work together without modification.
  • No syntax to remember. Which means it is less arcane for people who aren’t experienced programmers.
  • Less run-time errors, because the system generally won’t let you do anything invalid. You don’t have to worry about getting function names or parameter ordering and types right.
  • Immediate feedback on every action. No need to compile and run.

The advantages of text programming are:

  • Denser representation of information.
  • Greater flexibility. Easier to do things like looping and recursion.
  • Better tooling. There is a vast ecosystem of tools for manipulating text, such as editors and version control systems.
  • Less lock-in. You can generally move your C++ or Python code from one IDE to another without much problem.
  • More opportunities for optimization. Because you have lower-level access there is more scope to optimize speed and/or memory as required.

The advantages and disadvantages of each are two sides of the same coin. A higher level of abstraction makes things simpler, but also reduces the expressiveness and flexibility. The explicit showing of connections can make things clearer, but can also increase on-screen clutter.

The typical complaints you hear online about visual programming systems are:


It makes 95% of things easy and 5% of things impossible

Visual programming systems are not as flexible. However many visual programming systems will let you drop down into text programming, when required, to implement that additional 5%.

Jokes aside, I think this hybrid approach does a lot to combine the strengths of both approaches.

It doesn’t scale to complex systems

Managing complex systems has been much improved over the years in text programming, using techniques such as hierarchy and encapsulation. But there is no reason these same techniques can’t also be applied to visual programming.

It isn’t high enough performance

The creators of a visual programming system are making a lot of design decisions for you. If you need to tune a system for high performance on a particular problem, then you probably need the low level control that text based programming allows. But with most problems you probably don’t care if it takes a few extra seconds to run, if you can do the programming in a fraction of the time. Also, a lot of visual programming systems are pretty fast. Easy Data Transform can join 2 one million row datasets on a laptop in ~5 seconds, which is faster than base R.

It ends up as spaghetti

Labview spaghetti from DailyWTF
Unreal Blueprint spaghetti from reddit.com/r/ProgrammerHumor/

I’m sure we’ve all seen examples of spaghetti diagrams. But you can also create horrible spaghetti code with text programming. Also, being able to immediately see that a visual program has been sloppily constructed might serve as a useful cue.

If you are careful to layout your nodes, you can keep things manageable (ravioli, rather than spaghetti). But it starts to become tricky when you have 50+ nodes with a moderate to high degree of connectivity, especially if there is no support for hierarchy (nodes within nodes).

Automatic layout of graphs for easier comprehension (e.g. to minimize line crossings) is hard (NP-complete, in the same class of problems as the ‘travelling salesman’).

No support for versioning

It is possible to version visual programming tools if they store the information in a text based file (e.g XML). Trying to diff raw XML isn’t ideal, but some visual based programming tools do have built-in diff and merge tools.

It isn’t searchable

There is no reason why visual programming tools should not be searchable.

Too much mousing

Professional programmers love their keyboard shortcuts. But there is no reason why visual programming tools can’t also make good use of keyboard shortcuts.

Vendor lock-in

Many visual programming tools are proprietary, which means the cost can be high for switching from one to another. So, if you are going to invest time and/or money heavily in a visual programming tool, take time to make a good choice and consider how you could move away from it if you need to. If you are doing quick and dirty one-offs to solve a particular problem that you don’t need to solve again, then this doesn’t really matter.

No code’ just means ‘someone else’s code’

If you are using Python+Pandas or R instead of Easy Data Transform, then you are also balancing on top of an enormous pile of someone else’s code.

We are experts, we don’t need no stinkin drag and drop

If you are an experienced text programmer, then you aren’t really the target market for these tools. Easy Data Transform is aimed at the analyst or business guy trying to wrangle a motley collection of Excel and CSV files, not the professional data scientist who dreams in R or Pandas. However even a professional code jockey might find visual tools faster for some jobs.


Both visual and text programming have their places. Visual programming is excellent for exploratory work and prototyping. Text based programming is almost always a better choice for experts creating production systems where performance is important. When I want to analyse some sales data, I use Easy Data Transform. But when I work on Easy Data Transform itself, I use C++.

Text programming is more mature than visual programming. FORTRAN appeared in the 1950s. Applications with graphical user interfaces only started becoming mainstream in the 1980s. Some of the shortcomings with visual programming reflect it’s relative lack of maturity and I think we can expect to see continued improvements in the tooling associated with visual programming.

Visual programming works best in specific domains, such as:

  • 3d graphics and animations
  • image processing
  • audio processing
  • game design
  • data wrangling

These domains tend to have:

  • A single, well defined data type. Such as a table of data (dataframe) for data wrangling.
  • Well defined abstractions. Such as join, to merge 2 tables of data using a common key column.
  • A relatively straightforward control flow. Typically a step-by-step pipeline, without loops, recursion or complex control flow.

My teenage son has been able to do some (I think) pretty impressive 3D modelling and animations just with Blender’s visual tools.

Visual programming has been much less successful when applied to generic programming, where you need lots of different data types, a wide range of abstractions and potentially complex control flow.

I’ve been a professional software developer since 1987. People (mostly in marketing) have talked about replacing code and programmers with point and click tools for much of that time. That is clearly not going to happen. Text programming is the best approach for some kinds of problems and will remain so for the foreseeable future. But domain-specific visual programming can be very powerful and has a much lower barrier to entry. Visual programming empowers people to do things that might be out of their reach with text programming and might never get done if they have to wait for the IT department to do it.

So, unsurprisingly, the answer to ‘which is better?’ is very much ‘it depends’. Both have their place and neither is going away.

Further reading:

Hacker News folk wisdom on visual programming

Visual Programming Codex

The life and times of Yahoo Pipes

The ‘No Code’ Delusion and HN discussion

‘Visual programming doesnt suck’ HN discussion (original article seems to have disappeared)

Visual Programming Languages – Snapshots

A Personal History of Visual Programming Environments

Is the future of data science drag and drop?

Rethinking Visual Programming with Go

Responses to this post on Reddit:

reddit.com/r/Programminglanguages

reddit.com/r/nocode

reddit.com/r/datascience

Winterfest 2023

Winterfest 2023 is on. Loads of quality software for Mac and Windows from independent vendors, at a discount. This includes my own Easy Data Transform and Hyper Plan, which are on sale with a 25% discount.

Find out more at artisanalsoftwarefestival.com .

What is the index of an empty string in an empty string?

This sounds like a question a programmer might ask after one medicinal cigarette too many. The computer science equivalent of “what is the sounds of one hand clapping?”. But it is a question I have to decide the answer to.

I am adding indexOf() and lastIndexOf() operations to the Calculate transform of my data wrangling (ETL) software (Easy Data Transform). This will allow users to find the offset of one string inside another, counting from the start or the end of the string. Easy Data Transform is written in C++ and uses the Qt QString class for strings. There are indexOf() and lastIndexOf() methods for QString, so I thought this would be an easy job to wrap that functionality. Maybe 15 minutes to program it, write a test case and document it.

Obviously it wasn’t that easy, otherwise I couldn’t be writing this blog post.

First of all, what is the index of “a” in “abc”? 0, obviously. QString( “abc” ).indexOf( “a” ) returns 0. Duh. Well only if you are a (non-Fortran) programmer. Ask a non-programmer (such as my wife) and they will say: 1, obviously. It is the first character. Duh. Excel FIND( “a”, “abc” ) returns 1.

Ok, most of my customers, aren’t programmers. I can use 1 based indexing.

But then things get more tricky.

What is the index of an empty string in “abc”? 1 maybe, using 1-based indexing or maybe empty is not a valid value to pass.

What is the index of an empty string in an empty string? Hmm. I guess the empty string does contain an empty string, but at what index? 1 maybe, using 1-based indexing, except there isn’t a first position in the string. Again, maybe empty is not a valid value to pass.

I looked at the Qt C++ QString, Javascript string and Excel FIND() function for answers. But they each give different answers and some of them aren’t even internally consistent. This is a simple comparison of the first index or last index of text v1 in text v2 in each (Excel doesn’t have an equivalent of lastIndexOf() that I am aware of):

Changing these to make the all the valid results 1-based and setting invalid results to -1, for easy comparison:

So:

  • Javascript disagrees with C++ QString and Excel on whether the first index of an empty string in an empty string is valid.
  • Javascript disagrees with C++ QString on whether the last index of an empty string in a non-empty string is the index of the last character or 1 after the last character.
  • C++ QString thinks the first index of an empty string in an empty string is the first character, but the last index of an empty string in an empty string is invalid.

It seems surprisingly difficult to come up with something intuitive and consistent! I think I am probably going to return an error message if either or both values are empty. This seems to me to be the only unambiguous and consistent approach.

I could return a 0 for a non-match or when one or both values are empty, but I think it is important to return different results in these 2 different cases. Also, not found and invalid feel qualitatively different to a calculated index to me, so shouldn’t be just another number. What do you think?

*** Update 14-Dec-2023 ***

I’ve been around the houses a bit more following feedback on this blog, the Easy Data Transform forum and hacker news and this what I have decided:

IndexOf() v1 in v2:

v1v2IndexOf(v1,v2)
1
aba
aba1
aa1
aaba1
xy
worldhello world7

This is the same as Excel FIND() and differs from Javascript indexOf() (ignoring the difference in 0 or 1 based indexing) only for “”.indexOf(“”) which returns -1 in Javascript.

LastIndexOf() v1 in v2:

v1v2LastIndexOf(v1,v2)
1
aba
aba4
aa1
aaba3
xy
worldhello world7

This differs from Javascript lastIndexOf() (ignoring difference in 0 or 1 based indexing) only for “”.indexOf(“”) which returns -1 in Javascript.

Conceptually the index is the 1-based index of the first (IndexOf) or last (LastIndexOf) position where, if the V1 is removed from the found position, it would have to be re-inserted in order to revert to V2. Thanks to layer8 on Hacker News for clarifying this.

Javascript and C++ QString return an integer and both use -1 as a placeholder value. But Easy Data Transform is returning a string (that can be interpreted as a number, depending on the transform) so we aren’t bound to using a numeric value. So I have left it blank where there is no valid result.

Now I’ve spent enough time down this rabbit hole and need to get on with something else! If you don’t like it you can always add an If with Calculate or use a Javascript transform to get the result you prefer.

*** Update 15-Dec-2023 ***

Quite a bit of debate on this topic on Hacker News.

Summerfest 2023

Summerfest 2023 is on. Loads of quality software for Mac and Windows from independent vendors, at a discount. This includes my own Easy Data Transform and Hyper Plan, which are on sale with a 25% discount.

Find out more at artisanalsoftwarefestival.com .

Winterfest 2022

Easy Data Transform and Hyper Plan Professional edition are both on sale for 25% off at Winterfest 2022. So now might be a good time to give them a try (both have free trials). There is also some other great products from other small vendors on sale, including Tinderbox, Scrivener and Devonthink. Some of the software is Mac only, but Easy Data Transform and Hyper Plan are available for both Mac and Windows (one license covers both OSs).

Easy Data Transform progress

I have been gradually improving my data wrangling tool, Easy Data Transform, putting out 70 public releases since 2019. While the product’s emphasis is on ease of use, rather than pure performance, I have been trying to make it fast as well, so it can cope with the multi-million row datasets customers like to throw at it. To see how I was doing, I did a simple benchmark of the most recent version of Easy Data Transform (v1.37.0) against several other desktop data wrangling tools. The benchmark did a read, sort, join and write of a 1 million row CSV file. I did the benchmarking on my Windows development PC and my Mac M1 laptop.

Easy Data Transform screenshot

Here is an overview of the results:

Time by task (seconds), on Windows without Power Query (smaller is better):

data wrangling/ETL benchmark Windows

I have left Excel Power Query off this graph, as it is so slow you can hardly see the other bars when it is included!

Time by task (seconds) on Mac (smaller is better):

data wrangling/ETL benchmark M1 Mac

Memory usage (MB), Windows vs Mac (smaller is better):

data wrangling/ETL benchmark memory Windows vs Mac

So Easy Data Transform is nearly as fast as it’s nearest competitor, Knime, on Windows and a fair bit faster on an M1 Mac. It is also uses a lot less memory than Knime. However we have got some way to go to catch up with the Pandas library for Python and the data.table package for R, when it comes to raw performance. Hopefully I can get nearer to their performance in time. I was forbidden from including benchmarks for Tableau Prep and Alteryx by their licensing terms, which seems unnecessarily restrictive.

Looking at just the Easy Data Transform results, it is interesting to notice that a newish Macbook Air M1 laptop is significantly faster than a desktop AMD Ryzen 7 desktop PC from a few years ago.

Windows vs Mac M1 benchmark

See the full comparison:

Comparison of data wrangling/ETL tools : R, Pandas, Knime, Power Query, Tableau Prep, Alteryx and Easy Data Transform, with benchmarks

Got some data to clean, merge, reshape or analyze? Why not download a free trial of Easy Data Transform ? No sign up required.

Creating a Mac Universal binary for Intel and ARM M1/M2 with Qt

Apple has transitioned Macs from Intel to ARM (M1/M2) chips. In the process it has provided an emulation layer (Rosetta2) to ensure that the new ARM Macs can still run applications created for Intel Macs. The emulation works very well, but is quoted to be some 20% slower than running native ARM binaries. That may not seem like a lot, but it is significant on processor intensive applications such as my own data wrangling software, which often processes datasets with millions of rows through complex sequences of merging, splitting, reformatting, filtering and reshaping. Also people who have just spent a small fortune on a shiny new ARM Mac can get grumpy about not having a native ARM binary to run on it. So I have been investigating moving Easy Data Transform from an Intel binary to a Universal (‘fat'[1]) binary containing both Intel and ARM binaries. This is a process familiar from moving my seating planner software for Mac from PowerPC to Intel chips some years ago. Hopefully I will have retired before the next chip change on the Mac.

My software is built on-top of the excellent Qt cross-platfom framework. Qt announced support for Mac Universal binaries in Qt 6.2 and Qt 5.15.9. I am sticking with Qt 5 for now, because it better supports multiple text encodings and because I don’t see any particular advantage to switching to Qt 6 yet. But, there is a wrinkle. Qt 5.15.3 and later are only available to Qt customers with commercial licenses. I want to use the QtCharts component in Easy Data Transform v2, and QtCharts requires a commercial license (or GPL, which is a no-go for me). I also want access to all the latest bug fixes for Qt 5. So I decided to switch from the free LGPL license and buy a commercial Qt license. Thankfully I was eligible for the Qt small business license which is currently $499 per year. The push towards commercial licensing is controversial with Qt developers, but I really appreciate Qt and all the work that goes into it, so I am happy to support the business (not enough to pay the eye-watering fee for a full enterprise license though!).

Moving from producing an Intel binary using LGPL Qt to producing a Universal binary using commercial Qt involved several major stumbling points that took me hours and a lot of googling to sort out. I’m going to spell them out here to save you that pain. You’re welcome.

  • The latest Qt 5 LTS releases are not available via the Qt maintenance tool if you have open source Qt installed. After you buy your commercial licence you need to delete your open source installation and all the associated license files. Here is the information I got from Qt support:
I assume that you were previously using open source version, is that correct?

Qt 5.15.10 should be available through the maintenance tool but it is required to remove the old open source installation completely and also remove the open source license files from your system.

So, first step is to remove the old Qt installation completely. Then remove the old open source licenses which might exist. Instructions for removing the license files:

****************************
Unified installer/maintenancetool/qtcreator will save all licenses (downloaded from the used Qt Account) inside the new qtlicenses.ini file. You need to remove the following files to fully reset the license information.

Windows
"C:/Users/%USERNAME%/AppData/Roaming/Qt/qtlicenses.ini"
"C:/Users/%USERNAME%/AppData/Roaming/Qt/qtaccount.ini"

Linux
"/home/$USERNAME/.local/share/Qt/qtlicenses.ini"
"/home/$USERNAME/.local/share/Qt/qtaccount.ini"

OS X
"/Users/$USERNAME/Library/Application Support/Qt/qtlicenses.ini"
"/Users/$USERNAME/Library/Application Support/Qt/qtaccount.ini"

As a side note: If the files above cannot be found $HOME/.qt-license(Linux/macOS) or %USERPROFILE%\.qt-license(Windows) file is used as a fallback. .qt-license file can be downloaded from Qt Account. https://account.qt.io/licenses
Be sure to name the Qt license file as ".qt-license" and not for example ".qt-license.txt".

***********************************************************************

After removing the old installation and the license files, please download the new online installer via your commercial Qt Account.
You can login there at:
https://login.qt.io/login

After installing Qt with commercial license, it should be able to find the Qt 5.15.10 also through the maintenance tool in addition to online installer.
  • Then you need to download the commercial installer from your online Qt account and reinstall all the Qt versions you need. Gigabytes of it. Time to drink some coffee. A lot of coffee.
  • In your .pro file you need to add:
macx {
QMAKE_APPLE_DEVICE_ARCHS = x86_64 arm64
}
  • Note that the above doubles the build time of your application, so you probably don’t want it set for day to day development.
  • You can use macdeployqt to create your deployable Universal .app but, and this is the critical step that took me hours to work out, you need to use <QtDir>/macos/bin/macdeployqt not <QtDir>/clang_64/bin/macdeployqt . Doh!
  • You can check the .app is Universal using the lipo command, e.g.:
lipo -detailed_info EasyDataTransform.app/Contents/MacOS/EasyDataTransform
  • I was able to use my existing practise of copying extra files (third party libraries, help etc) into the .app file and then digitally signing everything using codesign –deep [2]. Thankfully the only third party library I use apart from Qt (the excellent libXL library for Excel) is available as a Universal framework.
  • I notarize the application, as before.

I did all the above on an Intel iMac using the latest Qt 5 LTS release (Qt 5.15.10) and XCode 13.4 on macOS 12. I then tested it on an ARM MacBook Air. No doubt you can also build Universal binaries on an ARM Mac.

Unsurprisingly the Universal app is substantially larger than the Intel-only version. My Easy Data Transform .dmg file (which also includes a lot of help documentation) went from ~56 MB to ~69 MB. However that is still positively anorexic compared to many bloated modern apps (looking at you Electron).

A couple of tests I did on an ARM MacBook Air showed ~16% improvement in performance. For example joining two 500,000 row x 10 column tables went from 4.5 seconds to 3.8 seconds. Obviously the performance improvement depends on the task and the system. One customer reported batch processing 3,541 JSON Files and writing the results to CSV went from 12.8 to 8.1 seconds, a 37% improvement.

[1] I’m not judging.

[2] Apparently the use of –deep is frowned on by Apple. But it works (for now anyway). Bite me, Apple.