Tag Archives: state machine

Why you can’t parse CSV with a regular expression

Regular expressions are a very useful tool in a programmer’s toolbox. But they can’t do everything. And one of the things they can’t do is to reliably parse CSV (comma separated value) files. This is because a regular expression doesn’t store state. You need a state machine (or something equivalent) to parse a CSV file.

For example, consider this (very short) CSV file (3 double quotes + 1 comma + 3 double quotes):


This is correctly interpreted as:

quote to start the data value + escaped quote + comma + escaped quote + quote to end the data value

E.g. a single value of:


How each character is interpreteted depends on what characters come before and after it. E.g. the first quote puts you into an ‘inside data’ state. The second quote puts you into a ‘might be an escaped for the following character or might be end of data’ state. The third quote puts you back into a ‘inside data’ state.

No matter how complicated a regex you come up with, it will always be possible to create a CSV file that your regex can’t correctly parse. And once the parsing goes wrong, everything after that point is probably garbage.

You can write a regex that can handle CSV file where you are guaranteed there are no commas, quotes or carriage returns in the data values. But commas, quotes or carriage returns in the data values are perfectly valid in CSV files. So it is only ever going to handle a subset of all the possible well-formed CSV files.

Note that you can parse a TSV (tab separated value) file with a regex, as TSV files are (generally!) not allowed to contain tabs or carriage returns in data and therefore don’t need escaping.

See also on Stackoverflow:

Using regular expressions to parse HTML: why not?