Why you can’t parse CSV with a regular expression

Regular expressions are a very useful tool in a programmer’s toolbox. But they can’t do everything. And one of the things they can’t do is to reliably parse CSV (comma separated value) files. This is because a regular expression doesn’t store state. You need a state machine (or something equivalent) to parse a CSV file.

For example, consider this (very short) CSV file (3 double quotes + 1 comma + 3 double quotes):

“””,”””

This is correctly interpreted as:

quote to start the data value + escaped quote + comma + escaped quote + quote to end the data value

E.g. a single value of:

“,”

How each character is interpreteted depends on what characters come before and after it. E.g. the first quote puts you into an ‘inside data’ state. The second quote puts you into a ‘might be an escaped for the following character or might be end of data’ state. The third quote puts you back into a ‘inside data’ state.

No matter how complicated a regex you come up with, it will always be possible to create a CSV file that your regex can’t correctly parse. And once the parsing goes wrong, everything after that point is probably garbage.

You can write a regex that can handle CSV file where you are guaranteed there are no commas, quotes or carriage returns in the data values. But commas, quotes or carriage returns in the data values are perfectly valid in CSV files. So it is only ever going to handle a subset of all the possible well-formed CSV files.

Note that you can parse a TSV (tab separated value) file with a regex, as TSV files are (generally!) not allowed to contain tabs or carriage returns in data and therefore don’t need escaping.

See also on Stackoverflow:

Using regular expressions to parse HTML: why not?

14 thoughts on “Why you can’t parse CSV with a regular expression

  1. Charly

    I sell a parsing app myself, and I often receive emails asking how to parse an attached CSV.

    And then I wonder… do I reply with the quick and dirty way (a regular expression) or do I go with that C# code that most users don’t understand?

    I usually reply with the RegExp, it works 99.99% of the times. If the user comes back with a problem, I give them the script. Call me pragmatic.

    However, you make a very good point, Andy. Regular expressions do not hold state and therefore there are some tasks that they cannot perform, like proper CSV parsing. Actually, if they were able to save the state I guess they would be “Turing complete” and therefore you could even write any program just with RegExps. That would be amazing and terrific at the same time.

    Reply
    1. Tele

      No, this is wrong. Regular expressions hold state and are implemented using state machines (NFA / DFA). You don’t need to be Turing complete in order to parse CSV. Please learn the theory first (see the links I posted below)

      Reply
    2. Tele

      No, this is wrong. Regular expressions ARE IMPLEMENTED using a state machine. You don’t need to be Turing-complete in order to parse CSV. See the links I posted below.

      Reply
  2. Andy Brice Post author

    >and therefore you could even write any program just with RegExps. That would be amazing and terrific at the same time.

    And completely impossible to understand by anyone (including yourself 5 minutes after you had written it)?

    Reply
  3. Tele

    In fact, regular expressions ARE implemented using a state machine, so you are fundamentally wrong. It’s possible to parse a CSV using a regex, but the regex will be quite complex.

    Reply
    1. Tele

      Please google for “finite state machine”. Here is the correct regex to extract one field according to your rules. Note that there is no formal definition for CSV, e.g. some software may use ; instead of ,

      (?<=,|^)("(([^"]|"")*?)"|[^,"]+)(?:,|^)

      Reply
        1. Tele

          Please try (note the last $):

          (?<=,|^)("(([^"]|"")*?)"|[^,"]+)(?:,|$)

          And yes, [^"]+ matches CR+LF without any problem

          Reply
            1. Tele

              const regexp = /(?<=,|^)("((?:[^"]|"")*?)"|[^,"]+)(?:,|$)/g;
              const csv = '"a""b",cd,"def""\r\n"""';
              let m;
              while (m = regexp.exec(csv)) {
              console.log(m[2] ? m[2].replace(/""/g, '"') : m[1]);
              }

              Reply
              1. Andy Brice Post author

                So I tried your regex and it does indeed seem to handle escaping correctly to capture the first token. It uses some a technique that I am not really familiar with (positive look behind?). So I learnt something. But I’m not sure whether that is supported in all regex implementations (e.g. QRegExp in Qt 5, which I am most familiar with). Also the code above uses 2 regexes and a loop and still doesn’t handle breaking a CSV into rows. So I think I am still technically correct to say that “you can’t parse CSV with *A* regular expression”! ;0)

                Reply
                1. Tele

                  You can certainly think what you want, but you are WRONG. Please at least read something about parsing and compilers. The CSV grammar is regular, not context-free, so you can read it with a regular expression.

                  Breaking into lines is split(“\n”) or you can just read the file line-by-line. The positive lookbehind is not strictly necessary here, but the code is shorter with it.

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s