Regular expressions are a very useful tool in a programmer’s toolbox. But they can’t do everything. And one of the things they can’t do is to reliably parse CSV (comma separated value) files. This is because a regular expression doesn’t store state. You need a state machine (or something equivalent) to parse a CSV file.
For example, consider this (very short) CSV file (3 double quotes + 1 comma + 3 double quotes):
“””,”””
This is correctly interpreted as:
quote to start the data value + escaped quote + comma + escaped quote + quote to end the data value
E.g. a single value of:
“,”
How each character is interpreteted depends on what characters come before and after it. E.g. the first quote puts you into an ‘inside data’ state. The second quote puts you into a ‘might be an escaped for the following character or might be end of data’ state. The third quote puts you back into a ‘inside data’ state.
No matter how complicated a regex you come up with, it will always be possible to create a CSV file that your regex can’t correctly parse. And once the parsing goes wrong, everything after that point is probably garbage.
You can write a regex that can handle CSV file where you are guaranteed there are no commas, quotes or carriage returns in the data values. But commas, quotes or carriage returns in the data values are perfectly valid in CSV files. So it is only ever going to handle a subset of all the possible well-formed CSV files.
Note that you can parse a TSV (tab separated value) file with a regex, as TSV files are (generally!) not allowed to contain tabs or carriage returns in data and therefore don’t need escaping.
See also on Stackoverflow:
I sell a parsing app myself, and I often receive emails asking how to parse an attached CSV.
And then I wonder… do I reply with the quick and dirty way (a regular expression) or do I go with that C# code that most users don’t understand?
I usually reply with the RegExp, it works 99.99% of the times. If the user comes back with a problem, I give them the script. Call me pragmatic.
However, you make a very good point, Andy. Regular expressions do not hold state and therefore there are some tasks that they cannot perform, like proper CSV parsing. Actually, if they were able to save the state I guess they would be “Turing complete” and therefore you could even write any program just with RegExps. That would be amazing and terrific at the same time.
No, this is wrong. Regular expressions hold state and are implemented using state machines (NFA / DFA). You don’t need to be Turing complete in order to parse CSV. Please learn the theory first (see the links I posted below)
No, this is wrong. Regular expressions ARE IMPLEMENTED using a state machine. You don’t need to be Turing-complete in order to parse CSV. See the links I posted below.
>and therefore you could even write any program just with RegExps. That would be amazing and terrific at the same time.
And completely impossible to understand by anyone (including yourself 5 minutes after you had written it)?
Sure! I am sometimes barely able to understand really complex Regexps. A good example is parsing email addresses. Someone came up with a crazy complex one is stackoverflow: https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression
In fact, regular expressions ARE implemented using a state machine, so you are fundamentally wrong. It’s possible to parse a CSV using a regex, but the regex will be quite complex.
Please google for “finite state machine”. Here is the correct regex to extract one field according to your rules. Note that there is no formal definition for CSV, e.g. some software may use ; instead of ,
(?<=,|^)("(([^"]|"")*?)"|[^,"]+)(?:,|^)
I tried your regular expression at https://regex101.com/ on “””,”””. It didn’t match “,”. Also I don’t think it handles carriage returns in data.
Please try (note the last $):
(?<=,|^)("(([^"]|"")*?)"|[^,"]+)(?:,|$)
And yes, [^"]+ matches CR+LF without any problem
And can you make it so it doesn’t capture the escaping quotes?
const regexp = /(?<=,|^)("((?:[^"]|"")*?)"|[^,"]+)(?:,|$)/g;
const csv = '"a""b",cd,"def""\r\n"""';
let m;
while (m = regexp.exec(csv)) {
console.log(m[2] ? m[2].replace(/""/g, '"') : m[1]);
}
So I tried your regex and it does indeed seem to handle escaping correctly to capture the first token. It uses some a technique that I am not really familiar with (positive look behind?). So I learnt something. But I’m not sure whether that is supported in all regex implementations (e.g. QRegExp in Qt 5, which I am most familiar with). Also the code above uses 2 regexes and a loop and still doesn’t handle breaking a CSV into rows. So I think I am still technically correct to say that “you can’t parse CSV with *A* regular expression”! ;0)
You can certainly think what you want, but you are WRONG. Please at least read something about parsing and compilers. The CSV grammar is regular, not context-free, so you can read it with a regular expression.
Breaking into lines is split(“\n”) or you can just read the file line-by-line. The positive lookbehind is not strictly necessary here, but the code is shorter with it.
Some theory for you:
https://swtch.com/~rsc/regexp/
Regular expressions are implemented using state machines (NFA / DFA). Parsing HTML is a different story because HTML is a context-free language (requiring a push-down automata to recognize), but CSV is a regular language (requiring a state machine). See https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy