Regex fun with CSV
I recently needed to do some escaping of a semicolon separated file. This was because a lot of the fields contained data that contained semicolons, and they had forgot to escape those fields within quotes.
I was lucky that the bad behaving fields contained structured data, SOAP requests in this case. So I came up with an expression to match the this. Because multiple SOAP datafields existed, I needed to do a non-greedy match to find the first occurence of the closing tag.
I also needed to do something with different parts of the match, so I created some namend groups for that
This resulted in the expression below:
(?<START><?xml)(?<DATA>.*?)(?<END></soap:Envelope>;)
Then I noticed some of the fields were incorreclty matched. The reason was that in the file, a field had a maximum size of 4000 characters. So in that case I was’nt guaranteed to find a match on the end tag when a field was overflowed.
I needed an expression to match either on the semicolon after 4000 characters after start or on the end tag in case the match was less than 400 characters.
I came up with the following solution:
(?<START><?xml)
(?:
(?<DATA>.{3995})(?<END>;)
|
(?<DATA>.{0,3994}?)(?<END></soap:Envelope>;)
)
The 3995 is of course 5 less than 4000, because 5 characters have already been matched for the start part.
The next trick was to find an expression that I could use to find out if my row was properly escaped, i.e. contained the correct number of columns. Luckily, I found this expression on the Internet, as it is not so simple to also deal with fields escaped by quotes that also contain quotes and semicolons. For example the following string:
1 ; “2;3” ; “4;””;““5”
Here only the semicolons in red should be treated as column separators. This proved to be a difficult task for most expressions I found. However, I did find a working one (after a tiny tweak, the blue part was’nt there) that I’d like to share with you:
;(?=(?:[^”]*”[^”]*”)*(?![^”]*”))
This expression matches all field separators, so simply perform a split using the regex and you end up with an array of all fields in your row. Checking the number of columns is a piece of cake then :-).
So why was the blue part needed? It turns out that performing a Regex.Split in .NET wil also return captured groups, als well as the splitted parts of the strings. So making the group non-capturing solved the problem.
To conclude, I think regex is fun, but sometimes the drawback is that you need a lot of comment lines to explain what that regex actually does…:-)