Hacker News new | ask | show | jobs
by DanielBryars 798 days ago
What's the utility of defining the "Error" exception. Why not use an existing one, say InvalidOperationException, or a plain Exception. Is making your own better practice?
3 comments

There is no utility. It's perhaps written for JavaScript developers who are used to Error.. but it's not idiomatic C#. Might be indicative of a copilot too.

The use of a class-scoped `StringBuilder` that only one method uses, and `ReadQuotedColumn`/`ReadNonQuotedColumn` yielding one character at a time, rather than accepting a the builder isn't a good sign either (for efficiency). Or casting everything to a `char` (this won't support UTF8), or assuming an end quote followed by anything (:71) is valid way to end a field.

C# `char` is a UTF-16 code unit. It does not indicate a byte which is just `byte`.

Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

Iterating over the `char`s does not support the full range of what can be stored in a C# string (for instance, UTF-8 graphemes that are serialized as surrogate pairs are usually two `char`s in a C# string.

.Net provides a TextElementEnumerator that will iterate over graphemes instead: https://learn.microsoft.com/en-us/dotnet/api/system.globaliz...

There's a fairly comprehensive guide to working with .net character encodings at https://learn.microsoft.com/en-us/dotnet/standard/base-types... .

The return value of StreamReader.Read() will always be within bounds of -1 and char.MaxValue.

All surrogate pairs will be drained into the StringBuilder, working correctly. Most implementations usually agree that torn UTF-16 surrogate pairs (which are strictly the code points outside of basic multilingual plane) may exist in the input and will be passed as is, which is different to what UTF-8 implementations choose (Rust is strict with this, Go lets you tear code points arbitrarily).

We, as a community, can do better than to jump to immediate criticism of this type.

If you (a consuming dev) want the world's smallest (in your code) - use the .net built in parser[0]. Bonus, it's RFC4180 compliant.

If you (competing/learning) want to write the world's smallest (code golf style)... this isn't it, and has some weird superfluous lines (if that's your measure - per the original question).

If you (learning) want to write an efficient parser.. this isn't it. You don't need a StringBuilder, you can seek the Stream to collect the (already formed) strings directly from source vs char-by-char memory copy and rebuild. Yes; that limits your stream choices, but since the example/tests only use FileStreams (which are seekable) you might not come across other kinds. If you need to use un-seekable streams, then you'll need to use a large enough buffer.

[0]: https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...

This is not a correct link (it refers to VB.NET). There are better parsers out there (Sep).

I'm not sure what is your point but it certainly misses the idea behind this HN submission and makes me sad as it would be nice to see words of encouragement in .NET submissions here instead.

The unit tests have an emoji test (which uses a surrogate pair). I thought I would have to use Runes, but it's not necessary. https://github.com/kjpgit/SmallestCSVParser/blob/master/Smal...
You imply that a string, reversed, would have the same length as the original.

This is not true.

Where are they implying this and why would the strings not have the same length? Is there normalization implied somewhere?
If they weren't reversing it, what other operation would separate grapheme clusters?
> Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

It doesn’t matter for this API, but it is a code smell. It makes the class not reentrant.

Talking of the API, I would make it simpler to use and more idiomatic by making the entire public API

   static IEnumerable<List<String>> parse(StreamReader sr)
That call would store the parser state (currently just the StreamReader and that reused StringBuilder) in a private inner class. There would not be a constructor of the publicly visible class, removing that code smell.
I will add a micro benchmark to see if the `yield return` is slowing things down, compared to just calling _sb.Add() inside Read*(). I will also see if it looks cleaner that way. To be honest, the `yield return` is currently in there just because I thought it's "cool".
30% performance improvement after removing the `yield return`, and readability is probably better too.
It's good practice to throw an exception from your own namespace if you're writing a library.

You don't want to expose an implementation detail like some specific exception as part of your public API and have to worry about breaking that later.

You could overload some built in exception but IMO that's not the best practice. You muddy your API and a caller has to wrap your exception if they want to bubble it up and catch it specifically, anyway.

> Why not use ... a plain Exception.

It is forbidden.

https://learn.microsoft.com/en-us/dotnet/standard/exceptions...

> Exception ... None (use a derived class of this exception).

https://learn.microsoft.com/en-us/dotnet/standard/design-gui...

> DO NOT throw System.Exception or System.SystemException.

It’s a guideline not to. It’s not a hard rule or forbidden.
I'm not seeing where it says not to throw InvalidOperationException.
InvalidOperationException means "the object is in an inappropriate state". That does not describe a parse error.

C# conventions for exceptions are admittedly a bit confusing. There are a handful of very specific scenarios where you're supposed to use a built-in exception (most commonly ArgumentException). For everything else, you want to define your own type.

The state of the stream, such that it's pointing to an illegal character, actually does seem to be invalid though. Maybe this is an overly pedantic argument. I've been writing quite a bit of c# for over a decade and have basically been doing this the whole time. I thought that I knew how to use exceptions, but it seems I do not.
If you want to recover from CSV errors you should ideally throw InvalidCSVException.