Hacker News new | ask | show | jobs
by alphanumeric0 1430 days ago
> I find single file dense leetcode style code easier to understand and follow the flow. Algorithmic code I can reason around. A large mature codebase is far harder to get to know.

I genuinely can't tell if you're being serious or not. If you are, do you also like to read books written as one giant chapter? Or entire chapters as one giant paragraph?

5 comments

I have trouble with you equating "leetcode style" with "one giant chapter" and also with you equating "enterprise code" with one chapter following another, because when I read enterprise code it's

1. read one line of first chapter,

2. then skip to the last sentence of the middle chapter,

3. then realize the first chapter was actually the penultimate,

4. then read the forth sentence of the first paragraph of the second chapter, bearing in mind what you have learned,

5. then throw your hands up in the air in dispair

Jumping around is the nature of code in general, whether its in a 1000-line file or split up amongst multiple files.

For a maintenance programmer, they may already understand how everything works. They aren't following a particular code path, necessarily. Maybe they're working on a new feature and they need to re-familiarize themselves with previous chapters. It that case, it's nice to jump to a file that concerns itself with things grouped together.

>> things grouped nicely together

fixed it for ya :)

I'm a maintenance programmer. Even working layers of layers of layers above the actual shit does not make it not stink.

>> jumping around

...should be intuitive and joyful, not a disaster to your brain.

EDIT: I am a fan, though, of SOC. I guess enterprisey code tries to be that (but fails hard at it).

Okay, great. You are a maintenance programmer. You have a 1000-line program all in one file. You need to update the e-mail functionality of this program. You aren't following a stack trace or following a particular code path. There are 15 or 16 different e-mail related functions. How do you find the e-mail function you need to update? Do you memorize line numbers? Use regex search? Do you have vim marks setup?
>> here are 15 or 16 different e-mail related functions

Is this you showboating the greats of layered code?

Depth is not width and width is not depth, but surely you see the difference in the two?

If you have 15 or 15 different e-mail related functions spread across a whole bunch of files, how do you find the one you need to update?
At one point in time I used OpenGrok to try understand large projects.

Without documentation I find large projects difficult to understand. There's literally too many global symbols and I cannot see the forest for the trees.

What's the model of this program? What are the core principles that the author is using? Do I really need to read every file to understand what is going on?

With Leetcode style programs there's one file with everything in it and I can usually find the entry point. The problem is well defined.

I can see the moving parts in a Leetcode style problem. The looping, the data structure creation and control flow, arrays and recursion.

Large mature codebases such as Java projects have thousands to millions of small files and packages it can be difficult to see how things fit together, every file seems to be 10-20 lines long.

I like C projects as they have lots of code in one file. Everything I need to understand a module is in one file and I can use vim folding.

I’ve been programming for nearly 30 years and I feel the same way. 1000loc+ files are increasingly common in my code bases. I find it really hard to read code with lots of tiny files that individually don’t do anything. It’s like the programmer is embarrassed by their code so they’re making me search for the core logic.

Splitting code into multiple files really only makes sense to me when there’s a clear division of responsibilities. That can mean a lot of things - like client / server, utility methods / core algorithm or class A / class B. But plenty of complex data structures are much easier to read and understand all at once. For example, I have a rust rope library which implements a skip list of gap buffers. The skip list is one (big) file. The gap buffer is another file. Easy.

Same. Well, 1000 is an outlier but 300-600 feels right. I sometimes feel bad for doing it, because it's not what some other people might consider good code to look like. I occasionally have to do code review or help fix a bug in a the other kind of codebase and even the devs often don't seem to understand how those thousand 15-line files all fit together.
> Without documentation I find large projects difficult to understand. There's literally too many global symbols and I cannot see the forest for the trees.

Good abstractions let you see the shape if the forest.

Most "good code" or "simple code" is an inedible potluck of "simplest solution for the problem at the time" with some documentation.

Actual good code has intention revealing abstractions that communicate the essence of the problem and problem domain.

back when i inherited a legacy C codebase i found https://www.jgrasp.org/ pretty useful to explore large files.
I could see this working with vim folding. That's an interesting approach. I never really used folding in IDEs.
The book analogy you use isn't very accurate. Even if you merge chapters and paragraphs like that, you still read it sequentially. Just in a less comfortable way.

Which is not at all like a modern codebase that is modular, abstracted, etc. If you're new to a codebase, and want to understand one particular feature, you'd likely need to jump back and forth across 10 files.

It's not far-fetched to say that makes it difficult to understand.

The thing is that if you just need to understand a specific part of something you will need to jump as well even if everything you needed for that one thing is written sequentially in one file. You will want to skip over implentation details of certain things to get the general picture first on a more abstract level.

Let's say you have a simple endpoint that takes a list of comma separated inputs, parses them as numbers and spits back a sorted version of that.

I don't want to see a version of that, which a compiler might have inlined. Including the implementation of the sorting algorithm. I only want to see a high level of abstraction version of it. Basically just something like (pseudo code in a non existent language) :

    fun endpoint(input):
      inputs[] = split(input, ',')
      numbers[] = parseAsIntegers(inputs)
      return quicksort(numbers)
I can easily understand what this does and what the idea is behind this "algorithm" in 3 lines. If I had the "inlined" version of this I would have to manually identify each of these parts and potentially skip over tens to hundreds of lines.

I think this is really a bit about trust. Do you trust that these named functions I am calling do what their name does? Does quicksort actually do a quicksort or has someone implemented bubble sort in there? Of course this is a minimal example and especially quicksort would probavly just be a library but imagine all of these were large complex pieces of our code base.

Personally I am an advocate for using functions (methods or whatever your language calls them etc) and naming them properly and then trusting those names by default. I want to spend time making this nice and understandable and abstracted once when writing. Not every time someone reads it. As soon as something does not seem to behave in the way the name suggests I will then and only then go check the actual implementation and for example find out that parseAsIntegers actually also supports floats and quicksort is not actually quicksort but bubblesort and that is why this endpoint was slow etc.

This would be the “simple” code. The “abstract” code would be more like this:

  public class EndpointManager {
    private EndpointInputManager eim;
    private StringSplitter splitter;
    private NumberParser parser;
    private Sorter sorter;
    public EndpointManager(EndpointInput input) {
      eim = new EndpointInputManagerFactory().setInput(input).build();
      splitter = new StringSplitterFactory().setDelimiter(new Delimiter(",")).build();
      parser = new NumberParserFactory().setFormat(NumberParserFormat.INTEGER).setMode(NumberParserMode.LIST).build();
      sorter = new SorterFactory().setSortOrder(SortOrder.ASCENDING).setAlgorithm(SortingAlgorithm.QUICK_SORT).build();
    }
    public EndpointOutput endpoint() throws ParseException {
      splitter.split(eim.getInput());
      parser.parse(splitter.getList());
      sorter.sort(parser.getOutput());
      return new EndpointOutputFactory().setOutput(sorter.getSortedList()).build();
    }
  }
My guess is this is Java or something close? We can make that much more readable. We may have to do away with bad libraries. A lot more could actually be magicked away, which can also be a problem sometimes. In a "real" application this resource's interface would probably not just take a comma separated string in a body but accept a proper JSON object or somesuch and not just be a "sort endpoint" but it's not going to be much different from this if written properly. I happen to like the few annotations you'll see me use here. I also think that something as simple as a line break can unclog things. Also the choice of having each of the stream operations in a separate line is deliberate for readability. There are linter/auto formatting rules to enforce this (we do this at my current place for example).

    @Path("/sort")
    public class SortResource {

        @GET
        public List sort(@Body String input) {
            validate(input);
            return Arrays.stream(input.split(","))
                .map(Integer::valueOf)
                .sort()
                .collect(Collectors.toList());
        }
    }
I do recognize the kind of code you pasted. Had to work in code bases like that for way too long. Never want to work in one of those again. There's probably lots of EJBs and other such nonsense around that?
I was going to make the predictable joke that you need a factory. Disappointing that you took care of it already.
You can easily understand what it does because you picked an example that is easy to understand :)

This discussion often ends up in the extremes: inline everything versus abstract everything. I don't think anybody reasonable would opt for either of those extremes, we should focus on the very large middle ground where there's a lot of subjectivity.

Trust me on this, I've been raised on the DRY dogma and all related architectural patterns in favor of abstraction. I've lived the life, for 2 decades. But I cannot ignore the outcomes. Most codebases are extremely difficult to understand and it's very painful to change things. As 90% of all software development is maintenance, that's a planetary-sized problem.

This doesn't mean you should inline everything, it means sane choices. As a simple example, say you're using a literal in your code:

(if orderAmount > 100000)

This code is incredibly easy to read. Common convention says to put this in a constant, at the top of the file. Old me agrees, new me does not. For as long as it's the only occurrence of the value, it doesn't need abstraction. The only thing it would do is make the code more difficult to understand. The very eager abstracter might even put that constant in a separate file.

The point of this example is to abstract based on real reuse, not imagined reuse. I'm not against abstraction only against unnecessary abstraction.

A second example. Say you have a reusable UI component. A change request comes in that's pretty large and specific for one niche need. It kind of goes against the spirit of the initial purpose of the component but is still related enough to consider it in scope of the component.

Old me might add a "toggle" to the component, after which it can render in two modes. Sometimes called a "god component". This approach sucks. It makes the component much more complicated and changing and testing it becomes a nightmare.

Even older me would break down the component into smaller components and then "compose" them based on its mode. This is even worse, now you have to jump around many places whilst the sub components are never actually reused (pointless abstraction).

New me says fuck it and splits the component in two. Allowing significant code duplication between both components. It's not as radical as it sounds, it's in fact incredibly comforting. Each component can easily be understood (less complexity) and making changes becomes far less stressful as your blast radius is tiny.

Developers spent the vast majority of their time not coding, instead figuring out how something works and how to make a change that doesn't break anything.

It may not have been evident from my simple example but I do agree with your "middle" approach. That's where I try to end up in our code base. Endless interfaces, methods that are only one line long and such are counter productive. But nobody can tell me that inlining quicksort will ever be useful outside of a place where your compiler can't do it for you and you need to favour execution speed over everything else. I don't believe such places really exist much if at all any longer.

What I do have to have to question is the strict non-use of constants. It can be very very useful to use constants for such things, e.g. if you are calling libraries that do not make it apparent what is what. Say you have something that takes a timeout value.

    send(data, 10)
What is this? I have to know what send is, what parameters it takes etc. I might have to look that up. I can easily work around that with a constant.

    const timeoutInMillis = 10
    send(data, timeoutInMillis)
The same principle can easily apply for other similar situations. I really like it for things like

    doSomethingThatCouldTakeLongButAlsoShouldHaveATimeout(data, 64800000)
What is that and what does that value even mean in human readable? Of course some of these values you will recognize if used enough but so far my domains have been sparse enough that I don't recognize all of them and have to compute. Much better (with shorter, real names anyway but ya know, we're dealing in simple examples here :) ):

    REALLY_LONG_RUNNING_PROCESS_TIMEOUT_IN_MILLIS = 1000 * 60 * 60 * 18
    doSomethingThatCouldTakeLongButAlsoShouldHaveATimeout(data, REALLY_LONG_RUNNING_PROCESS_TIMEOUT_IN_MILLIS)
So far most people I've talked to find that it's much easier to recognize that this has a timeout of 18 hours but the method happens to want milliseconds.

Oh and don't get me started on people that use the constants from the real code in their tests, completely defeating the testing. Especially if they then do math with the constants and simply copy the math - or worse, put the math into a method and call it from the tests too - to their tests. Test expectations have to be computed once, when writing the test and just hardcoded into them, otherwise they serve no purpose as changing the code itself will always result in green tests even if you've just made a major mistake by changing the values without thinking.

I think he is, can second that. Not having to jump around 10 files with multiple classes and remembering where goes what usually means I can understand the code faster.

Not sure about literature but for almost any technical matter I prefer learning from the smaller details instead of the big picture - it's often too vague and just doesn't stick in my memory.

Reading enterprise code is like reading a book where 99% of the pages don't contain any meaningful information.