Hacker News new | ask | show | jobs
by bohemian99 1816 days ago
My question is would Copilot be useful if you could choose the codebase it would be drawing from? Almost as an internal company tool?
4 comments

It would certainly alleviate the license concerns. If it was possible to train it to a level (that produces effective output), then sure.

As a thought experiment, I thought "what would happen if we trained it on our 15 million lines of product code + my language-ext project". It would almost certainly produce something that looks like 'us'.

But:

* It would also trip over a million or so lines of generated code

* And the legacy OO code

* It will 'see' some of the extreme optimisations I've had to built into language-ext to make it performant. Something like the internals of the CHAMP hash-map data-structure [1]. That code is hideously ugly, but it's done for a good reason. I wouldn't want to see optimised code parroted out upfront. Maybe it wouldn't pick up on it, because it hasn't got a consistent shape like the majority of the code? Who knows.

Still, I'd be more willing to allow my team to use it if I could train it myself.

[1] https://github.com/louthy/language-ext/blob/main/LanguageExt...

> legacy OO code

Aside from OO vs FP. A concern with that I'd have is that it would encourage and enforce idiosyncracies in large corporate codebases.

If you've ever worked for a large corporation on their legacy code, you know you don't want any of that to be suggested to colleagues.

This would enforce bad behaviors and make it even harder for fresh developers to argue against it.

> This would enforce bad behaviors and make it even harder for fresh developers to argue against it.

I think this is a significant point. It maintains the status quo. We change our guidance to devs every other year or so. New language features become available, old ones die, etc. But we're not rewriting the entire code-base every time, we know if we hit old code, we refactor with the new guidance; but we don't do it for the sake of it, so there's plenty of code that I wouldn't want in a training set (even if I wrote it myself!)

That would actually be potentially useful, it could do a kind of combination of autocompletion of internal libraries, automatic templates for common patterns and internal style/linting type tasks all in one. Certainly augmenting those other things.

It would be interesting how much code you would need before it was useful (and how good does it have to be to be useful? Does even a small error rate cost so much that it erases other gains, because so many of the potential errors in usage of this type of tool are very subtle?)

That sounds interesting, though it still feels like it would need work. Like a way to annotate suggestions with comments, or flag them. Definitive licensing shown for each snippet. A way to mark deprecated code as deprecated to the training algorithm, etc.
If you find yourself copying code someone else in your organization wrote rather than abstracting it to a function in a shared library or building a more declarative framework to manage the problem, something horrible has happened.
Sometimes boilerplate is unavoidable. As an example, how do you send a GET request with libcurl in C with an authorization header? I can't tell you offhand, but I can tell you the file in my codebase that does have it, because I've duplicated the logic for two separate systems.
So you are saying you would rather every project in the world have at least one--if not, thanks to making it easier via Copilot, many--copies of this code rather than one shared library that provides a high-level abstraction for libcurl?... At least for your own code, how did you end up with two copies of duplicated logic rather than a shared library of functionality?
> So you are saying you would rather every project in the world have at least one--if not, thanks to making it easier via Copilot, many--copies of this code.

Absolutely not, not at all. I'm suggesting that copying and pasting happens, particularly in the context of a single project.

> At least for your own code, how did you end up with two copies of duplicated logic rather than a shared library of functionality?

At what point is it worth introducing an abstraction rather than copying? Using my libcurl example, you can create an abstraction over the~ 10 lines of initialization, but if you need to change it to a POST, then you're just implemnenting an abstraction over libcurl, which is just silly.

If you have 10 lines of repeated code with one line changed to make it GET vs POST, introducing an abstraction isn't "silly": it is simultaneously both ergonomic and advantageous, as not only is libcurl's API extremely verbose (as it is a low-level primitive), if you ever need to add another line of code to that initialization--which totally happens over the years, due to various security extensions you might need to either enable or disable with respect to acceptable TLS settings, or to tune performance parameters related to connection caching, or to add a header to every request (for any number of reasons from debugging to authentication)--you can do it in one place instead of umpteen number of places. The libcurl API is itself a leaky abstraction of the underlying TLS libraries in places, so if you ever realize you need to switch SSL libraries (a space in which there has been absolute upheaval in recent years) you are going to reach for shared abstractions; and like... to take this to its ultimate conclusion: I use libcurl as a fallback for Linux, but if you want to correctly support the user's settings for proxy servers--which are sometimes needed for your requests to work at all--my code is abstracted so I can plug in entirely different HTTP backends instead of libcurl, such as Apple's CFNetwork (which you absolutely should be using if at all possible on iOS). You act like abstraction is somehow a bad thing or some inherent cost you want to avoid, when it should absolutely take you less time to wrap duplicated code into a function than to duplicate it in the first place, and if IDE features (including Copilot) are somehow making you think it is easier to throw a ton of duplicated code everywhere, that is part of the argument for why those features are dangerous... they are apparently undermining all the work people did onto refactoring code browsers that are designed to help users locally manage abstraction instead of mitigating poor architecture :/.