Is Github a trustworthy keeper of our code?

Jan 03, 2025

Before Github became the default, there were many competitors on equal-ish footing.

If we zoom in on SourceForge, we see a flexible code hosting platform, which works with multiple version control systems. In 2015, SourceForge injected adware directly into the installer binaries for packages hosted on their platform. This directly harmed the open source community, both for those creating the code, and those consuming the code. The trust in both SourceForge, and projects hosted on SourceForge, dropped significantly as a result.

Since the web is built on the ideals of open code, freely shared, a rogue host for open source codebases is harmful to the entire industry. If Python started being distributed with adware built into the string module without the permission of the Python maintainers, it would be a pretty bad thing.

Github has not done anything directly hostile as SourceForge did. In fact, despite Github currently being owned by Microsoft, I would say they have been reasonable bastions of the open source community at large. Many features have been freely provided, and those which cost money have been charged at a reasonable rate. There have been some controversies in the scope of code hosting, particularly around region bans and DMCAs. But I think that’s just a natural occurrence that is destined to happen to any large provider that operates out of the USA.

There is one thing Github has done which is increasingly leaving a bad taste in my mouth, though. They trained AI models, and allowed others to train AI models, on open source codebases, without asking those who contributed the code.

Was it inevitable?

Simply, yes. Generative AI already scrapped the web. Large Generative AI providers are known to ignore robots.txt and other scrape-limiting mechanics. If Github didn’t do it, others would have. There’s clearly a demand for using Generative AI for coding, just as much as there is for anything else. Additionally, Bitbucket and Gitlab have both released their own AI code assistants.

Github is not to blame for jumping onboard. If they didn’t, others would have.

Licensing

The licenses1 used in open source handle three things: litigation, copyright, and modifications.

Generative AI is not human intelligence. While it may have similarities to how human brains work, it is superficial. Github themselves say:

When thinking about intellectual property and open source issues, it is critical to understand how GitHub Copilot really works. The AI models that create Copilot’s suggestions may be trained on public code, but do not contain any code. When they generate a suggestion, they are not “copying and pasting” from any codebase.
GitHub Copilot generates suggestions using probabilistic determination.

At a simplified level, probabilistic determination can be thought of as “how likely is X to occur after Y?”. Despite the model not containing direct source code, it has a mathematical model which has been trained using source code. Arguing over whether a mathematical model representing a large collection of codebases is the same as copy/pasting is a semantics debate. Regardless of the debate, the outcomes are roughly the same: licensed code is being used to produce more code, without accreditation.

In a human, we would not expect a software engineer to accredit all code they had ever read to populate their human brain, only the code they directly use, with or without modification. Does the same apply to generative AI models? The AI providers have decided on behalf of the community, that yes, only direct code usage needs accreditation.

Altruism

Why do people contribute to open source? If we look past what the licenses say, what about the intent? One big reason is altruism - to improve the world.

In the short term, AI coding assistants help new developers, and help people who don't code to produce more code. Whether that code is of high quality is debatable, but particularly for those who don't know how to code, some functioning code is an amazing leap from no code.

So does that help the world? Yes, probably in some way.

However:

Generating code via an AI “compiler” uses more power and resources than traditional compilers. Any self respecting compiler author would be horrified at the byte/second output of LLMs to produce code.
The code produced is of a lower quality than that created by knowledgeable developers.
GenAI has already given companies the idea they can reduce the human workforce in favour of AI. If the world had UBI, maybe this wouldn't be a problem. However no society has wide scale UBI. Replacing human workforce means fewer jobs. Less stability for real people.2

Street cred & reputation

Open source contributions provide the contributors with a reputation. During interview processes, open source contributions let hiring managers see a candidate’s coding style or knowledge in a more direct way than listed technologies on a CV. If a project is released as a package, the package infrastructure often provides statistics on usage. Github themselves provide stars as a metric for interest in a project.

LLMs may be trained on this code, but will not inform neither the user of the LLM nor the project creator of the sources used to produce some output. LLMs prefer well known libraries when installing dependencies, rather than libraries based on quality. As a result, while a developer’s knowledge and code may be widely used, it won’t be known. This harms the reputation of open source developers - rather than “project X is great”, instead people say “Copilot is great”. Copilot is only great because of project X, and all other projects hosted on Github.

A possible solution: provide statistics on how public repository code is used in generated LLM code. I suspect this may be difficult to do in a good way, without revealing Github’s business secrets. However, Github’s focus up til now has social coding. It is not very social to move code creation to brainless AI without giving the training material a lift.

Reporting bugs, requesting features, and fixing bugs

Contributions to open source also take the form of submitting bug reports, requesting features, and opening pull requests. If a user of Copilot runs into a problem with a library, they are most likely to assume either 1) Copilot generated some bad code, or 2) the library does not do what they need. LLM code generators tend to recreate functions, even if the function already exists in the codebase. Therefore, those bug reports and pull requests never get made. Especially if a Copilot user is not a software engineer.

AI does not understand security or performance. Neither do most engineers. However, engineers are expected to care about these things, AI is not. Even when prompted to care, AI is not a security engineer. Nor is it a performance engineer.

LLMs do not produce consistent, type-aware, output. In compilers, this would be called undefined behaviour. Undefined behaviour should not and cannot be relied on. Yet things built with LLMs without an engineer involved produce undefined behaviour.

Funding

Open source maintainers struggling with funding. Big, well known, projects get funding. Highly used but relatively unknown projects don't (e.g xz). Code generation built on top of open source further the separation between consumers and creators, making funding even more unlikely. Before, developers would be at least aware of who created the package or tutorial they're using. Now credit is given to Copilot, ChatGPT, or Claude. Editor tooling, not original thought.

Why does it bother me?

In my various roles in different parts of the tech world, I see the tech industry struggling. Junior developers are struggling to find jobs. Senior developers are the least likely to be affected by job replacement by AI.

Those training, implementing or creating LLMs have a duty to society. These tools now exist at scale. There must be push back, to help define how we as an industry want to work with AI. LLMs have enabled some amazing things, for sure. I love seeing more people have access to coding tooling, especially those who don’t know how to program. But at the same time, the tech industry is going through hardships. Layoffs, less hiring, less investment. It would be foolish to ignore the role AI has played in this.

I will continue to use Github. The features they provide, for free, for open source are better than the alternatives. But I do hope that their AI experiments feed into creating a better world for those in tech, rather than damaging it.

Media preview image credit: Carreg Cennen, Ken Day, via Wikipedia.

MIT:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

BSD 3-Clause:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Apache 2.0:

You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
You must give any other recipients of the Work or Derivative Works a copy of this License; and
You must cause any modified files to carry prominent notices stating that You changed the files; and
You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and

Tech jobs, particularly in the USA, have been overpaid in proportion to their positive impact in the world. A lowering of salary, while hiring the same amount of people, will avoid economic collapse for workers in the tech industry. The CEOs and shareholders still make a large amount of money vs the work put in by ordinary developers. Reducing workforce in favour of AI only intends to reduce costs and increase profits, rather than increasing product quality or improving employee conditions. Give me a passionate human rather than AI any day.

The Tech Enabler

Discussion about this post