Originally posted 2022-07-02. Last updated 2023-05-24.
I am not a lawyer. This post is satirical commentary on:
In the process, I intentionally misrepresent how the judicial system works: I portray the system the way people like to imagine it works. Please don’t make any important legal decisions based on anything I say.
The only section you should take seriously is “Context”.
GitHub is enabling copyleft violation ✨at scale✨ with Copilot. GitHub Copilot encourages people to make derivative works of source code without complying with the original code’s license. This facilitates the creation of permissively-licensed or proprietary derivatives of copyleft code.
Unfortunately, challenging Microsoft (GitHub’s parent company) in court is a bad idea: their legal budget probably ensures their victory, and they likely already have a comprehensive defense planned. How can we determine Copilot’s legality on a level playing field? We can create legal precedent that they haven’t had a chance to study yet!
A chat with Matt Campbell about a speech synthesizer gave me a horrible idea. I think I know a way to find out if GitHub Copilot is legal: we could use its legal justification against another software project with a smaller legal budget. Specifically, against a speech synthesizer. The outcome of our actions could set a legal precedent to determine the legality of Copilot.
Let’s cover the technologies and actors at play before I start my evil monologue.
GitHub Copilot is a predictive autocompletion service for writing software. It’s powered by OpenAI Codex, a language model based on GPT-3. It was trained using the source code of public repositories hosted on GitHub, regardless of their licensing.
=> OpenAI Codex announcement | Wikipedia on GPT-3
In response to a Request for Comments from the US Patent and Trademark Office, OpenAI claimed that “Artificial Intelligence Innovation”, such as code written by GitHub Copilot, should be considered “fair use”:
Many of the code snippets it suggests are exact copies of source code from various GitHub repositories. For an example, see this tweet:
=> "I don't want to say anything but that's not the right license Mr Copilot." by Armin Ronacher | archive link that doesn’t require JavaScript, captured on 2022-07-01
It contains a screen recording of Copilot suggesting this Quake code:
=> Quake III source code snippet
When prompted to do so, it obediently fills in a permissive license. That permissive license violates the Quake code’s GPL-2.0 license. Copilot provides no indication that a license violation is taking place.
GitHub performed its own research into the matter.[1] You can read about it on their blog:
=> GitHub Copilot research recitation, by Albert Ziegler
I’m not convinced that it accounts for the fact that suggested code might have mechanical alterations to match surrounding text, while still remaining close enough to trained data to be a license violation.
I recently had a chat with Matt on IRC about screen readers and different types of speech synthesizers. I mentioned that while I do like some variety, I always find myself returning to the underrated robotic voice of eSpeak NG. He shared some of my fondness, and also shared his preference for a similar speech synthesizer called Eloquence.
Downloads of Eloquence are easy to find (it’s even included with the JAWS screen reader), but I struggle to find any “official” pages about the original Eloquence. Nuance acquired Eloquent Technology, the developer of Eloquence. Microsoft later acquired Nuance.
I like the Eloquence speech synthesizer. It sounds similar to the robotic yet predictable voice of my beloved eSpeak NG, but with improved overall quality. Unfortunately, Eloquence is proprietary.
Eloquence sample audio
=> Sample audio of Eloquence (audio/ogg, opus codec)
Matt recorded this sample audio clip of Eloquence reading some text. The text is from the introduction of another post of mine:
=> Best practices for inclusive textual websites
Deep learning speech synthesis is a recent approach to speech synthesizer creation. It involves training a deep neural network on voice samples, and using the trained model to generate speech similar to a real human voice. One synthesizer using deep learning speech synthesis is Mozilla’s TTS.
=> Wikipedia on deep learning speech synthesis | Mozilla TTS is an example of deep speech synthesis
Zero-shot approaches could allow a pre-trained model to generate multiple different voices. This could allow us to synthetically re-create a person’s voice more easily.
=> YourTTS is one such example.
My horrible plan revolves around going through two different lawsuits to set some judicial precedents; these precedents could improve the odds of succeeding in a lawsuit against Microsoft for Copilot’s licensing violations.
If this succeeds, we have new legal justification that GitHub Copilot is illegal; if it fails, we have still gained a means to legally re-create proprietary software. It’s a win-win situation.
Our goal here is to get the same legal outcome as the low-stakes “trial run” of Part One.
Microsoft owns Nuance. Nuance previously bought Eloquent Technology, the developers of the Eloquence speech synthesizer.
If we win both cases: Microsoft has the legal high ground. Making a derivative of a copyrighted work using a machine-learning algorithm allows us to bypass copyright licenses.
If we lose both cases: Microsoft does not have the legal high ground. We have good judicial precedent against Microsoft to use when filing suit for Copilot’s behavior.
Either way, it’s an absolute win for free software. Taking down Copilot protects copyleft from enabling proprietary derivatives (and by extension, protects software freedom). But if we accidentally win these two low-stakes “test” cases, we still gain something else: we can liberate huge swaths of proprietary software, starting with speech synthesizers.
This post isn't "satire through-and-through" like something from The Onion. Rather, my intent was to make some clear points, but extrapolate them to absurdity to highlight other problems. I don't think I was clear enough when doing this. I'm sorry.
Copilot has been found to suggest significant amounts of code that is dangerously similar to existing works. It does this without disclosing obligations that come with those works' licenses. Training a model on copyrighted works may not be wrong in and of itself; however, using that model to generate new works that are not sufficiently distinct from original works is where things get problematic. Copilot's users could apply proprietary licenses to the generated works, defeating the point of copyleft.
When a tool almost exclusively encourages problematic behavior, the makers of that tool should have put thought into its implications. GitHub and OpenAI have not demonstrated a sufficiently careful approach.
I don't think that "going after" a smaller player just to manipulate our legal system is a good thing to do. The fact that this idea seems plausible to some of my readers shows how warped our perception of the judicial system is. Even if it's accurate (I doubt it's accurate, but I'm not certain), it's sad. Judicial systems incentivise too much predatory behavior.
Updated on 2022-07-02: It's come to my attention that Eloquence may or may not still belong to Nuance. Further research is needed.
Eloquent Technology was acquired by SpeechWorks in 2000.
=> Article changelog | Homepage | View “An experiment to test GitHub Copilot's legality” on the WWW | Gemini capsule source code
Copyright © 2023 Rohan “Seirdy” Kumar
text/gemini;lang=en-US
This content has been proxied by September (ba2dc).