logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git
commit: f7762ea27353d8a7d1645bde48bb9f18d5d4d825
parent 2db03526a0c8f90a08b475122c136b1197aecddf
Author: Drew DeVault <sir@cmpwn.com>
Date:   Thu, 23 Jun 2022 14:53:04 +0200

GPL laundering

Diffstat:

Acontent/blog/Copilot-GPL-washing.md237+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 237 insertions(+), 0 deletions(-)

diff --git a/content/blog/Copilot-GPL-washing.md b/content/blog/Copilot-GPL-washing.md @@ -0,0 +1,237 @@ +--- +title: GitHub Copilot and open source laundering +date: 2022-06-23 +--- + +*Disclaimer: I am the founder of a company which competes with GitHub. I am also +a long-time advocate for and developer of free and open source software, with a +broad understanding of free and open source software licensing and philosophy. I +will not name my company in this post to reduce the scope of my conflict of +interest.* + +We have seen an explosion in machine learning in the past decade, alongside an +explosion in the popularity of free software. At the same time as FOSS has come +to dominate software and found its place in almost all new software products, +machine learning has increased dramatically in sophistication, facilitating more +natural interactions between humans and computers. However, despite their +parallel rise in computing, these two domains remain philosophically distant. + +Though some audaciously-named companies might suggest otherwise, the machine +learning space has enjoyed almost none of the freedoms forwarded by the free and +open source software movement. Much of the actual code related to machine +learning is publicly available, and there are many public access research +papers available for anyone to read. However, the key to machine learning is +access to a high-quality dataset and heaps of computing power to process that +data, and these two resources are still kept under lock and key by almost all +participants in the space.[^1] + +[^1]: Shout-out to Mozilla Common Voice, one of the few exceptions to this rule, + which is an excellent project that has produced a high-quality, freely + available dataset of voice samples, and used it to develop free models and + software for text-to-speech and speech recognition. + +The essential barrier to entry for machine learning projects is overcoming these +two problems, which are often very costly to secure. A high-quality, well tagged +data set generally requires thousands of hours of labor to produce,[^2] a task +which can potentially cost millions of dollars. Any approach which lowers this +figure is thus very desirable, even if the cost is making ethical compromises. +With Amazon, it takes the form of gig economy exploitation. With GitHub, it +takes the form of disregarding the terms of free software licenses. In the +process, they built a tool which facilitates the large-scale laundering of free +software into non-free software by their customers, who GitHub offers plausible +deniability through an inscrutable algorithm. + +[^2]: Typically exploitative labor from low-development countries which the tech + industry often pretends isn't a hair's breadth away from slavery. + +Free software is not an unqualified gift. There are terms for its use and +re-use. Even so-called "liberal" software licenses impose requirements on +re-use, such as attribution. To quote the MIT license: + +> Permission is hereby granted [...] subject to the following conditions: +> +> The above copyright notice and this permission notice shall be included in all +> copies or substantial portions of the Software. + +Or the equally "liberal" BSD license: + +> Redistribution and use in source and binary forms, with or without +> modification, are permitted provided that the following conditions are met: +> +> Redistributions of source code must retain the above copyright notice, this +> list of conditions and the following disclaimer. + +On the other end of the spectrum, copyleft licenses such as GNU General Public +License or Mozilla Public License go further, demanding not only attribution for +derivative works, but that such derived works are *also released* with the same +license. Quoting GPL: + +> You may convey a work based on the Program, or the modifications to produce it +> from the Program, in the form of source code under the terms of section 4, +> provided that you also meet all of these conditions: +> +> [...] +> +> You must license the entire work, as a whole, under this License to anyone who +> comes into possession of a copy. + +And MPL: + +> All distribution of Covered Software in Source Code Form, including any +> Modifications that You create or to which You contribute, must be under the +> terms of this License. You must inform recipients that the Source Code Form of +> the Covered Software is governed by the terms of this License, and how they +> can obtain a copy of this License. You may not attempt to alter or restrict +> the recipients' rights in the Source Code Form. + +Free software licenses impose obligations on the user through terms governing +attribution, sublicensing, distribution, patents, trademarks, and relationships +with laws like the Digital Millennium Copyright Act. The free software community +is no stranger to the difficulties in enforcing compliance with these +obligations, which some groups view as too onerous. But as onerous as one may +view these obligations to be, one is nevertheless required to comply with them. +If you believe that the force of copyright should protect your proprietary +software, then you must agree that it equally protects open source works, +despite the inconvenience or cost associated with this truth. + +GitHub's Copilot is trained on software governed by these terms, and it fails to +uphold them, and enables customers to accidentally fail to uphold these terms +themselves. Some argue about the risks of a "copyleft surprise", wherein someone +incorporates a GPL licensed work into their product and is surprised to find +that they are obligated to release their product under the terms of the GPL as +well. Copilot institutionalizes this risk and any user who wishes to use it to +develop non-free software would be well-advised not to do so, else they may find +themselves legally liable to uphold these terms, perhaps ultimately being +required to release their works under the terms of a license which is +undesirable for their goals. + +Essentially, the argument comes down to whether or not the model constitutes a +derivative work of its inputs. Microsoft argues that it does not. However, these +licenses are not specific regarding the means of derivation; the classic +approach of copying and pasting from one project to another need not be the only +means for these terms to apply. The model exists as the result of applying an +algorithm to these inputs, and thus the model itself is a derivative work of its +inputs. The model, then used to create new programs, forwards its obligations to +those works. + +All of this assumes the best interpretation of Microsoft's argument, with a +heavy reliance on the fact that the model becomes a general purpose programmer, +having meaningfully learned from its inputs and applying this knowledge to +produce original work. Should a human programmer take the same approach, +studying free software and applying those lessons, but not the code itself, to +original projects, I would agree that their applied knowledge is not creating +derivative works. However, that is not how machine learning works. Machine +learning is essentially a glorified pattern recognition and reproduction engine, +and does not represent a genuine generalization of the learning process. It is +perhaps capable of a limited amount of originality, but is also capable of +degrading to the simple case of copy and paste. Here is an example of Copilot +reproducing, verbatim, a function which is governed by the GPL, and would thus +be governed by its terms: + +<video + src="https://mirror.drewdevault.com/copilot-squareroot.mp4" + muted autoplay controls></video> + +<div class="text-center"> + <small> + Source: <a href="https://twitter.com/mitsuhiko/status/1410886329924194309"> + Armin Ronacher</a> via Twitter + </small> +</div> + +The license reproduced by Copilot is not correct, neither in form nor function. +This code was not written by V. Petkov and the GPL imposes much stronger +obligations than those suggested by the comment. This small example was +deliberately provoked with a suggestive prompt (this famous function is known as +the "[fast inverse square root]") and the "float Q_", but it's not a stretch to +assume someone can accidentally do something similar with any particularly +unlucky English-language description of their goal. + +[fast inverse square root]: https://en.wikipedia.org/wiki/Fast_inverse_square_root + +Of course, the use of a suggestive prompt to convince Copilot to print GPL +licensed code suggests another use: deliberately laundering FOSS source code. If +Microsoft's argument holds, then indeed the only thing which is necessary to +legally circumvent a free software license is to teach a machine learning +algorithm to regurgitate a function you want to use. + +This is a problem. I have two suggestions to offer to two audiences: one for +GitHub, and another for free software developers who are worried about Copilot. + +To GitHub: this is your Oracle v Google moment. You've invested in building a +platform on top of which the open source revolution was built, and leveraging +this platform for this move is a deep betrayal of the community's trust. The law +applies to you, and banking on the fact that the decentralized open source +community will not be able to mount an effective legal challenge to your $7.5B +Microsoft war chest does not change this. The open source community is +astonished, and the astonishment is slowly but surely boiling over into rage as +our concerns fall on deaf ears and you push forward with the Copilot release. I +expect that if the situation does not change, you will find a group motivated +enough to challenge this. The legitimacy of the free software ecosystem may rest +on this problem, and there are many companies who are financially incentivized +to see to it that this legitimacy stands. I am certainly prepared to join a +class action lawsuit as a maintainer, or alongside other companies with +interests in free software making use of our financial resources to facilitate a +lawsuit. + +The tool can be improved, probably still in time to avoid the most harmful +effects (harmful to your business, that is) of Copilot. I offer the following +specific suggestions: + +1. Allow GitHub users and repositories to opt-out of being incorporated into the + model. Better, allow them to opt-in. Do not tie this flag into unrelated + projects like Software Heritage and the Internet Archive. +2. Track the software licenses which are incorporated into the model and inform + users of their obligations with respect to those licenses. +3. Remove copyleft code from the model entirely, unless you want to make the + model and its support code free software as well. + +Your current model probably needs to be thrown out. The GPL code incorporated +into it entitles anyone who uses it to receive a GPL'd copy of the model for +their own use. It entitles these people to commercial use, to build a competing +product with it. But, it presumably also includes works under incompatible +licenses, such as the CDDL, which is... problematic. The whole thing is a legal +mess. + +I cannot speak for the rest of the community that have been hurt by this +project, but for my part, I would be okay with not pursuing the answers to any +of these questions with you in court if you agreed to resolve these problems +now. + +And, my advice to free software maintainers who are pissed that their licenses +are being ignored. First, don't use GitHub and your code will not make it into +the model (for now). [I've written before] about why it's generally important +for free software projects to use free software infrastructure, and this only +re-enforces that fact. Furthermore, the old "vote with your wallet" approach is +a good way to show your disfavor. That said, if it occurs to you that you +*don't* actually pay for GitHub, then you may want to take a moment to consider +if the incentives created by that relationship explain this development and may +lead to more unfavorable outcomes for you in the future. + +[I've written before]: https://drewdevault.com/2022/03/29/free-software-free-infrastructure.html + +You may also be tempted to solve this problem by changing your software licenses +to prohibit this behavior. I'll say upfront that according to Microsoft's +interpretation of the situation (invoking fair use), it doesn't matter to them +which license you use: they'll use your code regardless. In fact, [some +proprietary code] was found to have been incorporated into the model. However, I +still support your efforts to address this in your software licenses, as it +provides an even stronger legal foundation upon which we can reject Copilot. + +[some proprietary code]: https://twitt.re/ChrisGr93091552/status/1539731632931803137 + +I will caution you that the way you approach that clause of your license is +important. Whenever writing or changing a free and open source software license, +you should consider whether or not it will still qualify as free or open source +after your changes. To be specific, a clause which outright forbids the use of +your code for training a machine learning model will make your software +*non-free*, and I do not recommend this approach. Instead, I would update your +licenses to clarify that incorporating the code into a machine learning model is +considered a form of derived work, and that your license terms apply to the +model and any works produced with that model. + +To summarize, I think that GitHub Copilot is a bad idea as designed. It +represents a flagrant disregard of FOSS licensing in of itself, and it enables +similar disregard &mdash; deliberate or otherwise &mdash; among its users. I +hope they will heed my suggestions, and I hope that my words to the free +software community offer some concrete ways to move forward with this problem.