logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Copilot-GPL-washing.md (13809B)


  1. ---
  2. title: GitHub Copilot and open source laundering
  3. date: 2022-06-23
  4. ---
  5. *Disclaimer: I am the founder of a company which competes with GitHub. I am also
  6. a long-time advocate for and developer of free and open source software, with a
  7. broad understanding of free and open source software licensing and philosophy. I
  8. will not name my company in this post to reduce the scope of my conflict of
  9. interest.*
  10. We have seen an explosion in machine learning in the past decade, alongside an
  11. explosion in the popularity of free software. At the same time as FOSS has come
  12. to dominate software and found its place in almost all new software products,
  13. machine learning has increased dramatically in sophistication, facilitating more
  14. natural interactions between humans and computers. However, despite their
  15. parallel rise in computing, these two domains remain philosophically distant.
  16. Though some audaciously-named companies might suggest otherwise, the machine
  17. learning space has enjoyed almost none of the freedoms forwarded by the free and
  18. open source software movement. Much of the actual code related to machine
  19. learning is publicly available, and there are many public access research
  20. papers available for anyone to read. However, the key to machine learning is
  21. access to a high-quality dataset and heaps of computing power to process that
  22. data, and these two resources are still kept under lock and key by almost all
  23. participants in the space.[^1]
  24. [^1]: Shout-out to Mozilla Common Voice, one of the few exceptions to this rule,
  25. which is an excellent project that has produced a high-quality, freely
  26. available dataset of voice samples, and used it to develop free models and
  27. software for text-to-speech and speech recognition.
  28. The essential barrier to entry for machine learning projects is overcoming these
  29. two problems, which are often very costly to secure. A high-quality, well tagged
  30. data set generally requires thousands of hours of labor to produce,[^2] a task
  31. which can potentially cost millions of dollars. Any approach which lowers this
  32. figure is thus very desirable, even if the cost is making ethical compromises.
  33. With Amazon, it takes the form of gig economy exploitation. With GitHub, it
  34. takes the form of disregarding the terms of free software licenses. In the
  35. process, they built a tool which facilitates the large-scale laundering of free
  36. software into non-free software by their customers, who GitHub offers plausible
  37. deniability through an inscrutable algorithm.
  38. [^2]: Typically exploitative labor from low-development countries which the tech
  39. industry often pretends isn't a hair's breadth away from slavery.
  40. Free software is not an unqualified gift. There are terms for its use and
  41. re-use. Even so-called "liberal" software licenses impose requirements on
  42. re-use, such as attribution. To quote the MIT license:
  43. > Permission is hereby granted [...] subject to the following conditions:
  44. >
  45. > The above copyright notice and this permission notice shall be included in all
  46. > copies or substantial portions of the Software.
  47. Or the equally "liberal" BSD license:
  48. > Redistribution and use in source and binary forms, with or without
  49. > modification, are permitted provided that the following conditions are met:
  50. >
  51. > Redistributions of source code must retain the above copyright notice, this
  52. > list of conditions and the following disclaimer.
  53. On the other end of the spectrum, copyleft licenses such as GNU General Public
  54. License or Mozilla Public License go further, demanding not only attribution for
  55. derivative works, but that such derived works are *also released* with the same
  56. license. Quoting GPL:
  57. > You may convey a work based on the Program, or the modifications to produce it
  58. > from the Program, in the form of source code under the terms of section 4,
  59. > provided that you also meet all of these conditions:
  60. >
  61. > [...]
  62. >
  63. > You must license the entire work, as a whole, under this License to anyone who
  64. > comes into possession of a copy.
  65. And MPL:
  66. > All distribution of Covered Software in Source Code Form, including any
  67. > Modifications that You create or to which You contribute, must be under the
  68. > terms of this License. You must inform recipients that the Source Code Form of
  69. > the Covered Software is governed by the terms of this License, and how they
  70. > can obtain a copy of this License. You may not attempt to alter or restrict
  71. > the recipients' rights in the Source Code Form.
  72. Free software licenses impose obligations on the user through terms governing
  73. attribution, sublicensing, distribution, patents, trademarks, and relationships
  74. with laws like the Digital Millennium Copyright Act. The free software community
  75. is no stranger to the difficulties in enforcing compliance with these
  76. obligations, which some groups view as too onerous. But as onerous as one may
  77. view these obligations to be, one is nevertheless required to comply with them.
  78. If you believe that the force of copyright should protect your proprietary
  79. software, then you must agree that it equally protects open source works,
  80. despite the inconvenience or cost associated with this truth.
  81. GitHub's Copilot is trained on software governed by these terms, and it fails to
  82. uphold them, and enables customers to accidentally fail to uphold these terms
  83. themselves. Some argue about the risks of a "copyleft surprise", wherein someone
  84. incorporates a GPL licensed work into their product and is surprised to find
  85. that they are obligated to release their product under the terms of the GPL as
  86. well. Copilot institutionalizes this risk and any user who wishes to use it to
  87. develop non-free software would be well-advised not to do so, else they may find
  88. themselves legally liable to uphold these terms, perhaps ultimately being
  89. required to release their works under the terms of a license which is
  90. undesirable for their goals.
  91. Essentially, the argument comes down to whether or not the model constitutes a
  92. derivative work of its inputs. Microsoft argues that it does not. However, these
  93. licenses are not specific regarding the means of derivation; the classic
  94. approach of copying and pasting from one project to another need not be the only
  95. means for these terms to apply. The model exists as the result of applying an
  96. algorithm to these inputs, and thus the model itself is a derivative work of its
  97. inputs. The model, then used to create new programs, forwards its obligations to
  98. those works.
  99. All of this assumes the best interpretation of Microsoft's argument, with a
  100. heavy reliance on the fact that the model becomes a general purpose programmer,
  101. having meaningfully learned from its inputs and applying this knowledge to
  102. produce original work. Should a human programmer take the same approach,
  103. studying free software and applying those lessons, but not the code itself, to
  104. original projects, I would agree that their applied knowledge is not creating
  105. derivative works. However, that is not how machine learning works. Machine
  106. learning is essentially a glorified pattern recognition and reproduction engine,
  107. and does not represent a genuine generalization of the learning process. It is
  108. perhaps capable of a limited amount of originality, but is also capable of
  109. degrading to the simple case of copy and paste. Here is an example of Copilot
  110. reproducing, verbatim, a function which is governed by the GPL, and would thus
  111. be governed by its terms:
  112. <video
  113. src="https://mirror.drewdevault.com/copilot-squareroot.mp4"
  114. muted autoplay controls></video>
  115. <div class="text-center">
  116. <small>
  117. Source: <a href="https://twitter.com/mitsuhiko/status/1410886329924194309">
  118. Armin Ronacher</a> via Twitter
  119. </small>
  120. </div>
  121. The license reproduced by Copilot is not correct, neither in form nor function.
  122. This code was not written by V. Petkov and the GPL imposes much stronger
  123. obligations than those suggested by the comment. This small example was
  124. deliberately provoked with a suggestive prompt (this famous function is known as
  125. the "[fast inverse square root]") and the "float Q_", but it's not a stretch to
  126. assume someone can accidentally do something similar with any particularly
  127. unlucky English-language description of their goal.
  128. [fast inverse square root]: https://en.wikipedia.org/wiki/Fast_inverse_square_root
  129. Of course, the use of a suggestive prompt to convince Copilot to print GPL
  130. licensed code suggests another use: deliberately laundering FOSS source code. If
  131. Microsoft's argument holds, then indeed the only thing which is necessary to
  132. legally circumvent a free software license is to teach a machine learning
  133. algorithm to regurgitate a function you want to use.
  134. This is a problem. I have two suggestions to offer to two audiences: one for
  135. GitHub, and another for free software developers who are worried about Copilot.
  136. To GitHub: this is your Oracle v Google moment. You've invested in building a
  137. platform on top of which the open source revolution was built, and leveraging
  138. this platform for this move is a deep betrayal of the community's trust. The law
  139. applies to you, and banking on the fact that the decentralized open source
  140. community will not be able to mount an effective legal challenge to your $7.5B
  141. Microsoft war chest does not change this. The open source community is
  142. astonished, and the astonishment is slowly but surely boiling over into rage as
  143. our concerns fall on deaf ears and you push forward with the Copilot release. I
  144. expect that if the situation does not change, you will find a group motivated
  145. enough to challenge this. The legitimacy of the free software ecosystem may rest
  146. on this problem, and there are many companies who are financially incentivized
  147. to see to it that this legitimacy stands. I am certainly prepared to join a
  148. class action lawsuit as a maintainer, or alongside other companies with
  149. interests in free software making use of our financial resources to facilitate a
  150. lawsuit.
  151. The tool can be improved, probably still in time to avoid the most harmful
  152. effects (harmful to your business, that is) of Copilot. I offer the following
  153. specific suggestions:
  154. 1. Allow GitHub users and repositories to opt-out of being incorporated into the
  155. model. Better, allow them to opt-in. Do not tie this flag into unrelated
  156. projects like Software Heritage and the Internet Archive.
  157. 2. Track the software licenses which are incorporated into the model and inform
  158. users of their obligations with respect to those licenses.
  159. 3. Remove copyleft code from the model entirely, unless you want to make the
  160. model and its support code free software as well.
  161. 4. Consider compensating the copyright owners of free software projects
  162. incorporated into the model with a margin from the Copilot usage fees, in
  163. exchange for a license permitting this use.
  164. Your current model probably needs to be thrown out. The GPL code incorporated
  165. into it entitles anyone who uses it to receive a GPL'd copy of the model for
  166. their own use. It entitles these people to commercial use, to build a competing
  167. product with it. But, it presumably also includes works under incompatible
  168. licenses, such as the CDDL, which is... problematic. The whole thing is a legal
  169. mess.
  170. I cannot speak for the rest of the community that have been hurt by this
  171. project, but for my part, I would be okay with not pursuing the answers to any
  172. of these questions with you in court if you agreed to resolve these problems
  173. now.
  174. And, my advice to free software maintainers who are pissed that their licenses
  175. are being ignored. First, don't use GitHub and your code will not make it into
  176. the model (for now). [I've written before] about why it's generally important
  177. for free software projects to use free software infrastructure, and this only
  178. re-enforces that fact. Furthermore, the old "vote with your wallet" approach is
  179. a good way to show your disfavor. That said, if it occurs to you that you
  180. *don't* actually pay for GitHub, then you may want to take a moment to consider
  181. if the incentives created by that relationship explain this development and may
  182. lead to more unfavorable outcomes for you in the future.
  183. [I've written before]: https://drewdevault.com/2022/03/29/free-software-free-infrastructure.html
  184. You may also be tempted to solve this problem by changing your software licenses
  185. to prohibit this behavior. I'll say upfront that according to Microsoft's
  186. interpretation of the situation (invoking fair use), it doesn't matter to them
  187. which license you use: they'll use your code regardless. In fact, [some
  188. proprietary code] was found to have been incorporated into the model. However, I
  189. still support your efforts to address this in your software licenses, as it
  190. provides an even stronger legal foundation upon which we can reject Copilot.
  191. [some proprietary code]: https://twitter.com/ChrisGr93091552/status/1539731632931803137
  192. I will caution you that the way you approach that clause of your license is
  193. important. Whenever writing or changing a free and open source software license,
  194. you should consider whether or not it will still qualify as free or open source
  195. after your changes. To be specific, a clause which outright forbids the use of
  196. your code for training a machine learning model will make your software
  197. *non-free*, and I do not recommend this approach. Instead, I would update your
  198. licenses to clarify that incorporating the code into a machine learning model is
  199. considered a form of derived work, and that your license terms apply to the
  200. model and any works produced with that model.
  201. To summarize, I think that GitHub Copilot is a bad idea as designed. It
  202. represents a flagrant disregard of FOSS licensing in of itself, and it enables
  203. similar disregard &mdash; deliberate or otherwise &mdash; among its users. I
  204. hope they will heed my suggestions, and I hope that my words to the free
  205. software community offer some concrete ways to move forward with this problem.