logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Is-GitHub-a-derivative-work.md (5148B)


  1. ---
  2. title: Is GitHub a derivative work of GPL'd software?
  3. date: 2021-07-04
  4. outputs: [html, gemtext]
  5. ---
  6. GitHub recently announced a tool called [Copilot][0], a tool which uses machine
  7. learning to provide code suggestions, inciting no small degree of controversy.
  8. One particular facet of the ensuing discussion piques my curiosity: what happens
  9. if the model was trained using software licensed with the GNU General Public
  10. License?
  11. *Disclaimer: I am the founder of a company which competes with GitHub.*
  12. [0]: https://copilot.github.com
  13. The GPL is among a family of licenses considered "copyleft", which are
  14. characterized by their "viral" nature. In particular, the trait common to
  15. copyleft works is the requirement that "derivative works" are required to
  16. publish their new work under the same terms as the original copyleft license.
  17. Some weak copyleft licenses, like the Mozilla Public License, only apply to any
  18. changes to specific files from the original code. Stronger licenses like the GPL
  19. family affect the broader work that any GPL'd code has been incorporated into.
  20. [A recent tweet by @mitsuhiko][tweet 1] notes that Copilot can be caused to
  21. produce, verbatim, the famous fast inverse square root function from Quake III
  22. Arena: a codebase distributed under the GNU GPL 2.0 license. This raises an
  23. interesting legal question: is the work produced by a machine learning system,
  24. or even the machine learning system itself, a derivative work of the inputs to
  25. the model? [Another tweet][tweet 2] suggests that, if the answer is "no",
  26. GitHub Copilot can be used as a means of washing the GPL off of code you want to
  27. use without obeying its license. But, what if the answer is "yes"?
  28. [tweet 1]: https://twitter.com/mitsuhiko/status/1410886329924194309
  29. [tweet 2]: https://twitter.com/eevee/status/1410037309848752128
  30. I won't take a position on this question[^1], but I will point out something
  31. interesting: if the answer is "*yes*, machine learning models create derivative
  32. works of their inputs", then GitHub may itself now be considered a derivative
  33. work of copyleft software. Consider this statement from GitHub's blog post on
  34. the subject:
  35. [^1]: Though I definitely have one 😉
  36. > During GitHub Copilot’s early development, nearly 300 employees used it in
  37. > their daily work as part of an internal trial.
  38. — [Albert Ziegler: A first look at rote learning in GitHub Copilot suggestions](https://docs.github.com/en/github/copilot/research-recitation)
  39. If 300 GitHub employees used Copilot as part of their daily workflow, they are
  40. likely to have incorporated the output of Copilot into nearly every software
  41. property of GitHub, which provides network services to users. If the model was
  42. trained on software using the GNU Affero General Public License (AGPL), and the
  43. use of this model created a derivative work, this may entitle all GitHub users
  44. to receive a copy of GitHub's source code under the terms of the AGPL,
  45. effectively forcing GitHub to become an open source project. I'm normally
  46. against GPL enforcement by means of pulling the rug out from underneath someone
  47. who made an honest mistake[^2], but in this case it would certainly be a
  48. fascinating case of comeuppance.
  49. [^2]: I support GPL enforcement, but I think we would be wise to equip users with a clear understanding of what our license entails, so that those mistakes are less likely to happen in the first place.
  50. Following the Copilot announcement, many of the ensuing discussions hinted to me
  51. at a broader divide in the technology community with respect to machine
  52. learning. I've seen many discussions having to wrestle with philosophical
  53. differences between participants, who give different answers to more fundamental
  54. questions regarding the ethics of machine learning: what rights should be, and
  55. are, afforded to the owners of the content which is incorporated into training
  56. data for machine learning? If I want to publish a work which I *don't* want to
  57. be incorporated into a model, or which, if used for a model, would entitle the
  58. public to access to that model, could I? Ought I be allowed to? What if the work
  59. being used is my personal information, collected without my knowledge or
  60. consent? What if the information is used against me, for example in making
  61. lending decisions? What if it's used against society's interests at large?
  62. The differences of opinion I've seen in the discussions born from this
  63. announcement seem to suggest a substantial divide over machine learning, which
  64. the tech community may have yet to address, or even understand the depth of. I
  65. predict that GitHub Copilot will mark one of several inciting events which start
  66. to rub some of the glamour off of machine learning technology and gets us
  67. thinking about the ethical questions it presents.[^3]
  68. [^3]: I also predict that capitalism will do that thing it normally does and sweep all of the ethics under the rug in any scenario in which addressing the problem would call their line of business into doubt, ultimately leaving the dilemma uncomfortably unresolved as most of us realize it's a dodgy ethical situation while simultaneously being paid to not think about it too hard.