logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

The-problem-with-Python-3.md (5563B)


  1. ---
  2. date: 2017-01-13
  3. # vim: tw=80
  4. title: The only problem with Python 3's str is that you don't grok it
  5. layout: post
  6. tags: [python]
  7. ---
  8. I've found myself explaining Python 3's str to people online more and more often
  9. lately. There's this ridiculous claim about that Python 3's string handling is
  10. broken or somehow worse than Python 2, and today I intend to put that myth to
  11. rest. Python 2 strings are broken, and Python 3 strings are sane. The only
  12. problem is that you don't grok strings.
  13. The basic problem many people seem to have with Python 3's strings arises when
  14. they write code that treats bytes like a string, because that's how it was in
  15. Python 2. Let me make this as clear as possible:
  16. <div class="loud">a bytes is not a string</div>
  17. <style>
  18. .loud {
  19. font-size: 14pt;
  20. font-weight: bold;
  21. text-align: center;
  22. margin-bottom: 1rem;
  23. }
  24. </style>
  25. I want you to read that, over and over again, until it sinks in. A string is
  26. basically an array of characters (characters being Unicode codepoints), whereas
  27. bytes is an array of bytes, aka octets, aka unsigned 8 bit integers. That's
  28. right - bytes is an array of unsigned 8 bit integers, or as the name would
  29. imply, bytes. If you *ever* do string operations against bytes, you are Doing
  30. It Wrong because bytes are not strings.
  31. <div class="loud">a bytes is not a string</div>
  32. It's entirely possible that your bytes contains an *encoded representation* of a
  33. string. That encoding could be ASCII, UTF-8, UTF-32, etc. These encodings are
  34. means of representing strings as bytes, aka unsigned 8 bit integers. In order to
  35. treat it like a string, you first must *decode* it. Luckily Python 3 makes this
  36. painless: `bytes.decode()`. This defaults to UTF-8, but you can specify any
  37. encoding you want: `bytes.decode('latin-1')`. If you want bytes again, use
  38. `str.encode()`, which again defaults to UTF-8 but accepts any encoding. If you
  39. have a bytes that contains an encoded string, your first order of business is
  40. decoding it.
  41. <div class="loud">a bytes is not a string</div>
  42. Let's look at some examples of why this matters in practice:
  43. ```python
  44. Python 3.6.0 (default, Dec 24 2016, 08:03:08)
  45. [GCC 6.2.1 20160830] on linux
  46. Type "help", "copyright", "credits" or "license" for more information.
  47. >>> 'おはようございます'
  48. 'おはようございます'
  49. >>> 'おはようございます'[::-1]
  50. 'すまいざごうよはお'
  51. >>> 'おはようございます'[0]
  52. 'お'
  53. >>> 'おはようございます'[1]
  54. 'は'
  55. ```
  56. Or in Python 2:
  57. ```python
  58. Python 2.7.13 (default, Dec 21 2016, 07:16:46)
  59. [GCC 6.2.1 20160830] on linux2
  60. Type "help", "copyright", "credits" or "license" for more information.
  61. >>> 'おはようございます'
  62. '\xe3\x81\x8a\xe3\x81\xaf\xe3\x82\x88\xe3\x81\x86\xe3\x81\x94\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'
  63. >>> 'おはようございます'[::-1]
  64. '\x99\x81\xe3\xbe\x81\xe3\x84\x81\xe3\x96\x81\xe3\x94\x81\xe3\x86\x81\xe3\x88\x82\xe3\xaf\x81\xe3\x8a\x81\xe3'
  65. >>> print('おはようございます'[::-1])
  66. 㾁㄁㖁㔁ㆁ㈂㯁㊁ã
  67. >>> 'おはようございます'[0]
  68. '\xe3'
  69. >>> 'おはようございます'[1]
  70. '\x81'
  71. ```
  72. For anything other than ASCII, Python 2 "strings" are broken. Python 3's string
  73. handling is superb. The problem with it has only ever been that you don't
  74. actually know how strings work. Instead of starting ignorant flamewars about it,
  75. learn how it works.
  76. ## Actual examples people have given me
  77. **"Python 3 can't handle bytes as file names"**
  78. Yes it can. Just stop treating them like strings:
  79. ```python
  80. >>>open(b'test-\xd8\x01.txt', 'w').close()
  81. ```
  82. Note the use of bytes as the file name, not str. \xd8\x01 is unrepresentable as
  83. UTF-8.
  84. ```python
  85. >>> [open(f, 'r').close() for f in os.listdir(b'.')]
  86. [None]
  87. ```
  88. Note the use of bytes as the path to os.listdir (the documentation says that if
  89. you want bytes back as file names, pass bytes as the path. The docs are helpful
  90. like that). Also note the lack of crashes or broken behavior.
  91. **"Python 3's csv module writes b'Hello',b'World' into CSV files"**
  92. CSV files are "comma seperated values". Is each value an array of unsigned 8 bit
  93. integers? No, of course not. They're strings. So why would you pass an array of
  94. unsigned 8 bit integers to it?
  95. **"Python 3 doesn't support writing files as latin-1"**
  96. Sure it does.
  97. ```python
  98. with open('some latin-1 file', 'rb') as f:
  99. text = f.read().decode('latin-1')
  100. with open('some utf8 file', 'wb') as f:
  101. f.write(text.encode('utf-8'))
  102. ```
  103. <div class="loud">a bytes is not a string</div>
  104. <div class="loud">a bytes is not a string</div>
  105. <div class="loud">a bytes is not a string</div>
  106. Python 2's shitty design has broken your mindset. Unlearn it.
  107. ## Python 2 is dead, long live Python 3
  108. Listen. It's time you moved to Python 3. You're missing out on a lot of really
  109. great improvements to the language and are stuck with a lot of problems. Python
  110. 2 is really being EoL'd, and closing your eyes and covering your ears singing
  111. "la la la" doesn't change that. The transition is really not that difficult or
  112. time consuming, and well worth it. Some people say only new projects should be
  113. written in Python 3. I say that's bollocks - all projects should be written in
  114. Python 3 and you need to migrate, *now*.
  115. Python 3 is better. Much, much better. For every legitimate criticism of Python
  116. 3 I've seen, I've seen 10 that are bullshit. Come join us in the wonderful world
  117. of sane string handling, type decorations, async/await, and more awesome
  118. features. Every library supports it now. Let go of your biases and evaluate the
  119. language honestly.