logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

BARE-message-encoding.md (8111B)


  1. ---
  2. date: 2020-06-21
  3. layout: post
  4. title: Introducing the BARE message encoding
  5. ---
  6. I like stateless tokens. We started with state*ful* tokens: where a generated
  7. string acts as a unique identifier for a resource, and the resource itself is
  8. looked up separately. For example, your sr.ht OAuth token is a stateful token:
  9. we just generate a random number and hand it to you, something like
  10. "a97c4aeeec705f81539aa". To find the information associated with this token, we
  11. query the database — our local *state* — to find it.
  12. <a href="#announcement">
  13. Click here to skip the context and read the actual announcement -&gt;
  14. </a>
  15. But, increasingly, we've been using stateless tokens, which are a bloody good
  16. idea. The idea is that, instead of using random numbers, you encode the actual
  17. state you need into the token. For example, your sr.ht login session cookie is a
  18. JSON blob which is encrypted and base64 encoded. Rather than associating your
  19. session with a record in the database, we just decrypt the cookie when your
  20. browser sends it to us, and the session information is right there. This
  21. improves performance and simplicity in a single stroke, which is a huge win in
  22. my book.
  23. There is one big problem, though: stateless tokens tend to be a lot larger than
  24. their stateful counterparts. For a stateful token, we just need to generate
  25. enough random numbers to be both unique and unpredictable, and then store the
  26. rest of the data elsewhere. Not so for a stateless token, whose length is a
  27. function of the amount of state which has been sequestered into it. Here's an
  28. example: the cursor fields on the new GraphQL APIs are stateless. This is one of
  29. them:
  30. gAAAAABe7-ysKcvmyavwKIT9k1uVLx_GXI6OunjFIHa3OJmK3eBC9NT6507PBr1WbuGtjlZSTYLYvicH2EvJXI1eAejR4kuNExpwoQsogkE9Ua6JhN10KKYzF9kJKW0hA_-737NurotB
  31. A whopping 141 characters long! It's hardly as convenient to lug this monster
  32. around. Most of the time it'll be programs doing the carrying, but it's still
  33. annoying when you're messing with the API and debugging your programs. This
  34. isn't an isolated example, either: these stateless tokens tend to be large
  35. throughout sr.ht.
  36. In general, JSON messages are pretty bulky. They represent everything as text,
  37. which can be 2x as inefficient for certain kinds of data right off the bat.
  38. They're also self-describing: the schema of the message is encoded into the
  39. message itself; that is, the names of fields, hierarchy of objects, and data
  40. types.
  41. There are many alternatives that attempt to address this problem, and I
  42. considered many of them. Here were a selected few of my conclusions:
  43. - [protobuf](https://developers.google.com/protocol-buffers/): too
  44. complicated and too fragile, and I've never been fond of the generated code
  45. for protobufs in any language. Writing a third-party protobuf implementation
  46. would be a gargantuan task, and there's no standard. RPC support is also
  47. undesirable for this use-case.
  48. - [Cap'n Proto](https://capnproto.org/): fixed width, alignment, and so on
  49. &mdash; good for performance, bad for message size. Too complex. RPC support
  50. is also undesirable for this use-case. I also passionately hate C++ and I
  51. cannot in good faith consider something which makes it their primary target.
  52. - [BSON](http://bsonspec.org/): MonogoDB implementation details have leaked into
  53. the specification, and it's extensible in the worst way. I appreciate that
  54. JSON is a closed spec and no one is making vendor extensions for it &mdash;
  55. and, similarly, a diverse extension ecosystem is not something I want to see
  56. for this technology. Additionally, encoding schema into the message is wasting
  57. space.
  58. - [MessagePack](https://msgpack.org/): ruled out for similar reasons: too much
  59. extensibility, and the schema is encoded into the message, wasting space.
  60. - [CBOR](https://cbor.io/): ruled out for similar reasons: too much
  61. extensibility, and the schema is encoded into the message. Has the advantage
  62. of a specification, but the disadvantage of that spec being 54 pages long.
  63. There were others, but hopefully this should give you an idea of what I was
  64. thinking about when evaluating my options.
  65. There doesn't seem to be anything which meets my criteria just right:
  66. - Optimized for small messages
  67. - Standardized
  68. - Easy to implement
  69. - Universal &mdash; little to no support for extensions
  70. - Simple &mdash; no extra junk that isn't contributing to the core mission
  71. The solution is evident.
  72. [![xkcd comic 927, "Standards"](https://imgs.xkcd.com/comics/standards.png)](https://xkcd.com/927)
  73. <a id="announcement"></a>
  74. ## BARE: Binary Application Record Encoding
  75. [BARE](https://baremessages.org) meets all of the criteria:
  76. - **Optimized for small messages**: messages are binary, not self-describing,
  77. and have no alignment or padding.
  78. - **Standardized & simple**: the specification is just over 1,000 words &mdash;
  79. shorter than this blog post.
  80. - **Easy to implement**: the first implementation (for Go) was done in a single
  81. weekend (this weekend, in fact).
  82. - **Universal**: there is room for user extensibility, but it's done in a manner
  83. which does not require expanding the implementation nor making messages which
  84. are incompatible with other implementations.
  85. Stateless tokens aren't the only messages that I've wanted a simple binary
  86. encoding for. On many occasions I've evaluated and re-evaluated the same set of
  87. existing solutions, and found none of them quite right. I hope that BARE will
  88. help me solve many of these problems in the future, and I hope you find it
  89. useful, too!
  90. The cursor token I shared earlier in the article looks like this when encoded
  91. with BARE:
  92. gAAAAABe7_K9PeskT6xtLDh_a3JGQa_DV5bkXzKm81gCYqNRV4FLJlVvG3puusCGAwQUrKFLO-4LJc39GBFPZomJhkyqrowsUw==
  93. 100 characters (41 fewer than JSON), which happens to be the minimum size of a
  94. padded [Fernet](https://github.com/fernet/spec/) message. If we compare only the
  95. cleartext:
  96. JSON: eyJjb3VudCI6MjUsIm5leHQiOiIxMjM0NSIsInNlYXJjaCI6bnVsbH0=
  97. BARE: EAUxMjM0NQA=
  98. Much improved!
  99. BARE also has an optional schema language for defining your message structure.
  100. Here's a sample:
  101. ```
  102. type PublicKey data<128>
  103. type Time string # ISO 8601
  104. enum Department {
  105. ACCOUNTING
  106. ADMINISTRATION
  107. CUSTOMER_SERVICE
  108. DEVELOPMENT
  109. # Reserved for the CEO
  110. JSMITH = 99
  111. }
  112. type Customer {
  113. name: string
  114. email: string
  115. address: Address
  116. orders: []{
  117. orderId: i64
  118. quantity: i32
  119. }
  120. metadata: map[string]data
  121. }
  122. type Employee {
  123. name: string
  124. email: string
  125. address: Address
  126. department: Department
  127. hireDate: Time
  128. publicKey: optional
  129. metadata: map[string]data
  130. }
  131. type Person (Customer | Employee)
  132. type Address {
  133. address: [4]string
  134. city: string
  135. state: string
  136. country: string
  137. }
  138. ```
  139. You can feed this into a code generator and get types which can encode & decode
  140. these messages. But, you can also describe your schema just using your
  141. language's existing type system, like this:
  142. ```go
  143. type Coordinates struct {
  144. X uint // uint
  145. Y uint // uint
  146. Z uint // uint
  147. Q *uint // optional<uint>
  148. }
  149. func main() {
  150. var coords Coordinates
  151. payload := []byte{0x01, 0x02, 0x03, 0x01, 0x04}
  152. err := bare.Unmarshal(payload, &coords)
  153. if err != nil {
  154. panic(err)
  155. }
  156. fmt.Printf("coords: %d, %d, %d (%d)\n", /* coords: 1, 2, 3 (4) */
  157. coords.X, coords.Y, coords.Z, *coords.Q)
  158. }
  159. ```
  160. Bonus: you can get the schema language definition for this struct with
  161. `schema.SchemaFor(coords)`.
  162. ## BARE is under development
  163. There are some possible changes that could come to BARE before finalizing the
  164. specification. Here are some questions I'm thinking about:
  165. - Should the schema language include support for arbitrary annotations to
  166. inform code generators? I'm inclined to think "no", but if you use BARE and
  167. find yourself wishing for this, tell me about it.
  168. - Should BARE have first-class support for bitfield enums?
  169. - Should maps be ordered?
  170. [Feedback welcome](mailto:~sircmpwn/public-inbox@lists.sr.ht)!
  171. **Errata**
  172. - This article was originally based on an older version of the draft
  173. specification, and was updated accordingly.