logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Understanding-pointers.md (10272B)


  1. ---
  2. date: 2016-05-28
  3. # vim: tw=80
  4. layout: post
  5. title: Understanding pointers
  6. tags: [C, instructive]
  7. ---
  8. <style>
  9. table {
  10. border-spacing: 0;
  11. width: 100%;
  12. }
  13. th, td {
  14. border-bottom: 1px solid black;
  15. }
  16. </style>
  17. I was recently chatting with a new contributor to Sway who is using the project
  18. as a means of learning C, and he had some questions about what `void**` meant
  19. when he found some in the code. It became apparent that this guy only has a
  20. basic grasp on pointers at this point in his learning curve, and I figured it
  21. was time for another blog post - so today, I'll explain pointers.
  22. To understand pointers, you must first understand how memory works. Your RAM is
  23. basically a flat array of
  24. [octets](https://en.wikipedia.org/wiki/Octet_(computing)). Your compiler
  25. describes every data structure you use as a series of octets. For the context of
  26. this article, let's consider the following memory:
  27. | 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
  28. |:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
  29. | 0x00 | 0x00 | 0x00 | 0x00 | 0x08 | 0x42 | 0x00 | 0x00 |
  30. We can refer to each element of this array by its index, or address. For
  31. example, the value at address 0x0004 is 0x08. On this system, we're using 16-bit
  32. addresses to refer to 8-bit values. On an i686 (32-bit) system, we use 32-bit
  33. addresses to refer to 8-bit values. On an amd64 (64-bit) system, we use 64-bit
  34. addresses to refer to 8-bit values. On Notch's imaginary DCPU-16 system, we use
  35. 16-bit addresses to refer to 16-bit values.
  36. To refer to the value at 0x0004, we can use a pointer. Let's declare it like so:
  37. ```c
  38. uint8_t *value = (uint8_t *)0x0004;
  39. ```
  40. Here we're declaring a variable named value, whose type is `uint8_t*`. The *
  41. indicates that it's a pointer. Now, because this is a 16-bit system, the size of
  42. a pointer is 16 bits. If we do this:
  43. ```c
  44. printf("%d\n", sizeof(value));
  45. ```
  46. It will print 2, because it takes 16-bits (or 2 bytes) to refer to an address on
  47. this system, even though the value there is 8 bits. On your system it would
  48. probably print 8, or maybe 4 if you're on a 32-bit system. We could also do this:
  49. ```c
  50. uint16_t address = 0x0004;
  51. uint8_t *ptr = (uint8_t *)address;
  52. ```
  53. In this case we're not casting the `uint16_t` value 0x0004 to a `uint8_t`, which
  54. would truncate the integer. No, instead, we're casting it to a `uint8_t*`, which
  55. is the size required to represent a pointer on this system. All pointers are the
  56. same size.
  57. ## Dereferencing pointers
  58. We can refer to the value at the other end of this pointer by *dereferencing* it.
  59. The pointer is said to contain a *reference* to a value in memory. By
  60. *dereferencing* it, we can obtain that value. For example:
  61. ```c
  62. uint8_t *value = (uint8_t *)0x0004;
  63. printf("%d\n", *value); // prints 8
  64. ```
  65. ## Working with multi-byte values
  66. Even though memory is basically a big array of `uint8_t`, thankfully we can work
  67. with other kinds of data structures inside of it. For example, say we wanted to
  68. store the value 0x1234 in memory. This doesn't fit in 8 bits, so we need to
  69. store it at two different addresses. For example, we could store it at 0x0006
  70. and 0x0007:
  71. | 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
  72. |:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
  73. | 0x00 | 0x00 | 0x00 | 0x00 | 0x08 | 0x42 | 0x34 | 0x12 |
  74. *0x0007 makes up the first byte of the value, and *0x0006 makes up the second
  75. byte of the value.
  76. <div class="well">
  77. Why not the other way around? Well, most systems these days use the "little
  78. endian" notation for storing multi-byte integers in memory, which stores the
  79. least significant byte first. The least significant byte is the one with the
  80. smallest order of magnitude (in base sixteen). To get the final number, we
  81. use (0x12 * 0x100) + (0x34 * 0x1), which gives us 0x1234. Read more about
  82. endianness <a href="https://en.wikipedia.org/wiki/Endianness">here</a>.
  83. </div>
  84. C allows us to use pointers that refer to these sorts of composite values, like
  85. so:
  86. ```c
  87. uint16_t *value = (uint16_t *)0x0006;
  88. printf("0x%X\n", *value); // Prints 0x1234
  89. ```
  90. Here, we've declared a pointer to a value whose type is `uint16_t`. Note that the
  91. size of this pointer is the same size of the `uint8_t*` pointer - 16 bits, or
  92. two bytes. The value it *references*, though, is a different type than
  93. `uint8_t*` references.
  94. ## Indirect pointers
  95. Here comes the crazy part - you can work with pointers to pointers. The address
  96. of the `uint16_t` pointer we've been talking about is 0x0006, right? Well, we
  97. can store that number in memory as well. If we store it at 0x0002, our memory
  98. looks like this:
  99. | 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
  100. |:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
  101. | 0x00 | 0x00 | 0x06 | 0x00 | 0x08 | 0x42 | 0x34 | 0x12 |
  102. The question might then become, how do we get it out again? Well, we can use a
  103. pointer *to that pointer*! Check out this code:
  104. ```c
  105. uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
  106. ```
  107. This code just declared a variable whose type is `uint16_t**`, which a pointer
  108. whose value is a `uint16_t*`, which itself points to a value that is a
  109. `uint16_t`. Pretty cool, huh? We can dereference this too:
  110. ```c
  111. uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
  112. uint16_t *pointer = *pointer_to_a_pointer;
  113. printf("0x%X\n", *pointer); // Prints 0x1234
  114. ```
  115. We don't actually even need the intermediate variable. This works too:
  116. ```c
  117. uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
  118. printf("0x%X\n", **pointer_to_a_pointer); // Prints 0x1234
  119. ```
  120. ## Void pointers
  121. The next question that would come up to your average C programmer would be,
  122. "well, what is a `void*`?" Well, remember earlier when I said that all pointers,
  123. regardless of the type of value they reference, are just fixed size integers?
  124. In the imaginary system we've been talking about, pointers are 16-bit addresses,
  125. or indexes, that refer to places in RAM. On the system you're reading this
  126. article on, it's probably a 64-bit integer. Well, we don't actually need to
  127. specify the type to be able to manipulate pointers if they're just a fixed size
  128. integer - so we don't have to. A `void*` stores an arbitrary address without
  129. bringing along any type information. You can later *cast* this variable to a
  130. specific kind of pointer to dereference it. For example:
  131. ```c
  132. void *pointer = (void*)0x0006;
  133. uint8_t *uintptr = (uint8_t*)pointer;
  134. printf("0x%X", *uintptr); // prints 0x34
  135. ```
  136. Take a closer look at this code, and recall that 0x0006 refers to a 16-bit value
  137. from the previous section. Here, though, we're treating it as an 8-bit value -
  138. the `void*` contains no assumptions about what kind of data is there. The result
  139. is that we end up treating it like an 8-bit integer, which ends up being the
  140. least significant byte of 0x1234;
  141. ## Dereferencing structures
  142. In C, we often work with structs. Let's describe one to play with:
  143. ```c
  144. struct coordinates {
  145. uint16_t x, y;
  146. struct coordinates *next;
  147. };
  148. ```
  149. Our structure describes a linked list of coordinates. X and Y are the
  150. coordinates, and next is a pointer to the next set of coordinates in our list.
  151. I'm going to drop two of these in memory:
  152. | 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
  153. |:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
  154. | 0xAD | 0xDE | 0xEF | 0xBE | 0x06 | 0x00 | 0x34 | 0x12 |
  155. Let's write some C code to reason about this memory with:
  156. ```c
  157. struct coordinates *coords;
  158. coords = (struct coordinates*)0x0000;
  159. ```
  160. If we look at this structure in memory, you might already be able to pick out
  161. the values. C is going to store the fields of this struct in order. So, we can
  162. expect the following:
  163. ```c
  164. printf("0x%X, 0x%X", coords->x, coords->y);
  165. ```
  166. To print out "0xDEAD, 0xBEEF". Note that we're using the structure dereferencing
  167. operator here, `->`. This allows us to dereference values *inside* of a
  168. structure we have a pointer to. The other case is this:
  169. ```c
  170. printf("0x%X, 0x-X", coords.x, coords.y);
  171. ```
  172. Which only works if `coords` is not a pointer. We also have a pointer within
  173. this structure named next. You can see in the memory I included above that its
  174. address is 0x0004 and its value is 0x0006 - meaning that there's another `struct
  175. coordinates` that lives at 0x0006 in memory. If you look there, you can see the
  176. first part of it. It's X coordinate is 0x1234.
  177. ## Pointer arithmetic
  178. In C, we can use math on pointers. For example, we can do this:
  179. ```c
  180. uint8_t *addr = (uint8_t*)0x1000;
  181. addr++;
  182. ```
  183. Which would make the value of `addr` 0x1001. But this is only true for pointers
  184. whose type is 1 byte in size. Consider this:
  185. ```c
  186. uint16_t *addr = (uint16_t*)0x1000;
  187. addr++;
  188. ```
  189. Here, `addr` becomes 0x1002! This is because ++ on a pointer actually adds
  190. `sizeof(type)` to the actual address stored. The idea is that if we only added
  191. one, we'd be referring to an address that is *in the middle* of a uint16_t,
  192. rather than the next uint16_t in memory that we meant to refer to. This is also
  193. how arrays work. The following two code snippets are equivalent:
  194. ```c
  195. uint16_t *addr = (uint16_t*)0x1000;
  196. printf("%d\n", *(addr + 1));
  197. ```
  198. ```c
  199. uint16_t *addr = (uint16_t*)0x1000;
  200. printf("%d\n", addr[1]);
  201. ```
  202. ## NULL pointers
  203. Sometimes you need to work with a pointer that points to something that may not
  204. exist yet, or a resource that has been freed. In this case, we use a NULL
  205. pointer. In the examples you've seen so far, 0x0000 is a valid address. This is
  206. just for simplicity's sake. In practice, pretty much no modern computer has
  207. any reason to refer to the value at address 0. For that reason, we use NULL to
  208. refer to an uninitialized pointer. Dereferencing a NULL pointer is generally a
  209. Bad Thing and will lead to segfaults. As a fun side effect, since NULL is 0, we
  210. can use it in an if statement:
  211. ```c
  212. void *ptr = ...;
  213. if (ptr) {
  214. // ptr is valid
  215. } else {
  216. // ptr is not valid
  217. }
  218. ```
  219. I hope you found this article useful! If you'd
  220. like something fun to read next, read about ["three star
  221. programmers"](http://c2.com/cgi/wiki?ThreeStarProgrammer), or programmers who
  222. have variables like `void***`.