commit: dca7a4c59fcf1263266db22dc3f1e2c4828d6110
parent cc6ab4d761c8d3b79038991d81470fe13ae79884
Author: Haelwenn (lanodan) Monnier <contact@hacktivis.me>
Date: Mon, 29 Apr 2024 18:33:32 +0200
cmd/wc: Add caveat about codepoint vs. character
Diffstat:
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/cmd/wc.1 b/cmd/wc.1
@@ -27,7 +27,7 @@ A word is defined as a non-empty string delimited by whitespace,
some other implementation choose to additionally exclude
non-printable characters.
.Sh OPTIONS
-.Bl -tag -width Ds
+.Bl -tag -width __
.It Fl c
Explicitly use single-byte mode, and write the number of bytes in each
.Ar file .
@@ -35,9 +35,29 @@ Explicitly use single-byte mode, and write the number of bytes in each
Write the number of newlines in each
.Ar file .
.It Fl m
-Switch to multibyte-characters mode, and write the number of characters in each
+Switch to multi-byte mode, and write the number of codepoints in each
.Ar file .
-The particular encoding is dependent on your locale.
+The encoding is dependent on the
+.Xr locale 1
+environment variables.
+.Pp
+Note that while codepoints are often close enough to characters,
+some characters use multiple codepoints,
+plus by design
+.Nm
+cannot consider glyphs due to lacking rendering.
+.Pp
+For example with decomposed é (e with acute diacritic) in a
+.Ql C.UTF-8
+locale:
+.Bd -literal -compact
+$ printf '\\145\\314\\201\\n'
+é
+$ printf '\\145\\314\\201' | wc -c
+3
+$ printf '\\145\\314\\201' | wc -m
+2
+.Ed
.It Fl w
Write the number of words in each
.Ar file .