logo

utils-std

Collection of commonly available Unix tools
commit: dca7a4c59fcf1263266db22dc3f1e2c4828d6110
parent cc6ab4d761c8d3b79038991d81470fe13ae79884
Author: Haelwenn (lanodan) Monnier <contact@hacktivis.me>
Date:   Mon, 29 Apr 2024 18:33:32 +0200

cmd/wc: Add caveat about codepoint vs. character

Diffstat:

Mcmd/wc.126+++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/cmd/wc.1 b/cmd/wc.1 @@ -27,7 +27,7 @@ A word is defined as a non-empty string delimited by whitespace, some other implementation choose to additionally exclude non-printable characters. .Sh OPTIONS -.Bl -tag -width Ds +.Bl -tag -width __ .It Fl c Explicitly use single-byte mode, and write the number of bytes in each .Ar file . @@ -35,9 +35,29 @@ Explicitly use single-byte mode, and write the number of bytes in each Write the number of newlines in each .Ar file . .It Fl m -Switch to multibyte-characters mode, and write the number of characters in each +Switch to multi-byte mode, and write the number of codepoints in each .Ar file . -The particular encoding is dependent on your locale. +The encoding is dependent on the +.Xr locale 1 +environment variables. +.Pp +Note that while codepoints are often close enough to characters, +some characters use multiple codepoints, +plus by design +.Nm +cannot consider glyphs due to lacking rendering. +.Pp +For example with decomposed é (e with acute diacritic) in a +.Ql C.UTF-8 +locale: +.Bd -literal -compact +$ printf '\\145\\314\\201\\n' +$ printf '\\145\\314\\201' | wc -c +3 +$ printf '\\145\\314\\201' | wc -m +2 +.Ed .It Fl w Write the number of words in each .Ar file .