Output of file(1)
from pmk@lemmy.sdf.org to openbsd@lemmy.sdf.org on 23 Sep 2023 20:35
https://lemmy.sdf.org/post/4445351

Hello, I’ve tried to find someone else using OpenBSD in various places for a while now, but with no success, so I’m hoping someone will read this.

I’m wondering what your output is from file(1) on a file you know has text encoded as UTF-8.

On my system (7.3-stable) the output is “Non-ISO extended-ASCII text”, and I’m trying to figure out if this is how it should be, or if I did something wrong setting up the system.

So, if you have a computer with OpenBSD and a minute to spare, could you try running file(1) on a UTF-8 file and see if it identifies it as UTF-8 or “Non-ISO extended-ASCII text”?

Thanks in advance

#openbsd

threaded - newest

Rand0mA@lemmy.world on 23 Sep 2023 21:30 next collapse

I don’t have OpenBSD, but to check if a file is UTF-8, try this:

file -i filename.txt

The command should tell you the charset information, and if it’s UTF-8, it should say something like “charset=utf-8.”

The file command might still label UTF-8 files as ASCII text due to its classification rules (UTF-8 is an extension of the ASCII character set).

That result doesn’t necessarily mean there’s something wrong with your system setup.

pmk@lemmy.sdf.org on 23 Sep 2023 21:55 collapse

If I run file(1) on a file containing only characters in the ASCII set, the output is “ASCII text”. So far so good. If I add an “å”, the output of file(1) is “ISO-8859 text”. This is not correct, since if I look closer at what’s there, the “å” is encoded as \xc3\xa5, and this same file is reported to be UTF-8 in Debian and other OSs. If I add more unicode like “· ß ð ŋ” to the file, then file(1) says it is “Non-ISO extended-ASCII text” on OpenBSD. file -i testfile gives “text/plain”. Something is not right here.

edit: the file does not contain a BOM, but that is discouraged in UTF-8 files anyway. I have tried manually adding the correct BOM and it didn’t help.

Rand0mA@lemmy.world on 23 Sep 2023 22:15 collapse

Make sure your test file contains a decent amount of UTF-8 text, not just a few characters. The file command uses statistical analysis, so having more text might help it make a more accurate determination.

What does the locale command return?? … to set your locale you can use the export command (eg. export LC_CTYPE=“en_US.UTF-8” using whatever code is relevant)

pmk@lemmy.sdf.org on 23 Sep 2023 22:36 collapse

I have all of this page: www.w3.org/2001/06/utf-8-test/UTF-8-demo.html as a test file. It renders fine and displays all the languages and special characters in vim.

LC_CTYPE is “en_US.UTF-8” , I export it in .xsession (and in .profile).

XTERM_LOCALE is also “en_US.UTF-8”

tycho@lemmy.sdf.org on 25 Sep 2023 18:58 collapse

Yep I have the same result so most likely you didn’t do anything wrong. My VPS on openbsd.amsterdam shows this and my laptop does too.

pmk@lemmy.sdf.org on 25 Sep 2023 19:01 collapse

Aha, I understand, thank you! file(1) might not be utf8 aware yet then.

tycho@lemmy.sdf.org on 26 Sep 2023 10:37 collapse

I explored the source of file(1) and the part to determine file types of text file seems to be in text.c: cvsweb.openbsd.org/cgi-bin/cvsweb/…/text.c?rev=1.…

And especially this part:

static int
text_try_test(const void *base, size_t size, int (*f)(u_char))
{
	const u_char	*data = base;
	size_t		 offset;

	for (offset = 0; offset < size; offset++) {
		if (!f(data[offset]))
			return (0);
	}
	return (1);
}

const char *
text_get_type(const void *base, size_t size)
{
	if (text_try_test(base, size, text_is_ascii))
		return ("ASCII");
	if (text_try_test(base, size, text_is_latin1))
		return ("ISO-8859");
	if (text_try_test(base, size, text_is_extended))
		return ("Non-ISO extended-ASCII");
	return (NULL);
}

So file(1) is not capable of saying if a file is UTF-8 right now. There is some other file (/etc/magic) which can help to determine if a text file is UTF-7 or UTF-8-EBCDIC because those need a BOM but as you said UTF-8 does not need a BOM. So it looks like we are stuck here :)

pmk@lemmy.sdf.org on 26 Sep 2023 17:32 next collapse

Thank you. At least I know now that it’s the expected output of utf-8 files, that’s good to know. Thank you again.

wgs@lemmy.sdf.org on 30 Oct 2023 12:42 collapse

Which is ironic, given that OpenBSD only supports the UTF-8 encoding :)

tycho@lemmy.sdf.org on 31 Oct 2023 08:26 collapse

Yes it looks like utf8 is a first-class citizen but really it is ASCII which is 100% supported. From the FAQ:

The OpenBSD base system fully supports the ASCII character set and encoding, and partially supports the UTF-8 encoding of the Unicode character set.