Character encoding of text files not present

Issue description

While Topincs files may have an encoding specified, this information is actually never present. Upload is the most common way of archiving a file to Topincs. In this process, the encoding is never inspected nor presisted. It is always null. It should be there for text files.

Developer comments

All files are binary files. Some files are text files. They predominatly contain bytes or byte sequences which represent characters in one or more human scripts. Their content may be organized in lines in which case the line ending is encoded by a common byte or byte sequence used for this purpose.

The best way to distinguish text and binary (non-text) files is most likely a (partial) distribution analysis, since text files use only a limited subset of the byte domain where as binary files use the whole domain. In any case: very short files might be hard to classify.

Also: why has this never been a problem?

In text files byte sequence frequencies are determined by an external (non-digital) rule system. In binary, there is a software creating/parsing the data, which determines the byte sequences.

Reporting date

2024-01-01

Reported by

Robert Cerny

Planned for version

Topincs Psi Ψ

Blocks

Characters not legible in table viewer when csv file not UTF-8(Bug)

Helpful webpages2

www.dpconline.org/…ile-type-of-a-text-file

stackoverflow.com/…d-make-everything-utf-8