Unicode support

This article contains details about Unicode support in Python.

Product line

Standard

|

Expert

Operating mode

CLOUD ABO

|

ON-PREMISES

Modules

Services & CRM

Budget & Phases

Purchases

Resource Planning

Business Intelligence

Created: 08.09.2021
Machine translated
Updated: 19.10.2021 | Added Custom Extensions section.

Until Vertec 6.4, the data storage of texts in Vertec (e.g. addresses, service texts, remarks) was limited to a 1 byte encoding, namely Windows-1252 or ANSI (the two terms mean the same thing, we use ANSI as the term below). Vertec could therefore process exactly the symbols defined in ANSI, see e.g. Wikipedia (Disambiguation) . Other symbols were replaced by ?.

With this 250-character ANSI symbol set, English and all Western European languages can be mapped, which is sufficient for the application of Vertec in most cases. However, more and more customers with business relationships in Eastern Europe, for example, wanted to be able to map these addresses correctly in Vertec. For this reason, Vertec 6.5 switched to Unicode.

From Vertec 6.5 onwards, texts in Vertec are stored in Unicode. Accordingly, Vertec supports all approximately 65,000 symbols of the BMP (“Basic Multilingual Plane”) and thus the symbols of all relevant languages, including all Asian languages. Symbols outside the BMP and entered in Vertec are replaced (as with unsupported symbols) by a ?, which is likely to be the case for only a few emojis.

If the Database Convert is converted from 6.4 to 6.5, all texts in Vertec are converted from ANSI to Unicode. On the interface, this change remains invisible. The Vertec database usually increases by 20%-30% during the update, but this is unlikely to be a negative effect.

Backward compatibility

The Unicode enhancement is fully backward-compatible, but not necessarily forward-compatible. This means that an existing Vertec installation will behave the same for the same use, but as soon as Unicode symbols are processed in Vertec (consciously or unconsciously, e.g. by copying a text in), this will not necessarily be the case.

Below are a few points that may require attention in customer installations. The risks are focused on customer-specific scripts (especially Python) and extensions.

Python and Unicode

A Vertec string member (such as the text of a service) was of type up to version Vertec.6.4. str, in 6.5 of the type unicode. This ensures that Python code copying, composing, adding etc. of the strings still works the same way. The str module has also been “bent” by us to Unicode, because an analysis of existing Python code from customers has shown that it has many uses of str() for a conversion of, for example, also strings themselves to a string, which would then generate errors without this correction.

The default encoding in Python is ANSI. This ensures the greatest possible backward compatibility with existing code from versions prior to 6.5. However, non-ANSI symbols cannot be specified directly as string literals (but in Python there are alternatives).

The transition from a Unicode string to the default encoding, i.e. from Unicode to ANSI, is fault-tolerant: symbols that cannot be converted do not lead to an error, but are replaced by ?.

Encoding

The default encoding in ANSI means that existing Python code from versions before 6.5 has to be adapted after the update, e.g. if it processes strings and saves them in a file or sends them to a web service. In order for non-ANSI symbols that can occur in Vertec from version 6.5 to arrive correctly, a Unicode encoding (such as UTF-8) must be selected – but the other side must also be able to read it. Without explicit encoding such as

string.encode("UTF-8")

the implicit conversion to ANSI comes into effect with the eventual loss of data of non-ANSI symbols mentioned above.

Also when reading data to Vertec, be it by opening a text file, via the Vertec XML Server or a vtcapp.requestfilefromclient() one should take care of encoding, because data loss of non-ANSI symbols due to Unicode support is unnecessary. So if one reads a text file encoded in UTF-8, for example, it must be decoded correctly:

unicodestring = filecontent.decode("UTF-8")

As a rule of thumb, you can remember that in Python

  • encode always converts from a Unicode string to a byte stream (ä occupies e.g. 2 bytes in UTF-8)
  • decode always converts from a byte stream (e.g. encoded in UTF-8) to Unicode.

Custom Extensions

The “bending” of the str module to Unicode is not done for custom extensions that are stored as Python Files in the Extensions folder of the Vertec directory. This also applies to modified extension code originally supplied by Vertec. For these extensions, a str() causes a Unicode string with the expected data loss for non-ANSI symbols to be converted to ANSI. Customers with such interfaces should check the Python code for possible problems when updating to version 6.5.

Unicode support for custom items

The additional field types Zeichen and Text were able to display text in Vertec 6.4. Symbol is a field that is limited to a maximum of 255 symbols, while Text is unlimited. Custom fields of type symbol are automatically converted to Unicode with 6.5, those of type text are not.

The custom fields of the type Text are stored in the database as BLOB (Binary Large Object) fields. This means that the data is stored in the database as pure binary data and is only interpreted as texts in the Vertec business logic. Since the custom fields of type Bild are saved in the same database field, custom fields of type Text remain unchanged 8-byte texts in ANSI format for reasons of backward compatibility in Vertec 6.5.

Vertec 6.5 introduces the new Unicode Text extension field type instead:

If you are using additional fields of type Text, we strongly recommend that you convert them to Unicode Text immediately after converting. DB Convert to 6.5 automatically copies the texts from the Text field into the Unicode Text field in the correct format.

Attention: All accesses to the custom field (OCL, SQL) must also be changed:

  • with SQL from WERTBLOB on WERTTEXT
  • in OCL of zusatzfeldblob('zusatzfeldname') on zusatzfeld('zusatzfeldname').