Using Umlaute in US7ASCII – doesn’t work, does it?

Creating an Oracle database with character set U7ASCII should be an exception nowadays. So if you try dbca (database configuration assistant) in Oracle 11g you don’t even have the choice by default. But it’s still possible – in all current Oracle versions. If that’s what you really wanted – I don’t know!

Migrating from US7ASCII to Unicode

In one of my current projects I was asked to migrate an Oracle 10g database to Unicode (and Oracle 11g of course) because the application as to work with kyrillic and arabic characters. My suggestion was that (because it is a German company) the current database character set is something like WE8 (so western european character set with umlaute and some other symbols we are using very often) and I aksed to run on of my check scripts I’m using to get an overview about the database.

But o wonder: the database was set up with character set US7ASCII! I remembered that there is a command “ALTER DATABASE CHARACTER SET …” and because US7ASCII is a strict subset of all other character sets is must be possible to easily change the character set to AL32UFT8. But unfortunately this command is no longer available since Oracle 9i because it does not work for LOB-Data Types and Oracle is using LOBs in the data dictionary. So instead of this easy going I had to use the new character set scanner and the procedure csalter.plb (directory $ORACLE_HOME/rdbms/admin).

To migrate an Oracle database to a new character you first have to install an additional schema in the database with some tables using the script scminst.sql (directory $ORACLE_HOME/rdbms/admin). After the tables are created you can run the character set scanner (cssan) which needs some additional parameters for example the target character set. If this command completes successfully without generating any error messages (and corresponding rows in the tables in schema csmig) the database character set can be altered using the procedure csalter.plb. But as I already expected tons of error messages were generated because the applications are using umlaute nevertheless.

How does that work?

Normally Oracle takes care about converting a character between the application and the oracle database. If an application is running on a Microsoft PC for example that the client is probably using WE8MSWIN1252. If the database is running with unicode (so AL32UTF8) character set and the user is typing an “€” an automatic conversion from 0x80 (MSWIN) to 0x20AC (two byte code in Unicode) will occur vice versa. If the database is using US7ASCII character set the database will ignore the “€” and will store an “?” as invalid symbol. In most cases the client character set is set by the operating system or the application. A nice tool for testing is putty because it allows to set various character sets on the fly.

But the conversion is only necessary if client and database are using different character set (which is mostly the case). If client and database are using the same character set (e.g. because the database was created with WE8MSWIN1252 and the client is a MS-Windows PC) no conversion is done at all. The same behavior can be simulated using the variable NLS_LANG. This variable consists of the following three parts:

  1. Language (for error messages etc.)
  2. Territory (notation for numbers “.,” and sorting)
  3. Character Set

With the command NLS_LANG=GERMAN_GERMANY.WE8ISO8859P15 you are changing the default behavior so a client will use WE8ISO character set regardless of the real character set. If the database is using the same character set no conversion will take place. For our database this meant that NLS_LANG was set to US7ASCII for all applications and operating systems. But keep in mind that only the “right” client is able to query the “correct” characters. If another tool like Toad for Oracle or SQL Developer is used to view the data a “?” or something similar will be shown.

Difficult enough in this project various clients were used so some MS-Windows PCs with WE8MSWIN1252, some Unix boxes with WE8ISO8859P15 and some Java applications with UTF8 Character set.

How to migrate this database?

In addition to the default behavior that csscan is using the current character set of the database an individual setting can be used as well. So a command like the following will ignore the original character set but will scan the database as if it has the character set WE8ISO8859P15:

csscan system/<pwd> FULL=Y FROMCHAR=WE8ISO8859P15 TOCHAR=WE8ISO8859P15 

But in this project even this scan was generating tons of error message because as said various different clients were used. If you are okay with missing some few characters you can simply ignore the error message and delete the rows in the corresponding tables in sampcsmig (especially csm$columns). After the error messages have been cleaned up successfully you can execute the package csalter.plb. This is a one way aproach so it’s difficult to use the scanner more than once.

ALTER DATABASE CHARACTER SET …

As already mentioned the command ALTER DATABASE CHARACTER SET is deprecated since Version 9i. But after a closer look I found out that the command can still be used with a very specific syntax:

SQL> ALTER DATABASE CHARACTER SET INTERNAL_USE <character set>; 

With this command only the data dictionary entry for the character set of the database will be changed. This shouldn’t be used for a production database because the “normal” character set conversion does not work any longer. But for a migration purpose (so exporting the data and importing it into another database) it’s the best approach I currently see. For the project we could simply set the character set of the database to the one we need, do an export of the schemas and reset it again to another character set as follows:

SQL> ALTER DATABASE CHARACTER SET INTERNAL_USE WE8ISO8859P15;

Now we export all schemas which were used with a corresponding client.

SQL> ALTER DATABASE CHARACTER SET INTERNAL_USE WE8MSWIN1252;

And again an export of those schemas and last:

SQL> ALTER DATABASE CHARACTER SET INTERNAL_USE UTF8;

… and again the export.

Now we can set up the new Unicode database accordingly and will hopefully have no problems with character set conversion any longer.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top