The Linux Page

PostgreSQL and UNICODE

Trying to create a PostgreSQL Database with UNICODE does not always work. I ran in a problem with a RedHat installation...

There can be two problems:

  1. The default database setup is ASCII (more or less considered ISO-8859-1)
  2. The database cluster was compiled without UNICODE support

Now a day, the second case is quite unlikely and it means that you won't  be able to use a UNICODE encoding unless you recompile the database system. Make sure you use the ./configure --multi-byte encoding for that purpose.

In the first case, however, it could be that your cluster was initialized in SQL_ASCII. This is annoying but it does not prevent you from creating database in UNICODE (or UTF-8).

I found the solution on an Italian website. Although I don't really read Italian that well, the solution was in regular SQL language.

The problematic command line is this one:


When I try to run that statement, I get the following error:

  ERROR:  encoding UTF8 does not match locale en_US
  DETAIL:  The chosen LC_CTYPE setting requires encoding LATIN1.

Which I understood, at first, as "The database does not support UNICODE"... Hmmm...

Later I found the Italian solution but tried from the wrong template—I always create new databases from template1!

    LC_CTYPE = 'en_US.utf8' LC_COLLATE = 'en_US.utf8';

Here I got it! The LC_CTYPE and LC_COLLATE are PostgreSQL variables, not shell environment variables. I was wondering since the shell did not show those variables...

But I then got the following error:

  ERROR:  new encoding (UTF8) is incompatible with the encoding of the template database (LATIN1)
  HINT:  Use the same encoding as in the template database, or use template0 as template.

That last error was followed by a great hint though! Use template0 instead of your current template (which was template1). Oh! Yeah! I did not read the Italian fix from top to bottom... Surprised?! 

    LC_CTYPE = 'en_US.utf8' LC_COLLATE = 'en_US.utf8'
    TEMPLATE template0;

Voilà! That worked.

And yes, to my point of view, all databases should just use UTF-8, although that is not always 100% welcome when some binary data is to be saved in the database and no conversion should occur on that data... But any text should by default be viewed as UTF-8 so all text works.