UNICODE from OSM to PGSQL


This week I’ve been presented with a problem importing OpenStreetMap data of Africa  from GeoFabrik’s shapefile export into a PostgreSQL / PostGIS database.

The problem consisted in a loss of information during the transport, resulting in wrongly encoded strings (road names) ending up in the db. This was during a feasibility study. So, is that feasible ? Let’s take a look.

I downloaded the shapefiles and tried to import the roads one using shp2pgsql with no options, and here’s the result:

Why is shp2pgsql trying to convert, and from which encoding? When I left it, the default was to perform no conversion unless -W was given…

Well, it turns out the default is now to convert from WINDOWS-1252 encoding (why?) and there no way to request no encoding at all (why?!).

So I patched the loader to give more informations about the encoding process and specified UTF-8 as source encoding_. Here’s the result_:

So it’s official: the dbf file contains invalid data. The confusing error message (Invalid argument) means the multibyte sequence is incomplete rather than invalid (EINVAL errno).

Adding more debugging code I can see that many many rows have values that look truncated, all ending with a single byte being either <D8> or <D9>.

OpenOffice confirms the malformation (wanted an independent opinion on that just in case it was shapelib doing the truncation):

Querying openstreetmap for way 4005333 shows the full string, and the full string is also present in the .osm file downloaded from geofabrik:

So the problem is only with the shapefile, not the OSM data itself, nor with postgis.

Surely postgis loader could be tweaked to allow for a tolerance, in case anyone wants to import the truncated data anyway. In this specific case discarding the final partial multibyte string might be the best you can do as it’s a case of truncation as any other, only being multibyte it gives more problems than single-byte encoding.

Timely enough someone submitted a patch aimed at exactly this kind of tolerance handling. I’m going to see how well that’ll cope with this case.

But bottom line is we do want the good data, so this problem is not solved until the data will be in the database, stepping by shapefiles (if possible) or directly.

Next stop: osm2pgsql -> go there