Photo Organizer

  • Status Closed
  • Percent Complete
  • Task Type Bug Report
  • Category Backend / Core
  • Assigned To
  • Operating System All
  • Severity Low
  • Priority Very Low
  • Reported Version Devel
  • Due in Version 2.34
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Photo Organizer
Opened by pizza - 2007-03-19
Last edited by pizza - 2007-08-09

FS#181 - Handle UTF-8 encoded IPTC+XMP data.

IPTC and XMP data that's encoded via UTF-8 doesn't properly import, with the database barfing on the actual import due to illegal characters.

Theoretically XMP and IPTC data can have encoding tags, but in practice nobody uses them. We need to detect the charset and convert to UTF-8 before we can shove anything in the database.

Closed by  pizza
2007-08-09 15:55
Reason for closing:  Implemented
Additional comments about closing:  Finally implemented and committed; see r1534. (will be in 2.34-rc3)

Is there anyway to disable/remove the IPTC and XMP data that's encoded in UTF-8?

Also, is it possible to just set the database to UTF-8 as is done with the language database?

pizza commented on 2007-06-24 17:28

The database language is set when the database is created; AFAIK the only way to switch is to dump it convert the dump as needed, then re-create the database with the correct type.

As for disabling UTF-8 text in IPTC/XMP fields; there's probably a simple way to filter the data against legal ASCII.

It looks like the way to detect a UTF-8 character set is to look at the first bit of the byte. ASCII (non-extended) characters all have a zero as the first bit and all UTF-8 characters need to either have a one as the first bit or be ASCII compliant. This first 5 bits of a UTF-8 character describe how many bytes (1-4) the character is.

This should be simple to do. I just need a couple of questions answered first:
1) Do you want to drop the non-ASCII characters?
-Very simple
2) What is the end encoding scheme? Ex: HTML, ASCII
-It looks like utf8_decode() or one of the functions included in the comments will work.

pizza commented on 2007-06-24 19:07

I'd much rather everything we get (even ASCII) get converted to UTF-8 text.

The danger here isn't that we'll mis-detect UTF-8, but rather that we'll treat something as UTF-8 that is actually something else (eg IS08859-2)

XML attempts to determine the encoding scheme by embedding a constant message in the beginning of every document "<?xml". Is there anything like this in the IPTC and XMP data?

If we can find out the encoding scheme, then converting won't be a problem, but if there isn't a constant message header, then all we can do is check for ASCII compliance, then for UTF-8 compliance and then take a guess or ask the user. I do not think there is a practical way to determine the character sets of any of the 8-bit ISO encodings. The only possible way that I can think of is to decode chunks of it by each ISO character set, one by one and then run them through dictionaries (language determined by the character set) to see which has more correctly spelled words. I don't think we have enough text for that.

Is there a list of character sets that most cameras use?

pizza commented on 2007-06-25 14:09

XMP, I believe, has standard charset encoding representation above and beyond what XML specifies. (Actually, I take that back. I just checked in a fix for this, and now it supposedly works properly.

IPTC data also has a standard tag to represent character encoding, but unforunately none of the IPTC-equipped files (even the ones Photoshop generates) actually uses that tag. That is the basic problem with the image attached to this bug ticket.

pizza commented on 2007-06-25 14:22

Basically, IPTC data is supposed to be straight ASCII unless the encoding tag (1:90) is present.

PO's current IPTC code is rather convoluted (as is the XMP decoding) and my general feeling at the moment is that both should be ripped out altogether in favor of just using ExifTool's superior abilities.

pizza commented on 2007-06-26 18:46

As a FYI -- I have a few high priority bugs to hunt down first (affecting photo nagivation), but once those are taken care of I'll take care of the IPTC/XMP parsing problems one way or another.

pizza commented on 2007-06-29 00:32

I've committed the framework to make this happen. IPTC/XMP importing will be handled entirely via ExifTool using the same mechanisms as the EXIF importer.

pizza commented on 2007-08-07 03:40

r1515 now stores the ExifTool-extracted data into the database, and it appears to deal with UTF-8 data just fine.

Next up is to migrate the parsing data to deal with this new information.


Available keyboard shortcuts


Task Details

Task Editing