Photo Organizer

Notice: Undefined index: tasklist_type in /var/www/flyspray/includes/class.tpl.php(128) : eval()'d code on line 85 Notice: Undefined index: tasklist_type in /var/www/flyspray/includes/class.tpl.php(128) : eval()'d code on line 90
  • Status Closed
  • Percent Complete
    100%
  • Task Type Bug Report
  • Category Backend / Core
  • Assigned To Solomon Peachy (pizza)
  • Operating System All
  • Severity Low
  • Priority Normal
  • Reported Version Devel
  • Due in Version 2.34
  • Due Date Undecided
  • Votes 0
  • Private No
Attached to Project: Photo Organizer
Opened by Solomon Peachy (pizza) - 2007-03-19
Last edited by Solomon Peachy (pizza) - 2007-08-09

FS#181 - Handle UTF-8 encoded IPTC+XMP data.

IPTC and XMP data that's encoded via UTF-8 doesn't properly import, with the database barfing on the actual import due to illegal characters.

Theoretically XMP and IPTC data can have encoding tags, but in practice nobody uses them. We need to detect the charset and convert to UTF-8 before we can shove anything in the database.

This task does not depend on any other tasks.

Closed by  Solomon Peachy (pizza)
Thursday, 09 August 2007, 15:55 GMT
Reason for closing:  Implemented
Additional comments about closing:  Finally implemented and committed; see r1534. (will be in 2.34-rc3)
Jeff Robins (jeffrobins)
Sunday, 24 June 2007, 17:20 GMT
Is there anyway to disable/remove the IPTC and XMP data that's encoded in UTF-8?

Also, is it possible to just set the database to UTF-8 as is done with the language database?
Solomon Peachy (pizza)
Sunday, 24 June 2007, 17:28 GMT
The database language is set when the database is created; AFAIK the only way to switch is to dump it convert the dump as needed, then re-create the database with the correct type.

As for disabling UTF-8 text in IPTC/XMP fields; there's probably a simple way to filter the data against legal ASCII.
Jeff Robins (jeffrobins)
Sunday, 24 June 2007, 17:41 GMT
It looks like the way to detect a UTF-8 character set is to look at the first bit of the byte. ASCII (non-extended) characters all have a zero as the first bit and all UTF-8 characters need to either have a one as the first bit or be ASCII compliant. This first 5 bits of a UTF-8 character describe how many bytes (1-4) the character is.

This should be simple to do. I just need a couple of questions answered first:
1) Do you want to drop the non-ASCII characters?
-Very simple
2) What is the end encoding scheme? Ex: HTML, ASCII
-It looks like utf8_decode() or one of the functions included in the comments will work.

Solomon Peachy (pizza)
Sunday, 24 June 2007, 19:07 GMT
I'd much rather everything we get (even ASCII) get converted to UTF-8 text.

The danger here isn't that we'll mis-detect UTF-8, but rather that we'll treat something as UTF-8 that is actually something else (eg IS08859-2)
Jeff Robins (jeffrobins)
Sunday, 24 June 2007, 20:34 GMT
XML attempts to determine the encoding scheme by embedding a constant message in the beginning of every document "<?xml". Is there anything like this in the IPTC and XMP data?

If we can find out the encoding scheme, then converting won't be a problem, but if there isn't a constant message header, then all we can do is check for ASCII compliance, then for UTF-8 compliance and then take a guess or ask the user. I do not think there is a practical way to determine the character sets of any of the 8-bit ISO encodings. The only possible way that I can think of is to decode chunks of it by each ISO character set, one by one and then run them through dictionaries (language determined by the character set) to see which has more correctly spelled words. I don't think we have enough text for that.

Is there a list of character sets that most cameras use?
Solomon Peachy (pizza)
Monday, 25 June 2007, 14:09 GMT
XMP, I believe, has standard charset encoding representation above and beyond what XML specifies. (Actually, I take that back. I just checked in a fix for this, and now it supposedly works properly.

IPTC data also has a standard tag to represent character encoding, but unforunately none of the IPTC-equipped files (even the ones Photoshop generates) actually uses that tag. That is the basic problem with the image attached to this bug ticket.

Solomon Peachy (pizza)
Monday, 25 June 2007, 14:22 GMT
Basically, IPTC data is supposed to be straight ASCII unless the encoding tag (1:90) is present.

PO's current IPTC code is rather convoluted (as is the XMP decoding) and my general feeling at the moment is that both should be ripped out altogether in favor of just using ExifTool's superior abilities.

Solomon Peachy (pizza)
Tuesday, 26 June 2007, 18:46 GMT
As a FYI -- I have a few high priority bugs to hunt down first (affecting photo nagivation), but once those are taken care of I'll take care of the IPTC/XMP parsing problems one way or another.

Solomon Peachy (pizza)
Friday, 29 June 2007, 00:32 GMT
I've committed the framework to make this happen. IPTC/XMP importing will be handled entirely via ExifTool using the same mechanisms as the EXIF importer.
Solomon Peachy (pizza)
Tuesday, 07 August 2007, 03:40 GMT
r1515 now stores the ExifTool-extracted data into the database, and it appears to deal with UTF-8 data just fine.

Next up is to migrate the parsing data to deal with this new information.

Loading...