This is the bug tracker for Photo Organizer.
FS#181 - Handle UTF-8 encoded IPTC+XMP data.
Attached to Project:
Photo Organizer
Opened by Solomon Peachy (pizza) - Monday, 19 March 2007, 14:13 GMT-4
Last edited by Solomon Peachy (pizza) - Thursday, 09 August 2007, 11:55 GMT-4
Opened by Solomon Peachy (pizza) - Monday, 19 March 2007, 14:13 GMT-4
Last edited by Solomon Peachy (pizza) - Thursday, 09 August 2007, 11:55 GMT-4
|
DetailsIPTC and XMP data that's encoded via UTF-8 doesn't properly import, with the database barfing on the actual import due to illegal characters.
Theoretically XMP and IPTC data can have encoding tags, but in practice nobody uses them. We need to detect the charset and convert to UTF-8 before we can shove anything in the database. |
This task depends upon
Closed by Solomon Peachy (pizza)
Thursday, 09 August 2007, 11:55 GMT-4
Reason for closing: Implemented
Additional comments about closing: Finally implemented and committed; see r1534. (will be in 2.34-rc3)
Thursday, 09 August 2007, 11:55 GMT-4
Reason for closing: Implemented
Additional comments about closing: Finally implemented and committed; see r1534. (will be in 2.34-rc3)
Test-1.jpg
Also, is it possible to just set the database to UTF-8 as is done with the language database?
As for disabling UTF-8 text in IPTC/XMP fields; there's probably a simple way to filter the data against legal ASCII.
This should be simple to do. I just need a couple of questions answered first:
1) Do you want to drop the non-ASCII characters?
-Very simple
2) What is the end encoding scheme? Ex: HTML, ASCII
-It looks like utf8_decode() or one of the functions included in the comments will work.
The danger here isn't that we'll mis-detect UTF-8, but rather that we'll treat something as UTF-8 that is actually something else (eg IS08859-2)
If we can find out the encoding scheme, then converting won't be a problem, but if there isn't a constant message header, then all we can do is check for ASCII compliance, then for UTF-8 compliance and then take a guess or ask the user. I do not think there is a practical way to determine the character sets of any of the 8-bit ISO encodings. The only possible way that I can think of is to decode chunks of it by each ISO character set, one by one and then run them through dictionaries (language determined by the character set) to see which has more correctly spelled words. I don't think we have enough text for that.
Is there a list of character sets that most cameras use?
IPTC data also has a standard tag to represent character encoding, but unforunately none of the IPTC-equipped files (even the ones Photoshop generates) actually uses that tag. That is the basic problem with the image attached to this bug ticket.
PO's current IPTC code is rather convoluted (as is the XMP decoding) and my general feeling at the moment is that both should be ripped out altogether in favor of just using ExifTool's superior abilities.
Next up is to migrate the parsing data to deal with this new information.