jleader 
02-25-2003
01:11 PM ET (US)
|
Someday, Unicode will solve all our problems. Then, we'll have to invent new problems.
|
Chris Smith 
02-25-2003
11:32 AM ET (US)
|
I had a long thing written here about why curly quotes lead to broken feeds. Then I figured out what ACTUALLY happened - it's almost Cory's fault (grin). Here goes...
This is Blogger Pro's xml declaration (sans angle brackets)
?xml version="1.0"?
This is bb's headers from http://boingboing.net/rss.xml
HTTP/1.1 200 OK Date: Tue, 25 Feb 2003 16:46:42 GMT Server: Apache Last-Modified: Tue, 25 Feb 2003 16:22:28 GMT ETag: "7d57-49f5-3e5b9844" Accept-Ranges: bytes Content-Length: 18933 Connection: close Content-Type: text/xml
Notice that neither of these specifies a text encoding.
Now for the most critical bit from 4.3.3 in the XML spec.
It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
...and that's where it was going sour. Curly quotes (and foreign currencies and simple fractions) would cause invalid utf-8 sequences to appear, breaking any utf-8 parsers. Given the combination of RSS and headers, it was incumbent on Cory et al to ensure that all postings were in utf-8.
This would be difficult though, given that many sites (including Blogger) include no HTML charset declaration. In such cases, the usual fallback is to assume ISO-8859-1. This would *almost* be ok, except that there are no curly quotes in 8859-1, so many systems cheat, and just use the windows-1252 codes.
The resulting conflict of assumptions meant that postings to the site went in in ISO-8859-1, but came out in the RSS feed in utf-8. Unfortunately, no actual conversion of data took place at the server to deal with this change.
Aaron's instructions have replaced assumptions with solid references. bb now comes out in utf-8 (because of the meta tag in the top), and Mozilla can be told to override the submission in 8859-1 and use utf-8. Mozilla appears to have the smarts to coerce many foreign types into utf-8 correctly, so that cutting and pasting invokes a conversion to the appropriate encodings. Cutting and pasting (particularly the pasting) appears to be a magical process - it changes depending on the various settings of all the various components. This is the clue to the occasional nature of the problem - there are just too many hidden settings to keep track of.
The one thing that would make this job easier would be having the server side automatically declare utf-8 for both the bb pages and for the posting pages. Five years ago it would have broken too many browsers. Maybe the time for that change has finally come.
|
Eli the Bearded 
02-24-2003
06:58 PM ET (US)
|
Cory, about /m54, quite possibly the problem is that the clipboard does not contain information about the charset or when pasted that information is need checked.
It would match, eg, the epoch problem when cutting and pasting dates between Mac and Windows originated Excel documents.
So much is effected by something as seemingly simple as character sets.
|
Cory Doctorow 
02-24-2003
05:50 PM ET (US)
|
Well, hush my mouth! It *does* work -- yer a genius, Aaron.
|
Aaron Swartz 
02-24-2003
05:43 PM ET (US)
|
Cory, as I said, you need to put <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the <head> of the HTML template
|
Cory Doctorow 
02-24-2003
05:37 PM ET (US)
|
Aaron, that doesn't work either. The curly-quotes show up in the RSS, but when published to the blog, they show up as dipthongs.
|
Cory Doctorow 
02-24-2003
05:36 PM ET (US)
|
Chris, look at the original post: the tools aren't there yet. Every tool I've used can and does break in some instances of curly-quotes -- try pasting a word doc with curly-quotes into mutt and mailing it to someone who reads it with Mailsmith -- at least one of those apps is going to render incorrectly, reducing the readability of your system.
The fact is that despite "standardization," curly-quotes are still mostly a proprietary affair.
|
Chris Smith 
02-24-2003
05:15 PM ET (US)
|
I'd love to see how any system thinks it can send or receive paired quotes in ISO-8859-1, since that character set doesn't include the relevant characters.
It's lying to you - the problem is knowing exactly HOW it's lying. That was always the problem I ran into when I hit character problems - there was never any clear documentation about which system bits did which conversion bits, and what spec they used to do it.
|
Chris Smith 
02-24-2003
04:23 PM ET (US)
|
Cory? Hang in there....
1) This stuff works ok, if not perfectly, when I try it. In fact, some of the not perfect is local - things like my browser which lies to me. 2) There are even tools out there to automatically convert straight quotes (the feet-inches type) to paired quotes ... including some common ones such a MS Word. 3) These are legal constructs in XML and thus in XHTML and RDF.
Given that it is both legal and working, how do you explain why people should stop? Fair enough, your tools break. But if they are breaking on what is defined as legitimate content, then is this the originators' fault?
Further - a validation run on bb's RSS suggests that the ONLY out-of-valid item is a single date field. Simply adding paired quotes to a otherwise valid RSS feed will not make it invalid. Even the windows-1252 characters are legit XML, they just shouldn't be displayed as quotes.
The one-sender / multi-recipient model DEPENDS on standards. You can't be driven by every tool author out there who complains that your feed breaks his reader if your feed is valid. It is NOT 'hiding behind the spec' to point them to a validation check, and tell them that they have to accept valid feeds.
This is why validators sometimes show you how to add a 'validate my feed' link to your work. If the first place someone can go is a validator that shows your feed is fine, then that is less likely to be an email to you. Maybe that's a start - something simple to keep the email load down.
There are a couple RSS validators, but here's a start. I think it's likely that you've heard of these before - time to see if you can use one to save yourself some workload?
http://feeds.archive.org/validator/check?u...boing.net%2Frss.xml
|
Aaron Swartz 
02-24-2003
04:20 PM ET (US)
|
As for the other editors, what browsers do they use?
In Safari: Safari, Preferences, Appearance, Default Encoding, Unicode (UTF-8).
|
Aaron Swartz 
02-24-2003
04:18 PM ET (US)
|
OK, so I downloaded Mozilla and set up a Blogspot account. For some reason beyond puny human comprehension, Mozilla persists in seeing Blogger.com as being in ISO-8859-1 despite all indications and instructions to the contrary.
Luckily, if you visit the blogger posting page and select View menu, Character Encoding, Unicode (UTF-8) it does things right. This seems to stick across quitting the application and new posts.
Tech details: Mozilla can be totally wacky.
|
Eli the Bearded 
02-24-2003
04:16 PM ET (US)
|
Overall, I'd say lack of unicode support is an application problem, not a problem with people using it. Just because you only see it for smart quotes does not make those characters the real problem.
Asumming you've got a unicode system, I like to spell my surname with a "ffi" glyph (U+FB03). Not sure if QT will get it right here.
|
Cory Doctorow 
02-24-2003
03:48 PM ET (US)
|
Doesn't work.
|
Cory Doctorow 
02-24-2003
03:44 PM ET (US)
|
Let's see if that works. Of course, I have three co-editors who don't use Moz.
|
Aaron Swartz 
02-24-2003
03:40 PM ET (US)
|
OK, I think I figured out how to fix Mozilla so your RSS feed won't break again, Cory: From the Mozilla menu, select Preferences. Click Languages in the Navigator category. Under Default Character Coding select "Unicode (UTF-8)". Click OK. Tech details: Mozilla stuplidly assumes pages are in the legacy ISO-8859-1 format, and so it sends the smart quotes in ISO-8859-1. Web browsers have code to check if the page is using ISO-8859-1 and handle it appropriately but many RSS readers don't, because XML feeds are supposed to do the right thing and use UTF-8. By telling Mozilla to do the right thing and assume UTF-8 also, Blogger gets the right characters, and puts them in the RSS feed correctly. You should also put <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the <head> of the HTML template for other browsers that don't assume UTF-8. Edited 02-24-2003 03:43 PM
|
Cory Doctorow 
02-24-2003
02:30 PM ET (US)
|
Glad it's not a problem for you. Maybe you can answer some of the 75+ emails I've received this weekend about the various curly-quotes that have snuck into BB posts by being pasted in from other blogs, breaking our RSS feed in a variety of readers...
God, the "get better tools" answer is a cop-out. How about "stop breaking the tools that people use to communicate in order to hew to some doctrinaire notion of 'correct' type?"
|