QuickTopic free message boards logo
Skip to Messages


Why I hate curly-quotes

^     All messages            46-61 of 61  30-45 >>
Lee-Anne Phillips
02:49 PM ET (US)

The problem is that the tools were, by and large, quite deliberately designed to break other vendor’s browsers, which was (mostly) done by Microsoft to encourage people to use Microsoft tools in an all-Microsoft environment.

A similar strategy by automobile manufacturers would have the brake and gas pedals randomly switched between vehicles, and perhaps the windshield wiper controls and the shift lever. Shifting from Ford to Honda, in this scenario, might entail a real risk of killing someone. Brilliant marketing move, though.

We have a long history of depending on standards, so that a pound of coffee one purchases in New York is pretty much the same as a pound of coffee from Seattle.

We also have a long history of problems associated with conflicting standards: The "Phillips Head" screw, for example, was invented to make life easier on early assembly lines by allowing the driver bit to "cam out," that is, either to pop right out of the screw, or to strip the cross-points of the screw when over-torqued, resulting in entire generations of home mechanics cursing the damed things, since they were deliberately-designed to accommodate professional power tools. The square-drive screw, on the other hand, was designed before the Phillips-head, and is a pure pleasure to work with. They’re easy to find in Canada (that’s where they were invented) but difficult to come by in the USA, since Detroit invented the Phillips-head.

We’ll pass over the "metric" versus "English" measurement controversy, since it’s pretty much America versus the entire rest of the world by now, but the same US-centrism holds true for the several lame Microsoft "code-pages" versus Unicode. Heck, if we’re that conservative, let’s all just go back to shillings and ha'pennys, and please don’t let’s forget the gold standard!

Life moves on.

I personally think that Microsoft should have supplied handy tools to fix the problems they created gratis, but than I’ve had plenty of practice in believing six impossible things before breakfast.

jleaderPerson was signed in when posted
01:11 PM ET (US)
Someday, Unicode will solve all our problems. Then, we'll have to invent new problems.
Chris SmithPerson was signed in when posted
11:32 AM ET (US)
I had a long thing written here about why curly quotes lead to broken feeds. Then I figured out what ACTUALLY happened - it's almost Cory's fault (grin).
Here goes...

This is Blogger Pro's xml declaration (sans angle brackets)

  ?xml version="1.0"?

This is bb's headers from http://boingboing.net/rss.xml

  HTTP/1.1 200 OK
  Date: Tue, 25 Feb 2003 16:46:42 GMT
  Server: Apache
  Last-Modified: Tue, 25 Feb 2003 16:22:28 GMT
  ETag: "7d57-49f5-3e5b9844"
  Accept-Ranges: bytes
  Content-Length: 18933
  Connection: close
  Content-Type: text/xml

Notice that neither of these specifies a text encoding.

Now for the most critical bit from 4.3.3 in the XML spec.

  It is also a fatal error if an XML entity contains
  no encoding declaration and its content is not
  legal UTF-8 or UTF-16.

...and that's where it was going sour. Curly quotes (and foreign currencies and simple fractions) would cause invalid utf-8 sequences to appear, breaking any utf-8 parsers. Given the combination of RSS and headers, it was incumbent on Cory et al to ensure that all postings were in utf-8.

This would be difficult though, given that many sites (including Blogger) include no HTML charset declaration. In such cases, the usual fallback is to assume ISO-8859-1. This would *almost* be ok, except that there are no curly quotes in 8859-1, so many systems cheat, and just use the windows-1252 codes.

The resulting conflict of assumptions meant that postings to the site went in in ISO-8859-1, but came out in the RSS feed in utf-8. Unfortunately, no actual conversion of data took place at the server to deal with this change.

Aaron's instructions have replaced assumptions with solid references. bb now comes out in utf-8 (because of the meta tag in the top), and Mozilla can be told to override the submission in 8859-1 and use utf-8. Mozilla appears to have the smarts to coerce many foreign types into utf-8 correctly, so that cutting and pasting invokes a conversion to the appropriate encodings. Cutting and pasting (particularly the pasting) appears to be a magical process - it changes depending on the various settings of all the various components. This is the clue to the occasional nature of the problem - there are just too many hidden settings to keep track of.

The one thing that would make this job easier would be having the server side automatically declare utf-8 for both the bb pages and for the posting pages. Five years ago it would have broken too many browsers. Maybe the time for that change has finally come.
Eli the BeardedPerson was signed in when posted
06:58 PM ET (US)
Cory, about /m54, quite possibly the problem is that the
clipboard does not contain information about the charset
or when pasted that information is need checked.

It would match, eg, the epoch problem when cutting and pasting
dates between Mac and Windows originated Excel documents.

So much is effected by something as seemingly
simple as character sets.
Cory DoctorowPerson was signed in when posted
05:50 PM ET (US)
Well, hush my mouth! It *does* work -- yer a genius, Aaron.
Aaron SwartzPerson was signed in when posted
05:43 PM ET (US)
Cory, as I said, you need to put <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the <head> of the HTML template
Cory DoctorowPerson was signed in when posted
05:37 PM ET (US)
Aaron, that doesn't work either. The curly-quotes show up in the RSS, but when published to the blog, they show up as dipthongs.
Cory DoctorowPerson was signed in when posted
05:36 PM ET (US)
Chris, look at the original post: the tools aren't there yet. Every tool I've used can and does break in some instances of curly-quotes -- try pasting a word doc with curly-quotes into mutt and mailing it to someone who reads it with Mailsmith -- at least one of those apps is going to render incorrectly, reducing the readability of your system.

The fact is that despite "standardization," curly-quotes are still mostly a proprietary affair.
Chris SmithPerson was signed in when posted
05:15 PM ET (US)
I'd love to see how any system thinks it can send or receive paired quotes in ISO-8859-1, since that character set doesn't include the relevant characters.

It's lying to you - the problem is knowing exactly HOW it's lying. That was always the problem I ran into when I hit character problems - there was never any clear documentation about which system bits did which conversion bits, and what spec they used to do it.
Chris SmithPerson was signed in when posted
04:23 PM ET (US)
Cory? Hang in there....

1) This stuff works ok, if not perfectly, when I try it. In fact, some of the not perfect is local - things like my browser which lies to me.
2) There are even tools out there to automatically convert
straight quotes (the feet-inches type) to paired quotes ... including some common ones such a MS Word.
3) These are legal constructs in XML and thus in XHTML and RDF.

Given that it is both legal and working, how do you explain why people should stop? Fair enough, your tools break. But if they are breaking on what is defined as legitimate content, then is this the originators' fault?

Further - a validation run on bb's RSS suggests that the ONLY out-of-valid item is a single date field. Simply adding paired quotes to a otherwise valid RSS feed will not make it invalid. Even the windows-1252 characters are legit XML, they just shouldn't be displayed as quotes.

The one-sender / multi-recipient model DEPENDS on standards. You can't be driven by every tool author out there who complains that your feed breaks his reader if your feed is valid. It is NOT 'hiding behind the spec' to point them to a validation check, and tell them that they have to accept valid feeds.

This is why validators sometimes show you how to add a 'validate my feed' link to your work. If the first place someone can go is a validator that shows your feed is fine, then that is less likely to be an email to you. Maybe that's a start - something simple to keep the email load down.

There are a couple RSS validators, but here's a start. I think it's likely that you've heard of these before - time to see if you can use one to save yourself some workload?

Aaron SwartzPerson was signed in when posted
04:20 PM ET (US)
As for the other editors, what browsers do they use?

In Safari: Safari, Preferences, Appearance, Default Encoding, Unicode (UTF-8).
Aaron SwartzPerson was signed in when posted
04:18 PM ET (US)
OK, so I downloaded Mozilla and set up a Blogspot account. For some reason beyond puny human comprehension, Mozilla persists in seeing Blogger.com as being in ISO-8859-1 despite all indications and instructions to the contrary.

Luckily, if you visit the blogger posting page and select View menu, Character Encoding, Unicode (UTF-8) it does things right. This seems to stick across quitting the application and new posts.

Tech details: Mozilla can be totally wacky.
Eli the BeardedPerson was signed in when posted
04:16 PM ET (US)
Overall, I'd say lack of unicode support is an application
problem, not a problem with people using it. Just because
you only see it for smart quotes does not make those characters
the real problem.

Asumming you've got a unicode system, I like to spell my surname
with a "ffi" glyph (U+FB03). Not sure if QT will get it
right here.
Cory DoctorowPerson was signed in when posted
03:48 PM ET (US)
Doesn't work.
Cory DoctorowPerson was signed in when posted
03:44 PM ET (US)
Let's see if that works. Of course, I have three co-editors who don't use Moz.
Aaron SwartzPerson was signed in when posted
03:40 PM ET (US)
OK, I think I figured out how to fix Mozilla so your RSS feed won't break again, Cory:

From the Mozilla menu, select Preferences. Click Languages in the Navigator category. Under Default Character Coding select "Unicode (UTF-8)". Click OK.

Tech details: Mozilla stuplidly assumes pages are in the legacy ISO-8859-1 format, and so it sends the smart quotes in ISO-8859-1. Web browsers have code to check if the page is using ISO-8859-1 and handle it appropriately but many RSS readers don't, because XML feeds are supposed to do the right thing and use UTF-8. By telling Mozilla to do the right thing and assume UTF-8 also, Blogger gets the right characters, and puts them in the RSS feed correctly.

You should also put <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the <head> of the HTML template for other browsers that don't assume UTF-8.
Edited 02-24-2003 03:43 PM
^     All messages            46-61 of 61  30-45 >>

Print | RSS Views: 1285 (Unique: 880 ) / Subscribers: 3 | What's this?