Threading emails

Wednesday 2 September 2009This is over 15 years old. Be careful.

Trent Mick wrote to me some time ago asking for a feature on this blog: could I make it so that email notifications of blog comments would thread together nicely?

The email subject lines from my notifications look like this:

A comment on “Weird URL data encoding” from Richard Schwartz

I use Thunderbird for email, and don’t thread my inbox, so I never considered threading. Trent sent along information from a friend which said that “References:” headers were the key that would make a set of emails into a single thread.

I hacked for a little while, and could not get them to thread. I created a fake message id from the blog post and had all comment notifications have a References header with the id in it. No threading. I added unique Message-ID headers to each comment, then made subsequent comments have all previous message ids in a References header. No threading.

I tried the same in Gmail, and nothing seemed to thread the messages together. Googling around, it seemed others had come to the conclusion that only the subject line matters. Apparently if two messages have the same subject (plus or minus some “Re:” prefixes), then they are in the same thread.

But what is the actual algorithm? I know that there can be differences in the subject lines (“Re:” and all). What are these mail clients doing to decide that two messages are in a thread?

I like having the author name in the subject line, it makes the Inbox listing richer. But it’s also what’s keeping these messages from threading. Is there a way to get the best of both worlds?

I know I’ve seen threads in Thunderbird where the subject line changes completely mid-thread. Is that because they have Reply-To headers? Comment notifications aren’t replies to each other, but maybe that’s a way to force threading?

Comments

[gravatar]
Have you considered taking sender out of the subject and changing the From header of the emails? Then your inbox is still "rich" because you can see who wrote the comment, and plus threading.

Since it's only for display, you don't even need to change the routing part of the email address (foo@bar) if that matters, just the display part: "Blog Daemon" <blogdaemon@nedbatchelder.com> becomes "Richard Schwartz" <blogdaemon@nedbatchelder.com>.
[gravatar]
There are two headers that determine threading -- References: and In-Reply-To:. The latter simply references the Message-ID(s) of the message(s) that it's directly replying to, while the latter contains a (possibly sparse) list of the Message-IDs of all the "parent" messages in the thread (i.e., the References: header from the message that it's In-Reply-To as well as the contents of its own In-Reply-To:). Not all MUAs, I think, use References, so perhaps adding In-Reply-To is the key. Or maybe you're just testing with a crappy MUA. :)

Also, some MUAs can change their threading heuristics. Mutt, for example, uses the standard threading headers, but can also try to gather threads together by looking at Subject: lines. That's only used as a last-resort, though. Not sure about other MUAs.
[gravatar]
Malcolm Tredinnick 12:52 AM on 3 Sep 2009
Ned, there's a fairly good summary of the relevant header history at http://www.jwz.org/doc/threading.html, along with the algorithm jwz used in some of his tools and that has subsequently been adopted in a few other applications. Because those headers have been historically so poorly set, particularly by retransmit services (e.g. server -> Blackberry), it's resulted in the need for the heuristic approaches, which is why a lot of programs do a good approximation even when you're only close to correct.
[gravatar]
A description of Thunderbird's threading algorithms and preferences affecting it can be found here:
https://wiki.mozilla.org/MailNews:Message_Threading

The bottom line is that Thunderbird 3.0 (currently nearing beta 4 release) honors the References/In-Reply-To headers a lot. This includes both single-folder views and cross-folder views. (Previously, threading was not possible in cross-folder saved searches.)

Thunderbird 2.0 was crazy for subject threading, although it was capable of doing references threading acceptably if so configured. (But not as well as 3.0 is/will be.)
[gravatar]
I've poked around this issue a couple times, and the whole References/In-Reply-To scheme has always struck me as rather problematic.

For example, I can't count the number of friends/family I know who use their Inbox as an address book... ask them to invite "Fred" to lunch and the first thing they do is find an email Fred sent to them - any one, it doesn't matter - and hit "Reply". They then delete the subject and body, and type their "new" message. Never mind that it likely includes headers that link it to Fred's original message.

If you're writing an email client like, oh, GMail, what do you do in this case? Should you honor the headers in the email, even though the subject and body have *nothing* to do with the original Fred message? Talk about confusing! You'll have M's of users bitching at you for burying messages in threads that, to the user, are completely unrelated.

Given this imperfect user behavior, wouldn't you be better off ignoring the mail headers or, at most, using them as "hints" and instead look to the Subject line as the definitive measure of threading?

Anyhow... Cory's suggestion is the one that first came to mind for me. It's also how other forum products, like Google Groups and Google Code maintain threading with their email comment notifications.
[gravatar]
I don't like the idea of your putting the poster's name in the From header. And that's not just because you used my name in your example subject! ;-) It's because after many months have gone by, I've probably forgotten the title of the thread I replied to, so if I get a notification message about a late comment I want the first thing I see when the message arrives to indicate that the message was sent by your blog.

But if you do spoof the From header, be sure to also set the Sender header to your own address or to a stand-in address for your blog. See RFC-5322, section 3.6.2 http://tools.ietf.org/html/rfc5322

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.