Viewable With ANY Browser

Note: My Web pages are best viewed with style sheets enabled.

Unrated

Plain-Text vs HTML E-Mail

Sizes and Errors: 2015 Analysis

Copyright © 2015 by David E. Ross

Definitions

bloat
the increase in the size of an HTML-formatted message to convey the same textual content as a plain-text message
bloat factor
a measure of bloat, computed by dividing the size of the HTML-formatted message by the size of the plain-text message that has the same content. If a 3,000 byte HTML-formatted message has the same content as a 600 byte plain-text message, the bloat factor is 3,000 ÷ 600 = 5.0.
KB
kilobyte, 1,000 bytes

History

In the fall of 2008, I examined 20 HTML-formatted E-mail messages. From this study, I concluded that the average bloat factor was 3.4 and that HTML-formatted messages contained an average of 9.1 HTML errors per KB of file size.

Early in 2010, I decided to repeat this study for several reasons.

I again collected 20 HTML-formatted E-mail messages. This time, the average bloat factor was 4.6 (worse than in 2008); and HTML-formatted messages contained an average of 4.6 HTML errors per KB of file size (only half the ratio in 2008).

After two more years during which E-mail clients might have further evolved, I repeated the 2010 study with a fresh set of 20 HTML-formatted messages collected during January 2012. Unlike the 2010 study, however, I did not collect data on what E-mail clients were used; in 2010, half the messages failed to indicate any client while in 2012 that grew to three-fourths of the message not indicating any client, preventing any use of client identifications.

E-mail clients and the HTML specifications have both evolved since the 2012 study. Therefore, I decided in 2015 to again repeat the study.

Findings

Conclusions


Methodology

I stored each message twice. I excluded any attachments (which, for HTML-formatted messages, means excluding any images or background), links to attachments, the section for marking a message as spam (added by my ISP's mail server), and the header section of each message. First, I stored the readable content from the first line to the end of the message into a plain-text file. Then, I stored the raw message from the <x-html> tag to the </x-html> tag (excluding those tags) into an HTML file. Finally, I saved the source file, which in some cases reflected a 2-part message — a message that included both plain-text and HTML formatting.

I recorded the sizes of each file in a spread sheet — the first file as Plain and the second as HTML. Dividing the total of the HTML sizes by the total of the Plain sizes gave me the average bloat factor. Because bloat is meaningful in the gross context of many messages — in terms of bandwidth impacts, disc space occupied, etc — this average is based on the total size of all messages compared with the total size of the equivalent plain-text content. If I had averaged the individual bloat factors of each message, the result would have been 10.9, greater than the 10.0 reported under "Findings".

Note that the Plain files might not contain all the content intended for the message. This would be caused by placing text within images, which would make the message incomplete for a blind person using an audio browser. For commercial messages, this would be a violation of the Americans with Disabilities Act.

All but two of the messages were 2-part (both plain-text and HTML-formatted). The overall sizes of the 2-part messages were not considered when computing bloat, which would have been even greater if they had been considered.

For HTML errors, I used the W3C Markup Validation Service Validate by Direct Input capability. I input the content of an HTML file. If the message's HTML included a <!DOCTYPE> declaration (which is required for Web pages), I selected the "Validate Full Document" option. Otherwise, I selected the "Validate HTML fragment" option, specifying "HTML 4.01 Transitional" (which is the least restrictive HTML 4.01 syntax) and then repeated with "HTML5 (experimental)", choosing the least number of errors between them. I recorded the number of errors on the same spreadsheet. Dividing the number of errors by the HTML size and multiplying by 1,000 gave me the proportion of HTML errors per KB of HTML for each message. Because the impact of HTML errors falls on individual messages, I then took the average of the individual proportions. If I had instead used the total number of errors versus the total HTML sizes, the result would have been 13.4 HTML errors per KB of message size, significant greater than the 6.0 reported under "Findings".

Spam messages were those identified by my E-mail client (Thunderbird) as "junk".

Raw Data
(sizes in bytes)
Msg #Plain SizeHTML SizeBloat FactorHTML ErrorsErrors per KB2-Part MessageSpam
13,94950,29912.74438.8XX
21,66124,45014.7542.2XX
312,64884,5076.75236.2X
41045995.81016.7X
51,76747,17126.760512.8X
61,39628,75820.62137.4
74609832.11515.3X
81,1251,5361.43019.5X
91,4554,3923.0122.7XX
103,17828,4589.0150.5X
111,35360,10644.4520.9X
122,00643,98121.9400.9
134462,3385.2219.0X
143632,0095.5178.5X
155,34549,1919.29.2<0.1X
162,74916,7506.170.4XX
171,8878,4794.580.9XX
183,7546,3431.720.3XX
196,27250,2348.04298.5XX
2019,183201,73510.52131.1X

Updated 11 November 2015


Valid HTML 4.01