Note: My Web pages are best viewed with style sheets enabled.

Plain-Text vs HTML E-Mail

Sizes and Errors: 2025 Analysis

Copyright © 2025 by David E. Ross

Definitions

bloat: the increase in the size of an HTML-formatted message to convey the same textual content as a plain-text message
bloat factor: a measure of bloat, computed by dividing the size of the HTML-formatted message by the size of the plain-text message that has the same content. If a 3,000 byte HTML-formatted message has the same content as a 600 byte plain-text message, the bloat factor is 3,000 ÷ 600 = 5.0.
KB: kilobyte, 1,000 bytes

History

In the fall of 2008, I examined 20 HTML-formatted E-mail messages. From this study, I concluded that the average bloat factor was 3.4 and that HTML-formatted messages contained an average of 7.4 HTML errors per KB.
Early in 2010, I decided to repeat this study for several reasons.
- I did not save the messages from the earlier study in case I wanted to re-examine them.
- I could not recall whether I made sure that I did not use more than one message from the same sender.
- I computed the average bloat by taking the total size of all HTML-formatted messages and divided by the total size of the equivalent plain-text messages. I was curious what the result would have been if I had averaged the individual bloats of each message.
- I suspected that, since the last study, improvements in E-mail clients might have resulted in less bloat and fewer HTML errors in HTML-formatted messages.
I again collected 20 HTML-formatted E-mail messages. This time, the average bloat factor was 4.6 (worse than in 2008); and HTML-formatted messages contained an average of 4.6 HTML errors per KB (less than the ratio in 2008).
After two more years during which E-mail clients might have further evolved, I repeated the study with a fresh set of 20 HTML-formatted messages collected during January 2012. Unlike the 2010 study, however, I did not collect data on what E-mail clients were used; in 2010, half the messages failed to indicate any client while in 2012 that grew to three-fourths of the message not indicating any client, preventing any use of client identifications. The average bloat for HTML-formatted messages was 4.0 times the size of equivalent plain-text messages, with 5.0 HTML errors per KB.
I repeated the study with another 20 E-mail messages collected in June of 2015 to see what changes might have occurred in HTML-formatted E-mail in the prior three years. The average bloat factor for HTML-formatted messages increased to 10.0 times the size of the equivalent plain-text content, with 6.1 HTML errors per KB.
I repeated the study with another 20 E-mail messages collected in June of 2019 to see what changes might have occurred in HTML-formatted E-mail in the prior four years. The average bloat factor for HTML-formatted messages increased to 16.0 times the size of the equivalent plain-text content, with 7.3 HTML errors per KB.
In September 2021, I repeated the study, again with 20 HTML-formatted and multi-part (plain-text and HTML-formatted combined) messages each from a different sender. The average bloat factor for HTML-formatted messages at 15.5 times the size of the equivalent plain-text content (slightly better than in 2019), with 10.0 HTML errors per KB (significantly worse than in 2019).

In 2025, I noticed a number of newsgroup messages about HTML-formatting of E-mail. I decided to see if there was any improvement in how this worked by collecting 20 HTML-formatted E-mail in late September and early October 2025, much of which was spam.

Findings

The average bloat factor for HTML-formatted messages was 18.4 times the size of the equivalent plain-text content. The range was from 2.7 to 137.5. The largest HTML-formatted message had a bloat factor of 18.9. The smallest had 6.9, which was not the smallest bloat factor. The bloat factors of the messages were not correlated with their sizes.
HTML-formatted messages contain an average of 7.3 HTML syntax errors per KB of HTML markup. The range was from 0.1 to 35.9 errors per KB. The largest HTML-formatted message had a total of 11 HTML errors (0.1 errors per KB). The smallest had 111 (16.3 errors per KB). As with bloat, there was no correlation between message size and the number of HTML errors.
None of the 20 sampled messages were free of HTML errors. I attribute the errors to the applications that generated the HTML, not to their users.
There were no significant correlations between bloat factors and the proportion of HTML errors.
While doing this study, I encountered messages that were entirely images (e.g., PNG, JPEG). Generally, these messages cannot be "read" by audio browsers used by persons with visual handicaps. If such messages are sent by businesses or other organizations, they might thus be considered violations of the Americans with Disabilities Act (ADA).
Slightly over half the studied messages were two-part. This indicates an acknowledgement that many recipients prefer plain-text E-mail messages or are not equipped to view HTML-formatted messages.

Conclusions

HTML-formatted messages are excessively larger than plain-text messages conveying the same information. Improvements in E-mail clients has not resulted in reductions in bloat. Indeed, bloat is now worse than in 2021. Bloat means longer download time for those few who still use dial-up connections to the Internet and slower anti-virus checking for everyone. Bloat also requires more disc space to store messages for both users and mail servers; this can impact businesses that are required by law to archive E-mail messages. Mass-mailings of HTML-formatted messages may cause Internet congestion by consuming excessive bandwidth.
Improvements in E-mail clients since 2021 resulted in a slight reduction in HTML errors since 2021. Such invalid HTML may cause garbled quoting of a message when replying to it or forwarding it using an E-mail client different from the client that originally sent the message. It can also prevent processing a message through an audio application for a blind user; for a business or government agency, this might be a violation of the Americans with Disabilities Act (ADA).

Methodology

I collected 20 messages that were either HTML-formatted or were two-part (containing both plain text and HTML). Making sure that no two messages were from the same source, I stored each message four times. I excluded any attachments (which, for HTML-formatted messages, means excluding any images or background). The first two of these were used in my analysis. The last two were saved in case I needed further information.

I stored the readable content from the first line to the end of the message into a plain-text file.
I stored the raw message (the source file) excluding any <x-html> — </x-html> tags and any <!Document> tags into an HTML file. I excluded those tags because not all HTML-formatted E-mail messages contain them, and I did not want comparison of the sizes of HTML markup to be skewed by such tags. This is especially important with <!Document> tags, which often have very extensive sets of attributes.
I saved the source file, which in some cases reflected a 2-part message — a message that included both plain-text and HTML formatting. This included the entire header section.
I stored the source file without the header section.

I recorded the sizes of the first two in a spread sheet — the first file as Text and the second as HTML. Dividing the total of the HTML sizes by the total of the Plain sizes gave me the average bloat factor. Because bloat is meaningful in the gross context over many messages — in terms of bandwidth impacts, disc space occupied, etc — this average is based on the total size of all messages compared with the total size of the equivalent plain-text content. If I had averaged the individual bloat factors of each message, the result would have been 16.9, greater than the 16.0 reported under "Findings".

Note that the Plain files might not contain all the content intended for the message. This would be caused by placing text within images, which would make the message incomplete for a blind person using an audio browser. For commercial messages, this would be a violation of the Americans with Disabilities Act.

For HTML errors, I used the W3C Markup Validation Service. I input the content of an HTML file. I recorded the number of errors on the same spreadsheet. Dividing the number of errors by the HTML size and multiplying by 1,000 gave me the proportion of HTML errors per KB of HTML for each message. Because the impact of HTML errors falls on individual messages, I then took the average of the individual proportions. If I had instead used the total number of errors versus the total HTML sizes, the result would have been 6.4 HTML errors per KB of message size, fewer than the 7.3 reported under "Findings".

NOTE WELL: In the chart below, two-part messages are indicated. The total size of a two-part message is equal to the sum of the plain-text and HTML-formatted sizes plus about 4-6 KB for the header section. Thus, two-part messages might appear to increase bloat. In this latest study, bloat only considered the size of the HTML-formatting; the size of the plain-text portion of two-part messages was excluded. Nevertheless, two-part messages require more bandwidth to send and receive and more disc space to store than messages that are only HTML-formatted.

Raw Data
(sizes in bytes)
Msg #	Plain Size	HTML Size	Bloat Factor	HTML Errors	Errors per KB	2-Part Message
1	759	104,395	137.5	431	4.1	X
2	1,271	62,190	48.9	63	1.0
3	2,535	27,610	10.9	90	3.3
4	1,745	22,651	13.0	814	35.9
5	4,096	60,805	14.8	647	10.6	X
6	2,028	39,396	19.4	62	1.6
7	2,236	36,328	16.2	65	1.8	X
8	3,373	9,739	2.9	193	19.8	X
9	1,254	21,252	16.9	123	5.8	X
10	3,317	61,405	18.5	164	2.7
11	3,671	90,697	24.7	415	4.6	X
12	2,032	107,591	52.9	94	0.9
13	2,793	31,423	11.3	131	4.2	X
14	6,226	91,218	14.7	54	0.6	X
15	496	3,430	6.9	17	5.0
16	4,228	13,897	3.3	290	20.9
17	9,821	185,343	18.9	11	0.1
18	3,670	73,496	20.0	185	2.5	X
19	2,486	6,827	2.7	111	16.3	X
20	1,212	39,417	32.5	172	4.4	X

New study 7 October 2025