About this document

Goals

The following description should give you a high-level impression of what the mozITXTToHTMLConv routines do. It is not 100% correct for simplicity.

Audience

This document is aimed¹ at programmers, who use the routines in their applications, and power users, who use these applications.

Goals

The code escapes "<", ">", "&" and optionally tries to transform some of the structure and formatting of the text into HTML tags.

Failures should be minimized. A wrong recognition is a failure, not recognizing a structure/formatting is not seen as failure.

Currently, the only users of the routines are Mailnews and AIM¹, but in the future, it will propably be used by the browser (for displaying text documents) as well.

Description

Apart from the escaping function, there are 3 parts, which can be turned on individually by the caller.

URL recognition

This part does not alter the text other than inserting HTML tags.

Only the (guessed) bounds of the URL are (in a generic way) determined by the converter. After that, the validity is tested with functions of the network library. That way, (nearly) no special casing for certain URL types had to be done and exactly the URL types, which can be processed by Mozilla, are recognized.

If all checks are passed, a HTML-a-tag with a classes is inserted. The class is mode-specific and described below. For the modes RFC1738 and RFC2396, delimiters are included in the a-tag to give some visual feedback for the modality.

The code knows 4 modes. Subsets of them are tested in a predefined order. The first successful mode wins. In other words: there are several fallbacks.

Note for users: When creating text, usage of the angle bracket version of RFC2396 is recommended, i.e. <user@host> for email addresses and <scheme:schemeSpecificPart> for all other URI types, e.g. <ben@example.net> and <http://www.mozilla.org>.

The modes are (in this order):

1. RFC1738

RFC1738, APPENDIX compliant, i.e. "<URL:url>".

spaces are stipped. punctation chars stay intact.

class=moz-txt-link-rfc1738

Examples:

<URL:http://www.mozilla.org>

2. RFC2396E

RFC2396, APPENDIX E allows angle brackets (without "URL:") or quotation marks around URIs.

spaces are stipped. punctation chars stay intact.

Also allow email addresses without scheme, i.e. without heading "mailto:".

class=moz-txt-link-rfc2396E

Examples:

3. Freetext

The scheme (at the start) must match the regexp "[A-Za-z][A-Za-z0-9\-\.\+]*:" (RFC2396, Section 3.1).

space, parenthesis and closing brackets end the URL. punctation chars and "-" at the end are stipped off.

class=moz-txt-link-freetext

Examples:

http://www.mozilla.org

4. Abbreviated

Similar to freetext mode, but without scheme.

Only (abbreviated) URLs starting with "www." or "ftp." or including an "@" are recognized. They are turned into "http", "ftp" and "mailto" type URLs.

class=moz-txt-link-abbreviated

Examples:

Gylph conversion

This part alters the text by replacing parts of it.

Currently, the substitutions are hard coded, but the plan is to make it configurable for the user at some time.

Original Replacement

HTML source Meaning Example

:-)² <img src=\"chrome://messenger/skin/smile.gif\" alt=\":-)\" class=moz-txt-smily height=17 width=17 align=middle> Smiling smily

:)² <img src=\"chrome://messenger/skin/smile.gif\" alt=\":)\" class=moz-txt-smily height=17 width=17 align=middle> Smiling smily

:-(² <img src=\"chrome://messenger/skin/frown.gif\" alt=\":-(\" class=moz-txt-smily height=17 width=17 align=middle> Frowning smily

:(² <img src=\"chrome://messenger/skin/frown.gif\" alt=\":(\" class=moz-txt-smily height=17 width=17 align=middle> Frowning smily

;-)² <img src=\"chrome://messenger/skin/wink.gif\" alt=\";-)\" class=moz-txt-smily height=17 width=17 align=middle> Winking smily

;)² <img src=\"chrome://messenger/skin/wink.gif\" alt=\";)\" class=moz-txt-smily height=17 width=17 align=middle> Winking smily

;-P² <img src=\"chrome://messenger/skin/sick.gif\" alt=\";-P\" class=moz-txt-smily height=17 width=17 align=middle> Sick smily

space+/- ± Plus/minus sign ±

alphanumeric^digits delimiter <sup class=moz-txt-sup>digits</sup> Superscript of powers 10⁶, a²

)^digits delimiter (a+b)²

Original	Replacement
HTML source	Meaning	Example
`:-)`²	`<img src=\"chrome://messenger/skin/smile.gif\" alt=\":-)\" class=moz-txt-smily height=17 width=17 align=middle>`	Smiling smily
`:)`²	`<img src=\"chrome://messenger/skin/smile.gif\" alt=\":)\" class=moz-txt-smily height=17 width=17 align=middle>`	Smiling smily
`:-(`²	`<img src=\"chrome://messenger/skin/frown.gif\" alt=\":-(\" class=moz-txt-smily height=17 width=17 align=middle>`	Frowning smily
`:(`²	`<img src=\"chrome://messenger/skin/frown.gif\" alt=\":(\" class=moz-txt-smily height=17 width=17 align=middle>`	Frowning smily
`;-)`²	`<img src=\"chrome://messenger/skin/wink.gif\" alt=\";-)\" class=moz-txt-smily height=17 width=17 align=middle>`	Winking smily
`;)`²	`<img src=\"chrome://messenger/skin/wink.gif\" alt=\";)\" class=moz-txt-smily height=17 width=17 align=middle>`	Winking smily
`;-P`²	`<img src=\"chrome://messenger/skin/sick.gif\" alt=\";-P\" class=moz-txt-smily height=17 width=17 align=middle>`	Sick smily
`space+/-`	`±`	Plus/minus sign	±
`alphanumeric^digits delimiter`	`<sup class=moz-txt-sup>digits</sup>`	Superscript of powers	10⁶, a²
`)^digits delimiter`	(a+b)²

² For simlies, the following criteria has to be met: "space smily [ punctation ] space", where "["..."]" stands here for "optional".

Note for users: For best results, put a space between smilies and closing brackets. Otherwise, the code can't distinguish a smily plus a closing bracket from a smily with a large smile, and no substitution is done. Don't ommit the closing bracket after a smily (e.g. that way :-), or your parenthesis will look unbalanced (after all, they are) (e.g. that way .

Structured phrases recognition

This part does not alter the text other than inserting HTML tags.

It looks for "plain text tags", symbols, which surround words or phrases to format them or show their meaning. Their semantic is much like HTML, just that the interpretation is usually up to the reader (this code tries to change that). For more information about structured phrases in plain text, please read Jargon File, Hacker Writing Style.

In general, if "delimiter plainTextTag alpha anyContent alpha plainTextTag delimiter" is found, htmlTag is inserted as following: "delimiter htmlTag.opening plainTextTag alpha anyContent alpha plainTextTag htmlTag.closing delimiter".

E.g. "This is *important*!" is transformed to "This is <strong class=moz-txt-star>*important*</strong>!" and would be displayed like the following:

This is *important*!

plainTextTag htmlTag

Tag Attributes Example

* strong class=moz-txt-star *strong*

/ em class=moz-txt-slash /emphasis/

_ span² class=moz-txt-underscore _underline_

| code class=moz-txt-verticalline |code|

`plainTextTag`	`htmlTag`
Tag	Attributes	Example
`*`	`strong`	`class=moz-txt-star`	*strong*
`/`	`em`	`class=moz-txt-slash`	/emphasis/
`_`	`span`²	`class=moz-txt-underscore`	_underline_
`\|`	`code`	`class=moz-txt-verticalline`	`\|code\|`

² u is deprecated. Mozilla's stylesheet defines underlining for that element and class.

Known failures

URL recognition

Bug #19313
Msg-IDs

Gylph conversion

for (int i=0; i<10 ;)

Structured phrases recognition

Some shell commands
- rm *foo*
- /usr/bin/
- /home/user/.bashrc
C-comments /*foo*/

Glossary

space: Space, linebreaks or tabs.
delimiter: Currently mostly a non-alphanumeric
punctation: One of the following chars: ",", ";", ".", "!", "?"
Scheme: The part before the first ":" in an URI, e.g. "http".
Structured phrases: This term is derived from the corresponding HTML 4.0 definition captions

¹ AIM is a trademark of AOL. The concept of human-human communication is patented by AOL.