About this document

Goals

The following description should give you a high-level impression of what the mozITXTToHTMLConv routines do. It is not 100% correct for simplicity.

Audience

This document is aimed1 at programmers, who use the routines in their applications, and power users, who use these applications.

Goals

The code escapes "<", ">", "&" and optionally tries to transform some of the structure and formatting of the text into HTML tags.

Failures should be minimized. A wrong recognition is a failure, not recognizing a structure/formatting is not seen as failure.

Currently, the only users of the routines are Mailnews and AIM1, but in the future, it will propably be used by the browser (for displaying text documents) as well.

Description

Apart from the escaping function, there are 3 parts, which can be turned on individually by the caller.

URL recognition

This part does not alter the text other than inserting HTML tags.

Only the (guessed) bounds of the URL are (in a generic way) determined by the converter. After that, the validity is tested with functions of the network library. That way, (nearly) no special casing for certain URL types had to be done and exactly the URL types, which can be processed by Mozilla, are recognized.

If all checks are passed, a HTML-a-tag with a classes is inserted. The class is mode-specific and described below. For the modes RFC1738 and RFC2396, delimiters are included in the a-tag to give some visual feedback for the modality.

The code knows 4 modes. Subsets of them are tested in a predefined order. The first successful mode wins. In other words: there are several fallbacks.

Note for users: When creating text, usage of the angle bracket version of RFC2396 is recommended, i.e. <user@host> for email addresses and <scheme:schemeSpecificPart> for all other URI types, e.g. <ben@example.net> and <http://www.mozilla.org>.

The modes are (in this order):

1. RFC1738

RFC1738, APPENDIX compliant, i.e. "<URL:url>".

spaces are stipped. punctation chars stay intact.

class=moz-txt-link-rfc1738

Examples:

2. RFC2396E

RFC2396, APPENDIX E allows angle brackets (without "URL:") or quotation marks around URIs.

spaces are stipped. punctation chars stay intact.

Also allow email addresses without scheme, i.e. without heading "mailto:".

class=moz-txt-link-rfc2396E

Examples:

3. Freetext

The scheme (at the start) must match the regexp "[A-Za-z][A-Za-z0-9\-\.\+]*:" (RFC2396, Section 3.1).

space, parenthesis and closing brackets end the URL. punctation chars and "-" at the end are stipped off.

class=moz-txt-link-freetext

Examples:

4. Abbreviated

Similar to freetext mode, but without scheme.

Only (abbreviated) URLs starting with "www." or "ftp." or including an "@" are recognized. They are turned into "http", "ftp" and "mailto" type URLs.

class=moz-txt-link-abbreviated

Examples:

Gylph conversion

This part alters the text by replacing parts of it.

Currently, the substitutions are hard coded, but the plan is to make it configurable for the user at some time.

OriginalReplacement
HTML sourceMeaningExample
:-)2<img src=\"chrome://messenger/skin/smile.gif\" alt=\":-)\" class=moz-txt-smily height=17 width=17 align=middle>Smiling smily:-)
:)2<img src=\"chrome://messenger/skin/smile.gif\" alt=\":)\" class=moz-txt-smily height=17 width=17 align=middle>Smiling smily:)
:-(2<img src=\"chrome://messenger/skin/frown.gif\" alt=\":-(\" class=moz-txt-smily height=17 width=17 align=middle>Frowning smily:-(
:(2<img src=\"chrome://messenger/skin/frown.gif\" alt=\":(\" class=moz-txt-smily height=17 width=17 align=middle>Frowning smily:(
;-)2<img src=\"chrome://messenger/skin/wink.gif\" alt=\";-)\" class=moz-txt-smily height=17 width=17 align=middle>Winking smily;-)
;)2<img src=\"chrome://messenger/skin/wink.gif\" alt=\";)\" class=moz-txt-smily height=17 width=17 align=middle>Winking smily;)
;-P2<img src=\"chrome://messenger/skin/sick.gif\" alt=\";-P\" class=moz-txt-smily height=17 width=17 align=middle>Sick smily;-P
space+/-&plusmn;Plus/minus sign±
alphanumeric^digits delimiter<sup class=moz-txt-sup>digits</sup>Superscript of powers106, a2
)^digits delimiter(a+b)2

2 For simlies, the following criteria has to be met: "space smily [ punctation ] space", where "["..."]" stands here for "optional".

Note for users: For best results, put a space between smilies and closing brackets. Otherwise, the code can't distinguish a smily plus a closing bracket from a smily with a large smile, and no substitution is done. Don't ommit the closing bracket after a smily (e.g. that way :-), or your parenthesis will look unbalanced (after all, they are) (e.g. that way :-).

Structured phrases recognition

This part does not alter the text other than inserting HTML tags.

It looks for "plain text tags", symbols, which surround words or phrases to format them or show their meaning. Their semantic is much like HTML, just that the interpretation is usually up to the reader (this code tries to change that). For more information about structured phrases in plain text, please read Jargon File, Hacker Writing Style.

In general, if "delimiter plainTextTag alpha anyContent alpha plainTextTag delimiter" is found, htmlTag is inserted as following: "delimiter htmlTag.opening plainTextTag alpha anyContent alpha plainTextTag htmlTag.closing delimiter".

E.g. "This is *important*!" is transformed to "This is <strong class=moz-txt-star>*important*</strong>!" and would be displayed like the following:

This is *important*!

plainTextTaghtmlTag
TagAttributesExample
*strongclass=moz-txt-star*strong*
/emclass=moz-txt-slash/emphasis/
_span2class=moz-txt-underscore_underline_
|codeclass=moz-txt-verticalline|code|

2 u is deprecated. Mozilla's stylesheet defines underlining for that element and class.

Known failures

URL recognition

Gylph conversion

Structured phrases recognition

Glossary

space
Space, linebreaks or tabs.
delimiter
Currently mostly a non-alphanumeric
punctation
One of the following chars: ",", ";", ".", "!", "?"
Scheme
The part before the first ":" in an URI, e.g. "http".
Structured phrases
This term is derived from the corresponding HTML 4.0 definition captions

1 AIM is a trademark of AOL. The concept of human-human communication is patented by AOL.