[twg-tds] font related files: sfd / CMap

Thomas Esser te at dbs.uni-hannover.de
Thu Jul 31 08:55:23 CEST 2003

Well, during the TuG2003 conference, Jin-Hwan Cho told me about SFD and
CMap files which are already used by some tools (and the plan is to add
more tools), but do not have their proper place in the TDS tree.

>From what I have read, fonts/sfd/<package> and fonts/cmap/<package>
might be the right thing to do (where <package> is an identifier to
indicate to which package the files belong to).

For your reference, I append a mail that Jin-Hwan just wrote to me.
Comments welcome!


Here is a few statements what CMap and sfd files are. Current
KPATHSEA library does not support those file formats. I hope I can use
kpse_cmap_format and kpse_sfd_format in the next or future KPATHSEA
library. There is no TeX system yet supporting CMap file format but it
is used for Adobe Reader, GNU Ghostscript, and DVIPDFMx. In the contrary,
SFD file format is supported in some TeX systems; its location is usually
texmf/ttf2pk// because ttf2tfm and ttf2pk uses this file format.

1. SFD (SubFont Definition files)

Usually .sfd files have been used with the ttf2tfm and ttf2pk packages
which are contained in the FreeType 1 library as a contribution. The
role of this file format is simple. It gives information how to divide
huge characters into the set of 256 characters.

Subfont definition files (from ttf2pk.doc)

CJKV (Chinese/Japanese/Korean/old Vietnamese) fonts usually contain
several thousand glyphs; to use them with TeX it is necessary to split
such large fonts into subfonts.  Subfont definition files (usually having
the extension '.sfd') are a simple means to do this smoothly.

A subfont file name usually consists of a prefix, a subfont infix,
and a postfix (which is empty in most cases), e.g.

    ntukai23 -> prefix: ntukai, infix: 23, postfix: (empty)

Here the syntax of a line in an SFD file, describing one subfont:

  <whitespace> <infix> <whitespace> <ranges> <whitespace> `\n'

A line can be continued on the next line with a backslash ending the line.
The ranges must not overlap; offsets have to be in the range 0-255.


  The line

    03   10: 0x2349 0x2345_0x2347

  assigns to  the code  positions 10,  11, 12, and  13 of  the subfont
  having the  infix `03' the  character codes 0x2349,  0x2345, 0x2346,
  and 0x2347, respectively.

The SFD files in the distribution are customized for the CJK package
for LaTeX.

2. CMap

The following short statements from the PDF Reference show exactly
what CMap files are.  Simply it is a 16-bit extension of .enc file with
some additional features.  It is a text file and its format looks like
XML. Anyway don't confuse with 'cmap' table in the Truetype font spec.

CID-Keyed Fonts Overview (from PDF Reference 1.4, p.335-336)

CID-keyed fonts provide a convenient and efficient method for defining
multiple-byte character encodings, fonts with a large number of glyphs,
and fonts that incorporate glyphs obtained from other fonts. These
capabilities provide great flexibility for representing text in writing
systems for languages with large character sets, such as Chinese,
Japanese, and Korean (CJK).

The CID-keyed font architecture specifies the external representation of
certain font programs, called CMap and CIDFont files, along with some
conventions for combining and using those files.  This architecture is
independent of PDF; CIDkeyed fonts can be used in other environments.
For complete documentation on the architecture and the file formats,
see Adobe Technical Notes #5092, CIDKeyed Font Technology Overview,
and #5014, Adobe CMap and CIDFont Files Specification.

The term CID-keyed font reflects the fact that CID (character identifier)
numbers are used to index and access the glyph descriptions in the
font. This method is more efficient for large fonts than the method of
accessing by character name, as is used for some simple fonts.  CIDs range
from 0 to a maximum value that is subject to an implementation limit.

A character collection is an ordered set of all characters needed to
support one or more popular character sets for a particular language. The
order of the characters in the character collection determines the CID
number for each character. Each CID-keyed font must explicitly reference
the character collection on which its CID numbers are based.

A CMap (character map) file specifies the correspondence between character
codes and the CID numbers used to identify characters. It is equivalent
to the concept of an encoding in simple fonts.  Whereas a simple font
allows a maximum of 256 characters to be encoded and accessible at
one time, a CMap can describe a mapping from multiple-byte codes to
thousands of characters in a large CIDkeyed font. For example, it can
describe Shift-JIS, one of several widely used encodings for Japanese,
or Unicode, an international standard encoding that covers many languages.

A CMap can reference an entire character collection, a subset, or multiple
character collections.  It can also reference characters in other fonts
by character code or character name. The CMap mapping yields a font
number and a character selector that can be a CID, a character code,
or a character name. Furthermore, a CMap can incorporate another CMap
by reference, without having to duplicate it. These features enable
character collections to be combined or supplemented, and make all the
constituent characters accessible to text-showing operations through a
single encoding.

Any question and comment is welcome!

Best, ChoF.
~~~~~~~~~~~~~~~~~~~~~~~~~     ***
| Cho, Jin-Hwan == ChoF |     ^ ^
~~~~~~~~~~~~~~~~~~~~~~~~~      o
| Research Fellow       |     ~~~
| School of Mathematics ~~~~~~~~~~~~~~
| Korea Institute for Advanced Study |
| chofchof at ktug.or.kr                |
| http://free.kaist.ac.kr/ChoF/      |

More information about the twg-tds mailing list