319 lines
15 KiB
HTML
319 lines
15 KiB
HTML
<?xml version="1.0" encoding="us-ascii"?>
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
|
|
<head>
|
|
<link type="text/CSS" rel="stylesheet" href="style.css" />
|
|
<link type="image/x-icon" rel="shortcut icon" href="favicon.png" />
|
|
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii" />
|
|
<title>Thalassa CMS official documentation</title>
|
|
</head><body>
|
|
<div class="theheader">
|
|
<a href="index.html"><img src="logo.png"
|
|
alt="thalassa cms logo" class="logo" /></a>
|
|
<h1><a href="index.html">Thalassa CMS official documentation</a></h1>
|
|
</div>
|
|
<div class="clear_both"></div>
|
|
<div class="navbar" id="uppernavbar"> <a href="common_macros.html#uppernavbar" title="previous" class="navlnk">⇐</a> <a href="userdoc.html#headed_text" title="up" class="navlnk">⇑</a> <a href="generation_objects.html#uppernavbar" title="next" class="navlnk">⇒</a> </div>
|
|
|
|
<div class="page_content">
|
|
|
|
<h1 class="page_title"><a href="">Headed text format</a></h1>
|
|
<div class="page_body">
|
|
<p>Contents:</p>
|
|
<ul>
|
|
<li><a href="#motivation">Why headed text files</a></li>
|
|
<li><a href="#structure">Structure of the file</a></li>
|
|
<li><a href="#encodings">Character encodings</a></li>
|
|
<li><a href="#formats">Content formats</a></li>
|
|
</ul>
|
|
|
|
|
|
<h2 id="motivation">Why headed text files</h2>
|
|
|
|
<p>Text files with a clearly separated header are used by Thalassa CMS as a
|
|
source for the following types of content:</p>
|
|
<ul>
|
|
<li>pages of <a href="generation_objects.html#pagesets">page sets</a>;</li>
|
|
<li>user comments.</li>
|
|
</ul>
|
|
|
|
<p>Every file of this type serves as a source for exactly one page, or
|
|
exactly one comment. In contrast with ini files, a headed file can
|
|
have its own <em>format</em> and may specify its own character encoding.
|
|
<strong>Macroprocessing is NOT performed on the
|
|
content of these files</strong>, and sometimes this may be deadly
|
|
important.
|
|
</p>
|
|
<p>The idea behind these files is that they are intended to be created
|
|
programmatically, rather than written manually, despite that it is easy to
|
|
write them by hands as well. In particular, this documentation was
|
|
prepared as a set of headed text files, written in a text editor (what
|
|
editor? well, <a href="http://www.vim.org">vim</a>) and processed as a
|
|
<a href="page_sets.html">page set</a>. On the other hand,
|
|
<em>comments</em> are received with the Thalassa CGI program
|
|
(<code>thalcgi.cgi</code>) and stored as headed text files, so this is
|
|
an obvious case when they are created by a program.
|
|
</p>
|
|
<p>Furthermore, Thalassa CMS was initially written keeping in mind a task of
|
|
saving a particular site, which was initially created in 2009 with Drupal
|
|
(5.* version of it, which was declared end-of-life just a year later).
|
|
After several years, the version of MySQL used for that site was declared
|
|
unsupported as well, and the ancient version of Drupal refused to work with
|
|
later versions of MySQL or MariaDB. Even the PHP interpreters of newer
|
|
versions started to output warnings trying to run that old Drupal. After
|
|
working for more than a decade in a virtual box containing older versions
|
|
of all involved software, it became clear for the site owner that the only
|
|
way to keep the site running is to break out of this software slavery,
|
|
built by people who don't care about backward compatibility.
|
|
</p>
|
|
<p>This is how Thalassa was born; the site from this story is
|
|
<a href="http://stolyarov.info">http://stolyarov.info</a>, it is in
|
|
Russian, so unfortunately most of people reading this text will not be able
|
|
to rate it for themselves, but as of the moment of migration to Thalassa, it
|
|
had 200+ pages and 6000+ user comments, and to give up all the content
|
|
would be a catastrophe. What was actually done is all pages and comments
|
|
were fetched from the MySQL database and stored in files; all these blocks,
|
|
menus and the like, together with page layout and color scheme, were all
|
|
recreated manually.
|
|
</p>
|
|
<p>This makes another example when a set of headed text files was created
|
|
programmatically, this time <em>including</em> the source files for pages,
|
|
not only for comments. And here's why Thalassa allows these files to be in
|
|
a character encoding different from the one used for ini files (and for the
|
|
generated HTML files, too): the author prefers koi8-r as THE cyrillic
|
|
encoding, while Drupal since ancient times is only capable of utf8, and
|
|
author preferred to store the source information in files exactly how it
|
|
was fetched from the database (in order not to loose anything
|
|
accidentally).
|
|
</p>
|
|
<p>The real objective of these files is to work instead of another damn
|
|
database table, holding both the content (that is, the text of a page or a
|
|
comment, possibly not in the final form) and additional attributes such as
|
|
title, date, tags and the like. And yes, they perform well. Actually,
|
|
they work far better than the author expected.
|
|
</p>
|
|
|
|
|
|
<h2 id="structure">Structure of the file</h2>
|
|
|
|
<p>The format is much like the well-known since rfc822 <em>email message</em>
|
|
format: headers come first, and every header consists of a name, a colon
|
|
char, a space and a value. Placing a line starting with whitespace right
|
|
after a header line, one can <em>continue</em> the header line, thus making
|
|
it longer. The body is separated from the header by an empty line
|
|
(strictly speaking, a line either of a zero lenght, or consisting of
|
|
whitespace chars only). The most notable difference with rfc822 is that
|
|
header names are case-sensitive, and for Thalassa CMS they are usually
|
|
written all-lowercase.
|
|
</p>
|
|
<p>To get the idea, here's an example:
|
|
</p>
|
|
<pre>
|
|
id: testpg
|
|
title: Test page
|
|
unixdate: 1681890000
|
|
encoding: us-ascii
|
|
format: texbreaks
|
|
|
|
This is an example of an HTML page created
|
|
from a source stored as a headed text file.
|
|
|
|
BTW, thanks to the format, this phrase will
|
|
go into a separate paragraph.
|
|
</pre>
|
|
|
|
|
|
<p>The header fields <code>id</code>, <code>encoding</code> and
|
|
<code>format</code> are used internally and are not accessible directly
|
|
with macros. Actually, the <code>id</code> (both for a page and for a
|
|
comment) is accessible, but it must be set to the same value as determined
|
|
by the file name (or, for set items represented with a directory, by the
|
|
name of the directory). It is unspecified what happens in case they don't
|
|
match.
|
|
</p>
|
|
<p class="remark">Actually, it is very possible nothing will happen at
|
|
all. Please don't rely on that. In future versions, checks are likely to
|
|
be added.</p>
|
|
|
|
<p>The <code>encoding</code> and <code>format</code> fields determine what
|
|
filters must be applied to the body and <em>some</em> of the header files
|
|
before the content is made available as macro values. Both can be omitted
|
|
or left blank. If <code>encoding</code> is not specified explicitly, it is
|
|
assumed to be the same as for the ini files (and for the site), so no
|
|
recoding is needed. <a href="#encodings">Character encodings</a> are
|
|
discussed later.
|
|
</p>
|
|
<p>If <code>format</code> is not specified, it is assumed to be
|
|
“<code>verbatim</code>”, so, again, no conversion is needed. It is
|
|
notably inconvenient to write raw HTML manually; see the
|
|
<a href="#formats">Formats</a> section below for information on available
|
|
formats.
|
|
</p>
|
|
<p>Some of the fields are recognized by Thalassa, but their values are
|
|
available via macros, directly or indirectly. All information from
|
|
unrecognized fields is extracted and is available directly by the names of
|
|
the fields (via the <code>%[li: ]</code> macro for pages, via the
|
|
<code>%[cmt: ]</code> macro for comments), so the user can add any
|
|
fields as needed and use their values for site generation.
|
|
</p>
|
|
<p>Some of the recognized fields are not passed through filters at all (e.g.,
|
|
the <code>unixdate:</code> field, as it is just a number), some are passed
|
|
through the same set of filters as the body (as determined by
|
|
<code>format:</code> and <code>encoding:</code>; actually, only the
|
|
<code>descr:</code> fields for pages is converted this way in the present
|
|
version), and for the rest of the fields, including any unrecognized
|
|
fields, only the encoding conversion is applied as appropriate, but no
|
|
format conversion is done.
|
|
</p>
|
|
<p>The exact list of recognized fields and their roles differ for pages
|
|
and comments, so it is discussed along with
|
|
<a href="page_sets.html">page sets</a> and
|
|
<a href="comment_sections.html">comment sections</a>, respectively.
|
|
</p>
|
|
|
|
|
|
|
|
<h2 id="encodings">Character encodings</h2>
|
|
|
|
<p>Thalassa mostly acts in encoding-agnostic manner. In particular, it
|
|
assumes ini files are written in the same encoding as the site pages should
|
|
be generated, so it never performs any encoding transformations for all
|
|
content that comes from ini files.
|
|
</p>
|
|
<p>Unlike that, headed text files are <a href="#motivation">intended</a>
|
|
to be supplied by other programs, and the author of Thalassa had at least
|
|
one task at hands for which encoding conversion was desired. Hence
|
|
Thalassa supports encoding conversions for headed text files' content. To
|
|
activate this feature, the following conditions must be met:</p>
|
|
<ul>
|
|
|
|
<li>base encoding must be set with the <code>encoding</code> parameter of the
|
|
<a href="general_configuration.html#format_section"><code>[format]</code>
|
|
configuration section;</a></li>
|
|
|
|
<li>encoding for the particular headed text file must be explicitly
|
|
specified with the <code>encoding:</code> header field;</li>
|
|
|
|
<li>the two encodings must differ and both must be supported by
|
|
Thalassa.</li>
|
|
|
|
</ul>
|
|
|
|
<p>If any item in this list is not met, Thalassa will silently decide to do no
|
|
encoding converion.
|
|
</p>
|
|
<p>Currently, only the following encodings are supported:
|
|
<code>utf8</code>,
|
|
<code>ascii</code>,
|
|
<code>koi8-r</code>,
|
|
<code>cp1251</code>.
|
|
Thalassa recognizes the following synonims for the supported encodings:
|
|
<code>utf-8</code>,
|
|
<code>us-ascii</code>,
|
|
<code>koi8r</code>,
|
|
<code>koi8</code>,
|
|
<code>1251</code>,
|
|
<code>win1251</code>,
|
|
<code>win-1251</code>,
|
|
<code>windows-1251</code>.
|
|
Names are case-insensitive.
|
|
</p>
|
|
<p class="remark">
|
|
It is relatively easy to add more encodings; this involves making tables to
|
|
convert from that encoding to unicode and back, and adding them to the
|
|
source code of the <code>stfilter</code> library used by Thalassa. If you
|
|
need support for an encoding not listed above, please contact the author.
|
|
</p>
|
|
|
|
<p>When a character encountered in the source being converted is not
|
|
representable in the target encoding, Thalassa replaces it with the
|
|
appropriate HTML4 entity, such as <code>&laquo;</code> or
|
|
<code>&mdash;</code>, if there's such entity for the particular
|
|
character in HTML4 (<strong>not HTML5</strong>). If no appropriate entity
|
|
exists, the character is replaced with HTML hexadecimal unicode entity,
|
|
like <code>&#x42B;</code>.
|
|
</p>
|
|
|
|
|
|
<h2 id="formats">Content formats</h2>
|
|
|
|
<p>Thalassa is designed to pass the body of a headed text file through filters
|
|
that somehow adjust the markup, that is, convert the content from one
|
|
format to another. Format of the body (and possibly some of the header
|
|
fields) is indicated by the <code>format:</code> header field; the value of
|
|
this field is expected to be a comma-separated list of tokens designating
|
|
centain aspects of the format, or, strictly speaking, of the desired
|
|
conversion.
|
|
</p>
|
|
<p>The default format is <code>verbatim</code>, which means the body is
|
|
already in HTML and needs no conversion at all. No conversion will be
|
|
performed as well in case the <code>format:</code> header is missing,
|
|
empty, or contains no recognized tokens; it is, however, strongly
|
|
recommended to specify the <code>verbatim</code> format explicitly as the
|
|
defaults may change in the future.
|
|
</p>
|
|
<p>The recognized format tokens are <code>verbatim</code>,
|
|
<code>breaks</code>, <code>texbreaks</code>, <code>tags</code> and
|
|
<code>web</code>.
|
|
</p>
|
|
<p>The <code>web</code> token was initially intended to indicate that URLs and
|
|
email addresses encountered in the text should be automatically turned into
|
|
clickable web links. As of present, <strong>this conversion is not
|
|
implemented</strong>, and the token is silently ignored by Thalassa.
|
|
</p>
|
|
<p>The <code>tags</code> token means that HTML tags must be stripped off the
|
|
text being converted, with exception for the tags explicitly allowed for
|
|
user-supplied content. The list of allowed tags is set by
|
|
<code>tags</code> parameter in the
|
|
<a href="general_configuration.html#format_section"><code>[format]</code>
|
|
configuration section</a>.
|
|
</p>
|
|
<p>Both <code>breaks</code> and <code>texbreaks</code> turn on conversion of
|
|
newline characters into paragraph breaks, but in different manner. The
|
|
filter enabled by <code>texbreaks</code> works in TeX/LaTeX style: it takes
|
|
empty lines (or, strictly speaking, lines that contain nothing but possibly
|
|
whitespace) as paragraph breaks, and wraps continuous text fragments (those
|
|
not containing empty lines) with <code><p></code> and
|
|
<code></p></code>. This mode is suitable if you use a text editor
|
|
that wraps long lines, such as vim. In particular, in all source files of
|
|
this documentation the <code>texbreaks</code> token is used.
|
|
</p>
|
|
<p>The <code>breaks</code> token does essentially the same, but in addition it
|
|
replaces every lonely newline char (that is, a newline char which is not
|
|
followed by an empty line) with <code><br /></code>. This mode is
|
|
inconvenient for files prepared manually in a text editor, but is useful
|
|
for texts received through web-forms, such as user comments.
|
|
</p>
|
|
<p>In both modes the filter is aware of some HTML/XHTML basics (but in a
|
|
<strong>very basic</strong> level): it knows that some tags are not
|
|
suitable inside paragraphs, and that the paragraph conversion shouldn't be
|
|
done within some tags. However, the filter is not a real (X)HTML parser
|
|
and has no idea about all these DTD schemes. It simply uses a list of
|
|
“special” tags: <code>pre</code>, <code>ul</code>, <code>ol</code>,
|
|
<code>table</code>, <code>p</code>, <code>blockquote</code>, <code>h1</code>,
|
|
<code>h2</code>,
|
|
<code>h3</code>, <code>h4</code>, <code>h5</code>, <code>h6</code>. When
|
|
any of these tags starts, the filter closes the current paragraph, if it is
|
|
open, and doesn't start any new paragraphs until the tag closes. The list
|
|
is hard-coded. It is very possible this will change in future versions, as
|
|
all this is, well, just a hack which solves certain problems arised here
|
|
and now.
|
|
</p>
|
|
<p>Anyway, tags that are “special” anyhow in respect to paragraphs, and are
|
|
not on the list above, shouldn't perhaps be allowed for user-supplied
|
|
content. As of content created by the site's owner or administrator, it
|
|
can always be checked for results of conversion.
|
|
</p>
|
|
</div>
|
|
|
|
</div>
|
|
<div class="navbar" id="bottomnavbar"> <a href="common_macros.html#bottomnavbar" title="previous" class="navlnk">⇐</a> <a href="userdoc.html#headed_text" title="up" class="navlnk">⇑</a> <a href="generation_objects.html#bottomnavbar" title="next" class="navlnk">⇒</a> </div>
|
|
|
|
<div class="bottomref"><a href="map.html">site map</a></div>
|
|
<div class="clear_both"></div>
|
|
<div class="thefooter">
|
|
<p>© Andrey Vikt. Stolyarov, 2023-2026</p>
|
|
</div>
|
|
</body></html>
|