One of the more interesting parts of a wiki engine is the parser: the code that takes the raw text of a page, as entered by the users into the text area of their browsers, and turns it into actual HTML of the web page, as displayed in the “normal” view of the page. Personally I think that parsing is one of the more intriguing topics in computer science, but what makes parsing wiki text especially worth exploring is the nature of the markup: wiki markup is usually not described in a formal way. It is supposed to be simple for humans to use, not for the machines to parse. In the end it appears that most wiki markups in the wild are in fact regular. There might be a few exceptions that are context-free, but that’s just because the designer didn’t bother to limit the nesting. But the infinite nesting of list elements or quotations is never actually used: in fact its usefulness is constrained by the width of the browser window (you can only do so much indenting). So, in practice, most wiki markups in the wild are regular languages.
This is no mistake. The very first wiki engines were written in languages like Perl, and the parsing was indeed performed using regular expressions. Usually multiple regular expressions, and multiple passes, because it was simpler to write this way. Later, as there were more and more attempts at writing wiki engines in various languages, programmers sometimes attempted to build their own ad-hoc state machines – sometimes more, sometimes less formal. Some of them even have a stack to keep track of generated HTML elements – but that’s because HTML is context-free, not because the language parsed was. I think that to this day the prevailing technique is to use regular expressions in two or more passes.
The first wiki engine I ever explored in depth was PikiPiki. Written in python, using simple files for its database, and using regular expressions to parse the text:
class PageFormatter:
"""Object that turns Wiki markup into HTML.
All formatting commands can be parsed one line at a time, though
some state is carried over between lines.
"""
def __init__(self, raw):
self.raw = raw
self.is_em = self.is_b = 0
self.list_indents = []
self.in_pre = 0
def _emph_repl(self, word):
if len(word) == 3:
self.is_b = not self.is_b
return ['</b>', '<b>'][self.is_b]
else:
self.is_em = not self.is_em
return ['</em>', '<em>'][self.is_em]
def _rule_repl(self, word):
s = self._undent()
if len(word) <= 4:
s = s + "\n<hr>\n"
else:
s = s + "\n<hr size=%d>\n" % (len(word) - 2 )
return s
def _word_repl(self, word):
return Page(word).link_to()
def _url_repl(self, word):
return '<a href="%s">%s</a>' % (word, word)
def _email_repl(self, word):
return '<a href="mailto:%s">%s</a>' % (word, word)
def _ent_repl(self, s):
return {'&': '&',
'<': '<',
'>': '>'}[s]
def _li_repl(self, match):
return '<li>'
def _pre_repl(self, word):
if word == '{{{' and not self.in_pre:
self.in_pre = 1
return '<pre>'
elif self.in_pre:
self.in_pre = 0
return '</pre>'
else:
return ''
def _macro_repl(self, word):
macro_name = word[2:-2]
# TODO: Somehow get the default value into the search field
return apply(globals()['_macro_' + macro_name], ())
def _indent_level(self):
return len(self.list_indents) and self.list_indents[-1]
def _indent_to(self, new_level):
s = ''
while self._indent_level() > new_level:
del(self.list_indents[-1])
s = s + '</ul>\n'
while self._indent_level() < new_level:
self.list_indents.append(new_level)
s = s + '<ul>\n'
return s
def _undent(self):
res = '</ul>' * len(self.list_indents)
self.list_indents = []
return res
def replace(self, match):
for type, hit in match.groupdict().items():
if hit:
return apply(getattr(self, '_' + type + '_repl'), (hit,))
else:
raise "Can't handle match " + `match`
def print_html(self):
# For each line, we scan through looking for magic
# strings, outputting verbatim any intervening text
scan_re = re.compile(
r"(?:(?P<emph>'{2,3})"
+ r"|(?P<ent>[<>&])"
+ r"|(?P<word>\b(?:[A-Z][a-z]+){2,}\b)"
+ r"|(?P<rule>-{4,})"
+ r"|(?P<url>(http|ftp|nntp|news|mailto)\:[^\s'\"]+\S)"
+ r"|(?P<email>[-\w._+]+\@[\w.-]+)"
+ r"|(?P<li>^\s+\*)"
+ r"|(?P<pre>(\{\{\{|\}\}\}))"
+ r"|(?P<macro>\[\[(TitleSearch|FullSearch|WordIndex"
+ r"|TitleIndex|RecentChanges|GoTo)\]\])"
+ r")")
blank_re = re.compile("^\s*$")
bullet_re = re.compile("^\s+\*")
indent_re = re.compile("^\s*")
eol_re = re.compile(r'\r?\n')
raw = string.expandtabs(self.raw)
for line in eol_re.split(raw):
if not self.in_pre:
# XXX: Should we check these conditions in this order?
if blank_re.match(line):
print '<p>'
continue
indent = indent_re.match(line)
print self._indent_to(len(indent.group(0)))
print re.sub(scan_re, self.replace, line)
if self.in_pre: print '</pre>'
print self._undent()
The parser is line-based. It has some external state, like the level of indentation and the flags in_pre, in_em and in_b telling it whether it’s inside these kinds of blocks – they are needed to carry the state between the lines – but most of the work is done using the scan_re regular expression. So, we have a kind of mixed approach here. On the line level (we will call it “block-level”), we have an ad-hoc state machine, but on the character level (or “text-level”), there is a regular expression.
The use of re.sub() in this parser is particularly interesting. Normally, that method will replace all matches of th pattern provided with its second argument. But in this case the second argument is a function – so that function will be called for every match, and the match will be replaced with whatever the function returns. The function gets the match object as a parameter, so it can see what was actually matched. The match object contains, among other things, all the groups from the regular expression: those are the bits enclosed in (?P<name> ...), where “name” is the group’s name. The replace function will check those groups and call appropriate _*_repl() functions, that are responsible for generating actual HTML code for various elements of the wiki language. Simple, eh?
This parser doesn’t always generate valid HTML – mostly due to sloppiness of the block-level code: it won’t close <p> tags or close any inline tags before starting a new block tag. It also will not escape the “&”, “>” and “<” characters in URLs – probably because browsers are forgiving to this kind of error.
What I would like to do is removing the “ad-hoc parser” from this design. I can see two ways of doing it: replacing it with another regular expression, using roughly the same technique that was used for text-level parsing; and removing the block-level/text-level separation entirely, modifying the text-level regular expression to also handle separation into lines and blocks. The latter approach has one large advantage: the text is only ever scanned once, whereas with the two other approaches (the above one and the “two regular expressions” one), the text is scanned at least twice: once to split it into lines or blocks, and second time to apply the text-level regular expression. This approach has also one huge disadvantage: the regular expression becomes ridiculously complicated, and the whole thing becomes pretty hard to understand and debug.
Two-level parsing with regular expressions seem to be much easier and straightforward. That’s what I used in my Oink:WikiCreole parser in python for MoinMoin, only I was forced to introduce an additional third level in two places: inside lists and tables. Not that it couldn’t be avoided – I just decided that keeping it on two levels wasn’t worth complicating the regular expressions. It’s still O(n), after all. Oh, and I use the generated DOM tree to keep some of the parser’s state. This also could be avoided at the expense of complicating the regular expression. Oh well.