I found another way of writing a wiki parser. You might remember the technique I described previously, that used a regular expression to divide the text into blocks, then several other regular expressions to parse these blocks separately. It has the advantage of being relatively fast, as it uses built-in features, but it also has several disadvantages: you need to put whole text into memory as a string, as python’s regular expressions can’t work on any iterables, you cannot tell which line of generated html corresponds to given line of input text, debugging might require a lot of experience about regular expressions.
Today I attempted to write a similar parser, but make it line-based, so that you can give it any iterable of lines as input, and it will produce as much output as possible right away.
First, we need a function that, given a line of text, would determine what kind of block that line belongs to. There are several possibilities: normal paragraph, empty line that separates paragraphs, bullet list, indented text, heading, block of code, etc. Except for the code block (and later block-level macros), all lines can be recognized on their own, without any need to keep state. We can use something like this:
heading_re = re.compile(ur"^\s*=+.*=+$", re.U)
bullets_re = re.compile(ur"^\s*\*\s+", re.U)
empty_re = re.compile(ur"^\s*$", re.U)
code_re = re.compile(ur"^\{\{\{+\s*$", re.U)
indent_re = re.compile(ur"^\s+", re.U)
def get_line_kind(self, line):
if self.heading_re.match(line):
return "heading"
elif self.bullets_re.match(line):
return "bullets"
elif self.empty_re.match(line):
return "empty"
elif self.code_re.match(line):
return "code"
elif self.indent_re.match(line):
return "indent"
else:
return "paragraph"where the *_re attributes are regular expressions for recognizing various kinds of lines. This code is pretty clumsy, we can do better than that:
block = {
"bullets": ur"^\s*[*]\s+",
"code": ur"^[{][{][{]+\s*$",
"empty": ur"^\s*$",
"heading": ur"^\s*=+.*=+$",
"indent": ur"^[ \t]+",
} # note that the priority is alphabetical
block_re = re.compile(ur"|".join("(?P<%s>%s)" % kv
for kv in sorted(block.iteritems())))
def get_line_kind(self, line):
match = self.block_re.match(line)
if match:
return match.lastgroup
else:
return "paragraph"this code does roughly the same thing, only more efficiently – it only scans the line at most once. I guess you can tell by now that I have a certain weakness for making huge regular expressions from dicts
Anyways, the function is still a bit clumsy, we can make it a one-liner, using the fact that None dosn’t have a lastgroup attribute:
def get_line_kind(self, line): return getattr(self.block_re.match(line), "lastgroup", "paragraph")
Note that the “code” block is not really complete – we only recognize the first line of it. That’s because we will be eating up the rest of lines for it independently. Once we have the lines categorized, we can try the first crude parser:
def block_paragraph(self, line): return "<p>%s</p>" % line ... def parse(self): for line in self.lines: kind = self.get_line_kind(line) func = getattr(self, "block_%s" % kind) yield func(line)
This will call various self.block_* methods depending on the kind of the line, and yield their results. Looks good, but there is a problem. Let’s test it with simple text:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.
the result is something like this:
<p>Lorem ipsum dolor sit amet, consectetuer </p><p>adipiscing elit, sed diam nonummy nibh </p><p>euismod tincidunt ut laoreet dolore magna </p><p>aliquam erat volutpat. </p> <p>Ut wisi enim ad minim veniam, quis nostrud </p><p>exerci tation ullamcorper suscipit lobortis </p><p>nisl ut aliquip ex ea commodo consequat. </p>
What happened? Every line is a separate paragraph, and we would like to have them grouped: as long as the consecutive lines are of the same kind, they should be in the same block. We can achieve it with the itertools.groupby iterator:
def block_paragraph(self, block): yield u'<p>' for line in block: yield line yield u'</p>' ... def parse(self): for kind, block in itertools.groupby(self.lines, self.get_line_kind): func = getattr(self, "block_%s" % kind) for part in func(block): yield part
This solves the problem in a pretty elegant way, except for the code block. But we can do a hack for it, and read lines for the code block from the lines iterator directly, just after encountering the code opening line:
def block_code(self, block): for part in block: yield u'<pre>' line = self.lines.next() while not line.startswith("}}}"): yield line line = self.lines.next() yield u'</pre>'
The outer loop takes care of a case when there are several consecutive code blocks. You probably still need to cache the lines and strip them and add your own newlines, so that you don;t get an empty lines at the end – but these are details. You also still need to parse the contents of paragraphs – for inline markup like bold text and links – I think that it’s still best done with one huge regexp, just like I described previously.
I hope this helps any wiki developers out there.