Some time ago a bold gentleman by the name of “anon” edited the What sucks page on the Hatta’s main wiki, and added a strange line in there:
This is a very silly way of making a feature request – I have no idea what that is supposed to mean. I just can’t imagine a way that “encryption” of any sort could be added to an open wiki like Hatta and be useful in any way. So I made it a link to a feature request on the bug tracker with a question about what it is supposed to mean.
As could be expected, this was a typical “ask and run” feature request. The guy just added his first vague thought to a list and never ever came back. I will give him some more time and close the ticket as “wontfix”, removing his comment from the page. No problem, if you don’t care about your feature requests, why should I?
But the silliness doesn’t end here. Now the Bitbucket is getting pestered about introducing “encryption”. First there was a question on the IRC channel, but nobody managed to answer within the 2-minute attention span of the asker. Now on the mailing list. They guy got a detailed answer about how the Bitbucket infrastructure is secured against code theft, but he keeps on insisting on adding encryption somehow somewhere, not understanding that it is not going to change anything. As it usually happens, his ignorance is coupled with stubbornness and hostility. Sigh.
So I’m writing this post mostly to vent. I don’t think that it will help any of the people I mentioned (or maybe it’s a single person?), because it’s too long for a 2-minute attention span, but perhaps it will clear some matters about web applications.
In short, a web application runs (mostly) on the server, under complete control of the system administrators who have access to that server. If that application needs access to any data (for example to display the contents of your wiki, or contents of your repository), it needs to read that data on the server. That in turn means that if the data is encrypted, the application needs to decrypt it on the server. Anybody who has root privileges on that server also has access to the decrypted data through many means. It doesn’t help if you provide the decryption key every time you use the application, and it doesn’t save it anywhere. It can be modified by the attacker to save it, for example. Or to save the decrypted data. Or the decrypted data can be taken from the application’s memory. Or from different levels of caches. Whatever the application does to obtain the key and decrypt the data, the attacker can do exactly the same thing, if he is determined enough. Once you have a malicious person with root privileges or physical access to the machine, it’s a lost fight. That’s why they do so much to not let that happen.
Then again, if you believe that the source code that you have written in the last month or so is so innovative, revolutionary and precious that there are people willing to go through all the trouble with breaking into that server just to get it – and at the same time you don’t understand the basics of how encryption works – then you need professional help. But don’t seek it on the Internet, you are not going to get it there.
The Dandelion wiki engine is slowly taking the right shape. Introducing two dependencies, Werkzeug WSGI tools and Genshi templates helped me to greatly reduce the amount of code and bugs in it. I also simplified the structure of the program, got rid of the plugin system for now and rewritten the parser completely. Oh, and you probably also noticed the new theme, but that’s just a small detail.
I think that this toy wiki engine is slowly transforming from a dirty hack into something more useful. I will be porting some of my projects into it soon, and once I add a journal macro, I will probably also port this very site to it. There is no better way of finding bugs and polishing rough edges than by using something extensively.
I found another way of writing a wiki parser. You might remember the technique I described previously, that used a regular expression to divide the text into blocks, then several other regular expressions to parse these blocks separately. It has the advantage of being relatively fast, as it uses built-in features, but it also has several disadvantages: you need to put whole text into memory as a string, as python’s regular expressions can’t work on any iterables, you cannot tell which line of generated html corresponds to given line of input text, debugging might require a lot of experience about regular expressions.
Today I attempted to write a similar parser, but make it line-based, so that you can give it any iterable of lines as input, and it will produce as much output as possible right away.
First, we need a function that, given a line of text, would determine what kind of block that line belongs to. There are several possibilities: normal paragraph, empty line that separates paragraphs, bullet list, indented text, heading, block of code, etc. Except for the code block (and later block-level macros), all lines can be recognized on their own, without any need to keep state. We can use something like this:
heading_re = re.compile(ur"^\s*=+.*=+$", re.U)
bullets_re = re.compile(ur"^\s*\*\s+", re.U)
empty_re = re.compile(ur"^\s*$", re.U)
code_re = re.compile(ur"^\{\{\{+\s*$", re.U)
indent_re = re.compile(ur"^\s+", re.U)
def get_line_kind(self, line):
if self.heading_re.match(line):
return "heading"
elif self.bullets_re.match(line):
return "bullets"
elif self.empty_re.match(line):
return "empty"
elif self.code_re.match(line):
return "code"
elif self.indent_re.match(line):
return "indent"
else:
return "paragraph"where the *_re attributes are regular expressions for recognizing various kinds of lines. This code is pretty clumsy, we can do better than that:
block = {
"bullets": ur"^\s*[*]\s+",
"code": ur"^[{][{][{]+\s*$",
"empty": ur"^\s*$",
"heading": ur"^\s*=+.*=+$",
"indent": ur"^[ \t]+",
} # note that the priority is alphabetical
block_re = re.compile(ur"|".join("(?P<%s>%s)" % kv
for kv in sorted(block.iteritems())))
def get_line_kind(self, line):
match = self.block_re.match(line)
if match:
return match.lastgroup
else:
return "paragraph"this code does roughly the same thing, only more efficiently – it only scans the line at most once. I guess you can tell by now that I have a certain weakness for making huge regular expressions from dicts
Anyways, the function is still a bit clumsy, we can make it a one-liner, using the fact that None dosn’t have a lastgroup attribute:
def get_line_kind(self, line): return getattr(self.block_re.match(line), "lastgroup", "paragraph")
Note that the “code” block is not really complete – we only recognize the first line of it. That’s because we will be eating up the rest of lines for it independently. Once we have the lines categorized, we can try the first crude parser:
def block_paragraph(self, line): return "<p>%s</p>" % line ... def parse(self): for line in self.lines: kind = self.get_line_kind(line) func = getattr(self, "block_%s" % kind) yield func(line)
This will call various self.block_* methods depending on the kind of the line, and yield their results. Looks good, but there is a problem. Let’s test it with simple text:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.
the result is something like this:
<p>Lorem ipsum dolor sit amet, consectetuer </p><p>adipiscing elit, sed diam nonummy nibh </p><p>euismod tincidunt ut laoreet dolore magna </p><p>aliquam erat volutpat. </p> <p>Ut wisi enim ad minim veniam, quis nostrud </p><p>exerci tation ullamcorper suscipit lobortis </p><p>nisl ut aliquip ex ea commodo consequat. </p>
What happened? Every line is a separate paragraph, and we would like to have them grouped: as long as the consecutive lines are of the same kind, they should be in the same block. We can achieve it with the itertools.groupby iterator:
def block_paragraph(self, block): yield u'<p>' for line in block: yield line yield u'</p>' ... def parse(self): for kind, block in itertools.groupby(self.lines, self.get_line_kind): func = getattr(self, "block_%s" % kind) for part in func(block): yield part
This solves the problem in a pretty elegant way, except for the code block. But we can do a hack for it, and read lines for the code block from the lines iterator directly, just after encountering the code opening line:
def block_code(self, block): for part in block: yield u'<pre>' line = self.lines.next() while not line.startswith("}}}"): yield line line = self.lines.next() yield u'</pre>'
The outer loop takes care of a case when there are several consecutive code blocks. You probably still need to cache the lines and strip them and add your own newlines, so that you don;t get an empty lines at the end – but these are details. You also still need to parse the contents of paragraphs – for inline markup like bold text and links – I think that it’s still best done with one huge regexp, just like I described previously.
I hope this helps any wiki developers out there.