Lacking Rhoticity: 2008

Wednesday 17 December 2008

Helper for monkey patching in tests

Following on from my post on TempDirTestCase, here is another Python test case helper which I introduced at work. Our codebase has quite a few test cases which use monkey patching. That is, they temporarily modify a module or a class to replace a function or method for the duration of the test.

For example, you might want to monkey patch time.time so that it returns repeatable timestamps during the test. We have quite a lot of test cases that do something like this:

class TestFoo(unittest.TestCase):

    def setUp(self):
        self._old_time = time.time
        def monkey_time():
            return 0
        time.time = monkey_time

    def tearDown(self):
        time.time = self._old_time

    def test_foo(self):
        # body of test case

Having to save and restore the old values gets tedious, particularly if you have to monkey patch several objects (and, unfortunately, there are a few tests that monkey patch a lot). So I introduced a monkey_patch() method so that the code above can be simplified to:

class TestFoo(TestCase):

    def test_foo(self):
        self.monkey_patch(time, "time", lambda: 0)
        # body of test case

(OK, I'm cheating by using a lambda the second time around to make the code look shorter!)

Now, monkey patching is not ideal, and I would prefer not to have to use it. When I write new code I try to make sure that it can be tested without resorting to monkey patching. So, for example, I would parameterize the software under test to take time.time as an argument instead of getting it directly from the time module. (here's an example).

But sometimes you have to work with a codebase where most of the code is not covered by tests and is structured in such a way that adding tests is difficult. You could refactor the code to be more testable, but that risks changing its behaviour and breaking it. In that situation, monkey patching can be very useful. Once you have some tests, refactoring can become easier and less risky. It is then easier to refactor to remove the need for monkey patching -- although in practice it can be hard to justify doing that, because it is relatively invasive and might not be a big improvement, and so the monkey patching stays in.

Here's the code, an extended version of the base class from the earlier post:

import os
import shutil
import tempfile
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def monkey_patch(self, obj, attr, new_value):
        old_value = getattr(obj, attr)
        def tear_down():
            setattr(obj, attr, old_value)
        self._on_teardown.append(tear_down)
        setattr(obj, attr, new_value)

    def monkey_patch_environ(self, key, value):
        old_value = os.environ.get(key)
        def tear_down():
            if old_value is None:
                del os.environ[key]
            else:
                os.environ[key] = old_value
        self._on_teardown.append(tear_down)
        os.environ[key] = value

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Wednesday 26 November 2008

Shell features

Here are two features I would like to see in a Unix shell:

Timing: The shell should record how long each command takes. It should be able to show the start and stop times and durations of commands I have run in the past.
Finish notifications: When a long-running command finishes, the task bar icon for the shell's terminal window should flash, just as instant messaging programs flash their task bar icon when you receive a message. If the terminal window is tabbed, the tab should be highlighted too.

You can achieve the first with the time command, sure, but sometimes I start a command without knowing in advance that it will be long-running. It's hard to add the timer in afterwards. Also, Bash's time builtin stops working if you suspend a job with Ctrl-Z. It would be simpler if the shell collected this information by default.

The second feature requires some integration between the shell and the terminal. This could be done via some new terminal escape sequence or perhaps using the WINDOWID environment variable that gnome-terminal appears to pass to its subprocesses. But actually, I would prefer if the shell provided its own terminal window. There would be more scope for combining GUI and CLI features that way, such as displaying filename completions or (more usefully) command history in a pop-up window.

I have seen a couple of attempts to do that. Hotwire is one, but it is too different from Bash for my tastes. I would like a GUI shell that initially looks and can be used just like gnome-terminal + Bash. Gsh is closer to what I have in mind, but it is quite old, written in Tcl/Tk and C, and not complete.

Saturday 22 November 2008

TempDirTestCase, a Python unittest helper

I have seen a lot of unittest-based test cases written in Python that create temporary directories in setUp() and delete them in tearDown().

Creating temporary directories is such a common thing to do that I have a base class that provides a make_temp_dir() helper method. As a result, you often don't have to define setUp() and tearDown() in your test cases at all.

I have ended up copying this into different codebases because it's easier than making the codebases depend on an external library for such a trivial piece of code. This seems to be quite common: lots of Python projects provide their own test runners based on unittest.

Here's the code:

import shutil
import tempfile
import unittest

class TempDirTestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Monday 27 October 2008

Making relocatable packages with JHBuild

I have been revisiting an experiment I began back in March with building GNOME with JHBuild. I wanted to see if it would be practical to use JHBuild to package all of GNOME with Zero-Install. The main issue with packaging anything with Zero-Install is relocatability.

If you build and install an autotools-based package with

./configure --prefix=/FOO && make install

the resulting files installed under /FOO will often have the pathname /FOO embedded in them, sometimes in text files, other times compiled into libraries and executables. This is a problem for Zero-Install because it runs under a normal user account and wants to install files into a user's home directory under ~/.cache. Currently if a program is to be packaged with Zero-Install it must be relocatable via environment variables. Compiling pathnames in is no good (at least without an environment variable override) because you don't know in advance where the program will be installed.

I found a few cases where pathnames get compiled in:

text: pkg-config .pc files
text: libtool .la files
text: shell scripts generated from .in files, such as gtkdocize, intltoolize and gettextize
binary: rpaths added by libtool

It is possible to handle these individual cases. Zero-Install's make-headers tool will fix up pkg-config .pc files. libtool .la files can apparently just be removed on Linux without any adverse effects. libtool could be modified to not use rpaths (unfortunately --disable-rpath doesn't seem to work), which are overridden by LD_LIBRARY_PATH anyway. gtkdocize et al could be modified. But that sounds like a lot of work. I'd like to get something working first.

In revisiting this I hoped that the only cases that would matter would be text files. It would be easy to do a search and replace inside text files to relocate packages. The idea would be to build with

/configure --prefix=/FAKEPREFIX
make install DESTDIR=/tempdest

and then rewrite /FAKEPREFIX to (say) /home/fred/realprefix. In a text file, changing the length of a pathname and changing the size of the file usually doesn't matter, but doing this to an ELF executable would completely screw the executable up. This search-and-replace trick would be a hack, but it would be worth trying.

It turned out that Evince (which I was using as a test case) embeds the pathname /FAKEPREFIX/share/applications/evince.desktop in its own executable, and if this file doesn't exist, it segfaults on startup.

Then it occurred to me that I could rewrite filenames inside binary files without changing the length of the filename: just pad the filenames out to a fixed length at the start.

So the idea now is to build with something like:

/configure --prefix=/home/bob/builddir/gtk-XXXXXXX
make install

and, when installing the files on another machine, rewrite

/home/bob/builddir/gtk-XXXXXXXXXXXXXXXXXXX

/home/fred/.cache/0install.net/gtk-XXXXXXX

Just make sure you start off with enough padding to allow the package to be relocated to any path a user is likely to use in practice.

This is even hackier than rewriting filenames inside text files, but it's very simple!

This is partly inspired by Nix, which does something similar, but with a bit more complexity. Nix will install a package under (something like) /nix/store/<hash>, where <hash> is (roughly) a cryptographic hash of the binary package's contents. But packages like Evince contain embedded filenames referring to their own contents, so Nix will build it with:

./configure --prefix=/nix/store/<made-up-hash>
make install

where <made-up-hash> is chosen randomly. Afterwards, the output is rewritten to replace <made-up-hash> with the real hash of the output, but there is some cleverness to discount <made-up-hash> from affecting the real hash.

(Earlier versions of Nix used the hash of the build input to identify packages rather than the hash of the build output. This avoided the need to do rewriting but didn't allow a package's contents to be verified based on its hash name.)

The fact that Nix uses this scheme successfully indicates that filename rewriting in binaries works, and filenames are not being checksummed or compressed or encoded in weird ways, which is good.

My plan now is:

Extend JHBuild to build packages into fixed-length prefixes and produce Zero-Install feeds. My work so far is on this Bazaar branch.
Extend Zero-Install to do filename rewriting inside files in order to relocate packages.

Thursday 18 September 2008

Attribute access in format strings in Python 3.0

Here is another problem with securing Python 3.0: PEP 3101 has extended format strings so that they contain an attribute access syntax. This makes the format() method on strings too powerful. It exposes unrestricted use of getattr, so it can be used to read private attributes.

For example, the expression "{0._foo}".format(x) is equivalent to str(x._foo).

CapPython could work around this, but it would not be simple. It could block the "format" attribute (that is, treat it as a private attribute), although that is jarring because this word does not even look special. Lots of code will want to use format(), so we would have to rewrite this to a function call that interprets format strings safely. Having to do rewriting increases complexity. And if our safe format string interpreter is written in Python, it will be slower than the built-in version.

My recommendation would be to take out the getattr-in-format-strings feature before Python 3.0 is released. Once the release has happened it would be much easier to add such a feature than to take it out.

It is a real shame that Python 3 has not adopted E's quasi-literals feature, which has been around for a while. Not only are quasi-literals more secure than PEP 3101's format strings, they are more general (because they allow any kind of subexpression) and could be faster (because more can be done at compile time).

Sunday 14 September 2008

CapPython, unbound methods and Python 3.0

CapPython needs to block access to method functions. Luckily, one case is already covered by Python's unbound methods, but they are going away in Python 3.0.

Consider the following piece of Python code:

class C(object):

    def f(self):
        return self._field

x = C()

In Python, methods are built out of functions. "def" always defines a function. In this case, the function f is defined in a class scope, so the function gets wrapped up inside a class object, making it available as a method.

There are three ways in which we might use function f:

Via instances of class C as a normal method, e.g. x.f(). This is the common case. The expression x.f returns a bound method, which wraps the instance x and the function f.
Via class C, e.g. C.f(x). The expression C.f returns an unbound method. If you call this unbound method with C.f(y), it first checks that y is an instance of C. If that is the case, it calls f(x). Otherwise, it raises a TypeError.
Directly, as a function, assuming you can get hold of the unwrapped function. There are several ways to get hold of the function:
- x.f.im_func or C.f.im_func. Bound and unbound methods make the function they wrap available via an attribute called "im_func".
- In class scope, "f" is visible directly as a variable.
- C.__dict__["f"]

CapPython allows the first two but aims to block direct use of method functions.

In CapPython, attribute access is restricted so that you can only access private attributes (those starting with an underscore) via a "self" variable inside a method function. For this to work, access to methods functions must be restricted. Function f should only ever be used on instances of C and its subclasses.

Suppose that constraint was violated. If you could get hold of the unwrapped function f, you could apply it to an object y of any type, and f(y) would return the value of the private attribute, y._field. That would violate encapsulation.

To enforce encapsulation, CapPython blocks the paths for getting hold of f that are listed above, as well as some others:

"im_func" is treated as a private attribute, even though it doesn't start with an underscore.
In class scope, reading the variable f is forbidden. Or, to be more precise, if variable f is read, f is no longer treated as a method function, and its access to self._field is forbidden.
__dict__ is a private attribute, so the expression C.__dict__ is rejected.
Use of __metaclass__ is blocked, because it provides another way of getting hold of a class's __dict__.
Use of decorators is restricted.

Bound methods and unbound methods both wrap up function f so that it can be used safely.

However, this is changing in Python 3.0. Unbound methods are being removed. This means that C.f simply returns function f. If CapPython is going to work on Python 3.0, I am afraid it will have to become a lot more complicated. CapPython would have to apply rewriting to class definitions so that class objects do not expose unwrapped method functions. Pre-3.0 CapPython has been able to get away without doing source rewriting.

Pre-3.0 CapPython has a very nice property: It is possible for non-CapPython-verified code to pass classes and objects into verified CapPython code without allowing the latter to break encapsulation. The non-verified code has to be careful not to grant the CapPython code unsafe objects such as "type" or "globals" or "getattr", but the chances of doing that are fairly low, and this is something we could easily lint for. However, if almost every class in Python 3.0 provides access to objects that break CapPython's encapsulation (that is, method functions), so that the non-CapPython code must wrap every class, the risks of combining code in this way are significantly increased.

Ideally, I'd like to see this change in Python 3.0 reverted. Unbound methods were scheduled for removal in a one-liner in PEP 3100. This originated in a comment on Guido van Rossum's blog and a follow-on thread. The motivation seems to be to simplify the language, which is often good, but not in this case. However, I'm about 3 years too late, and Python 3.0 is scheduled to be released in the next few weeks.

Dealing with modules and builtins in CapPython

In my previous post, Introducing CapPython, I wrote that CapPython "does not yet block access to Python's builtin functions such as open, and it does not yet deal with Python's module system".

Dealing with modules and builtins has turned out to be easier than I thought.

At the time, I had in mind that CapPython could work like the Joe-E verifier. Joe-E is a static verifier for an object-capability subset of Java. If your Java code passes Joe-E, you can compile it, put it in your CLASSPATH, and load it with the normal Java class loader. If your Joe-E code uses any external classes, Joe-E statically checks that these classes (and the methods used on them) have been whitelisted as safe.

I had envisaged that CapPython would work in a similar way. Module imports would be checked against a whitelisted set. Uses of builtin functions would be checked: len would be allowed, but open would be rejected. CapPython code would be loaded through Python's normal module loader; code would be installed by adding it to PYTHONPATH as usual. (One difference from Joe-E is that CapPython would not be able to statically block the use of a method from a particular class or interface, because Python is not statically typed. CapPython can block methods only by their name.)

But there are some problems with this approach:

This provides a way to block builtins such as getattr and open, but not to replace them. We would want to change getattr to reject (at runtime) attribute names starting with "_". We could introduce a safe_getattr function, but we'd have to change code to import and use a differently named function.
It makes it hard to modify or replace modules.
In turn, that makes it hard to subset modules. Suppose you want to allow os.path.join, but not the rest of os. Doing this via a static check is awkward; it would have to block use of os as a first class value.
It makes it harder to apply rewriting to code, if it turns out that CapPython needs to be a rewriter and not just a verifier.
It would require reasoning about Python's module loader.
It's not clear who does the checking and who initiates the loading, so there could be a risk that the two are not operating on the same code.
It relies on global state like PYTHONPATH, sys.modules and the filesystem.
It doesn't let us instantiate modules multiple times. Sometimes you want to instantiate a module multiple times because it contains mutable state, or because you want the instantiations to use different imports. Some Python libraries have global state that would be awkward to remove, such as Twisted's default reactor (an event dispatcher). Joe-E rejects global state by rejecting static variables, but this is much harder to do in Python.

There is a simpler, more flexible approach, which is closer to what E and Caja/Cajita do: implement a custom module loader, and load all CapPython code through that. All imports that a CapPython module does would also go through the custom module loader.

It just so happens that Python provides enough machinery for that to work. Python's exec interface makes it possible to execute source code in a custom top-level scope, and that scope does not have to include Python's builtins if you set the special __builtins__ variable in the scope. Behind the scenes, Python's "import" statement is implemented via a function called __import__ which the interpreter looks up from a module's top-level scope as __builtins__["__import__"], so it can be replaced on a per-module basis.

Rather than trying to statically verify that Python's default global scope (which includes its default unsafe module importer and unsafe builtins) is used safely, we just substitute a safe scope - "scope substitution", as Mark Miller referred to it.

This is now implemented in CapPython. It has a "safe eval" function and a module loader. The aim is that the module loader will be able to run as untrusted CapPython code. Currently the module loader uses open to read files, so it has the run of the filesystem, but eventually that will be tamed.

I am surprised that custom module loaders are not used more often in Python.

I imagine Joe-E could work the same way, because Java allows you to create new class loaders.

Thursday 7 August 2008

Introducing CapPython

Python is not a language that provides encapsulation. That is, it does not enforce any difference between the private and public parts of an object. All attributes of an object are public from the language's point of view. Even functions are not encapsulated: you can access the internals of a function through the attributes func_closure, func_globals, etc.

However, Python has a convention for private attributes of objects which is widely used. It's written down in PEP 0008 (from 2001). Attributes that start with an underscore are private. (Actually PEP 0008 uses the term "non-public" but let's put that aside for now.)

CapPython proposes that we enforce this convention by defining a subset of Python to enforce it. The hope is that this subset could be an object-capability language. Hopefully we can do this in such a way that you can get encapsulation by default and still have fairly idiomatic Python code.

The core idea is that private attributes may only be accessed through "self" variables. (We have to expand the definition of "private attribute" to include attributes starting with "func_" and some other prefixes that are used for Python built-in objects.)

As an example, suppose we want to implement a read-only wrapper around dictionary objects:

class FrozenDict(object):
    def __init__(self, dictionary):
        self._dict = dictionary
    def get(self, key):
        return self._dict.get(key)
    # This is incomplete: there are other methods in the dict interface.

You can do this:

>>> d = FrozenDict({"a": 1})
>>> d.get("a")
1
>>> d.set("a", 2)
AttributeError

but the following code is statically rejected:

>>> d._dict

because _dict is a private attribute and d is not a "self" variable.

A self variable is a variable that is the first argument of a method function. A method function is a function defined on a class (with some restrictions to prevent method functions from escaping and being used in ways that would break encapsulation).

We also have to disallow all assignments to attributes (both public and private) except through "self". This is a harsher restriction. Otherwise a recipient of a FrozenDict could modify the object:

def my_function(key):
    return "Not the dictionary item you expected"
d.get = my_function

and the FrozenDict instance would no longer be frozen.

This scheme has some nice properties. As with lambda-style object definitions in E, encapsulation is enforced statically. No type checking is required; it's just a syntactic check. No run-time checks need to be added.

Furthermore, instance objects do not need to take any special steps to defend themselves; they are encapsulated by default. We don't need to wrap all objects to hide their private attributes (which is the approach that some attempts at a safer Python have taken). Class definitions do not need to inherit from some special base class. This means that TCB objects can be written in normal Python and passed into CapPython safely; they are defended by default from CapPython code.

However, class objects are not encapsulated by default. A class object has at least two roles: it acts as a constructor function, and it can be used to derive new classes. The new classes can access their instance objects' private attributes (which are really "protected" attributes in Java terminology - one reason why PEP 0008 does not use the word "private"). So you might want to make a class "final", as in not inheritable. One way to do that is to wrap the class so that the constructor is available, but the class itself is not:

class FrozenDict(object):
    ...
def make_frozen_dict(*args):
    return FrozenDict(*args)

The function make_frozen_dict is what you would export to other modules, while FrozenDict would be closely-held.

Maybe this wrapping should be done by default so that the class is encapsulated by default, but it's not yet clear how best to do so, or how the default would be overridden.

I have started writing a static verifier for CapPython. The code is on Launchpad. It is not yet complete. It does not yet block access to Python's builtin functions such as open, and it does not yet deal with Python's module system.

Tuesday 5 August 2008

Four Python variable binding oddities

Python has some strange variable binding semantics. Here are some examples.

Oddity 1: If Python were a normal lambda language, you would expect the expression x to be equivalent to (lambda: x)(). I mean x to be a variable name here, but you would expect the equivalence to hold if x were any expression. However, there is one context in which the two are not equivalent: class scope.

x = 1
class C:
    x = 2
    print x
    print (lambda: x)()

Expected output:

2
2

Actual output:

2
1

There is a fairly good reason for this.

Oddity 2: This is also about class scope. If you're familiar with Python's list comprehensions and generator expressions, you might expect list comprehensions to be just a special case of generators that evaluates the sequence up-front.

x = 1
class C:
    x = 2
    print [x for y in (1,2)]
    print list(x for y in (1,2))

Expected output:

[2, 2]
[2, 2]

Actual output:

[2, 2]
[1, 1]

This happens for a mixture of good reasons and bad reasons. List comprehensions and generators have different variable binding rules. Class scopes are somewhat odd, but they are at least consistent in their oddness. If list comprehensions and generators are brought into line with each other, you would actually expect to get this output:

[1, 1]
[1, 1]

Otherwise class scopes would not behave as consistently.

Oddity 3:

x = "top"
print (lambda: (["a" for x in (1,2)], x))()
print (lambda: (list("a" for x in (1,2)), x))()

Expected output might be:

(['a', 'a'], 'top')
(['a', 'a'], 'top')

Or if you're aware of list comprehension oddness, you might expect it to be:

(['a', 'a'], 2)
(['a', 'a'], 2)

(assuming this particular ordering of the "print" statements) But it's actually:

(['a', 'a'], 2)
(['a', 'a'], 'top')

If you thought that you can't assign to a variable in an expression in Python, you'd be wrong. This expression:

[1 for x in [100]]

is equivalent to this statement:

x = 100

Oddity 4: Back to class scopes again.

x = "xtop"
y = "ytop"
def func():
    x = "xlocal"
    y = "ylocal"
    class C:
        print x
        print y
        y = 1
func()

Naively you might expect it to print this:

xlocal
ylocal

If you know a bit more you might expect it to print something like this:

xlocal
Traceback ... UnboundLocalError: local variable 'y' referenced before assignment
(or a NameError instead of an UnboundLocalError)

Actually it prints this:

xlocal
ytop

I think this is the worst oddity, because I can't see a good use for it. For comparison, if you replace "class C" with a function scope, as follows:

x = "xtop"
y = "ytop"
def func():
    x = "xlocal"
    y = "ylocal"
    def g():
        print x
        print y
        y = 1
    g()
func()

then you get:

xlocal
Traceback ... UnboundLocalError: local variable 'y' referenced before assignment

I find that more reasonable.

Why bother? These issues become important if you want to write a verifier for an object-capability subset of Python. Consider an expression like this:

(lambda: ([open for open in (1,2)], open))()

It could be completely harmless, or it might be so dangerous that it could give the program that contains it the ability to read or write any of your files. You'd like to be able to tell. This particular expression is harmless. Or at least it is harmless until a new release of Python changes the semantics of list comprehensions...

Lacking Rhoticity