Corran Webster : Do Not Call Up That Which You Cannot Put Down

Do Not Call Up That Which You Cannot Put Down

2023-02-18

This admonition, which I first ran across in Charles Stross' excellent Laundry Files series (but which seems to have originated in the much more problematic work of H.P. Lovecraft), is all about making sure that when you summon an Eldritch Abomination to do your bidding that you can also send it back from whence it came.

This is also true of resources that you use in software.

In the world of C programming this tends to be something that programmers care about because C forces you to: you have to worry about how you deallocate the memory that you have allocated; you have to ensure that you close the files and sockets you open; make sure that you close the GUI windows when the application shuts down; and so forth. If you get it wrong, things go bad quickly and frequently terminally.

But high-level languages like Python is much more forgiving: they will deallocate objects for you automatically; they will close files and sockets automatically when they go out of scope. And not having to worry about those things is one of the things that makes Python so approachable and powerful - developers don't need to spend mental cycles worrying about memory management instead of the actual domain problem that they want to solve.

But there are edge cases - thankfully rare - where the automation isn't quite right. For example, let's say I want to count the words in a bunch of file-like objects:

from collections import counter
from typing import Iterable, TextIO

def count_words(files: Iterable[TextIO]) -> Counter:
    word_count = Counter()
    for file in files:
        word_count.update(file.read().split())
    return word_count

This, on the face of it, seems reasonable, particularly when written like this with the generic expectation of a typing.TextIO rather than a concrete file. But the first time you try to run it across the contents of a large directory, it will blow-up:

> import glob
> count_words([open(path) for path in glob.glob('*.txt')])

Your operating system has a limit on how many files can be open at once, and if there are more files than that in your directory, your program will crash. Fortunately Python has some ways that you can overcome this problem, either by controlling the lifetime of the file objects using a generator, by explicitly calling close when you are done with a file, or by using a with statement to control the clean-up of the file.

In other words, in Python, file objects let you put down what you've called up when you need to.

You run into these sorts of corner cases in particular when:

using threaded code, as cyclic garbage collection can run on any thread and some resources don't like being worked with on the wrong thread—this is particularly true of GUI code, which often requires interactions to be on the main UI thread.
deallocation at interpreter shutdown, when modules and objects that clean-up code relies on may no longer be available.
writing test suites which need access to a resource repeatedly over many different tests.
in interactive sessions which hold objects alive in a history, such as happens in IPython.

But the Python standard library is pretty good about giving you the tools you need to control resource lifetimes in those cases where you do need to worry, and the Python community has generally adopted patterns which minimize this sort of problem: people will do things like

with open(path) as f:
    ...

even when there is no danger of exhausting all available file handles.

Where things can get difficult is when you start using APIs from third-party libraries, or writing your own. Because Python garbage collection is generally good, and the corner cases uncommon, many APIs will allocate a resource when an object is initialized but then rely on garbage collection to handle clean-up, trusting that everything will just Go Away on its own. This is particularly problematic when the resource is some sort of shared or global state.

Consider a simple CRUD service backed by a DBAPI2-style database API. for characters in an RPG game A simple implementation might look something like this:

class CharacterStore:

    _connection: Connection

    def __init__(self, location):
        self._connection = connect(location)
        self._connection.row_factory = self._row_factory
        self._connection.execute(TABLE_QUERY)

    def create(self, character: Character):
        cursor = self._connection.cursor()
        cursor.execute(CREATE_QUERY, character.asdict())
        cursor.commit()
        return character

    def read(self, name: str) -> Character:
        cursor = self._connection.cursor()
        result = cursor.execute(READ_QUERY, {'name': name})
        character = result.fetchone()
        if character is None:
            raise KeyError(name)
        return character

    def update(self, character: Character):
        cursor = self._connection.cursor()
        cursor.execute(UPDATE_QUERY, character.asdict())
        cursor.commit()

    def delete(self, character: Character):
        cursor = self._connection.cursor()
        cursor.execute(DELETE_QUERY, character.asdict())
        cursor.commit()

    def _row_factory(self, cursor, row) -> Character:
        return Character(*row)

assuming an appropriate Character dataclass. This will work fine in 99% of cases, but there is a problem in that the database connection is implicitly disposed when the CharacterStore instance gets garbage collected. But when working in a IPython shell, that may be unpredictable:

In [1]: c = CharacterStore('store.sqlite')
In [2]: c
Out[2]: <character_store.CharacterStore object at 0xdeadbeef>
In [3]: del c  # note: IPython keeps a reference!
In [4]: c2 = CharacterStore('store.sqlite')

and now we unexpectedly have two open connections to the same database! If you were in a Jupyter notebook you could even end up with a situation that each time you re-run a cell you get yet another connection opened.

This can easily be fixed in code that you control by adding a close() method that performs any needed clean-up. For example:

class CharacterStore:
    ...
    def close(self):
        if self._connection is not None:
            self._connection.close()
        self._connection = None

which lets you do things like:

from contextlib import closing
with closing(CharacterStore('store.sqlite')) as c:
    for character in characters:
        c.update(character)

and guarantee that everything gets cleaned up.

But what do you do when you don't control the API? In these cases you really want to make sure that you control the lifetime of the problematic objects. One way to do this is to wrap them in a proxy object which holds the only reference and which has a close() method (or equivalent functionality) that drops the reference to the object:

class SafeCharacterStore:
    _character_store: CharacterStore

    def __init__(self, *args, **kwargs):
        self._character_store = CharacterStore(*args, **kwargs)

    def __getattr__(self, name):
        if self._character_store:
            return getattr(self._character_store, name)

    def __setattr__(self, name, value):
        if self._character_store:
            return setattr(self._character_store, name, value)

    def __detattr__(self, name):
        if self._character_store:
            return detattr(self._character_store, name)

    def close(self):
        self._character_store = None

This isn't perfect because someone can still access the private attribute, but at least they know what they are getting themselves in for! And of course it doesn't have to be a pure proxy for the other class - it could be your own service which has its own API and just ensures that everything gets cleaned up nicely.

So, to sum up:

when you are writing an API, please add a method that allows disposal of resources - explicit is better than implict!
when you are using an API which doesn't do this, it may be worth wrapping it in a way that prevents Eldritch Horrors from lurking in your code.