Region-based Memory Management

The Conseptizer programming language was originally called Polliwog, which stood for “POlliwog’s Like LIsp WithOut Gc”. Instead of a GC, it was supposed to use explicit region-management. It quickly drifted away a bit from Lisp by becoming stack-oriented, and finally I decided that there’s some things which are just so much more convenient with fully automatic memory management – especially when it comes to mutable state, and I certainly didn’t want to go the pure way.

I wrote an implementation that used a Mark&Sweep GC. But that hinders multi-threading and real-time applications (like games). Unless it’s a conservative GC, it also requires permanent attention in the implementation.

I wrote another implementation, this time using reference counting. But it still requires permanent attention in the C-code everywhere. Since extending the implementation (e.g. by adding library bindings or performance-critical application parts) is a common thing to do and messing it up can happen really easiely, which results in really ugly bugs, I was not satisfied.

Now I have decided to go back to the roots, for the sake of simplicity and the sake of trying something that as far as I know, nobody has really tried so far (because we all seem to not care that much about simplicity?). Region-based management has its advantages and disadvantages. In this article, I want to talk about both.

But first of all: What exactly are those regions I am using now? A region is a place where we can allocate memory. An object allocated in a pool can not be deallocated on its own. Instead, we deallocate the region as a whole as soon as we know that we don’t need any of those objects anymore. So region-based memory management is a form of semi-automatic memory management.

The advantages. First of all, allocation is cheap: We basically just have to increment a pointer (and check whether we’re at the end of a block, in which case we add a new one to the region). Deallocation is ultra-cheap: We just have to add the blocks of a region to a list of free blocks.

The implementation of this is trivial, every programmer can understand it (think about how it compares to the GC in a modern JVM in terms of complexity). There’s no problem with supporting soft real-time or multithreading. For global state, it’s like normal manual memory management (so global state is being discouraged by the language), but most of the time, you won’t have to think about memory management, since it is semi-automatic: As long as your code is purely functional or only has some local side-effects – which should be most of the time – it’s technically simple: All you need to do is insert a call to a certain subroutine in strategic places; this subroutine creates a new region, uses it while running a user-supplied piece of code and deallocates the region afterwards. Oh, maybe it also copies the return value of the user-supplied code back to the previous region (but that will also compact it like a Copying-GC, so that’s okay).

Regions increase locality of reference: When we allocate two new objects, they are guaranteed to be next to each other in memory (except when we’re at the end of a block, but that’s unavoidable). In this case, “next to each other” is to be taken literally – there just is no management overhead involved, while malloc() usually involves a rather large header before objects (which is mostly unavoidable if you want to be able to deallocate objects of different sizes individually).

If regions wouldn’t have any disadvantes, they would certainly be used more widely (though I think they deserve more attention than they’re getting usually). Since it is only semi-automatic, it requires some manual work and thus it’s possible to get it wrong. It can create both leaks and dangling pointers, though the latter can only happen when using mutable state crossing the boundaries of regions. In principle, there’s region-inference to avoid such issues, which can be done by compilers, but that has its own problems I won’t dive into here, as its clearly at odds with our philosophy of extreme simplicity anyway.

Regions can also increase memory consumption, as we have to pre-allocate a full block for each region in use. This will waste space when using lots of small regions, so you have to find the right places to insert the region-management in your code.

So we can see that regions are not the ultimate solution to our memory management headaches, but they are very simple and very efficient (actually they tend to be the most efficient way of all). I think it is worthwhile to try using explicity regions in a language supporting higher-order programming; this way, we might get a good compromise of convenience on the outside and simplicity on the inside.

‘hello, world!’ – done right

I just wrote this C-code:

#include <stdio.h>

int main() {
    printf("hello, world!\n");
    return 0;
}

I compiled this with extra warnings enabled:

$ gcc -Wall -Wextra -o hello hello.c

According to the compiler, there is nothing wrong in this code. And it seems to work:

$ ./hello
hello, world!
$ echo $?
0

Now that I tell you that there is something wrong with it, it will probably easy for you to figure it out. Yes, I’m talking about the missing check of the return value given by printf. Here is a demonstration of the bug:

$ ./hello >/dev/full
$ echo $?
0

Now ask yourself how often you have been guilty of not checking the return value of I/O operations. I have to admit that I tend to ignore those by habit – which is bad.

The interesting thing about stack-based languages is that you cannot just ignore return values. So if you write this in Conseptizer:

say "hello, world!%<endl>"

There will be a boolean value left over on the stack – and you will be told about it. Well, you could explicitly ignore it by dropping it:

drop say "hello, world!%<endl>"

But we provide something much better. If you expect an operation to succeed, you can use a kind of assertion, which we call “do”:

do say "hello, world!%<endl>"

This will abort the program if the output statement fails. It is shorter than “drop”, so lazy programmers will use it and therefore do error checking. Yes, actually checking return values is less work than ignoring them. That’s probably true for almost no other programming language. Also, it’s quite readable: The “do” emphasizes that the operation is something that has to be done successfully. (DO SAY HELLO WORLD, DAMMIT!!!)

So if you just want to write a script quickly and don’t want to write code for handling all the possible error conditions, this is a very convenient feature that still ensures correct behaviour so that when something goes wrong, it will tell you exactly where things went wrong (since it displays a stack trace).

Infix syntax

For reasons which I will probably never understand, many people want to use this clumsy thing called infix notation for their arithmetic expressions. I find prefix notation with line-breaks and indentation (and without parens) more readable, but in case somebody really is addicted to infix notation, here is how to simulate it in Conseptizer: We will encode the expression

 (10 - 5) * (6 / 3)

First, we define _ for convenience:

_: $call

Now we can write an infix expression that is just slightly more clumsy than usual. 😉

_(10; - 5); * _(6; / 3)

The Forth philosophy (maybe :P)

While I have never really programmed in Forth, I put some energy into understanding what it’s all about.  Here is a thing that took me a while to get:

I was reading somewhere that when asked how to write a “Hello world” program in Forth, you might get a reply like: “It’s a bad idea to do this. If you need to see the text ‘Hello world’ on your screen, it’s simpler to just type it in instead of writing a program to display it.” Um, this seemed to miss the point of a “Hello world” program completely, as it’s purpose is to show some aspects of a programming language.

But you know what? I think it does not misss the point. Thinking that it does means missing the point of the answer. That reply *is* the “Hello world” of Forth, because Forth does not start with code. It starts with simplifying things by questioning assumptions.

How to create a memory leak

Many people believe that you cannot have a memory leak in garbage collected programming languages. This is false. Look at this piece of Conseptizer code:

foo: | L
((len L))

This corresponds to the following Scheme code:

(define (foo lst)
  (lambda ()
    (length lst)))

The closure keeps a reference to the list, even though it doesn’t need to (unless we want to track mutations performed on the list). To avoid recalculating the length of the list each time we can refactor this, but this will not fix the leak:

foo: | L
(N: len L;
 (N))

This corresponds to the following Scheme code:

(define (foo lst)
  (let ((n (length lst)))
    (lambda ()
      n)))

I do not know how Scheme implementations handle this, but I guess some of them will also allow to implement a leak this way.

So what’s the leak? Well, the closure that gets returned by ‘foo’ still has the list in its environment that was passed to ‘foo’, even if we don’t need thata binding anymore. So at least the list will get reclaimed as soon as the closure gets reclaimed (given the list isn’t used anywhere else).

Since this just makes an existing object (the closure) “larger” as in “contains” the list, but we get rid of the list when the closures gets collected, this kind of leak in itself isn’t the worst, as the memory will eventually be reclaimed. It’s more problematic when we leak more and more memory over time so that our program requires continously more and more space when we would actually be expecting it to be of constant size. Is this possible in a garbage collected language as well?

Well, yes, of course, if the lists passed to our ‘foo’ subroutine contain closures returned by ‘foo’. (And also if we do something dumb like consing alist elements in front of a list instead of mutating existing items if they exist already.)

When reading about this topic on the web, you’ll find that there are more and better techniques for leaking memory in object-oriented languages like Java, but I haven’t used OOP for years now, so I can’t comment on that. (It’s not that I have anything against OOP, I’m just more interested in things which are less mainstream; that’s true for many areas, not only with respect to programming languages.)

But how do we fix the memory leak?

This can be fixed in two ways: The first one is to set the variable referring to the list to nil (We use ^ to update variables while : introduces them):

foo: | L
(N: len L;
 L^ nil;
 (N))

This corresponds to the following Scheme code:

(define (foo lst)
  (let ((n (length lst)))
    (set! lst '())
    (lambda ()
      n)))

That’s how you fix memory leaks in garbage collected languages usually. But we can do better in this case by taking advantage of the tacit nature of Conseptizer:

foo:
(N: len; (N))

This is another reason why tacit programming is a good idea: By avoiding local variables, it avoids bindings in closures that might cause memory leaks!