Made of Bugs

Write Yourself an Strace in 70 Lines of Code

Basically anyone who's used Linux for any amount of time eventually comes to know and love the strace command. strace is the system-call tracer, which traces the calls that a program makes into the kernel in order to interact with the outside world. If you're not already familiar with this incredibly versatile tool, I suggest you go check out my friend and coworker Greg Price's excellent blog post on the subject, and then come back here.

We all love strace, but have you ever wondered how it works? How does it interject itself between the kernel and the userspace program? This post will walk through a minimal implementation of strace in about 70 lines of C. It won't be nearly as functional as the real thing, but in the process you'll learn most of what you need to know about the core interfaces it uses.

On Linux (and probably some other UNIXes) strace uses a somewhat arcane interface known as ptrace, the process-tracing interface. ptrace allows one proccess to monitor the status of another process, and to inspect (or even manipulate) its internal state.

ptrace is a complex system call, taking a magic "request" first parameter, and doing completely different things depending on its value. Its general prototype looks like:

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

However, because different values of request use anywhere from zero to three of the remaining parameters, glibc prototypes it as a varargs function, allowing a a developer to only list as many parameters as a given call needs.

In order for one process to trace another, it attaches to that process, and temporarily becomes that process's parent. When a process is ptraced, the tracer can ask for the child to stop whenever various events happen, such as the child making a system call. When this happens, the kernel will stop the child with SIGTRAP. Since the tracer is now the child's parent, it can thus watch for this using the standard UNIX waitpid system call.

Our miniature strace will only support the strace COMMAND form of strace (as opposed to strace -p), and we'll only print syscall numbers and return values – no decoding of names or arguments or anything. So a sample run might look like:

$ ./ministrace ls
…
syscall(6) = 0
syscall(54) = 0
syscall(54) = 0
syscall(5) = 3
syscall(221) = 1
syscall(220) = 272
syscall(220) = 0
syscall(6) = 0
syscall(197) = 0
syscall(192) = -1219706880
…

Not the most useful thing in the world, but it shows off the core tracing tools. So, let's see the code:

#include <sys/ptrace.h>
#include <sys/reg.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

We start with the necessary headers. sys/ptrace.h defines ptrace and the __ptrace_request constants, and we'll need sys/reg.h to help decode system calls. More about that later. Everything else you should probably recognize.

int do_child(int argc, char **argv);
int do_trace(pid_t child);

int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s prog args\n", argv[0]);
        exit(1);
    }

    pid_t child = fork();
    if (child == 0) {
        return do_child(argc-1, argv+1);
    } else {
        return do_trace(child);
    }
}

We'll start with the entry point. We check that we were passed a command, and then we fork() to create two processes – one to execute the program to be traced, and the other to trace it.

int do_child(int argc, char **argv) {
    char *args [argc+1];
    memcpy(args, argv, argc * sizeof(char*));
    args[argc] = NULL;

The child starts with some trivial marshalling of arguments, since execvp wants a NULL-terminated argument array.

    ptrace(PTRACE_TRACEME);
    kill(getpid(), SIGSTOP);
    return execvp(args[0], args);
}

Next, we just execute the provided argument list, but first, we need to start the tracing process, so that the parent can start tracing the newly-executed program from the very start.

If a child knows that it wants to be traced, it can make the PTRACE_TRACEME ptrace request, which starts tracing. In addition, it means that the next signal sent to this process wil stop it and notify the parent (via wait), so that the parent knows to start tracing. So, after doing a TRACEME, we SIGSTOP ourselves, so that the parent can continue our execution with the exec call.

(You might have noticed that strace COMMAND output always starts with an execve call. Now you should understand why – we're actually going to start tracing immediately after the kill returns, so we see the execve call that starts the new program).

int wait_for_syscall(pid_t child);

int do_trace(pid_t child) {
    int status, syscall, retval;
    waitpid(child, &status, 0);

In the parent, meanwhile, we prototype a function we'll need later, and start tracing. We immediately waitpid on the child, which will return once the child has sent itself the SIGSTOP above, and is ready to be traced.

    ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESYSGOOD);

I mentioned earliar that ptrace turns basically all events into a SIGTRAP on the child. This is inconvenient because it means that when you see the child has stopped due to SIGTRAP, there's no good way to know which of several possible reasons it stopped for.

PTRACE_SETOPTIONS lets us set a number of options for how we want to trace the child. We use it here to set PTRACE_O_TRACESYSGOOD, which means that when the child stops for a syscall-related reason, we'll actually see it stopped with signal number SIGTRAP | 0x80, so we can easily distinguish syscall stops from other stops. Since (for the purposes of this demo), we only care about syscalls, this is very convenient.

    while(1) {
        if (wait_for_syscall(child) != 0) break;

Now we enter the tracing loop. wait_for_syscall, defined below, will run the child until either entry to or exit from a system call. If it returns non-zero, the child has exited and we end the loop.

        syscall = ptrace(PTRACE_PEEKUSER, child, sizeof(long)*ORIG_EAX);
        fprintf(stderr, "syscall(%d) = ", syscall);

Otherwise, though, we know that the child is entering a system call, and so we need to decode the system call number (and potentially arguments, if this were a less toy example). The PTRACE_PEEKUSER ptrace request reads a word of data from the child's "user area", which is a logical area that holds all of its registers and other internal non-memory state. On i386, the syscall number lives in %eax. For various technical reasons, however, the kernel has already clobbered the child's %eax at this point, but it saves the original value at a different offset, ORIG_EAX, which comes from `sys/regs.h'.

        if (wait_for_syscall(child) != 0) break;

Once we have the syscall number, we wait_for_syscall again, which should leave us stopped at the syscall return.

        retval = ptrace(PTRACE_PEEKUSER, child, sizeof(long)*EAX);
        fprintf(stderr, "%d\n", retval);

Return values on i386 are also passed in %eax, so this time we can read it directly and print the return value, and then return to the top of the loop to wait for the next syscall.

     }
    return 0;
}

And once the child exits, we just return.

int wait_for_syscall(pid_t child) {
    int status;
    while (1) {
        ptrace(PTRACE_SYSCALL, child, 0, 0);

wait_for_syscall is a simple helper function. We continue the child using PTRACE_SYSCALL, which allows a stopped child to continue executing until the next entry to or exit from a system call.

        waitpid(child, &status, 0);

We then waitpid to wait for something interesting to happen in the child.

        if (WIFSTOPPED(status) && WSTOPSIG(status) & 0x80)
            return 0;

Because of the PTRACE_O_SYSGOOD we set above, we can detect a syscall stop by checking if the child was stopped by a signal with the high bit set. If so, we return.

        if (WIFEXITED(status))
            return 1;
    }
}

If the child exited, we're done here; Otherwise, it stopped for some reason we don't care about (like an execve, for instance), and so we loop to start it again until it hits a syscall.

And that's all there is to it. You can find the version I just posted on github if you want to download and try it out.

Making it more useful

While it works, the previous version isn't exactly what I'd call particularly helpful. You have to decode the syscall numbers by hand, and you don't get any syscall arguments.

It's a little long to include in the post, but I've pushed a slightly more functional version to master in the same github repository. It includes a Python script to scan the Linux source to pick up syscall numbers and argument counts and types, and it knows how to decode string arguments, so that you can see filenames and read and write data.

Reading arguments is easy – on i386, they're passed in registers, so it's just another PTRACE_GETUSER for each argument. Perhaps the most interesting piece is the read_string function, which is used to read a NULL-terminated string from the child process. (Of course, NULL-terminated isn't quite right – the real strace knows about the count arguments to read() and write(), for instance. But it's close enough for a demo):

char *read_string(pid_t child, unsigned long addr) {

read_string takes a child to read from, and the address of the string it's going to read.

    char *val = malloc(4096);
    int allocated = 4096, read;
    unsigned long tmp;

We need some variables. A buffer to copy the string into, counters of how much data we've copied and allocated, and a temporary variable for reading memory.

    while (1) {
        if (read + sizeof tmp > allocated) {
            allocated *= 2;
            val = realloc(val, allocated);
        }

We grow the buffer if necessary. We read data one word at a time.

        tmp = ptrace(PTRACE_PEEKDATA, child, addr + read);
        if(errno != 0) {
            val[read] = 0;
            break;
        }

PTRACE_PEEKDATA returns a work of data from the child at the specified offset. Because it uses its return for the value, we need to check errno to tell if it failed. If it did (perhaps because the child passed an invalid pointer), we just return the string we've got so far, making sure to add our own NULL at the end.

        memcpy(val + read, &tmp, sizeof tmp);
        if (memchr(&tmp, 0, sizeof tmp) != NULL)
            break;
        read += sizeof tmp;

Then it's a simple matter of appending the data we read, and breaking out if we found a terminating NULL, or else looping to read another word.

    }
    return val;
}