swap_8_and_9: How a simple import can modify the Python interpreter

Published August 7th, 2023

Summary: I present a simple Python module that after imported, changes the values of two integers. The purpose is (a) show that modules can do harmful and unexpected things, (b) show an example of how to write a simple module in C, and (c) introduce a few things in CPython source code.

Try the following,

$ git clone https://github.com/kts/swap_8_and_9.git
$ cd swap_8_and_9
$ pip install .
$ python
>>> print(8)
8
>>> import swap_8_and_9
>>> print(8)
9
>>> print(9)
8
>>> print(list(range(10)))
[0, 1, 2, 3, 4, 5, 6, 7, 9, 8]
    

As the name implies and the example demonstrates, simply importing the module swap_8_9 has swapped the values of the two integers, 8 and 9.

This can come as a surprise if you view Python’s import as simply a way to introduce some name(s) into the local namespace. In fact, import can execute arbitrary code and some code can cause unexpected (or unwanted) behavior in the Python interpreter.

The rest of this page will describe why this works and how it is written. I’ll describe how to write a simple Python module in C and I’ll describe some parts of the Python source code.

Contents

1. Writing a C Module

The swap_8_and_9 module is an example of a very simple Python module written using only C. You only need two files: setup.py and the C source, which I’ve put in a subdirectory, src/swap_8_and_9.c.

1.1. setup.py

from distutils.core import setup, Extension

name = 'swap_8_and_9'
setup(name = name,
      version = '0.1.0',
      description = 'Proof of concept that swaps ints 8 and 9 on import',
      ext_modules = [
          Extension(name, sources = ['src/' + name + '.c']),
      ])

1.2. C source file

Our main code file, swap_8_and_9.c, is pretty simple because this module has no functions, no classes, no constants, etc. We only need to write the definition of the module and write the function that will be called when it is imported.

The file needs three things ({name} is our module name),

  1. List of methods (used in module defintion): static PyMethodDef {name}_methods[] = {...}; (this is an empty list in our case).

  2. Module definition : static struct PyModuleDef {name}_definition = {...};

  3. Init function: PyMODINIT_FUNC PyInit_{name}(void) {...}

#include <Python.h>

/*
  methods in Module
 */
static PyMethodDef swap_8_and_9_methods[] = {
  //marks end:
  {NULL, NULL, 0, NULL}
};


/*
  define Module
*/ 
static struct PyModuleDef swap_8_and_9_definition = { 
    PyModuleDef_HEAD_INIT,
    "swap_8_and_9", /* module name */

    /* __doc__ string: */
    "C Module that...",

    /* size of per-interpreter state of the module,
       or -1 if the module keeps state in global variables. */
    -1,

    swap_8_and_9_methods
};


/*
  initialize Module
  - called on import
  - PyInit_{name} (same name from setup.py name)
*/ 
PyMODINIT_FUNC PyInit_swap_8_and_9(void) {
    Py_Initialize();

    //CODE HERE
        
    return PyModule_Create(&swap_8_and_9_definition);
}
        

We will write somethig in the //CODE HERE part after we learn a little bit about some CPython internals.

1.3. Install / Build

With these two files you should be able to pip install . to install from the current directory.

Another useful command you can run is python setup.py build. This will show the compilation commands being run. (Note that if it is already built and the files haven’t changed, it won’t do anything).

On my macbook, the re-formatted output looks like this: (replaced lots of args with ..., mainly -I PATH, -L PATH, and some other flags),

# 1. compile
clang ... \
    -c src/swap_8_and_9.c \
    -o build/temp.macosx-12.6-arm64-cpython-311/src/swap_8_and_9.o
    
# 2. link
clang ... \
    build/temp.macosx-12.6-arm64-cpython-311/src/swap_8_and_9.o \
    -o build/lib.macosx-12.6-arm64-cpython-311/swap_8_and_9.cpython-311-darwin.so
             

2. Some CPython internals

CPython refers to the standard Python implementation. It is written in C and can be browsed on github.

On this page, I am using CPython version 3.11.4 and the code descriptions below will link to the github source code for this version.

For the CPython code that I’ll describe, I have noticed significant reorganization of the code between versions 3.8.6 to 3.11.4. However, the concepts remain the same.

2.1. Cached integers

In CPython, everything (integers, strings, dictionaries, functions, etc) is an "object", and represented by a C struct named PyObject. Common to all objects is the storage its type (and the data needed for that type) and a reference count used for garbage collection.

We are interested in integers — type int within Python and PyLongObject within CPython.

Because there is some overhead in creating and destroying objects, CPython maintains a cache for the set of small integers (values -5, …, 256), since these are commonly used.

(At this point, you can probably guess that our module will modify these cached objects).

pycore_global_objects.h defines the range of integers that will be cached,

#define _PY_NSMALLPOSINTS           257
#define _PY_NSMALLNEGINTS           5

In the same file, a few lines later, we see the array that will hold the cached integers. It is an array of type PyLongObject and named small_ints and is within a struct called _Py_global_objects:

struct _Py_global_objects {
    struct {
        /* Small integers are preallocated in this array so that they
         * can be shared.
         * The integers that are preallocated are those in the range
         * -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (exclusive).
         */    
        PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
        ...
    } singletons;
};

When and where is this array initialized? In pycore_runtime_init.h, we see a macro (itself generated by a Python script) that initializes the elements in the small_ints[] array,

        .small_ints = { \
            _PyLong_DIGIT_INIT(-5), \
            _PyLong_DIGIT_INIT(-4), \
            _PyLong_DIGIT_INIT(-3), \
            ...
            _PyLong_DIGIT_INIT(255), \
            _PyLong_DIGIT_INIT(256), \
        }, \

It is slightly convoluted to see how this macro (_Py_global_objects_INIT) gets used, but pystate.c stores it as static const _PyRuntimeState initial = _PyRuntimeState_INIT; and this initial variable is used when when the interpreter is initialized.

2.2. Module code

Back to the code for our module — we want to modify a couple of the elements of small_ints[], but we do not have direct access to it. We can use the API declared in longobject.h — namely, the following function that gives the PyObject for a given long (plain C) value: PyAPI_FUNC(PyObject *) PyLong_FromLong(long).

Its definition is in longobject.c. We can see that it checks for small integers (IS_SMALL_INT(ival)) and then calls get_small_int((sdigit)ival), which returns _PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival].

Finally, in pycore_long.h, you can see that this refers to the small_ints[] array: #define _PyLong_SMALL_INTS _Py_SINGLETON(small_ints).

Now, we have the first part of our code. It will get the references to the two integers we care about:

PyLongObject* obj8 = (PyLongObject*)PyLong_FromLong(8);
PyLongObject* obj9 = (PyLongObject*)PyLong_FromLong(9);

How do we change the value?

Python int (and CPython PyLongObject) represent arbitrarily large integers (positive or negative). longintrepr.h defines how integers are stored. I’ll include the whole comment below because it is so informative, followed by the struct definition,

/* Long integer representation.
   The absolute value of a number is equal to
        SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
   Negative numbers are represented with ob_size < 0;
   zero is represented by ob_size == 0.
   In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
   digit) is never zero.  Also, in all cases, for all valid i,
        0 <= ob_digit[i] <= MASK.
   The allocation function takes care of allocating extra memory
   so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
   We always allocate memory for at least one digit, so accessing ob_digit[0]
   is always safe. However, in the case ob_size == 0, the contents of
   ob_digit[0] may be undefined.

   CAUTION:  Generic code manipulating subtypes of PyVarObject has to
   aware that ints abuse  ob_size's sign bit.
*/

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};             
             

A couple other typedefs relate a few of the types shown above,

So, for the smallest integers, the "raw value" is simply .ob_digit[0].

Now, we have all the code we need: get a reference to cached integer object and change its value. These four lines can go in the //CODE HERE part above:

PyLongObject* obj8 = (PyLongObject*)PyLong_FromLong(8);
PyLongObject* obj9 = (PyLongObject*)PyLong_FromLong(9);

obj8->ob_digit[0] = 9;
obj9->ob_digit[0] = 8;

3. Notes

If you import this module and continue working, it will likely eventually lead to a complete crash of the interpreter (i.e. suddenly exit and print something like Segmentation fault (core dumped) — as opposed to simply raising a Python Exception).

I found this out in a few ways,

  • My original test was to redefine all small integers by shifting each of them, n => n+1. I found this crashed almost immediately.

  • I tried swapping lower integers (3 and 4) and this seemed to lead to crashes (in subsequent print calls) in Python 3.8 but not in Python 3.11.

In conclusion, there is no real reason for this module, but it serves as a good learning exercise and a reminder to always only install and import trusted modules.


https://github.com/kts/swap_8_and_9