Published August 7th, 2023
Summary: I present a simple Python module that after imported, changes the values of two integers. The purpose is (a) show that modules can do harmful and unexpected things, (b) show an example of how to write a simple module in C, and (c) introduce a few things in CPython source code.
Try the following,
$ git clone https://github.com/kts/swap_8_and_9.git
$ cd swap_8_and_9
$ pip install .
$ python
>>> print(8)
8
>>> import swap_8_and_9
>>> print(8)
9
>>> print(9)
8
>>> print(list(range(10)))
[0, 1, 2, 3, 4, 5, 6, 7, 9, 8]
As the name implies and the example demonstrates, simply importing the module swap_8_9
has swapped the values of the two integers, 8
and 9
.
This can come as a surprise if you view Python
’s import
as simply a way to introduce some name(s) into the local namespace. In fact, import
can execute arbitrary code and some code can cause unexpected (or unwanted) behavior in the Python interpreter.
The rest of this page will describe why this works and how it is written. I’ll describe how to write a simple Python
module in C
and I’ll describe some parts of the Python
source code.
The swap_8_and_9
module is an example of a very simple Python module written using only C
. You only need two files: setup.py
and the C
source, which I’ve put in a subdirectory, src/swap_8_and_9.c
.
from distutils.core import setup, Extension
name = 'swap_8_and_9'
setup(name = name,
version = '0.1.0',
description = 'Proof of concept that swaps ints 8 and 9 on import',
ext_modules = [
Extension(name, sources = ['src/' + name + '.c']),
])
Our main code file, swap_8_and_9.c
, is pretty simple because this module has no functions, no classes, no constants, etc. We only need to write the definition of the module and write the function that will be called when it is imported.
The file needs three things ({name}
is our module name),
List of methods (used in module defintion): static PyMethodDef {name}_methods[] = {...};
(this is an empty list in our case).
Module definition : static struct PyModuleDef {name}_definition = {...};
Init function: PyMODINIT_FUNC PyInit_{name}(void) {...}
#include <Python.h>
/*
methods in Module
*/
static PyMethodDef swap_8_and_9_methods[] = {
//marks end:
{NULL, NULL, 0, NULL}
};
/*
define Module
*/
static struct PyModuleDef swap_8_and_9_definition = {
PyModuleDef_HEAD_INIT,
"swap_8_and_9", /* module name */
/* __doc__ string: */
"C Module that...",
/* size of per-interpreter state of the module,
or -1 if the module keeps state in global variables. */
-1,
swap_8_and_9_methods
};
/*
initialize Module
- called on import
- PyInit_{name} (same name from setup.py name)
*/
PyMODINIT_FUNC PyInit_swap_8_and_9(void) {
Py_Initialize();
//CODE HERE
return PyModule_Create(&swap_8_and_9_definition);
}
We will write somethig in the //CODE HERE
part after we learn a little bit about some CPython
internals.
With these two files you should be able to pip install .
to install from the current directory.
Another useful command you can run is python setup.py build
. This will show the compilation commands being run. (Note that if it is already built and the files haven’t changed, it won’t do anything).
On my macbook, the re-formatted output looks like this: (replaced lots of args with ...
, mainly -I PATH
, -L PATH
, and some other flags),
# 1. compile
clang ... \
-c src/swap_8_and_9.c \
-o build/temp.macosx-12.6-arm64-cpython-311/src/swap_8_and_9.o
# 2. link
clang ... \
build/temp.macosx-12.6-arm64-cpython-311/src/swap_8_and_9.o \
-o build/lib.macosx-12.6-arm64-cpython-311/swap_8_and_9.cpython-311-darwin.so
CPython
refers to the standard Python implementation. It is written in C
and can be browsed on github.
On this page, I am using CPython
version 3.11.4
and the code descriptions below will link to the github source code for this version.
For the CPython
code that I’ll describe, I have noticed significant reorganization of the code between versions 3.8.6
to 3.11.4
. However, the concepts remain the same.
In CPython
, everything (integers, strings, dictionaries, functions, etc) is an "object", and represented by a C struct
named PyObject
.
Common to all objects is the storage its type (and the data needed for that type) and a reference count used for garbage collection.
We are interested in integers — type int
within Python
and PyLongObject
within CPython
.
Because there is some overhead in creating and destroying objects, CPython
maintains a cache for the set of small integers (values -5, …, 256), since these are commonly used.
(At this point, you can probably guess that our module will modify these cached objects).
pycore_global_objects.h
defines the range of integers that will be cached,
#define _PY_NSMALLPOSINTS 257
#define _PY_NSMALLNEGINTS 5
In the same file, a few lines later, we see the array that will hold the cached integers. It is an array of type PyLongObject
and named small_ints
and is within a struct called _Py_global_objects
:
struct _Py_global_objects {
struct {
/* Small integers are preallocated in this array so that they
* can be shared.
* The integers that are preallocated are those in the range
* -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (exclusive).
*/
PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
...
} singletons;
};
When and where is this array initialized? In pycore_runtime_init.h
, we see a macro (itself generated by a Python script) that initializes the elements in the small_ints[]
array,
.small_ints = { \
_PyLong_DIGIT_INIT(-5), \
_PyLong_DIGIT_INIT(-4), \
_PyLong_DIGIT_INIT(-3), \
...
_PyLong_DIGIT_INIT(255), \
_PyLong_DIGIT_INIT(256), \
}, \
It is slightly convoluted to see how this
macro (_Py_global_objects_INIT
) gets used,
but pystate.c
stores it as static const _PyRuntimeState initial = _PyRuntimeState_INIT;
and this initial
variable is used when when the interpreter is initialized.
Back to the code for our module — we want to modify a couple of the elements of small_ints[]
, but we do not have direct access to it.
We can use the API declared in longobject.h — namely, the following function that gives the PyObject
for a given long
(plain C
) value: PyAPI_FUNC(PyObject *) PyLong_FromLong(long)
.
Its definition is in longobject.c
. We can see that it checks for small integers (IS_SMALL_INT(ival)
) and then calls get_small_int((sdigit)ival)
, which returns _PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival]
.
Finally, in pycore_long.h
, you can see that this refers to the small_ints[]
array: #define _PyLong_SMALL_INTS _Py_SINGLETON(small_ints)
.
Now, we have the first part of our code. It will get the references to the two integers we care about:
PyLongObject* obj8 = (PyLongObject*)PyLong_FromLong(8);
PyLongObject* obj9 = (PyLongObject*)PyLong_FromLong(9);
How do we change the value?
Python
int
(and CPython
PyLongObject
) represent arbitrarily large integers (positive or negative). longintrepr.h defines how integers are stored. I’ll include the whole comment below because it is so informative, followed by the struct
definition,
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
We always allocate memory for at least one digit, so accessing ob_digit[0]
is always safe. However, in the case ob_size == 0, the contents of
ob_digit[0] may be undefined.
CAUTION: Generic code manipulating subtypes of PyVarObject has to
aware that ints abuse ob_size's sign bit.
*/
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
A couple other typedef
s relate a few of the types shown above,
pytypedefs.h: typedef struct _longobject PyLongObject;
.
longintrepr.h: typedef uint32_t digit;
So, for the smallest integers, the "raw value" is simply .ob_digit[0]
.
Now, we have all the code we need: get a reference to cached integer object and change its value. These four lines can go in the //CODE HERE
part above:
PyLongObject* obj8 = (PyLongObject*)PyLong_FromLong(8);
PyLongObject* obj9 = (PyLongObject*)PyLong_FromLong(9);
obj8->ob_digit[0] = 9;
obj9->ob_digit[0] = 8;
If you import this module and continue working, it will likely eventually lead to a complete crash of the interpreter (i.e. suddenly exit and print something like Segmentation fault (core dumped)
— as opposed to simply raising a Python
Exception
).
I found this out in a few ways,
My original test was to redefine all small integers by shifting each of them, n => n+1
. I found this crashed almost immediately.
I tried swapping lower integers (3 and 4) and this seemed to lead to crashes (in subsequent print
calls) in Python
3.8
but not in Python
3.11
.
In conclusion, there is no real reason for this module, but it serves as a good learning exercise and a reminder to always only install and import trusted modules.