Skip to content

Latest commit

 

History

History
160 lines (109 loc) · 5.14 KB

File metadata and controls

160 lines (109 loc) · 5.14 KB

set

contents

related file

  • cpython/Objects/setobject.c
  • cpython/Include/setobject.h

memory layout

memory layout

method

  • new

    • call stack
      • static PyObject * set_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
        • static PyObject * make_new_set(PyTypeObject *type, PyObject *iterable)

The following picture shows the value of each field in an empty set

make new set

  • add

    • call stack
      • static PyObject *set_add(PySetObject *so, PyObject *key)
        • static int set_add_key(PySetObject *so, PyObject *key)
          • static int set_add_entry(PySetObject *so, PyObject *key, Py_hash_t hash)

Initialize a new empty set. Whenever an empty set is created, the setentry field points to the smalltable field inside the same PySetObject. The default value of PySet_MINSIZE is 8, which means if there are fewer than 8 objects in PySetObject, they are stored in the smalltable. If there are more than 8 objects, the resize operation will make setentry point to a new malloced block. (Actually, the first time a resize operation takes place, the smalltable won't be filled. There is a factor that triggers the resize operation; please keep reading)

  s = set()

set_empty

The value in the mask field is the size of malloced entries minus 1; currently it's 7

s.add(0) # hash(0) & mask == 0

set_add_0

Add a value 5

s.add(5) # hash(5) & mask == 0

set_add_5

  • hash collision

Add a value 16. Because the mask is 7, hash(16) & 7 ===> 0 CPython uses LINEAR_PROBES to solve hash collisions instead of CHAIN (linked list)

// pseudocode
#define LINEAR_PROBES 9
while (1)
{
    // i is the hash result
    // find the position according to the hash value
    if (entry in i is empty)
        return entry[i]
    // reach here, means entry[i] already taken
    if (i + LINEAR_PROBES <= mask) {
        for (j = 0 ; j < LINEAR_PROBES ; j++) {
            if (entry in j is empty)
                return j
        }
    }
    // reach here, means entry[i] - entry[i + LINEAR_PROBES] all are taken
    // now, rehash take place
    perturb >>= PERTURB_SHIFT;
    i = (i * 5 + 1 + perturb) & mask;
}

The 0th position has been taken, so LINEAR_PROBES takes the next empty position, which is the 1st position

s.add(16) # hash(16) & mask == 0

set_add_16

The 0th position has been taken. The first time, LINEAR_PROBES finds the 1st position, which has also been taken. The second time, LINEAR_PROBES finds the 2nd position, which is empty.

s.add(32) # hash(32) & mask == 0

set_add_32

  • resize

Let's insert one more item with value 64, still repeating the LINEAR_PROBES process, inserting at index 3

s.add(64) # hash(64) & mask == 0

set_add_64

Now the value in field fill is 5 and mask is 7, which will trigger the resize operation. The trigger condition is fill*5 !< mask * 3

// from cpython/Objects/setobject.c
if ((size_t)so->fill*5 < mask*3)
    return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);

After resize, values 32 and 64 still repeat the LINEAR_PROBES process, inserted at index 1 and index 2. Because the value in the mask field becomes 31 and hash(16) is less than the mask, 16 can be inserted at index 16. The field table no longer points to smalltable; instead, it points to a new malloced block

set_add_64_resize

  • why LINEAR_PROBES

    • improve cache locality
    • reduces the cost of hash collisions
  • clear

    • call stack
      • static PyObject *set_clear(PySetObject *so, PyObject *Py_UNUSED(ignored))
        • static int set_clear_internal(PySetObject *so)
          • static void set_empty_to_minsize(PySetObject *so)

Call clear to restore the initial state

  s.clear()

set_clear