Categories
python

Python Performance: Checking a collection contains a value (Membership)

A common task in programming and development is finding items that exist already or recognising duplicates. This is done by checking membership of an element in a collection.

Often a list is used as it is mutable (can be changed and items added to it) and it is suited for single items. However there are other data structures that could help – a hash – which in python is a dict and a set.

Which data structure is best suited for this action for performance?

The Test Code

The full test code is not posted but the method for checking each type is here:

def check_dict_item(item):
    if my_dict.get(item):
        return True
    return False

def check_set_item(item):
    if item in my_set:
        return True
    return False

def check_list_item(item):
    if item in my_list:
        return True
    return False

Note: it was slightly slowed added about 0.3x more latency when using if item in my_dict.keys() to check for membership with a dict.

The results

  • µs: micro seconds (10^-6)
  • ns: nano seconds (10^-9)

Using timeit

Set (nano seconds 10^-9):

In [2]: %timeit membership.check_set()
575 ns ± 3.37 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [5]: %timeit  -r 10 -n 1000 membership.check_set()
1.2 µs ± 127 ns per loop (mean ± std. dev. of 10 runs, 1,000 loops each)

List:

In [3]: %timeit membership.check_list()
396 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [4]: %timeit  -r 10 -n 1000 membership.check_list()
398 µs ± 1.31 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)

Dict:

In [4]: %timeit membership.check_dict()
787 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [3]: %timeit  -r 10 -n 1000 membership.check_dict()
1.6 µs ± 224 ns per loop (mean ± std. dev. of 10 runs, 1,000 loops each)

Using time

Main method:

$ python3 membership.py 
List:  0.0004038810729980469 (400 micro seconds)
Set:  1.9073486328125e-06 (1.9 micro seconds)
Dict:  2.1457672119140625e-06 (2.1 micro seconds)

Conclusion

It is very quick in all cases however the optimal choice is using a set().
The slowest option between the 3 is a list.