memoized: fix for bug where wrong cached results were returned

memoized() used id() to get "unique" representations for arguments passed to a function, in order to ensure that results for the same function called with different arguments are treated differently. However, Python object identities are only guaranteed to be unique at a particular point in time. It is possible than a particular ID gets reused for a different object if the previous object associated with that ID no longer exists. In particular, in CPython, the IDs are just addresses of the corresponding PyObject's in memory. If a PyObject gets garbage collected, another object may get allocated in its place, and the new object will "inherit" the ID by virtue of being in the same memory location. If the new object is then used to call a memoized function that was previously called with the old object, the old cached result will be incorrectly returned. To get around this issue, the cache is now indexed not only by the ID of an object but also but the first few bytes of its value. NOTE: there is still a potential issue it two relatively large objects gets allocated in the same place and happen to share the first few (currently 32) bytes, and are then both used as parameters to the same memoized function. The only way around that is to hash the entire value of the object. However, given the performance penalty that would incur for larger object, and the unlikeliness of this situation actually arising, it is currently deemed not worth it.
2025-10-23 11:03:20 +01:00 · 2016-09-06 09:57:58 +01:00
parent 390a544a92
commit 6d854fd4dc
1 changed files with 25 additions and 1 deletions
--- a/devlib/utils/misc.py
+++ b/devlib/utils/misc.py
@@ -29,6 +29,7 @@ import subprocess
 import pkgutil
 import logging
 import random
+import ctypes
 from operator import itemgetter
 from itertools import groupby
 from functools import partial
@@ -541,13 +542,36 @@ def reset_memo_cache():
    __memo_cache.clear()


+def __get_memo_id(obj):
+    """
+    An object's id() may be re-used after an object is freed, so it's not
+    sufficiently unique to identify params for the memo cache (two different
+    params may end up with the same id). this attempts to generate a more unique
+    ID string.
+    """
+    obj_id = id(obj)
+    obj_pyobj = ctypes.cast(obj_id, ctypes.py_object)
+    # TODO: Note: there is still a possibility of a clash here. If Two
+    # different objects get assigned the same ID, an are large and are
+    # identical in the first thirty two bytes. This shouldn't be much of an
+    # issue in the current application of memoizing Target calls, as it's very
+    # unlikely that a target will get passed large params; but may cause
+    # problems in other applications, e.g. when memoizing results of operations
+    # on large arrays. I can't really think of a good way around that apart
+    # form, e.g., md5 hashing the entire raw object, which will have an
+    # undesirable impact on performance.
+    num_bytes = min(ctypes.sizeof(obj_pyobj), 32)
+    obj_bytes = ctypes.string_at(ctypes.addressof(obj_pyobj), num_bytes)
+    return '{}/{}'.format(obj_id, obj_bytes)
+
+
@wrapt.decorator
 def memoized(wrapped, instance, args, kwargs):
    """A decorator for memoizing functions and methods."""
    func_id = repr(wrapped)

    def memoize_wrapper(*args, **kwargs):
-        id_string = func_id + ','.join([str(id(a)) for a in  args])
+        id_string = func_id + ','.join([__get_memo_id(a) for a in  args])
        id_string += ','.join('{}={}'.format(k, v)
                              for k, v in kwargs.iteritems())
        if id_string not in __memo_cache: