What code is responsible for hashing things in the...
# development
p
What code is responsible for hashing things in the rule graph? We have all of these dataclasses - do we just use the hash of the dataclass? Or is there something else in rust to determine the hash key for a given request?
b
IIRC, yes the classes hash is used
h
yep,
hash(my_obj)
p
Copy code
def predicate(arg: str):
    def inner():
        return arg
    return inner

a1 = predicate("a")
a2 = predicate("a")
b = predicate("b")

hash(a1)
hash(a2)
hash(b)
Copy code
272502245
272502218
272502236
so functions are hashable, but … not really
b
No, that makes a lot of sense to me
Each invocation creates a new function and returns it
If you wanted stability, use an object with a
__call__
?
p
So, my recent work to add a predicate to
DependenciesRequest
can have a different hash per request simply because building a predicate on the fly will produce a different object every time. 😕
b
Ah yeah, try out a dataclass with a
__call__
?
p
I suppose for the default cases, we just use the same function every time, so those won’t have that problem. But for the package traversal one, I’ll have to try a dataclass with a
__call__
.
1
If that works, should all of the predicates be subclasses of a new class, and then the type hint would require instances of that object instead of merely a
Callable
?
b
You might look for
__closure__
and warn?
Re the subclass: meh. Don't make it hard for people to us a nice free function?
p
What is
__closure__
?
Is that an attribute of a function object? edit: apparently it is. hmm. I did not know about that.
@bitter-ability-32190 could you chime in here please: https://github.com/pantsbuild/pants/pull/19272#discussion_r1226929628
c
p
doh. Probably not
Tested. They are not stable across calls. So functions are not safely hashable across python sessions
b
That's ok. I don't think we expect them to be? The daemon is the persistent python process
p
Copy code
from pants.engine.target import DepsTraversalPredicates
hash(DepsTraversalPredicates.always)
273207884
and in a different session:
Copy code
hash(DepsTraversalPredicates.always)
287048744
which means that the cache keys related to dep traversal will be invalidated for each session of pantsd.
b
Yeah. Python hashes aren't stable across multiple processes IIRC. So no new news?
p
Then, how are we hashing all of our dataclasses and caching that?
b
The daemon. That's why when you restart the daemon rules are re-run
The persistent cache is for
Digests
and `Process`es
(and a few other odds and ends)
That's why the daemon is so important. The memoization means we don't re-run python rule code.
p
🤯
b
The important thing is hashes are stable between invocations in the same process. Otherwise, you'll lose the memoization
p
I thought different rules could indicate which level of caching was safe (per session being only one of the options)
b
That is also true
But that's only when you wanna opt out of the free caching
p
Sorry. Struggling to wrap my mind around all of this.
c
this gets me every time… 🤯
b
No worries. Although I gtg
p
Thanks for this bit of your time 🙂
b
So functions should be safe. Closures (inner functions) not safe. Frozen dataclasses with a call method, also safe
It's not incorrect though, so a warn should be sufficient. You just lose memoization
p
dataclasses (at least simple frozen dataclasses) are hashable across sessions. I ran this in three python consoles and got the same hash:
Copy code
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class MyFrozenThing:
...     a: int
...     def __call__(self) -> int:
...         return self.a
>>> a = MyFrozenThing(1)
>>> a
MyFrozenThing(a=1)
>>> hash(a)
-6644214454873602895
>>> a()
1
👍 1
b
It depends on the data inside the dataclass
And maybe your version of Python? Anywho the hashing that matters is in the same session 👍
h
I would recommend dataclasses with
__call__
as it makes it very explicit that we expect a stable hash
👍 3