Do you know why we use `RefCell<Option<Strin...
# development
h
Do you know why we use
RefCell<Option<String>>
for the
thread_local!
static
THREAD_PARENT_ID
, but we only use
Option<String>
for the
task_local!
static? I'd expect it to use
RefCell
as well? This goes back to https://github.com/pantsbuild/pants/pull/9071/files#diff-230c46e3890a73dee860295c27128ef3aadc239d2db0f6bcd1ac67db038385ef. It used to be a
Mutex<Option<String>>
cc @witty-crayon-22786
Also, might make sense to pair with someone on this. I'm pretty in the weeds for why the metrics aren't showing up - have made lots of progress narrowing it down! But a little unsure where to go now
w
yes… the TaskLocal isn’t “mutable”… rather than “setting” it for the task, it’s scoped for a particular future
the threadlocal on the other hand is set for a task
h
Got it. That makes sense. I'm rereading https://doc.rust-lang.org/std/cell/index.html
w
@hundreds-father-404: i’m surprised that the the workunit stuff could be related here… the place where the metric is recorded asserts that a store is present, right?
there are a few parts to this problem: 1) making sure that metrics are written into the Store, and 2) making sure that something reads them out of the store, 3) making sure that 1 and 2 are happening in order
we already know that 3 isn’t the case, because of https://github.com/pantsbuild/pants/issues/11433
i can pair, but before we do, let’s discuss here for a bit
h
Yep, that's what I'm confused about. I've confirmed: * This is not an issue with the cache write not happening. Via log statements, I found that 29 writes started, and 28 "finished". Yet no counters recorded. I'm not worried about that 1 missing one's race condition here. * The code in
WorkunitStore::increment_counter()
is working correctly. The store handle is present, and the counters hashmap is being incremented correctly I'm trying to figure out why things aren't getting wired up correctly. For example, are we writing to the correct WorkunitStore?
Store
Wording nit, it'll help me if we can consistently call it a
WorkunitStore
.
Store
is wired in my head to be the caching store
w
there is only one workunitstore per run
h
there is only one workunitstore per run
Great, I thought that was the case but was confused. Something I'm confused by, why aren't we using something like <Arc<Mutex>> for
WorkunitStore
?
Copy code
#[derive(Clone)]
pub struct Context {
  workunit_store: WorkunitStore,
  build_id: String,
}
w
ok, so you’ve confirmed 1… if you add some logging near where 2 happens, is it happening in time to observe 1 ?
because the
WorkunitStore
is internally reference counted… it’s something like
Arc<Mutex<InnerWorkunitStore>>
👍 1
h
wait with your #1, did you mean by
Store
the
WorkunitStore
or the caching
Store
?
w
WorkunitStore
👍 1
h
Cool. Sounds good on figuring out #2 now, whether we are actually reading from the
WorkunitStore
. That's where I'm a little confused how to go about logging, where is the call for that read? Streaming workunits?
w
yep
and as i said, we already know that 3 is not happening. that’s a known bug.
(c.f. that ticket and the linked “tail tasks” thing)
but i would view what you’re currently doing is confirming that 1 and 2 are happening out of order
if “finished=True”, it’s going to be the last poll before the run completes… anything that happens after that final poll will not be captured by the StreamingWorkunitHandler
the Session and WorkunitStore continue to exist after that call… so stuff can continue to happen after the client has gone away. that’s where tail tasks / background completion of the StreamingWorkunitHandler would come in
👍 1
grabbing lunch, bit if you want to pair afterward, lemme know
❤️ 1
🥪 1
h
I think that first link is more relevant. Second one is only histograms it seems But okay, trying to figure this out. In particular, I'm wondering if it'd help to somehow add a blocking call before the read to ensure the write has finished
w
that is Tom’s tail tasks PR
👍 1
h
Hm, neither Tom's PR nor adding
time.sleep(4)
before
self.run_tracker.end_run(engine_result)
in
LocalPantsRunner
made a difference. I'm suspecting somehow the write isn't going to the correct place? It's not an issue with lack of time, but location?
w
Possibly. But again: only one WorkunitStore, afaik. Some aggregation happens at read time, so that’s also a possibility.
I'd lean in to adding log statements to confirm that things are happening in the right order, and only then dive into whether data is being read/written different places
but it’s a “per-workunit” thing, which requires proper parent links, i think. not exactly sure how that works, but Tom knows more.
…given that there must be proper parent links for the workunits, if a task isn’t being created the right way, it wouldn’t have a parent workunit. i.e.,
Copy code
779:pub struct WorkunitStoreHandle {
780-  pub store: WorkunitStore,
781-  pub parent_id: Option<SpanId>,
…the parent_id would be none
h
whereas I've confirmed the parent_id is some
w
if it were none anywhere in the chain above where the metric was recorded
(except at the root)
but even then, i don’t think that matters actually… the workunit would still end up with the metric, so i think it would still get recorded.
so yea, sorry. can ignore the parent_id angle i think.
👍 1
h
More confirmation: the counter still doesn't work if I do it immediately at the start of the
self.executor.spawn(async move))
block. That is, I don't think has to do with race conditions
@fast-nail-55400 Stu and I figured it out! And we got it working, but there's a more general problem of dangling pointers with workunits that we can fix by leveraging Rust's borrow checker. Thoughts on https://github.com/pantsbuild/pants/issues/11548?
f
👀
added some questions to the issue