Do you know why we use `RefCell lt Option lt String gt gt ` Pants #development

Do you know why we use `RefCell<Option<Strin...

hundreds-father-404

02/10/2021, 7:38 PM

Do you know why we use

RefCell<Option<String>>

for the

thread_local!

static

THREAD_PARENT_ID

, but we only use

Option<String>

for the

task_local!

static? I'd expect it to use

RefCell

as well? This goes back to https://github.com/pantsbuild/pants/pull/9071/files#diff-230c46e3890a73dee860295c27128ef3aadc239d2db0f6bcd1ac67db038385ef. It used to be a

Mutex<Option<String>>

hundreds-father-404

02/10/2021, 7:38 PM

cc @witty-crayon-22786

hundreds-father-404

02/10/2021, 7:38 PM

Also, might make sense to pair with someone on this. I'm pretty in the weeds for why the metrics aren't showing up - have made lots of progress narrowing it down! But a little unsure where to go now

witty-crayon-22786

02/10/2021, 7:44 PM

yes… the TaskLocal isn’t “mutable”… rather than “setting” it for the task, it’s scoped for a particular future

witty-crayon-22786

02/10/2021, 7:44 PM

the threadlocal on the other hand is set for a task

hundreds-father-404

02/10/2021, 7:45 PM

Got it. That makes sense. I'm rereading https://doc.rust-lang.org/std/cell/index.html

witty-crayon-22786

02/10/2021, 7:45 PM

@hundreds-father-404: i’m surprised that the the workunit stuff could be related here… the place where the metric is recorded asserts that a store is present, right?

witty-crayon-22786

02/10/2021, 7:48 PM

there are a few parts to this problem: 1) making sure that metrics are written into the Store, and 2) making sure that something reads them out of the store, 3) making sure that 1 and 2 are happening in order

witty-crayon-22786

02/10/2021, 7:48 PM

we already know that 3 isn’t the case, because of https://github.com/pantsbuild/pants/issues/11433

witty-crayon-22786

02/10/2021, 7:48 PM

i can pair, but before we do, let’s discuss here for a bit

hundreds-father-404

02/10/2021, 7:49 PM

Yep, that's what I'm confused about. I've confirmed: * This is not an issue with the cache write not happening. Via log statements, I found that 29 writes started, and 28 "finished". Yet no counters recorded. I'm not worried about that 1 missing one's race condition here. * The code in

WorkunitStore::increment_counter()

is working correctly. The store handle is present, and the counters hashmap is being incremented correctly I'm trying to figure out why things aren't getting wired up correctly. For example, are we writing to the correct WorkunitStore?

hundreds-father-404

02/10/2021, 7:50 PM

Store

Wording nit, it'll help me if we can consistently call it a

WorkunitStore

Store

is wired in my head to be the caching store

witty-crayon-22786

02/10/2021, 7:51 PM

there is only one workunitstore per run

hundreds-father-404

02/10/2021, 7:52 PM

there is only one workunitstore per run

Great, I thought that was the case but was confused. Something I'm confused by, why aren't we using something like <Arc<Mutex>> for

WorkunitStore

Copy code

#[derive(Clone)]
pub struct Context {
  workunit_store: WorkunitStore,
  build_id: String,
}

witty-crayon-22786

02/10/2021, 7:52 PM

ok, so you’ve confirmed 1… if you add some logging near where 2 happens, is it happening in time to observe 1 ?

witty-crayon-22786

02/10/2021, 7:52 PM

because the

WorkunitStore

is internally reference counted… it’s something like

Arc<Mutex<InnerWorkunitStore>>

👍 1

hundreds-father-404

02/10/2021, 7:54 PM

wait with your #1, did you mean by

Store

the

WorkunitStore

or the caching

Store

witty-crayon-22786

02/10/2021, 7:54 PM

WorkunitStore

👍 1

hundreds-father-404

02/10/2021, 7:56 PM

Cool. Sounds good on figuring out #2 now, whether we are actually reading from the

WorkunitStore

. That's where I'm a little confused how to go about logging, where is the call for that read? Streaming workunits?

witty-crayon-22786

02/10/2021, 7:57 PM

yep

witty-crayon-22786

02/10/2021, 7:58 PM

and as i said, we already know that 3 is not happening. that’s a known bug.

witty-crayon-22786

02/10/2021, 7:58 PM

(c.f. that ticket and the linked “tail tasks” thing)

witty-crayon-22786

02/10/2021, 7:59 PM

but i would view what you’re currently doing is confirming that 1 and 2 are happening out of order

witty-crayon-22786

02/10/2021, 8:01 PM

2 happens here: https://github.com/pantsbuild/pants/blob/477cfec6b1bf12e2983ac5d30a651cedaccf9071/src/python/pants/engine/streaming_workunit_handler.py#L210[…]-L218

witty-crayon-22786

02/10/2021, 8:01 PM

if “finished=True”, it’s going to be the last poll before the run completes… anything that happens after that final poll will not be captured by the StreamingWorkunitHandler

witty-crayon-22786

02/10/2021, 8:02 PM

the Session and WorkunitStore continue to exist after that call… so stuff can continue to happen after the client has gone away. that’s where tail tasks / background completion of the StreamingWorkunitHandler would come in

👍 1

witty-crayon-22786

02/10/2021, 8:03 PM

…or. hm, sec. i guess there is a separate method for metrics. this one: https://github.com/pantsbuild/pants/blob/477cfec6b1bf12e2983ac5d30a651cedaccf9071/src/python/pants/engine/streaming_workunit_handler.py#L66-L72

👍 1

witty-crayon-22786

02/10/2021, 8:19 PM

grabbing lunch, bit if you want to pair afterward, lemme know

❤️ 1

🥪 1

hundreds-father-404

02/10/2021, 8:20 PM

I think that first link is more relevant. Second one is only histograms it seems But okay, trying to figure this out. In particular, I'm wondering if it'd help to somehow add a blocking call before the read to ensure the write has finished

witty-crayon-22786

02/10/2021, 8:20 PM

that is Tom’s tail tasks PR

👍 1

witty-crayon-22786

02/10/2021, 8:20 PM

it’s linked from https://github.com/pantsbuild/pants/issues/11433

hundreds-father-404

02/10/2021, 8:44 PM

Hm, neither Tom's PR nor adding

time.sleep(4)

before

self.run_tracker.end_run(engine_result)

LocalPantsRunner

made a difference. I'm suspecting somehow the write isn't going to the correct place? It's not an issue with lack of time, but location?

witty-crayon-22786

02/10/2021, 8:57 PM

Possibly. But again: only one WorkunitStore, afaik. ~~Some aggregation happens at read time, so that’s also a possibility.~~

witty-crayon-22786

02/10/2021, 8:57 PM

I'd lean in to adding log statements to confirm that things are happening in the right order, and only then dive into whether data is being read/written different places

witty-crayon-22786

02/10/2021, 9:01 PM

the counters are read here by `poll_workunits`: https://github.com/pantsbuild/pants/blob/477cfec6b1bf12e2983ac5d30a651cedaccf9071/src/rust/engine/src/externs/interface.rs#L1059-L1075

👍 1

👀 1

witty-crayon-22786

02/10/2021, 9:02 PM

but it’s a “per-workunit” thing, which requires proper parent links, i think. not exactly sure how that works, but Tom knows more.

witty-crayon-22786

02/10/2021, 9:06 PM

…given that there must be proper parent links for the workunits, if a task isn’t being created the right way, it wouldn’t have a parent workunit. i.e.,

Copy code

779:pub struct WorkunitStoreHandle {
780-  pub store: WorkunitStore,
781-  pub parent_id: Option<SpanId>,

…the parent_id would be none

hundreds-father-404

02/10/2021, 9:06 PM

whereas I've confirmed the parent_id is some

witty-crayon-22786

02/10/2021, 9:06 PM

if it were none anywhere in the chain above where the metric was recorded

witty-crayon-22786

02/10/2021, 9:06 PM

(except at the root)

witty-crayon-22786

02/10/2021, 9:07 PM

but even then, i don’t think that matters actually… the workunit would still end up with the metric, so i think it would still get recorded.

witty-crayon-22786

02/10/2021, 9:07 PM

so yea, sorry. can ignore the parent_id angle i think.

👍 1

hundreds-father-404

02/10/2021, 9:08 PM

More confirmation: the counter still doesn't work if I do it immediately at the start of the

self.executor.spawn(async move))

block. That is, I don't think has to do with race conditions

hundreds-father-404

02/10/2021, 11:46 PM

@fast-nail-55400 Stu and I figured it out! And we got it working, but there's a more general problem of dangling pointers with workunits that we can fix by leveraging Rust's borrow checker. Thoughts on https://github.com/pantsbuild/pants/issues/11548?

fast-nail-55400

02/10/2021, 11:52 PM

👀

fast-nail-55400

02/10/2021, 11:59 PM

added some questions to the issue

Open in Slack

Previous Next