I think I've identified an issue with `async_semap...
# development
h
I think I've identified an issue with
async_semaphor
https://github.com/pantsbuild/pants/blob/master/src/rust/engine/async_semaphore/src/lib.rs#L103 <- whenever we drop a
Permit
, we increment
available_permits
, which is an integer, but we only look at the front item in the
waiters
queue
so if we have a case where we spawn a bunch of processes on a
BoundedCommandRunner
all at once, more than the limit
when the first
n
processes finish, they'll increment the available count but call
wake_by_ref
on the same waiting process
so if a batch of processes finishes at roughly the same time, they'll only wake up one waiting process, not
n
waiting processes
this may or may not be what's going on with the slot ID bug being discussed in #getting-started but it seems incorrect to me
@witty-crayon-22786 do you have any thoughts?
w
… we should switch to tokio semaphore if possible.
and then implement the “slot” feature some other way.
h
that's probably a good long-term idea
w
i need to work on other things, but if the gist of the above is that the semaphore is buggy, then i would strongly endorse switching. because it isn’t the first time
tokio’s semaphore didn’t exist when I wrote that.
h
definitely makes sense (long-term) to use a common library functionality instead of bespoke code
w
re: slots: could literally do it with integers in a Stack/Vec under a mutex.
h
https://github.com/pantsbuild/pants/pull/10436 might fix things more quickly for the person who's having issues today
w
i mean, as i said: if the semaphore is buggy now, let’s switch. but yea, thanks for investigating.
@hundreds-breakfast-49010: we’re probably not going to do a release today, so maybe keep that fix green, and then if you’re still happy with it as we ramp up to a release, we can land it.
or whatever
just lemme know.
h
the above draft PR added a new, more comprehensive test for sane concurrency ID numbers in
async_semaphor
, which only passes if I make that adjustment to the call to
wake_by_ref
in the drop impl
so I'm hoping that this will solve that person's problem and if so we should merge it soon-ish. I do think it makes sense to stop using custom semaphor code, but unless it's really easy to swap it out I think that's better off as a post-2.0 thing
w
we need to independently discuss when we’re going to do the next release. i’ll be able to discuss that starting in a few hours
so i’m going to snooze this conversation for a few hours, and then we land that and cut another release, or land any other fix.
h
kk
w
@fast-nail-55400 would also be a good reviewer for that change, probably.
👍 1
h
Looking at it now, I don't know enough Rust to review the code in any depth, but I can reason about concurrency.
h
experimenting with this code more on my dev laptop, I don't think this actually solves the problem
I think we actually need to explicitly keep a set of used concurrency IDs in the
Inner
data structure, and I have some changes locally that do this and are passing tests
so I'll update that PR in a bit
(basically, if we use the value of
available_permits
as an effective id, once enough tasks are scheduled to exhaust the ids, we end up assigning an id of
1
whenever a particular task finishes. so we count down from whatever the value of local concurrency is to 1, and then only assign 1 to processes thereafter. instead we should give a newly-scheduled process the id of whatever process just finished, which means we need store those ids in a separate data structure, which is easy enough to do)
h
Yeah, looking at this I wasn't able to prove to my own satisfaction that this solves the problem.
h
with the changes I have locally right now, I can log the concurrency id number given to processes in our own tests, and now I'm seeing a wide variety of different numbers, instead of just 1 all the time after counting down from 16 (I have an 8 core CPU)
w
i still haven’t been able to look at this deeply, but if the fix is not trivial, i would continue to recommend switching to the tokio semaphore, and implementing the slot separately with a
Mutex<Vec<usize>>
full of the number of slots you want.
f
I’m looked at the PR but I’m not well versed enough in synchronization stuff to opine on it in any way I can be sure of
h
w
ok, sorry… out of my hole now
will take a look.
@hundreds-breakfast-49010: commented: thanks