https://pantsbuild.org/ logo
#development
Title
# development
b

bitter-ability-32190

11/09/2022, 8:27 PM
Working on https://github.com/pantsbuild/pants/issues/17282 and getting a very simply change in where we symlink if the file is above some N bytes. Easy-peasy, however it's incorrect if the file needs to be writable (most notable formatters/fixers). So we could also plumb through the code files/dirs to not make symlinks and oh hey look Process has that info. Is that sufficient though? Presumably the user could want to mutate a file without it being captured. CC @witty-crayon-22786 @ancient-vegetable-10556
w

witty-crayon-22786

11/09/2022, 8:35 PM
That's mentioned on the ticket as a Process argument
force_mutable
To replace the immutable_inputs dict
a

ancient-vegetable-10556

11/09/2022, 8:39 PM
@bitter-ability-32190 there’s a difference between “write” and “replace with a new file” though, right? Writing a file shouldn’t follow the symlink back and overwrite that?
(i.e. it’s only issue if the “writing” is random-access writing into the original file, which… I don’t think is that common)
b

bitter-ability-32190

11/09/2022, 8:40 PM
@witty-crayon-22786 that's certainly a knob/lever at the
Process
level, but I'm thinking at the user-level. E.g. what if they wanna edit a huge-ass file in
experimental_shell_command
?
I'm cool if the answer is just "yeah sorry, don't do that"
w

witty-crayon-22786

11/09/2022, 8:41 PM
Same thing there... Would have to expose it as a knob to override the heuristic
a

ancient-vegetable-10556

11/09/2022, 8:41 PM
@bitter-ability-32190 1) Yeah, don’t do that. 2)
experimental_shell_command
needs an overhaul anyway 😉
b

bitter-ability-32190

11/09/2022, 8:42 PM
OK, for now I think it's likely sufficient to punt and we can circle back immediately and decide how to handle this for the user (if at all).
👍 1
@ancient-vegetable-10556 FWIW
experimental_shell_command
is only an example. Any user-defined code running in the sandbox is at risk. Which is why I'm prone to punt
a

ancient-vegetable-10556

11/09/2022, 8:44 PM
definitely
b

bitter-ability-32190

11/09/2022, 8:44 PM
So a
python_test
(for whatever reason) could also be a problem
w

witty-crayon-22786

11/09/2022, 8:44 PM
@ancient-vegetable-10556: any overhaul of
e_s_c
will still need a way to declare inputs
a

ancient-vegetable-10556

11/09/2022, 8:44 PM
@witty-crayon-22786 yes.
w

witty-crayon-22786

11/09/2022, 8:46 PM
@bitter-ability-32190: mm. Good point. Our guarantees are much more nebulous in that case... I think that your change would be a good time to add a note somewhere on the test goal to "not assume that inputs are mutable"
b

bitter-ability-32190

11/09/2022, 8:47 PM
lint
and
check
has the same affliction, but we aren't so much running "user code", so yeah I agree maybe isolate this to
test
and
run
?
w

witty-crayon-22786

11/09/2022, 8:47 PM
But I don't think it's a deprecation violation.
b

bitter-ability-32190

11/09/2022, 8:48 PM
Besides,
cp
is a valid solution
😉 1
w

witty-crayon-22786

11/09/2022, 8:48 PM
Yea.
b

bitter-ability-32190

11/10/2022, 9:55 AM
Actually this begs the question, since we have mutable paths, sounds everything else be readonly?
(I might try it for the pants repo and see what goes kaboom)
w

witty-crayon-22786

11/10/2022, 5:40 PM
symlinking has a cost, because first you materialize the files in a particular spot, and then you symlink them somewhere else. if you don’t use them multiple times, you will have wasted that effort. the idea behind defaulting to using it for large inputs is that the overhead is a smaller fraction of the total when an input is large: i.e., single digit percentage points. but for tiny inputs, it would be more like a 50% overhead. so it’s less of a problem to guess incorrectly that something will be used multiple times.
additionally, you have to chose the “scope” to symlink: a huge file is an easy scope. but you (probably, there is a threshold) don’t want to materialize thousands of files, and then symlink them each individually
one thing that we can do with the heuristic is use the recursive size of a directory and its children to decide to symlink, since we have it in memory as a DigestTrie
b

bitter-ability-32190

11/10/2022, 5:55 PM
if you don’t use them multiple times, you will have wasted that effort.
The only effort here being creating the symlink right? This is onyl important for sandbox creation, which otherwise would be materializing the file. So for only-ever-one-sandbox we'd pay
Cost(materialize-in-tempdir) + Cost(symlink)
vs
Cost(materialize)
which honestly should be small, right? Unless what you're alluding to is that we materialize file-by-file which could maybe be more expensive than materializing the whole digest in certain situations?
w

witty-crayon-22786

11/10/2022, 6:00 PM
The only effort here being creating the symlink right?
that, and marking them readonly, which we only do for immutable inputs
for tiny files, it means 3x more syscalls
b

bitter-ability-32190

11/10/2022, 6:01 PM
Is it really an effort to mark? I assumed that was just the same instruction as the materialization. But I'm also pretty OS-unaware for a lot of operations 😛
w

witty-crayon-22786

11/10/2022, 6:01 PM
and for tiny files inodes/syscalls are the cost rather than bytes
(filesystems suck)
b

bitter-ability-32190

11/10/2022, 6:02 PM
Gotta have them at some point
w

witty-crayon-22786

11/10/2022, 6:02 PM
*POSIX filesystems suck
b

bitter-ability-32190

11/10/2022, 6:02 PM
Ah yeah, ok
w

witty-crayon-22786

11/10/2022, 6:03 PM
(but i can almost say it unqualified, because no other filesystems APIs have every truly gained steam)
b

bitter-ability-32190

11/10/2022, 6:18 PM
I missed which thread this was in. So in this case I'm saying we truly just mark them readonly, but still only symlinking large files
It'd make the code slightly cleaner, since we can internally assume permissions, instead of them being passed in
w

witty-crayon-22786

11/10/2022, 6:21 PM
mm, maybe. as mentioned, that’s not cheap.
but it would be cheaper than symlinking. hm.
b

bitter-ability-32190

11/10/2022, 6:21 PM
When we materialize a file, is it an extra syscall to mark readonly or writable?
w

witty-crayon-22786

11/10/2022, 6:21 PM
i don’t know… it’s easy to check by toggling that bit on `materialize_directory`… so yea. if it’s not as expensive as i’m thinking.
When we materialize a file, is it an extra syscall to mark readonly or writable?
yes
b

bitter-ability-32190

11/10/2022, 6:22 PM
If its yes for either than we're no more expensive. It's just a conditional
w

witty-crayon-22786

11/10/2022, 6:22 PM
er, sorry. it’s only an extra syscall to mark readonly.
b

bitter-ability-32190

11/10/2022, 6:22 PM
lol inclusive or
w

witty-crayon-22786

11/10/2022, 6:23 PM
the default perms of a file are defined by the “umask”… 99.99% of the time, the default umask is to create files as writable for the user
to the point where you don’t even check that usually when you create a file.
b

bitter-ability-32190

11/10/2022, 6:23 PM
ah I see
I'll hold off for now then
w

witty-crayon-22786

11/10/2022, 6:24 PM
Copy code
materialize_directory/materialize_directory(ReadOnly, 10000, 100)
                        time:   [6.8972 s 6.9628 s 7.0245 s]
materialize_directory/materialize_directory(Writable, 10000, 100)
                        time:   [5.1541 s 5.2723 s 5.3758 s]
that’s 10k files of 100 bytes each
b

bitter-ability-32190

11/10/2022, 6:26 PM
That seems significant, although 100 bytes seems small, considering
w

witty-crayon-22786

11/10/2022, 6:26 PM
the file size isn’t the relevant part… it’s about total number of syscalls
…oh. but yea, i see what you’re saying.
b

bitter-ability-32190

11/10/2022, 6:26 PM
Well would a larger file not take a larger amount of time to... oh right the difference between the numbers is the same no matter what
w

witty-crayon-22786

11/10/2022, 6:27 PM
if you look at the other benchmarks there, you can see that the overhead is lower.
b

bitter-ability-32190

11/10/2022, 6:27 PM
Yeah larger files means the syscalls are more in the noise, but the overhead is linear for # of files
w

witty-crayon-22786

11/10/2022, 6:27 PM
yea
👍 1