Working on <https://github.com/pantsbuild/pants/is...
# development
b
Working on https://github.com/pantsbuild/pants/issues/17282 and getting a very simply change in where we symlink if the file is above some N bytes. Easy-peasy, however it's incorrect if the file needs to be writable (most notable formatters/fixers). So we could also plumb through the code files/dirs to not make symlinks and oh hey look Process has that info. Is that sufficient though? Presumably the user could want to mutate a file without it being captured. CC @witty-crayon-22786 @ancient-vegetable-10556
w
That's mentioned on the ticket as a Process argument
force_mutable
To replace the immutable_inputs dict
a
@bitter-ability-32190 there’s a difference between “write” and “replace with a new file” though, right? Writing a file shouldn’t follow the symlink back and overwrite that?
(i.e. it’s only issue if the “writing” is random-access writing into the original file, which… I don’t think is that common)
b
@witty-crayon-22786 that's certainly a knob/lever at the
Process
level, but I'm thinking at the user-level. E.g. what if they wanna edit a huge-ass file in
experimental_shell_command
?
I'm cool if the answer is just "yeah sorry, don't do that"
w
Same thing there... Would have to expose it as a knob to override the heuristic
a
@bitter-ability-32190 1) Yeah, don’t do that. 2)
experimental_shell_command
needs an overhaul anyway 😉
b
OK, for now I think it's likely sufficient to punt and we can circle back immediately and decide how to handle this for the user (if at all).
👍 1
@ancient-vegetable-10556 FWIW
experimental_shell_command
is only an example. Any user-defined code running in the sandbox is at risk. Which is why I'm prone to punt
a
definitely
b
So a
python_test
(for whatever reason) could also be a problem
w
@ancient-vegetable-10556: any overhaul of
e_s_c
will still need a way to declare inputs
a
@witty-crayon-22786 yes.
w
@bitter-ability-32190: mm. Good point. Our guarantees are much more nebulous in that case... I think that your change would be a good time to add a note somewhere on the test goal to "not assume that inputs are mutable"
b
lint
and
check
has the same affliction, but we aren't so much running "user code", so yeah I agree maybe isolate this to
test
and
run
?
w
But I don't think it's a deprecation violation.
b
Besides,
cp
is a valid solution
😉 1
w
Yea.
b
Actually this begs the question, since we have mutable paths, sounds everything else be readonly?
(I might try it for the pants repo and see what goes kaboom)
w
symlinking has a cost, because first you materialize the files in a particular spot, and then you symlink them somewhere else. if you don’t use them multiple times, you will have wasted that effort. the idea behind defaulting to using it for large inputs is that the overhead is a smaller fraction of the total when an input is large: i.e., single digit percentage points. but for tiny inputs, it would be more like a 50% overhead. so it’s less of a problem to guess incorrectly that something will be used multiple times.
additionally, you have to chose the “scope” to symlink: a huge file is an easy scope. but you (probably, there is a threshold) don’t want to materialize thousands of files, and then symlink them each individually
one thing that we can do with the heuristic is use the recursive size of a directory and its children to decide to symlink, since we have it in memory as a DigestTrie
b
if you don’t use them multiple times, you will have wasted that effort.
The only effort here being creating the symlink right? This is onyl important for sandbox creation, which otherwise would be materializing the file. So for only-ever-one-sandbox we'd pay
Cost(materialize-in-tempdir) + Cost(symlink)
vs
Cost(materialize)
which honestly should be small, right? Unless what you're alluding to is that we materialize file-by-file which could maybe be more expensive than materializing the whole digest in certain situations?
w
The only effort here being creating the symlink right?
that, and marking them readonly, which we only do for immutable inputs
for tiny files, it means 3x more syscalls
b
Is it really an effort to mark? I assumed that was just the same instruction as the materialization. But I'm also pretty OS-unaware for a lot of operations 😛
w
and for tiny files inodes/syscalls are the cost rather than bytes
(filesystems suck)
b
Gotta have them at some point
w
*POSIX filesystems suck
b
Ah yeah, ok
w
(but i can almost say it unqualified, because no other filesystems APIs have every truly gained steam)
b
I missed which thread this was in. So in this case I'm saying we truly just mark them readonly, but still only symlinking large files
It'd make the code slightly cleaner, since we can internally assume permissions, instead of them being passed in
w
mm, maybe. as mentioned, that’s not cheap.
but it would be cheaper than symlinking. hm.
b
When we materialize a file, is it an extra syscall to mark readonly or writable?
w
i don’t know… it’s easy to check by toggling that bit on `materialize_directory`… so yea. if it’s not as expensive as i’m thinking.
When we materialize a file, is it an extra syscall to mark readonly or writable?
yes
b
If its yes for either than we're no more expensive. It's just a conditional
w
er, sorry. it’s only an extra syscall to mark readonly.
b
lol inclusive or
w
the default perms of a file are defined by the “umask”… 99.99% of the time, the default umask is to create files as writable for the user
to the point where you don’t even check that usually when you create a file.
b
ah I see
I'll hold off for now then
w
Copy code
materialize_directory/materialize_directory(ReadOnly, 10000, 100)
                        time:   [6.8972 s 6.9628 s 7.0245 s]
materialize_directory/materialize_directory(Writable, 10000, 100)
                        time:   [5.1541 s 5.2723 s 5.3758 s]
that’s 10k files of 100 bytes each
b
That seems significant, although 100 bytes seems small, considering
w
the file size isn’t the relevant part… it’s about total number of syscalls
…oh. but yea, i see what you’re saying.
b
Well would a larger file not take a larger amount of time to... oh right the difference between the numbers is the same no matter what
w
if you look at the other benchmarks there, you can see that the overhead is lower.
b
Yeah larger files means the syscalls are more in the noise, but the overhead is linear for # of files
w
yea
👍 1