Working on <https github com pantsbuild pants issues 17282> Pants #development

Working on <https://github.com/pantsbuild/pants/is...

bitter-ability-32190

11/09/2022, 8:27 PM

Working on https://github.com/pantsbuild/pants/issues/17282 and getting a very simply change in where we symlink if the file is above some N bytes. Easy-peasy, however it's incorrect if the file needs to be writable (most notable formatters/fixers). So we could also plumb through the code files/dirs to not make symlinks and oh hey look Process has that info. Is that sufficient though? Presumably the user could want to mutate a file without it being captured. CC @witty-crayon-22786 @ancient-vegetable-10556

witty-crayon-22786

11/09/2022, 8:35 PM

That's mentioned on the ticket as a Process argument

force_mutable

witty-crayon-22786

11/09/2022, 8:36 PM

To replace the immutable_inputs dict

ancient-vegetable-10556

11/09/2022, 8:39 PM

@bitter-ability-32190 there’s a difference between “write” and “replace with a new file” though, right? Writing a file shouldn’t follow the symlink back and overwrite that?

ancient-vegetable-10556

11/09/2022, 8:40 PM

(i.e. it’s only issue if the “writing” is random-access writing into the original file, which… I don’t think is that common)

bitter-ability-32190

11/09/2022, 8:40 PM

@witty-crayon-22786 that's certainly a knob/lever at the

Process

level, but I'm thinking at the user-level. E.g. what if they wanna edit a huge-ass file in

experimental_shell_command

bitter-ability-32190

11/09/2022, 8:41 PM

I'm cool if the answer is just "yeah sorry, don't do that"

witty-crayon-22786

11/09/2022, 8:41 PM

Same thing there... Would have to expose it as a knob to override the heuristic

ancient-vegetable-10556

11/09/2022, 8:41 PM

@bitter-ability-32190 1) Yeah, don’t do that. 2)

experimental_shell_command

needs an overhaul anyway 😉

bitter-ability-32190

11/09/2022, 8:42 PM

OK, for now I think it's likely sufficient to punt and we can circle back immediately and decide how to handle this for the user (if at all).

👍 1

bitter-ability-32190

11/09/2022, 8:44 PM

@ancient-vegetable-10556 FWIW

experimental_shell_command

is only an example. Any user-defined code running in the sandbox is at risk. Which is why I'm prone to punt

ancient-vegetable-10556

11/09/2022, 8:44 PM

definitely

bitter-ability-32190

11/09/2022, 8:44 PM

So a

python_test

(for whatever reason) could also be a problem

witty-crayon-22786

11/09/2022, 8:44 PM

@ancient-vegetable-10556: any overhaul of

e_s_c

will still need a way to declare inputs

ancient-vegetable-10556

11/09/2022, 8:44 PM

@witty-crayon-22786 yes.

witty-crayon-22786

11/09/2022, 8:46 PM

@bitter-ability-32190: mm. Good point. Our guarantees are much more nebulous in that case... I think that your change would be a good time to add a note somewhere on the test goal to "not assume that inputs are mutable"

bitter-ability-32190

11/09/2022, 8:47 PM

lint

and

check

has the same affliction, but we aren't so much running "user code", so yeah I agree maybe isolate this to

test

and

run

witty-crayon-22786

11/09/2022, 8:47 PM

But I don't think it's a deprecation violation.

bitter-ability-32190

11/09/2022, 8:48 PM

Besides,

cp

is a valid solution

😉 1

witty-crayon-22786

11/09/2022, 8:48 PM

Yea.

bitter-ability-32190

11/10/2022, 9:55 AM

Actually this begs the question, since we have mutable paths, sounds everything else be readonly?

bitter-ability-32190

11/10/2022, 1:52 PM

(I might try it for the pants repo and see what goes kaboom)

witty-crayon-22786

11/10/2022, 5:40 PM

symlinking has a cost, because first you materialize the files in a particular spot, and then you symlink them somewhere else. if you don’t use them multiple times, you will have wasted that effort. the idea behind defaulting to using it for large inputs is that the overhead is a smaller fraction of the total when an input is large: i.e., single digit percentage points. but for tiny inputs, it would be more like a 50% overhead. so it’s less of a problem to guess incorrectly that something will be used multiple times.

witty-crayon-22786

11/10/2022, 5:41 PM

additionally, you have to chose the “scope” to symlink: a huge file is an easy scope. but you (probably, there is a threshold) don’t want to materialize thousands of files, and then symlink them each individually

witty-crayon-22786

11/10/2022, 5:46 PM

one thing that we can do with the heuristic is use the recursive size of a directory and its children to decide to symlink, since we have it in memory as a DigestTrie

bitter-ability-32190

11/10/2022, 5:55 PM

if you don’t use them multiple times, you will have wasted that effort.

The only effort here being creating the symlink right? This is onyl important for sandbox creation, which otherwise would be materializing the file. So for only-ever-one-sandbox we'd pay

Cost(materialize-in-tempdir) + Cost(symlink)

Cost(materialize)

which honestly should be small, right? Unless what you're alluding to is that we materialize file-by-file which could maybe be more expensive than materializing the whole digest in certain situations?

witty-crayon-22786

11/10/2022, 6:00 PM

The only effort here being creating the symlink right?

that, and marking them readonly, which we only do for immutable inputs

witty-crayon-22786

11/10/2022, 6:01 PM

for tiny files, it means 3x more syscalls

bitter-ability-32190

11/10/2022, 6:01 PM

Is it really an effort to mark? I assumed that was just the same instruction as the materialization. But I'm also pretty OS-unaware for a lot of operations 😛

witty-crayon-22786

11/10/2022, 6:01 PM

and for tiny files inodes/syscalls are the cost rather than bytes

witty-crayon-22786

11/10/2022, 6:01 PM

(filesystems suck)

bitter-ability-32190

11/10/2022, 6:02 PM

Gotta have them at some point

witty-crayon-22786

11/10/2022, 6:02 PM

*POSIX filesystems suck

bitter-ability-32190

11/10/2022, 6:02 PM

Ah yeah, ok

witty-crayon-22786

11/10/2022, 6:03 PM

(but i can almost say it unqualified, because no other filesystems APIs have every truly gained steam)

bitter-ability-32190

11/10/2022, 6:18 PM

I missed which thread this was in. So in this case I'm saying we truly just mark them readonly, but still only symlinking large files

bitter-ability-32190

11/10/2022, 6:19 PM

It'd make the code slightly cleaner, since we can internally assume permissions, instead of them being passed in

witty-crayon-22786

11/10/2022, 6:21 PM

mm, maybe. as mentioned, that’s not cheap.

witty-crayon-22786

11/10/2022, 6:21 PM

but it would be cheaper than symlinking. hm.

bitter-ability-32190

11/10/2022, 6:21 PM

When we materialize a file, is it an extra syscall to mark readonly or writable?

witty-crayon-22786

11/10/2022, 6:21 PM

i don’t know… it’s easy to check by toggling that bit on `materialize_directory`… so yea. if it’s not as expensive as i’m thinking.

witty-crayon-22786

11/10/2022, 6:21 PM

When we materialize a file, is it an extra syscall to mark readonly or writable?

yes

bitter-ability-32190

11/10/2022, 6:22 PM

If its yes for either than we're no more expensive. It's just a conditional

witty-crayon-22786

11/10/2022, 6:22 PM

er, sorry. it’s only an extra syscall to mark readonly.

bitter-ability-32190

11/10/2022, 6:22 PM

lol inclusive or

witty-crayon-22786

11/10/2022, 6:23 PM

the default perms of a file are defined by the “umask”… 99.99% of the time, the default umask is to create files as writable for the user

witty-crayon-22786

11/10/2022, 6:23 PM

to the point where you don’t even check that usually when you create a file.

bitter-ability-32190

11/10/2022, 6:23 PM

ah I see

bitter-ability-32190

11/10/2022, 6:24 PM

I'll hold off for now then

witty-crayon-22786

11/10/2022, 6:24 PM

aha: this question is already answered: https://github.com/pantsbuild/pants/pull/13857#issuecomment-991272238

witty-crayon-22786

11/10/2022, 6:25 PM

Copy code

materialize_directory/materialize_directory(ReadOnly, 10000, 100)
                        time:   [6.8972 s 6.9628 s 7.0245 s]
materialize_directory/materialize_directory(Writable, 10000, 100)
                        time:   [5.1541 s 5.2723 s 5.3758 s]

witty-crayon-22786

11/10/2022, 6:25 PM

that’s 10k files of 100 bytes each

bitter-ability-32190

11/10/2022, 6:26 PM

That seems significant, although 100 bytes seems small, considering

witty-crayon-22786

11/10/2022, 6:26 PM

the file size isn’t the relevant part… it’s about total number of syscalls

witty-crayon-22786

11/10/2022, 6:26 PM

…oh. but yea, i see what you’re saying.

bitter-ability-32190

11/10/2022, 6:26 PM

Well would a larger file not take a larger amount of time to... oh right the difference between the numbers is the same no matter what

witty-crayon-22786

11/10/2022, 6:27 PM

if you look at the other benchmarks there, you can see that the overhead is lower.

bitter-ability-32190

11/10/2022, 6:27 PM

Yeah larger files means the syscalls are more in the noise, but the overhead is linear for # of files

witty-crayon-22786

11/10/2022, 6:27 PM

yea

👍 1

2 Views

Open in Slack

Previous Next