Hi Folks I m currently hitting the MDB MAP FULL error on v ` Pants #general

Hi Folks - I'm currently hitting the MDB_MAP_FULL ...

quick-caravan-31864

10/05/2023, 2:18 PM

Hi Folks - I'm currently hitting the MDB_MAP_FULL error on v

2.16.0

. We've been experimenting with using a "shared" cache by having multiple people point to the same location on a shared dev machine, and lmdb_store is getting > 200GB. Previous answers I've seen are "pantsd should garbage-collect", and it seems to be not doing that? Of course I can either configure the cache to be bigger, or wipe lmdb_store periodically, but thought it would be good to chase down why the cache size isn't being bounded over time. Thanks!

enough-analyst-54434

10/05/2023, 2:51 PM

I've just re-checked the code and gc only happens when using pantsd (the default) and when pantsd has been up for at least 1 hour (it's hourly).

We've been experimenting with using a "shared" cache by having multiple people point to the same location on a shared dev machine

Which leads me to question this. Are users all ssh'd in with the same account or is the dev machine lmdb store somehow shared via NFS or sshfs or something like that?

quick-caravan-31864

10/05/2023, 3:29 PM

users are ssh'd in with their own accounts, and we just set our

[GLOBAL] local_store_dir=/opt/some/place

in pants.toml. So everyone has their own pantsd running, but a common lmbd_store

enough-analyst-54434

10/05/2023, 4:16 PM

And the 1 hour thing obviously tripped nothing; so folks certainly have pantsd up for >1 hour?

quick-caravan-31864

10/05/2023, 4:16 PM

yeah for sure

quick-caravan-31864

10/05/2023, 4:17 PM

i guess it might be possible that something huge got written to the cache and pantsd hadn't been invoked yet

enough-analyst-54434

10/05/2023, 4:17 PM

Ok then. Yeah, I don't know much about LMDB or our detailed gc code, but this has at the very least occupied FUD status in my mind for a long time. As you say, lots of reports of GC not working.

quick-caravan-31864

10/05/2023, 4:17 PM

nah don't think so

quick-caravan-31864

10/05/2023, 4:17 PM

is there any way of poking GC to run, at least to see what it does?

enough-analyst-54434

10/05/2023, 4:17 PM

So, the one thing you can check for is ... just a sec for log line. The line actually shows the attempt to GC kicks off every hour.

enough-analyst-54434

10/05/2023, 4:19 PM

"Garbage collecting store. target_size=...": https://github.com/pantsbuild/pants/blob/016d98d727d42ee4ab75dcc9cb4f73d5a4f65b10/src/python/pants/pantsd/service/store_gc_service.py#L65

enough-analyst-54434

10/05/2023, 4:22 PM

And you check for that in

.pants.d/pants.log

(by default) in any repo where you run

pants

enough-analyst-54434

10/05/2023, 4:23 PM

In terms of poking - its just python, so you could try to whip up some code to manually call the same method the code I linked does, But nothing pre-existing.

quick-caravan-31864

10/05/2023, 4:23 PM

yep, looks like it's running: .pants.d/pants.log0854:11.24 [INFO] Garbage collecting store. target_size=28,800,000,000

enough-analyst-54434

10/05/2023, 4:24 PM

Ok, yeah. I have a hard time believing there is 1TB (for example) of live artifacts. I think the default lease is 2 hours.

enough-analyst-54434

10/05/2023, 4:25 PM

You have 2TB LMDB right?

enough-analyst-54434

10/05/2023, 4:26 PM

Oh, 200GB. Even then, I have a hard time believing there are 100GB of live artifacts. But ... in the ML world I have very recently seen PEXes >4GB due to Pytorch, etc; so its not completely out of the realm of possibility.

enough-analyst-54434

10/05/2023, 4:27 PM

Modern software is ridiculous!

quick-caravan-31864

10/05/2023, 4:28 PM

right. yeah we're not doing any ML. We are building hardware-related artifacts from verilog, which can get big. Before we tried a unified cache, folk's LMDB's were around 40GB. My hope was that there would be some de-dup, so maybe 200GB is a reasonable size given our experience

quick-caravan-31864

10/05/2023, 4:28 PM

and maybe I should just increase the size and see if it tapers off

enough-analyst-54434

10/05/2023, 4:30 PM

Hrm, yeah. I'd keep a bit of a hawk eye for sure. Like I said, FUD, but I'm not fully confident GC works in practice. Its just hard to reason about since the blob store is pretty opaque - you can't really see what is in there with what timestamps and relate that to file paths, etc in any simple way right now.

quick-caravan-31864

10/05/2023, 4:31 PM

ok I'll monitor and see how it goes. we currently clean up tmp with cron anyway, can always blast away the cache and maybe rebuild from master every night. Thanks for your help!

enough-analyst-54434

10/05/2023, 4:47 PM

So one thing that will limit de-dup potentially is Processes run by Pants to achieve goals. Those each are keyed by the hash of a protobuf message including the process's env + args + required input files digest. Of those, the env can vary from user to user if PATH is leaked and PATH has user homedir entries or if something like USER is in the leaked env. I'm not abreast of how well we've cleaned up PATH leaks in the default case, but there are knobs you have that allow extra leaking. The quick way to check this though is just run

pants

with

--keep-sandboxes=always

and check a few of those sandboxes out. The

__run.sh

script will show the env vars used.

enough-analyst-54434

10/05/2023, 4:50 PM

That said, the leak there is literally the protobuf message. As long as 2 processes with a different PATH produce the same outputs, the output blobs are de-duped and stored once based on their content hash. SO - presumably - the heavy-hitters on space are, in fact, file blobs, and those definitely de-dup. Its just the Process descriptors that might blow up storage a bit.

enough-analyst-54434

10/05/2023, 4:51 PM

FactCheck.org rating: 15% hand wave - but basically on point.

enough-analyst-54434

10/05/2023, 4:54 PM

@quick-caravan-31864 what language stack(s) do you use in Pants? I'm the PEX guy; so I know about Python blobs and those should all be reproducible with PEX zip files using 1/1/1980 timestamps internally, zip entry consistent sorting, etc., but that's another thing to look into.

enough-analyst-54434

10/05/2023, 5:02 PM

Ok, last from me. It looks like pantsd, while alive, bumps the lease on all objects it has loaded / stored in LMDB every ~80 seconds. So if you start pantsd today and its still up 30 days from now, it will run GC every hour but never GC anything it has known about in the 30 day span it has been up (if I read that correctly). It will only consider GCing objects it has not used in that span. Those log lines look like:

Copy code

09:53:39.53 [INFO] Extending leases
09:53:39.55 [INFO] Done extending leases
09:54:59.55 [INFO] Extending leases
09:54:59.56 [INFO] Done extending leases

So, if you do have long lived pantsds on the shared machine, you could have users bounce them every so often or cron that or something maybe.

quick-caravan-31864

10/05/2023, 5:22 PM

ooh i see, yeah killing pantsd in a cron might be a lightweight way to go

quick-caravan-31864

10/05/2023, 5:22 PM

we're using mostly C++ (compiling via shell_command), some python, and verilog

enough-analyst-54434

10/05/2023, 5:23 PM

Ok, yeah - so I don't know about

.a

.so

, etc, but it would be interesting to know if those run afoul of non-reproducible builds in terms of internal file format ordering and embedded timestamps, etc.

enough-analyst-54434

10/05/2023, 5:24 PM

And I know 0 about verilog! But, you know your tools so you can figure this out if you don't already know the answers.

quick-caravan-31864

10/05/2023, 5:25 PM

it could be. as for __run.sh and env vars,

shell_command

makes a hermetic PATH with a _binary_shims directory, so there doesn't seem to be anything user-specific there

enough-analyst-54434

10/05/2023, 5:26 PM

OK, good. For your C++ and verilog toolchains, perhaps they support https://reproducible-builds.org/docs/source-date-epoch/ which would allow you to backdoor a fixed timestamp.

enough-analyst-54434

10/05/2023, 5:26 PM

If that's even an issue at all.

enough-analyst-54434

10/05/2023, 5:27 PM

Pants does have a

brfs

tool (build result file system) that does allow you to poke around in an LMDB store fwiw. It's just not very friendly / pretty low level.

enough-analyst-54434

10/05/2023, 5:28 PM

I don't think we ship this anywhere but could be wrong. You might have to build it if you get to the point of wanting to really dig into what is held in your LMDB.

👍 1

3 Views

Open in Slack

Previous Next