Does anyone have any experience using `mold` while...
# development
w
Does anyone have any experience using
mold
while building pants? I’m testing on an old/slow computer - and it doesn’t appear to improve link time at all, while I was hoping there would be some per-core improvement
g
Can confirm 🙂
Copy code
MOLD:    Finished `release` profile [optimized + debuginfo] target(s) in 3m 37s
 LLD:    Finished `release` profile [optimized + debuginfo] target(s) in 3m 28s
w
.... huh... that's surprising, from everything I've read about mold and from their first-party info:
I still have some messing around to do, but I confirmed mold was used to link.
g
Just running
RUSTFLAGS="-Ztime-passes" cargo + nightly build --release
and by far the biggest item is LLVM, link times is neglible.
Copy code
time:   0.000; rss:   39MB ->   40MB (   +1MB)  parse_crate
time:   0.000; rss:   44MB ->   45MB (   +0MB)  crate_injection
time:   0.396; rss:   45MB ->  219MB ( +174MB)  expand_crate
time:   0.396; rss:   45MB ->  219MB ( +174MB)  macro_expand_crate
time:   0.004; rss:  219MB ->  219MB (   +0MB)  AST_validation
time:   0.002; rss:  219MB ->  219MB (   +0MB)  finalize_imports
time:   0.001; rss:  219MB ->  220MB (   +0MB)  finalize_macro_resolutions
time:   0.084; rss:  220MB ->  254MB (  +35MB)  late_resolve_crate
time:   0.005; rss:  254MB ->  254MB (   +0MB)  resolve_check_unused
time:   0.007; rss:  254MB ->  254MB (   +0MB)  resolve_postprocess
time:   0.100; rss:  219MB ->  254MB (  +35MB)  resolve_crate
time:   0.009; rss:  268MB ->  268MB (   +0MB)  drop_ast
time:   0.021; rss:  260MB ->  260MB (   +1MB)  misc_checking_1
time:   0.535; rss:  260MB ->  404MB ( +144MB)  coherence_checking
time:   2.494; rss:  260MB ->  471MB ( +211MB)  type_check_crate
time:   2.416; rss:  471MB ->  569MB (  +97MB)  MIR_borrow_checking
time:   0.031; rss:  569MB ->  570MB (   +1MB)  module_lints
time:   0.031; rss:  569MB ->  570MB (   +1MB)  lint_checking
time:   0.029; rss:  570MB ->  570MB (   +0MB)  privacy_checking_modules
time:   0.078; rss:  569MB ->  570MB (   +2MB)  misc_checking_3
time:   0.994; rss:  570MB ->  593MB (  +23MB)  monomorphization_collector_root_collections
time:   2.789; rss:  593MB ->  880MB ( +287MB)  monomorphization_collector_graph_walk
time:   0.416; rss:  886MB ->  912MB (  +26MB)  partition_and_assert_distinct_symbols
time:   0.000; rss:  904MB ->  905MB (   +1MB)  write_allocator_module
time:   4.823; rss:  905MB -> 1660MB ( +755MB)  codegen_to_LLVM_IR
time:   9.058; rss:  570MB -> 1660MB (+1090MB)  codegen_crate
time:  77.479; rss: 1660MB -> 1559MB ( -101MB)  LLVM_passes
time:  77.378; rss: 1296MB -> 1559MB ( +264MB)  finish_ongoing_codegen
time:   0.863; rss: 1515MB ->  779MB ( -736MB)  run_linker
time:   0.019; rss:  779MB ->  779MB (   +0MB)  link_binary_remove_temps
time:   0.895; rss: 1559MB ->  779MB ( -781MB)  link_binary
time:   0.896; rss: 1559MB ->  779MB ( -781MB)  link_crate
time:   0.896; rss: 1559MB ->  779MB ( -781MB)  link
time:  93.179; rss:   28MB ->  147MB ( +119MB)  total
    Finished `release` profile [optimized + debuginfo] target(s) in 2m 30s
This is just for the final
Engine
step, but out of 93~ seconds and change, we spent about 1 second in the linker. The rest was just codegen.
I think one issue for Pants native code is that is very funnel-shaped, with a lot of heavy compilation also happening in
engine
, bottlenecking the compilation overall.
(i.e., looking at
cargo build --timings
half the build time is just engine, of which the majority is codegen/optimization.)
w
I need to dig into this more - but when I was testing some of this earlier in year, a 1 character change triggered a several minute build - and I thought both LLVM + link times were responsible. If not true, this is interesting
g
So we set
codegen-units = 1
for release builds, and I get similar terrible performance on a work project if I do that. Just changing that to
codegen-units = 16
and instead enabling thin LTO shaves a minute from the compile time. I'd imagine splitting engine into 2-3 crates would have a similar effect.
w
So, codegen-units=1 is one of the classic "speeds up performance" things, if I'm not mistaken
g
Oh, and since Pants always builds in release and release doesn't do incremental builds by default, so that's likely a cause as well in the example you gave
w
Yeah, I'm specifically talking about release builds here, debug is much less of an issue
This is really interesting stuff, as I wonder how true this is after all the call-by-name. As in, was startup time the big problem, or was it "everything" being substantially slower
Copy code
We default to compiling with Rust's release mode, instead of its debug mode, because this makes Pants substantially faster. However, this results in the compile taking 5-10x longer.

If you are okay with Pants running much slower when iterating, set the environment variable MODE=debug and rerun pants to compile in debug mode.
Spot on, as usual, Tom. Swapped the codegen units to default, and my incremental release build is like 4-5x faster. Link time was negligible
There also exists a world soon where we can remove that too - trivial incremental debug build without that is 2 seconds
Copy code
[profile.dev]
# Increase the optimization level of the dev profile slightly, as otherwise `rule_graph`
# solving takes prohibitively long.
opt-level = 1
g
if it isn't already, we could move the rule graph to a separate crate that always compiles at max optimization, either way
w
True - but I think with call-by-name, the goal is to remove it entirely (or mostly, or slightly, or whatever)