For anyone more into Rust than me, what's the fast...
# random
w
For anyone more into Rust than me, what's the fastest way (lib or no-lib) to calculate the SHA256 of a file in Rust? As a POC, in Python I was using
hashlib.file_digest
and then in Rust, using the
sha2
library, it took 2x as long. I swapped over to
ring
as per https://rust-lang-nursery.github.io/rust-cookbook/cryptography/hashing.html and that ended up being roughly on par with hashlib, as I would have expected. Ran this over 230k files and took the total run time as my comparison point
👀 1
This is one of those rare cases where "as fast as possible" actually matters, as I'll be running it a few million times
I've played around with feature flags a bit, and it changes some of the timings, but not jaw droppingly. Haven't dug into the source code of these libs to see whether the intrinsics on my machines are fully supported or not
r
I'm not a Rust expert, but according to my experience with ripgrep, maybe you need some SIMD feature flag settings?
a
Can you post your actual code? It could be something like: wrapping the file in a buffered reader would help?
w
Yeah, so for the
sha2
lib, I enabled some of the feature flags that bought it roughly inline with
ring
(like, close enough that I didn't care) - otherwise, the code I used for the hashing was pulled straight from https://rust-lang-nursery.github.io/rust-cookbook/cryptography/hashing.html#calculate-the-sha-256-digest-of-a-file - and I just played with the buffer size
Now, once I used rayon. different story, so the absolute time across multiprocess was super fast
a
Bear in mind that a default build doesn't enable CPU-specific flags... Try throwing in
RUSTFLAGS="-C target-cpu=native"
(env var)
w
Yeah, I had tried a run with native enabled - small difference I think, nothing crazy
I mean, this might just be the lower-bound short of writing some crazy code, and that's totally fine. Was just wondering if I had obviously missed something attempting to use those two libs. And the python-equivalent is pretty streamlined, as it's basically C-calls, other than the act of passing data between C and Python 🤷
a
--release
+ native CPU + buffering IO are the only things that jump out to me 🙂
w
Yep! Thanks! All 3 covered 🙂
p
Late to the party, but you probably want to be able to do I/O and hashing in parallel and figure out if you're I/O bound or CPU bound. SHA256 is also probably not the best algorithm for these sorts of things unless you have to conform to some existing protocol.
w
Yeah, so after some more work - it’s pretty clear that the time due to reading in the data is substantially more than hashing or anything else. This is also intentionally not wholly optimized, as I was doing a 1:1 between a Python and Rust equivalent to see which language to write the tool in. e.g. once I ran it through rayon, the time dropped substantially, but again, not really the point of the exercise
I’ll be running other experiments looking at different hashing methodologies, partial file hashes, and all that fun parametric testing later
p
I am not an expert at I/O performance, but you will probably get performance benefits from parallelizing your reads, especially if you have a lot of small files
though it certainly depends on your hardware/filesystem too
w
Yep, thanks! The "actual" way I do this will be more clever. In the test I'm running, I'm intentionally hitting multi-gig files, small K files, symlinks, hardlinks, etc. Basically a worst-case folder design vs what I will actually run into in the future