How I Improved My Rust Compile Times by 75%

2023-11-09 01:00:00 UTC
A woman with a pixie haircut looking at a computer in an office environment, waiting for her code to compile.A woman with a pixie haircut looking at a computer in an office environment, waiting for her code to compile.

If you're looking for a talented Rust developer, or a good senior software engineer, please reach out. I'm looking for a full time role doing something neat, and am available for contract gigs. Contact info is in the footer. Thank you!

There's now a Part 2, where I cover a couple more options. Check it out here

One of Rust's often mentioned pain points is slow compile times. In order to have nice things like the borrow checker, safety, and zero cost abstractions, we pay in time spent compiling. Web developers, especially those that come from one of the Javascript frameworks, really, really like fast reload times. I was fairly active in the Remix discord early on, and the number of people who asked about HMR(Hot Module Reloading) was insane. Hot Module Reloading is when the web framework will replaces javascript, html, and css modules in an existing page as they are changed without losing page state. The alternative, Live Page Reloading, will trigger a page reload when a change is detected. For Remix, the time difference between the two is signifigantly less than it would be in Rust, where we'd have to recompile.

You can imagine, as a Leptos team member, we hear a lot of the same discussion. Leptos, as a frontend Rust web framework, sits at the intersection of these two worlds We see it on the Leptos Discord all the time.

"Now, if only rebuild was fasterā€¦ šŸ«£"

"Anyone have some tips to improve compile times? Leptos is amazing but having to wait 5-10 seconds to render an extra div is very annoying"

So Leptos basically has two approaches to solve this, improve Rust's compile time itself, and/or come up with a way to output changes to parts of the page without needing to recompile first. In this piece, we'll explore both.

Improving Rust Compile Times

In order to test different changes, we need to establish a baseline. In default Rust, on my machine, how long will it take to compile my web app?

The Test Machines

I have a quite beefy test machine on which I usually compile. On my machine I'll be compiling a simpler project, and my friend Alex kindly agreed to run test his startup Houski's site,

My Machine

  • AMD 5950x processor,
  • 72GB RAM
  • SATA SSD system drive.
  • 7200RPM spinning disk storage drives
  • NVME drives
  • NixOS linux distro
  • Rust 1.75 nightly

Houski's Machine

  • AMD 7900x processor
  • 128GB RAM
  • NVME drive
  • Ubuntu linux distro
  • Rust 1.75 nightly

Leptos Build Process

A Leptos web app typicallys undergoes a two stage build process.

  1. Build Webassembly frontend module using cargo. Optimize and package it with warm-bindgen.
  2. Build the server binary with cargo.

I chose to benchmark the whole process for most of these demos, but since it's essentially two cargo steps, any relative changes would affect any cargo command.

Note: Discord User @PaulH(and a couple others) let me know about this bug that prevents the server build from using the dependencies built for the previous compilation of the server/webassembly frontend build due to changes in the RUSTFLAGS. You can fix this by having cargo-leptos > 0.2.1 and adding this to you Cargo.toml under your Leptos settings block. I did not test with this enabled.

TOML markup
[package.metadata.leptos]
separate-front-target-dir = true

Optimizations

Optimization Level

This advice comes from Bevy, which recommends setting the opt-level for development higher in order to possibly reduce dev compile times and improve performance. By default, the Rust compiler sets an opt-level of 0 for development builds. We're going to give it an opt-level of 1 for our code, and an opt-level of 3 for all the dependencies of our code.

This does come with the drawback of much less useful error messages if they come from our dependencies. So you might have to adjust the levels if you run into tricky bugs.

Since optimization takes additional time, I expect the clean compile times to increase, but we may see the increased optimizations make the code easier to compile in incremental builds. We also had some positive anecdotes on our Discord.

"fwiw @gbj @Alex Wilkinson i dropped my compile time from between 5-30 minutes to 6 seconds by putting this in my workspace Cargo.toml

We can enable adding these blocks to our Cargo.toml file. Nice and easy.

TOML markup
[profile.dev]
opt-level = 1
[profile.dev.package."*"]
opt-level = 3

Mold

For those unfamiliar with how what goes on inside the Rust compiler, it typically follows a few basic steps. It reads in source code, which is converted to a variety of types of IR(Intermediate Representation), and optimizations are performed during that conversion. Then that IR is fed to a code generator, provided by LLVM, which converts the IR into object files, and then the linker links together those object files and other system libs into one executable binary. Way more details about how that works can be found here. Fascinating reading, but a bit too deep for our discussion here.

Mold is a new linker developed by Rui Ueyama, with the goal of increasing linker performance by parallelizing the load as much as possible, and benchmarks have shown it to be signifigantly faster than Rust's default linker.

For Linux and Mac, the default linker is ld, run by cc. Windows is a different story, using Microsoft's MVC link.exe. If you're running Linux, you can use mold directly. If you're on Mac, a paid version of mold called Sold is available for Mac. If Mold generates benefits for you, I encourage you to buy Sold(very affordable) or sponsor Rui on his Github Sponsors page. Windows users, unforunately, are not supported at this time. Support for Windows in Sold is in development.

A fairly signifigant amount of time is spent linking your Rust binary, especially during incremental builds, so this could provide some real benefits.

On Linux, it's actually really easy to use, just install Mold and then prepend you cargo commands with mold -run. For example, mold -run cargo build. It can also be enabled in .cargo/config.toml, like this:

TOML markup
[target.x86_64-unknown-linux-gnu]
linker = "clang"
rustflags = ["-C", "link-arg=-fuse-ld=/path/to/mold"]

where /path/to/mold is an absolute path to the mold executable. This is also how you enable Sold, just replace the mold path with the sold path and the target to the one for your Mac.

Cranelift

In the previous optimization, we replaced the linker the Rust compiler uses. Let's try replacing the code generator, Cranelift is an alternative code generator, used instead of LLVM in the build step. While it's not good at doing as many optimizations as LLVM, it is good at spitting out code fast. It was recently integrated as an option for code generation in Rust 1.73 nightly for x86_64 linux targets. Other platforms will need to setup cranelift seperately, see their README.

bash
rustup update nightly #install nightly if you haven't already
rustup component add rustc-codegen-cranelift-preview --toolchain nightly

To use it with Cargo, it can be enabled by enabling the unstable codegen-backend feature, and then setting the codegen-backend= "cranelift" value for a profile. That can be done in .cargo/config.toml like so:

TOML markup
[unstable]
codegen-backend = true
[profile.server-dev]
codegen-backend = "cranelift"

If you don't need to worry about multi-target builds, you can also enable it below with a target. This does conflict with the desire to not use cranelift during production builds, so I wouldn't recommend it. Rust doesn't support per target profile builds, so creating your own profile is a lot more flexible.

TOML markup
[target.x86_64-unknown-linux-gnu]
rustflags = ["-Zcodegen-backend=cranelift"]

Keep in mind that Cranelift is still in development, and may have some issues with missing intrinsics. So your crate may or may not work in it natively. If you do find missing intrinsics, I encourage you to create an issue in their repo here, there may be a workaround available.

Methodology

I'm primarily interested in optimizing build times for Rust projects in development. This means I'm not concerned about filesize, optimizations, or run speed, as long as they are not signifigantly affected. We're interested in the time it takes to build our Rust Leptos apps, both from a clean state, and with incremental compilation.

To do that we're going to use hyperfine a command line benchmarking tool, to run 'cargo leptos build', which as described earlier does Leptos's two step compilation. For clean builds, we will run cargo clean before each build, in order to delete any files cached by Rust. Incremental builds are a bit harder.

For incremental builds, we'll simulate a developer modifying a single HTML tag in a Leptos component. To do that, we'll run our handy unix friend sed to insert the current date and time into an HTML tag. For this purpose, I've chosen the <dfn> tag, because I've never seen it used, and thus I probably don't have to worry about multiple replacements tags. After some finagling, I ended up with this sed command:

bash
sed -i -e "s|<dfn>[^<]*</dfn>|<dfn>$(date +%m%s)</dfn>|g" src/routes/index.rs

Each configuration is tested six times, with two warmup runs to fill any caches and reduce variability in the output. Without the warmup runs variance was a bit wider, but the results were still consistent. Six runs is probably overkill for this test, but what can I say, I got CPU time for days, or more accurately nights.

We'll measure the time it takes to complete six runs, with two warmup runs to setup filesystem caches, and hopefully provide consistent results.

I will be compiling my site, and Alex will be compiling Houski's site. Houski's is a lot more complicated than this blog, with heavy usage of Polars, Serialization/Deserialization with Serde, and plenty more routes. Neither of us has tried to reduce these times much or done any special kinds of optimization.

It's Benchmarking Time

Baseline

For both of us, these will be the times to beat.

Clean Compilation
SiteMean(s)Standard Deviation(s)
benw.is 7200RPM88.3871.817
benw.is NVME80.0570.268
houski.ca124.9930.483
Incremental Compilation
SiteMean(s)Standard Deviation(s)
benw.is 7200RPM20.4610.801
benw.is NVME19.0970.078
houski.ca40.8180.252

Not a bad baseline. Interesting to me was the variation in the 7200RPM build time for both clean and incremental builds, which might suggest that the drive was struggling to provide the data at the same rate consistently, or perhaps something else was contending with the benchmark for IO access.

Twenty or forty seconds is a pretty substantial time to rebuild the site in the dev profile, hopefully we can improve that.

Mold Enabled

Let's enable the Mold linker as described earlier, and see how that changes things!

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM74.350.9614.03715.88129476
benw.is NVME67.2060.38612.85116.05231273
houski.ca102.8920.41622.10117.68179018
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM5.8420.26814.61971.44812082
benw.is NVME5.6150.05113.48270.59747604
houski.ca19.60.06721.21851.98196874

Woah! That's quite a bit faster. For my site, clean compilation time decreased by 12.04s(13.6%) on spinning disk and 12.80s(16.0%) for NVME. Houski dropped 22.10s(17.7s). Nice boost there.

Incremental compilation times changed signifigantly as well. My site on spinning disk dropped 14.62s(71.4%) and 13.5s(70.6%). That's a radical reduction, which makes some sense. Incremental compilation spends a majority of it's time in the linking step, so any upgrades there will disproportionally affect it.

As far as Leptos users go, mold only supports compiling for x84_64_linux, so the webassembly compile time remained unchanged. Most of these optimizations only affect the server build, but I will be careful to note which ones don't.

With few downsides, and such a huge upside, Mold/Sold seems like a no-brainer, especially if you are doing a lot of incremental builds!

Optimization Levels

What will adding more optimization do to compile times?

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM174.3990.488-86.012-97.31295326
benw.is NVME166.8470.41-86.79-108.4102577
houski.ca288.5620.674-163.569-130.8625283
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM16.2011.3034.2620.82009677
benw.is NVME14.3340.1474.76324.94109022
houski.ca32.4890.3488.32920.40521339

I did expect a clean compile to be slower, but this is quite a bit slower. Compared to baseline, it takes twice as long for clean builds. Incremental builds do net a twenty percent improvement, which is nice to see. Contained within the listed build time is a doubling of both the webassembly and server builds, which makes sense but should be noted. For me, I don't think the benefits of setting this are worth the delays in clean compile times.

Cranelift

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM67.9890.76320.39823.07805447
benw.is NVME63.6450.24716.41220.50039347
houski.ca
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM9.2010.09211.2655.03152339
benw.is NVME9.3450.2129.75251.0656124
houski.ca

Cranelift, I had high hopes for you, and you delivered! It improved both clean and incremental compilations compared to the baseline. You may notice that the houski app lines here are blank, which are due to a failure to compile due to missing llvm intrinsics.

After my testing completed, they merged this PR which added the missing llvm.x86.avx.ldu.dq.256 instrinsic, and @bjorn3 mentioned that I can avoid the missing aes intrinsics with a rust flag RUSTFLAGS="--cfg aes_force_soft". Either that, or this PR gets merged into rustup nightly. I hope to update this section soon.

If you are missing intrinsics, you can hook up a debugger to see what is causing the trap, or you can use cargo vendor and do a search for the the intrinsic name or part of the intrinsic name. I'd be sure to mention anything you're missing in the rustc cranelift repo and occasionally check in to see if there's a fix coming. @bjorn3 is on a roll there.

O3 + Mold

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM170.2411.016-81.854-92.60864154
benw.is NVME168.0471.018-87.99-109.9091897
houski.ca271.9490.735-146.956-117.571384
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM5.7890.05114.67271.70715019
benw.is NVME5.7270.04113.3770.01099649
houski.ca18.150.20222.66855.53432309

Our first combination of features. Unforunately, this seems to have the slow clean compile times of O3, without decreasing compile times more than mold alone.

O3 + Mold + Cranelift

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM107.4011.05-19.014-21.51221333
benw.is NVME103.5440.88-23.487-29.33784678
houski.ca
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM4.9920.32215.46975.60236548
benw.is NVME4.7670.07314.3375.03796408
houski.ca

Let's enable all the things! The incremental compiles are faster than any of the previously tested options, but the clean compiles are still suffering a penalty from O3. Cranelift seems to reduce the penalty from O3, but time will tell whether it is better than Mold + Cranelift alone. I suspect this is because Cranelift optimizes less than llvm, which is in line with their documentation.

Cranelift + O3

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM110.2510.687-21.864-24.737
benw.is NVME106.7320.526-26.675-33.323
houski.ca
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM9.0960.35111.36555.54469479
benw.is NVME8.730.07610.36754.28601351
houski.ca

Adding further support to the Cranelift optimization hypothesis, we see similiar performance in the clean compilation, but slower performance in incremental without Mold enabled. It is interesting that this ran faster than the previous test with Mold enabled, although I am not sure the difference is signifigant. Perhaps Mold and Cranelift or Mold and O3 have some negative interaction I don't recognize.

Cranelift + Mold

Clean Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM65.3431.24323.04426.07170738
benw.is NVME59.7770.19120.2825.33195098
houski.ca
Incremental Compilation
SiteMean(s)Standard Deviation(s)Delta from Baseline(s)% Delta from Baseline
benw.is 7200RPM4.8310.25115.6376.38922829
benw.is NVME4.6540.06714.44375.62968005
houski.ca

Here's the big money, Cranelift and Mold together. This offered the fastest clean compilation times, with a roughly 25% reduction, and incremental, sitting pretty at around 75% reduction. Interestingly enough, there seemed to be little difference between the NVME and the 7200RPM drives here, suggesting that these two might be better at using disk I/O than some of the other configurations.

Hot Reloading

For Rust web frameworks, we can try to simulate the performance of HMR, but we won't be able to substitute JS modules. Our attempt to do this is in cargo-leptos, and can be used with the --hot-reload flag. It will attempt to send html and css patches to the browser over a web socket connection when a change inside a view macro is detected that does not involve a change to a rust code block, and compile in the background. For me it works fairly well.

I didn't try to instrument and test it, as hot reload times are nearly instant. Any changes to the rust code will require a incremental recompilation though, which is why it's so useful to improve that time. JS frameworks will probably have an edge there for the foreseeable future.

Conclusion

Enabling Mold and Cranelift netted me a 75% incremental and 25% clean compilation time reduction, which is quite signifigant. Using Cranelift requires nightly, or setting it up outside rustup, which some projects might find undesirable. Since Mold requires Linux, and Sold requires Mac(and paying), for me this further increases the argument that Rust web development should be done on some for of Linux or Mac, and not on Windows. WSL2 may help improve things, but may be slower. Since I have a Mac laptop and a Linux workstation, I will be buying Sold for my Mac and using Cranelift for my projects that can.

If you feel like running some of these tests on your hardware or with your project, I have automated the process somewhat and put that repo on Github here.

Let me know what your experience was, and whether this helped you with your project by reaching out on mastodon at @benwis@hachyderm.io.