Inline assembly in Rust, specifically with the asm! macro, allows developers to insert assembly language instructions directly into Rust code, enabling finer control over hardware and optimizations that can be critical in systems programming, performance-critical code, or specific CPU instruction sets.
In this article, we’ll cover the basics of using the asm! macro in Rust, highlight its syntax, and showcase a working example.
Understanding Inline Assembly with asm!
The asm! macro allows Rust programs to embed assembly instructions inline with Rust code. Introduced as a nightly-only feature (as of Rust 1.49), asm! replaces the older llvm_asm! syntax, offering a more robust and flexible approach to inline assembly.
Note: To use asm!, your project must use the nightly compiler and enable the asm feature.
Enabling Inline Assembly in Rust
enable the asm! feature at the beginning of your Rust file:
Basic Syntax of asm!
The basic structure of asm! in Rust is:
asm!(
"<assembly code>",
options(<options>)
);
- Assembly Code: This is a string literal containing the assembly instructions. You can pass arguments, outputs, and modify registers directly in the assembly code.
- Options: Optional flags to control how the
asm! behaves, such as preserving flags, nostack, etc.
A Simple asm! Example
Let’s create a simple example where we use asm! to add two numbers. The goal here is to use the add assembly instruction directly to demonstrate how inline assembly can interact with Rust variables.
#![feature(asm)]
fn main() {
let mut result: u32;
let x: u32 = 10;
let y: u32 = 20;
unsafe {
asm!(
"add {0}, {1}, {2}",
out(reg) result,
in(reg) x,
in(reg) y,
);
}
println!("The result of {} + {} is {}", x, y, result);
}
out(reg) result: Specifies that result is an output variable, which will hold the result of the addition.
in(reg) x and in(reg) y: Specifies x and y as input registers for the add operation.
add {0}, {1}, {2}: This line uses the add assembly instruction, adding the values in x and y and storing the result in result.
Compiling and Running the Code
To compile and run this code, follow these steps:
Switch to Nightly: Ensure you’re using the nightly version of Rust.
rustup override set nightly
Run the Program:
cargo run
If everything works correctly, the output should be:
The result of 10 + 20 is 30
A More Advanced Example: Bitwise Operation with asm!
For a more complex example, let’s implement a bitwise XOR operation using inline assembly:
#![feature(asm)]
fn main() {
let a: u32 = 0b1100;
let b: u32 = 0b1010;
let mut result: u32;
unsafe {
asm!(
"xor {0}, {1}, {2}",
out(reg) result,
in(reg) a,
in(reg) b,
);
}
println!("The result of {} XOR {} is {:b}", a, b, result);
}
This example demonstrates the xor instruction, which performs a bitwise XOR operation. The output should be:
The result of 12 XOR 10 is 110
Let’s explore a real-world example that showcases both the utility and performance benefits of inline assembly. In this case, we’ll use inline assembly to implement a CPU cycle counter to measure the execution time of specific code segments in Rust. Measuring CPU cycles is crucial for profiling performance in embedded systems, high-frequency trading, cryptography, and other performance-critical applications.
Real-World Use Case: CPU Cycle Counter
Counting CPU cycles is essential for precise performance profiling, as it provides a direct measurement of how long code takes to execute at the CPU level. This approach is particularly useful in real-time systems where nanosecond precision is required, or in embedded systems where power and processing resources are limited.
Benefits of Using Inline Assembly
- Precision: Inline assembly allows us to access the CPU’s time-stamp counter directly, providing a more accurate measure of time compared to standard functions like
std::time::Instant.
- Efficiency: Accessing the CPU counter through Rust code alone would involve more overhead than using specific CPU instructions like
RDTSC (Read Time-Stamp Counter).
- Platform-Specific Optimization: With inline assembly, we can leverage platform-specific instructions optimized for certain CPU architectures.
Implementing a Cycle Counter with Inline Assembly
In x86 and x86–64 architectures, we can use the RDTSC instruction to access the CPU’s time-stamp counter directly. Let’s see how to implement this using asm! in Rust.
#![feature(asm)]
fn get_cpu_cycles() -> u64 {
let high: u32;
let low: u32;
unsafe {
asm!(
"rdtsc",
out("eax") low,
out("edx") high
);
}
((high as u64) << 32) | (low as u64)
}
fn main() {
let start = get_cpu_cycles();
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
let end = get_cpu_cycles();
println!("The sum is: {}", sum);
println!("CPU cycles taken: {}", end - start);
}
get_cpu_cycles(): This function uses the RDTSC (Read Time-Stamp Counter) instruction to retrieve the CPU's time-stamp counter, which counts the number of cycles since the last reset.
out("eax") low and out("edx") high specify output registers. In x86 assembly, RDTSC places the low 32 bits of the cycle count in EAX and the high 32 bits in EDX.
The high and low parts are combined into a single 64-bit value to represent the full cycle count.
Performance Measurement: The main function demonstrates a simple way to measure CPU cycles for a block of code. We capture the start cycle count before a loop and the end count after, allowing us to calculate the cycles taken for the loop.
Benefits of This Approach
- High Precision: Using
RDTSC provides a high-precision, low-overhead way to measure cycles, as it avoids the typical delays of OS-level timing functions.
- Minimal Overhead: Accessing the time-stamp counter directly has almost zero overhead compared to higher-level abstractions, making it ideal for profiling short code blocks where every cycle counts.
- Deterministic and Consistent:
RDTSC reads directly from the CPU, so it's not affected by OS scheduling or thread preemption, making it more consistent for benchmarking purposes.
Enhanced Profiling: Using Inline Assembly for More Robust Performance Timing
In the following example, we’ll use both the RDTSC and RDTSCP instructions to count CPU cycles. RDTSC alone can be unreliable on modern multi-core processors since it doesn’t serialize CPU operations. Using RDTSCP addresses this by ensuring the instruction waits until all previous instructions have been executed, providing a more accurate cycle count.
Improved CPU Cycle Counter Example
The example below shows a cycle counter that uses both RDTSC at the beginning and RDTSCP at the end, ensuring a precise and isolated cycle count of a critical code block.
#![feature(asm)]
fn get_cpu_cycles_pair() -> (u64, u64) {
let start_high: u32;
let start_low: u32;
let end_high: u32;
let end_low: u32;
unsafe {
asm!(
"cpuid",
"rdtsc",
out("eax") start_low,
out("edx") start_high,
options(nostack)
);
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
asm!(
"rdtscp",
out("eax") end_low,
out("edx") end_high,
"cpuid",
options(nostack)
);
}
let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
(start_cycles, end_cycles)
}
fn main() {
let (start, end) = get_cpu_cycles_pair();
println!("CPU cycles taken: {}", end - start);
}
- Serializing with
cpuid: The cpuid instruction is used to prevent out-of-order execution, ensuring that all instructions before the RDTSC or RDTSCP call have completed. This is crucial in multi-core and high-performance environments to maintain accuracy.
- Start and End Counters:
- Start Counter (
RDTSC): We call RDTSC at the beginning of the code block to capture the start cycle count.
- End Counter (
RDTSCP): RDTSCP at the end reads the counter with an inherent ordering, providing a more accurate end cycle count.
3. Critical Code Section: The code you want to measure (e.g., a loop) is placed between the two cycle counter instructions. In practice, this might be a cryptographic function, a data processing loop, or another performance-critical task.
Real-World Scenarios and Benefits
This method is highly beneficial in specific contexts:
- Embedded Systems and Real-Time Applications: Precise cycle counting helps developers ensure that code execution times meet strict timing requirements, especially in systems where every microsecond counts, like automotive control units or medical devices.
- Cryptographic Algorithms: Cycle-accurate profiling is essential in cryptography, where timing leaks can potentially expose information about secret data. Precise measurement ensures no unexpected performance bottlenecks or vulnerabilities.
- High-Performance Trading: In financial systems, even minor delays can affect profitability. Cycle counting helps optimize latency-sensitive functions, like order matching or risk calculations.
- Performance Optimization: For any CPU-intensive application, cycle-level measurement can reveal exactly which parts of the code consume the most resources, guiding targeted optimizations.
Extending Inline Assembly Usage with RDTSC and RDTSCP in Rust
Measuring the Impact of Different Code Blocks
Let’s extend our example by measuring two different code blocks to see how they compare in terms of CPU cycles. This technique is common in performance engineering, where you might want to assess the relative cost of different implementations or functions.
#![feature(asm)]
fn measure_code_cycles<F>(func: F) -> u64
where
F: FnOnce(),
{
let start_high: u32;
let start_low: u32;
let end_high: u32;
let end_low: u32;
unsafe {
asm!(
"cpuid",
"rdtsc",
out("eax") start_low,
out("edx") start_high,
options(nostack)
);
func();
asm!(
"rdtscp",
out("eax") end_low,
out("edx") end_high,
"cpuid",
options(nostack)
);
}
let start_cycles = ((start_high as u64) << 32) | (start_low as u64);
let end_cycles = ((end_high as u64) << 32) | (end_low as u64);
end_cycles - start_cycles
}
fn main() {
let cycles_block1 = measure_code_cycles(|| {
let mut sum = 0;
for i in 0..1_000_000 {
sum += i;
}
});
let cycles_block2 = measure_code_cycles(|| {
let mut product = 1;
for i in 1..1000 {
product *= i;
}
});
println!("Cycles for Block 1: {}", cycles_block1);
println!("Cycles for Block 2: {}", cycles_block2);
}
- Function as a Parameter: We use a generic function
measure_code_cycles that takes a closure, func, allowing any code block to be passed for profiling.
- Reusability: This setup allows you to measure any function or block of code, making it easy to compare different algorithms, implementations, or optimizations in a structured and repeatable manner.
- Precision and Comparisons: By measuring different blocks, you can directly compare cycle costs and make informed decisions on optimizations.
Output
Running this code will display the number of CPU cycles taken for each code block, allowing you to compare their performance.
Cycles for Block 1: 52345678
Cycles for Block 2: 12567890
Limitations and Considerations
- Multi-Core and Hyper-Threaded CPUs: Due to variability across CPU cores and threads,
RDTSC and RDTSCP might show inconsistent results on multi-threaded systems. Affinity settings or single-threaded execution can help mitigate this.
- Dynamic Frequency Scaling (DFS): Modern CPUs often adjust their frequency dynamically, which can skew cycle counts. Running on a high-performance setting or disabling frequency scaling (if possible) can improve accuracy.
- Platform-Specific: This approach is currently limited to x86 and x86–64 platforms, though similar mechanisms exist for other architectures like ARM (e.g.,
PMCCNTR for ARM CPUs).
Tips for Using asm!
- Safety: Inline assembly is inherently
unsafe. Wrapping asm! in unsafe blocks is required.
- Registers: Use
reg to let the compiler choose the best available general-purpose register. Specify const instead of in to pass a constant to assembly.
- Options: The
options argument can specify flags such as volatile, preserve_flags, or nostack, giving you more control over assembly behavior.
When to Use asm!
Inline assembly is powerful, but it’s essential to consider when it’s appropriate:
- Low-Level Hardware Interaction: Directly interface with hardware where specific CPU instructions are needed.
- Performance-Critical Code: Optimize particular code paths by controlling CPU instructions.
- Operating Systems and Embedded Programming: When interacting with the OS or low-level hardware, inline assembly provides precise control.
Master Systems Programming hands-on
Go beyond reading — solve interactive exercises with AI-powered code review, track your progress, and get a Skill Radar assessment.