Fast Algorithms for Big Integers Multiplication

Optimizing Big Integers Multiplication for Performance and MemoryMultiplying big integers — numbers much larger than a machine word — is a core operation in cryptography, arbitrary-precision arithmetic libraries, scientific computing, and number theory. When inputs reach thousands or millions of digits, naive multiplication becomes impractical. This article surveys algorithms, implementation strategies, and memory-performance trade-offs to help you build or choose an efficient big-integer multiplication routine.

Why ordinary multiplication fails at scale

The grade-school (long) multiplication algorithm runs in O(n^2) time for n-digit numbers; each digit of one operand multiplies every digit of the other. For small n this is simple and fine. But for n = 10^4–10^7 digits, O(n^2) is prohibitively slow and uses lots of temporary storage for intermediate partial products.

Key bottlenecks at large sizes:

Asymptotic inefficiency: computation grows quadratically.
Memory churn: many temporary arrays and carries lead to heavy allocation and cache misses.
Cache behavior and locality: algorithms that don’t process contiguous memory or reuse cache lines suffer large slowdowns.
Constant factors: low-level implementation, word size, and instruction-level optimizations matter.

Algorithmic choices

Choosing the right algorithm depends on operand size, hardware, and implementation complexity.

1) Schoolbook (O(n^2))

Best for small sizes (e.g., up to a few hundred or a few thousand machine words depending on implementation).
Simple, low overhead, easy to implement with in-place accumulation.
Optimization opportunities: loop unrolling, blocking, use of ⁄₁₂₈-bit limbs, and reducing memory allocations.

2) Karatsuba (O(n^{log_2 3}) ≈ O(n^{1.585}))

Recursive divide-and-conquer: splits numbers into halves and reduces multiplies from 4 to 3 per recursion.
Break-even point typically in the low hundreds to a few thousands of limbs.
Uses extra temporary space for additions/subtractions; memory growth is O(n) per recursion depth, but practical implementations reuse buffers.

3) Toom–Cook (Toom-3, Toom-k) (O(n^{1+ε}))

Generalizes Karatsuba by splitting into more parts; Toom-3 is common (≈ O(n^{1.465})).
Higher-order Toom reduces asymptotic exponent but increases complexity, overhead, and numerical stability concerns in implementations using FFT-like convolutions of evaluation points.
Break-even points increase with k; use when Karatsuba’s gains plateau.

4) FFT-based multiplication (Schönhage–Strassen, Fürer, and later improvements)

Use convolution via Fast Fourier Transform to reduce multiplication to O(n log n) (Schönhage–Strassen) or slightly better with advanced variants.
Essential for extremely large sizes (tens of thousands to millions of digits).
Requires careful choice of convolution modulus or floating FFT with error control.
Memory and implementation complexity are significant; also requires bit/word packing and careful handling of carries.

Practical hybrid strategies

No single algorithm is optimal for all sizes. Most production libraries use a multi-tiered approach:

For tiny sizes: schoolbook (optimized inner loops).
For medium sizes: Karatsuba.
For larger sizes: Toom variants.
For very large sizes: FFT-based methods.

This hybrid approach uses thresholds (empirically tuned) that switch algorithms based on operand length to minimize runtime and memory use.

Memory optimization techniques

Memory often limits the maximum multiplier size and influences speed because of cache effects.

In-place and semi-in-place multiplications: avoid allocating new big buffers when possible; accumulate results directly into destination limbs.
Reuse temporary buffers: maintain a pool of scratch space to reduce allocations and fragmentation.
Memory layout: use contiguous arrays of machine words (limbs) aligned to cache lines (e.g., 64 bytes) to maximize throughput.
Blocking / tiling: divide operands into blocks sized to fit L1/L2 cache and compute partial products that stay in cache.
Lazy carry propagation: delay carries to reduce memory writes; handle carries in batches.
Minimizing allocations for recursion: pass preallocated scratch buffers down recursion to avoid repeated malloc/free.
Use smaller limb sizes when it reduces memory and speeds up certain convolution kernels; e.g., 32-bit vs 64-bit depending on FFT base and target CPU.

CPU-level and micro-optimizations

Choose limb width to match CPU register size and available fast integer multiply instructions (e.g., 64-bit on modern x86_64).
Use compiler intrinsics for multiply-with-carry (MULX/ADCX/ADOX on x86_64) where available.
Leverage SIMD: some multiplication kernels can use vectorized multiply-add to compute many partial products in parallel (often helpful for schoolbook blocking).
Loop unrolling and software pipelining: reduce branch overhead and increase instruction-level parallelism.
Prefetching and memory alignment: explicit prefetch instructions help when reading large operands sequentially.
Use assembly for hot inner loops where portability cost is acceptable.

FFT implementation considerations

FFT-based multiplication brings its own set of performance and memory concerns:

Choice of transform: Number Theoretic Transform (NTT) avoids floating error by working modulo special primes; floating-point FFTs are faster in some environments but require error analysis and sometimes multiple transforms with Chinese Remainder Theorem (CRT).
Truncation and zero-padding: choose transform size (power-of-two or other radices) to minimize padding and memory while maintaining exactness.
Windowing/Convolution scheduling: implement iterative or recursive FFTs that reuse allocated twiddle factors and buffers.
Precision and rounding: when using double/long-double FFTs, ensure coefficient growth and rounding errors are controlled (split coefficients, use multiple FFTs, or use higher-precision FFT libraries).
Cache-friendly FFT: perform transforms in blocks to reduce cache misses; use in-place FFT algorithms where possible.
Precompute twiddle factors once and reuse across multiplications.

Reducing memory in FFT-based multiplies

Use half-sized complex arrays with real FFT optimizations when multiplying two real sequences.
Use convolution-by-NTT modulo carefully chosen primes to keep coefficients small and reconstruct via CRT, reducing the need for high-precision floating arrays.
Perform convolution on packed limbs (radix-B representation) to reduce the number of FFT points.
Stream or chunk large multiplies if full operands cannot fit into memory; compute convolution pieces and combine (trade extra computation for reduced peak memory).

Accuracy, side effects, and portability

Implement robust carry handling after convolution; off-by-one and carry propagation bugs are common sources of incorrect results.
Side-channel considerations (constant-time algorithms) are critical for cryptographic uses; some asymptotically fast algorithms leak more timing information due to data-dependent branching or variable-time transforms—protect with algorithmic choices or masking.
Portability vs. performance: architecture-specific intrinsics/assembly accelerate inner loops but reduce portability. Provide fallback C implementations.

Tuning and benchmarking

Profile with representative inputs (sizes, distributions, and patterns relevant to real use).
Measure both wall-clock time and memory/peak allocation.
Tune thresholds between algorithms: empirical testing on target hardware is essential.
Test scalability with multiple cores: some FFT libraries and multiplication routines benefit from multithreading, but synchronization and memory bandwidth can limit gains.

Example hybrid decision table

Operand size (limbs)	Recommended algorithm	Memory notes
< 64	Schoolbook	Low memory, in-place
64–1024	Karatsuba	Moderate temporaries, reuse buffers
1k–100k	Toom variants	Increased scratch, balance complexity
> 100k	FFT/NTT-based	High peak memory; use streaming/chunking if limited

Implementation checklist

Choose limb size matching architecture.
Implement optimized schoolbook with multiply-with-carry primitives.
Add Karatsuba and one or two Toom variants with shared scratch buffers.
Integrate FFT/NTT routines or link a tested FFT library.
Design buffer pool and in-place accumulation strategies to minimize allocations.
Add thorough unit tests (small/medium/huge) and randomized tests with checksums or independent big-int arithmetic.
Benchmark and tune switching thresholds per target hardware.
Review for constant-time requirements if used in crypto.

Closing notes

Optimizing big-integer multiplication is a layered engineering problem: algorithm selection, memory management, CPU-level tuning, and practical engineering (buffer reuse, testing, and benchmarking). The right mix depends on the operand sizes and constraints of your environment. Well-tuned hybrids used in modern libraries (GMP, OpenSSL Bignum, MPIR, etc.) demonstrate how combining techniques yields fast, memory-efficient multiplication across all sizes.