Bench-First Rust Development: Make Performance a Feature
How TTAPI achieves 23x speedup through systematic benchmarking and optimization
The Philosophy
“If you don’t measure it, you don’t own it.”
Performance isn’t an afterthought—it’s a feature that requires:
- Continuous measurement (benchmarks in CI/CD)
- Regression prevention (performance gates)
- Data-driven optimization (flamegraphs, profiling)
- Clear communication (throughput, latency, memory)
The TTAPI Approach
1. Benchmark Harness First
Before writing production code, set up benchmarking infrastructure:
// benches/symbol_collection.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use ttapi_platform::collect_symbols;
fn bench_symbol_collection(c: &mut Criterion) {
let rt = tokio::runtime::Runtime::new().unwrap();
c.bench_function("collect_symbols_sp500", |b| {
b.to_async(&rt).iter(|| async {
let symbols = collect_symbols(black_box("sp500")).await.unwrap();
black_box(symbols)
});
});
}
criterion_group!(benches, bench_symbol_collection);
criterion_main!(benches);
Run benchmarks:
cargo bench --bench symbol_collection
Output:
collect_symbols_sp500 time: [52.234 s 52.456 s 52.678 s]
thrpt: [9.63 symbols/s 9.67 symbols/s 9.71 symbols/s]
2. Real-World Performance Data
TTAPI benchmarks from production runs:
| Operation | Cold Run | Cached Run | Speedup |
|---|---|---|---|
| Total Pipeline | 3m 2s | 7.8s | 23x |
| Symbol Collection | 52s | 88ms | 590x |
| EOD Processing | 2m 26s | 4s | 36x |
| Core Data | 16s | 91ms | 176x |
Throughput benchmarks:
- Symbol collection: 8,608 symbols/second
- MetaStock import: 438,000 rows/second
- Parquet load: 690 MB/second
- Statistical analysis: 7.3M rows/second
3. Optimization Process
Step 1: Measure Baseline
Initial implementation (naive):
async fn collect_symbols_sequential(symbols: Vec<String>) -> Result<Vec<Quote>> {
let mut quotes = Vec::new();
for symbol in symbols {
let quote = fetch_quote(&symbol).await?;
quotes.push(quote);
}
Ok(quotes)
}
Benchmark:
collect_symbols_sequential time: [2535 s 2540 s 2545 s]
thrpt: [0.20 symbols/s]
Analysis: 507 symbols × 5s each = 2,535s (42 minutes) — too slow!
Step 2: Parallelize with Tokio
Optimized implementation:
async fn collect_symbols_parallel(symbols: Vec<String>) -> Result<Vec<Quote>> {
let semaphore = Arc::new(Semaphore::new(100)); // Max 100 concurrent
let tasks: Vec<_> = symbols.iter().map(|symbol| {
let sem = semaphore.clone();
let symbol = symbol.clone();
tokio::spawn(async move {
let _permit = sem.acquire().await.unwrap();
fetch_quote(&symbol).await
})
}).collect();
let results = futures::future::join_all(tasks).await;
// Collect successful results
results.into_iter()
.filter_map(|r| r.ok().and_then(|q| q.ok()))
.collect()
}
Benchmark:
collect_symbols_parallel time: [52.234 s 52.456 s 52.678 s]
thrpt: [9.67 symbols/s]
Improvement: 2,535s → 52s = 48x faster!
Step 3: Add Caching
With TTL-based caching:
async fn collect_symbols_cached(symbols: Vec<String>) -> Result<Vec<Quote>> {
// Check cache first
if let Some(cached) = CacheManager::load("symbols").await? {
if !cached.is_expired() {
return Ok(cached.data);
}
}
// Collect fresh data
let quotes = collect_symbols_parallel(symbols).await?;
// Persist in background
CacheManager::persist_in_background("symbols", "es, Duration::from_secs(86400));
Ok(quotes)
}
Benchmark:
collect_symbols_cached time: [88.234 ms 88.456 ms 88.678 ms]
thrpt: [5,720 symbols/s]
Improvement: 52s → 88ms = 590x faster!
4. Flamegraph Analysis
Generate flamegraph:
cargo flamegraph --bench symbol_collection
Findings from flamegraph:
- 30% time in JSON parsing → Switch to
simd-json(10% faster) - 20% time in HTTP connection setup → Reuse connection pool (15% faster)
- 15% time in allocations → Use
SmallVecfor small collections (5% faster)
Total improvement: 30% faster after flamegraph-driven optimizations
5. Memory Profiling
Before optimization:
Peak memory: 20.4 GB (Python/Pandas equivalent)
Allocations: 2.4M allocations/second
After optimization (Polars columnar format):
Peak memory: 2.4 GB (8x more efficient)
Allocations: 180k allocations/second (13x fewer)
Key changes:
- Row-based → Columnar format (Polars)
- Eager evaluation → Lazy evaluation (LazyFrame)
- Intermediate buffers → Zero-copy operations
Reproducibility
Environment Recording
Benchmark metadata:
[benchmark.environment]
os = "macOS 14.5"
cpu = "Apple M2 Pro (8 cores)"
ram = "16 GB"
rust_version = "1.75.0"
cargo_version = "1.75.0"
[benchmark.config]
warmup_time = "3s"
measurement_time = "10s"
sample_size = 100
Record in benchmark output:
fn bench_with_metadata(c: &mut Criterion) {
println!("Environment:");
println!(" OS: {}", std::env::consts::OS);
println!(" Arch: {}", std::env::consts::ARCH);
println!(" Cores: {}", num_cpus::get());
c.bench_function("my_benchmark", |b| {
// benchmark code
});
}
Versioned Results
Store benchmark results in repo:
benchmarks/
├── 2025-11-01_baseline.json
├── 2025-11-05_parallel.json
├── 2025-11-10_cached.json
└── 2025-11-14_optimized.json
Compare over time:
cargo bench --save-baseline main
git checkout feature-branch
cargo bench --baseline main
CI Performance Gates
GitHub Actions Workflow
name: Performance Regression Check
on: [pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run benchmarks (main)
run: |
git checkout main
cargo bench --bench symbol_collection -- --save-baseline main
- name: Run benchmarks (PR)
run: |
git checkout $
cargo bench --bench symbol_collection -- --baseline main
- name: Check for regressions
run: |
# Fail if p95 latency increased by >10%
python scripts/check_regression.py --threshold 0.10
Regression detection:
# scripts/check_regression.py
import json
import sys
def check_regression(baseline, current, threshold=0.10):
baseline_p95 = baseline['p95']
current_p95 = current['p95']
regression = (current_p95 - baseline_p95) / baseline_p95
if regression > threshold:
print(f"❌ Performance regression: {regression*100:.1f}%")
print(f" Baseline p95: {baseline_p95:.2f}ms")
print(f" Current p95: {current_p95:.2f}ms")
sys.exit(1)
else:
print(f"✅ No regression: {regression*100:.1f}%")
sys.exit(0)
Communication: Benchmark Reports
Internal Report (Engineering Team)
## Performance Benchmark Report — Symbol Collection Optimization
**Date:** 2025-11-14
**Branch:** feature/parallel-collection
**Baseline:** main (commit abc123)
### Summary
- ✅ 48x faster than baseline (2,535s → 52s)
- ✅ 8x more memory efficient (20.4 GB → 2.4 GB)
- ✅ No regressions in other benchmarks
### Detailed Results
| Metric | Baseline | Optimized | Change |
|--------|----------|-----------|--------|
| **Latency (p50)** | 2,535s | 52s | **-98%** |
| **Latency (p95)** | 2,580s | 54s | **-98%** |
| **Throughput** | 0.20 symbols/s | 9.67 symbols/s | **+4,735%** |
| **Memory (peak)** | 20.4 GB | 2.4 GB | **-88%** |
| **Allocations** | 2.4M/s | 180k/s | **-93%** |
### Flamegraph Analysis
- 30% time in JSON parsing → Optimized with simd-json
- 20% time in HTTP setup → Connection pool reuse
- 15% time in allocations → SmallVec for small collections
### Next Steps
- [ ] Add caching layer (estimated 590x speedup)
- [ ] Optimize Parquet I/O (estimated 2x speedup)
- [ ] Profile dxFeed WebSocket (unknown potential)
External Report (Portfolio/Resume)
## TTAPI Performance Highlights
**23x faster** end-to-end pipeline with intelligent caching
- Cold run: 3m 2s
- Cached run: 7.8s
**438,000 rows/second** data import throughput
- 27.6M rows imported in 63 seconds
- Parallel CSV parsing with rayon
**8x more memory efficient** than Python/Pandas
- 2.4 GB vs. 20+ GB for equivalent workload
- Polars columnar format with zero-copy operations
Tools & Techniques
Benchmarking Tools
- Criterion — Statistical benchmarking with regression detection
- Divan — Fast benchmarking with minimal overhead
- cargo-flamegraph — CPU profiling with flamegraphs
- perf — Linux performance analysis
- Instruments — macOS profiling (Time Profiler, Allocations)
Profiling Commands
# CPU profiling (flamegraph)
cargo flamegraph --bench symbol_collection
# Memory profiling (valgrind)
valgrind --tool=massif target/release/ttapi
# System calls (strace)
strace -c target/release/ttapi
# Tokio console (async profiling)
RUSTFLAGS="--cfg tokio_unstable" cargo run --features tokio-console
Results in TTAPI
Performance as a feature:
- ✅ 23x faster end-to-end pipeline
- ✅ 438,000 rows/second throughput
- ✅ 8x more memory efficient
- ✅ Zero performance regressions in 3+ months
Development velocity:
- ✅ Benchmarks catch regressions in CI
- ✅ Flamegraphs guide optimization efforts
- ✅ Clear metrics for stakeholder communication
- ✅ Confidence in production deployments
Summary
Bench-first development requires:
- Harness first — Set up benchmarking before optimization
- Measure baseline — Know where you’re starting from
- Optimize systematically — Parallelize, cache, profile
- Prevent regressions — CI gates with threshold checks
- Communicate clearly — Throughput, latency, memory
The result: Performance is a measurable, improvable feature—not a hope.
Additional Resources
- Performance Benchmarks — Detailed TTAPI performance data
- Architecture Deep Dive — How architecture enables performance
- Why Rust? — Performance comparison with Python/C++/Go
Copyright © 2025 SKY, LLC. All rights reserved.