Memory allocation made right
Read Time: 18 minutes
Memory allocation is often a silent performance killer in C++ applications. A single malloc() call can take anywhere from nanoseconds to milliseconds, depending on the allocator state, fragmentation, and system load. In this comprehensive post, I’ll share techniques for optimizing memory allocation that we’ve successfully applied in high-performance systems.
The Problem with malloc()
Standard allocators like glibc’s malloc are general-purpose - they work well on average but can be suboptimal for specific workloads:
// Innocent looking code
for (int i = 0; i < 1000000; i++) {
auto* obj = new MyObject(); // Hidden malloc() call
process(obj);
delete obj; // Hidden free() call
}
Problems:
- Thread contention: Global heap locks cause serialization
- Fragmentation: Mixed-size allocations lead to wasted memory
- Cache misses: Scattered allocations destroy locality
- System calls: Large allocations trigger expensive mmap() calls
Profiling Memory Allocation
Before optimizing, measure:
Using perf
perf record -e syscalls:sys_enter_mmap ./program
perf report
Using tcmalloc’s profiler
#include <gperftools/heap-profiler.h>
HeapProfilerStart("profile");
// ... your code ...
HeapProfilerDump("checkpoint");
HeapProfilerStop();
Custom allocation tracking
void* operator new(size_t size) {
allocation_count++;
total_bytes += size;
return malloc(size);
}
Optimization Strategies
1. Object Pools
Pre-allocate objects and reuse them:
template<typename T>
class ObjectPool {
std::vector<T> pool;
std::stack<T*> available;
public:
ObjectPool(size_t size) : pool(size) {
for (auto& obj : pool) {
available.push(&obj);
}
}
T* acquire() {
if (available.empty()) {
pool.emplace_back();
return &pool.back();
}
T* obj = available.top();
available.pop();
return obj;
}
void release(T* obj) {
obj->reset(); // Clear object state
available.push(obj);
}
};
Result: 10x speedup for frequent allocations.
2. Arena Allocators
Allocate from a large chunk:
class Arena {
std::vector<std::unique_ptr<char[]>> blocks;
char* current = nullptr;
size_t remaining = 0;
public:
void* allocate(size_t size) {
if (size > remaining) {
// Allocate new block
size_t block_size = std::max(size, 64 * 1024);
blocks.push_back(std::make_unique<char[]>(block_size));
current = blocks.back().get();
remaining = block_size;
}
void* result = current;
current += size;
remaining -= size;
return result;
}
void reset() {
blocks.clear();
current = nullptr;
remaining = 0;
}
};
Benefits:
- No fragmentation
- Bulk deallocation
- Excellent cache locality
3. Thread-Local Allocation
Eliminate contention with thread-local heaps:
thread_local Arena tl_arena;
template<typename T>
T* tl_new() {
void* mem = tl_arena.allocate(sizeof(T));
return new(mem) T(); // Placement new
}
4. Size-Class Segregation
Different pools for different sizes:
class SizeClassAllocator {
ObjectPool<SmallObject> small_pool; // ≤ 64 bytes
ObjectPool<MediumObject> medium_pool; // ≤ 512 bytes
Arena large_arena; // > 512 bytes
public:
void* allocate(size_t size) {
if (size <= 64) return small_pool.acquire();
if (size <= 512) return medium_pool.acquire();
return large_arena.allocate(size);
}
};
5. Custom STL Allocators
Make STL containers use your allocator:
template<typename T>
class PoolAllocator {
ObjectPool<T>* pool;
public:
using value_type = T;
PoolAllocator(ObjectPool<T>* p) : pool(p) {}
T* allocate(size_t n) {
if (n != 1) throw std::bad_alloc();
return pool->acquire();
}
void deallocate(T* p, size_t) {
pool->release(p);
}
};
// Usage
ObjectPool<int> pool(1000);
std::vector<int, PoolAllocator<int>> vec(&pool);
Advanced Techniques
Memory Prefetching
void process_array(int* data, size_t size) {
for (size_t i = 0; i < size; i++) {
__builtin_prefetch(&data[i + 8], 0, 1); // Prefetch ahead
compute(data[i]);
}
}
NUMA-Aware Allocation
#include <numa.h>
void* numa_allocate(size_t size, int node) {
return numa_alloc_onnode(size, node);
}
Huge Pages
void* huge_malloc(size_t size) {
void* ptr = mmap(nullptr, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (ptr == MAP_FAILED) {
// Fallback to regular pages
ptr = mmap(nullptr, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);
}
return ptr;
}
Real-World Case Study
Before Optimization
// Naive implementation
class DataProcessor {
void process() {
for (auto& item : input) {
auto* result = new Result();
compute(item, result);
output.push_back(result);
}
}
};
// Performance: 1000 items/sec
// Memory usage: 500MB
After Optimization
class OptimizedProcessor {
ObjectPool<Result> result_pool;
Arena temp_arena;
void process() {
for (auto& item : input) {
auto* result = result_pool.acquire();
compute_with_arena(item, result, temp_arena);
output.push_back(result);
temp_arena.reset(); // Bulk free temporary allocations
}
}
};
// Performance: 15000 items/sec (15x speedup)
// Memory usage: 50MB (10x reduction)
Memory Allocation Debugging
Detecting Leaks
class LeakDetector {
std::unordered_map<void*, size_t> allocations;
void* allocate(size_t size) {
void* ptr = malloc(size);
allocations[ptr] = size;
return ptr;
}
void deallocate(void* ptr) {
allocations.erase(ptr);
free(ptr);
}
~LeakDetector() {
if (!allocations.empty()) {
std::cerr << "Memory leaks detected: "
<< allocations.size() << " blocks\n";
}
}
};
Detecting Use-After-Free
class SafeAllocator {
void deallocate(void* ptr, size_t size) {
memset(ptr, 0xDE, size); // Poison memory
free(ptr);
}
};
Best Practices
- Profile first: Don’t guess where allocations hurt
- Batch allocations: Allocate multiple objects at once
- Avoid allocations in hot paths: Pre-allocate or use stack
- Consider memory layout: Group related data for cache efficiency
- Use appropriate allocators: One size doesn’t fit all
Modern C++ Alternatives
std::pmr (C++17)
#include <memory_resource>
std::pmr::monotonic_buffer_resource pool_resource;
std::pmr::vector<int> vec(&pool_resource);
std::make_unique/make_shared
// Single allocation for control block + object
auto ptr = std::make_shared<MyClass>();
Performance Comparison
| Allocator | Small Objects | Large Objects | Multithreaded |
|---|---|---|---|
| malloc | 100 ns | 1000 ns | 500 ns |
| tcmalloc | 50 ns | 800 ns | 100 ns |
| jemalloc | 60 ns | 750 ns | 120 ns |
| Pool | 5 ns | N/A | 10 ns |
| Arena | 3 ns | 10 ns | 20 ns |
Conclusion
Memory allocation optimization can yield dramatic performance improvements. The key is understanding your allocation patterns and choosing the right strategy:
- Many small objects: Use object pools
- Temporary allocations: Use arena allocators
- Multithreaded: Use thread-local allocation
- Mixed sizes: Use size-class segregation
Remember: premature optimization is the root of all evil, but informed optimization based on profiling data is the path to performance.
Next time: We’ll explore how these techniques apply to GPU memory management in CUDA/ROCm applications.
Enjoy Reading This Article?
Here are some more articles you might like to read next: