Memory allocation made right

Read Time: 18 minutes

Memory allocation is often a silent performance killer in C++ applications. A single malloc() call can take anywhere from nanoseconds to milliseconds, depending on the allocator state, fragmentation, and system load. In this comprehensive post, I’ll share techniques for optimizing memory allocation that we’ve successfully applied in high-performance systems.

The Problem with malloc()

Standard allocators like glibc’s malloc are general-purpose - they work well on average but can be suboptimal for specific workloads:

// Innocent looking code
for (int i = 0; i < 1000000; i++) {
    auto* obj = new MyObject();  // Hidden malloc() call
    process(obj);
    delete obj;  // Hidden free() call
}

Problems:

Thread contention: Global heap locks cause serialization
Fragmentation: Mixed-size allocations lead to wasted memory
Cache misses: Scattered allocations destroy locality
System calls: Large allocations trigger expensive mmap() calls

Profiling Memory Allocation

Before optimizing, measure:

Using perf

perf record -e syscalls:sys_enter_mmap ./program
perf report

Using tcmalloc’s profiler

#include <gperftools/heap-profiler.h>
HeapProfilerStart("profile");
// ... your code ...
HeapProfilerDump("checkpoint");
HeapProfilerStop();

Custom allocation tracking

void* operator new(size_t size) {
    allocation_count++;
    total_bytes += size;
    return malloc(size);
}

Optimization Strategies

1. Object Pools

Pre-allocate objects and reuse them:

template<typename T>
class ObjectPool {
    std::vector<T> pool;
    std::stack<T*> available;

public:
    ObjectPool(size_t size) : pool(size) {
        for (auto& obj : pool) {
            available.push(&obj);
        }
    }

    T* acquire() {
        if (available.empty()) {
            pool.emplace_back();
            return &pool.back();
        }
        T* obj = available.top();
        available.pop();
        return obj;
    }

    void release(T* obj) {
        obj->reset();  // Clear object state
        available.push(obj);
    }
};

Result: 10x speedup for frequent allocations.

2. Arena Allocators

Allocate from a large chunk:

class Arena {
    std::vector<std::unique_ptr<char[]>> blocks;
    char* current = nullptr;
    size_t remaining = 0;

public:
    void* allocate(size_t size) {
        if (size > remaining) {
            // Allocate new block
            size_t block_size = std::max(size, 64 * 1024);
            blocks.push_back(std::make_unique<char[]>(block_size));
            current = blocks.back().get();
            remaining = block_size;
        }

        void* result = current;
        current += size;
        remaining -= size;
        return result;
    }

    void reset() {
        blocks.clear();
        current = nullptr;
        remaining = 0;
    }
};

Benefits:

No fragmentation
Bulk deallocation
Excellent cache locality

3. Thread-Local Allocation

Eliminate contention with thread-local heaps:

thread_local Arena tl_arena;

template<typename T>
T* tl_new() {
    void* mem = tl_arena.allocate(sizeof(T));
    return new(mem) T();  // Placement new
}

4. Size-Class Segregation

Different pools for different sizes:

class SizeClassAllocator {
    ObjectPool<SmallObject> small_pool;   // ≤ 64 bytes
    ObjectPool<MediumObject> medium_pool; // ≤ 512 bytes
    Arena large_arena;                    // > 512 bytes

public:
    void* allocate(size_t size) {
        if (size <= 64) return small_pool.acquire();
        if (size <= 512) return medium_pool.acquire();
        return large_arena.allocate(size);
    }
};

5. Custom STL Allocators

Make STL containers use your allocator:

template<typename T>
class PoolAllocator {
    ObjectPool<T>* pool;

public:
    using value_type = T;

    PoolAllocator(ObjectPool<T>* p) : pool(p) {}

    T* allocate(size_t n) {
        if (n != 1) throw std::bad_alloc();
        return pool->acquire();
    }

    void deallocate(T* p, size_t) {
        pool->release(p);
    }
};

// Usage
ObjectPool<int> pool(1000);
std::vector<int, PoolAllocator<int>> vec(&pool);

Advanced Techniques

Memory Prefetching

void process_array(int* data, size_t size) {
    for (size_t i = 0; i < size; i++) {
        __builtin_prefetch(&data[i + 8], 0, 1);  // Prefetch ahead
        compute(data[i]);
    }
}

NUMA-Aware Allocation

#include <numa.h>

void* numa_allocate(size_t size, int node) {
    return numa_alloc_onnode(size, node);
}

Huge Pages

void* huge_malloc(size_t size) {
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                     -1, 0);
    if (ptr == MAP_FAILED) {
        // Fallback to regular pages
        ptr = mmap(nullptr, size,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS,
                   -1, 0);
    }
    return ptr;
}

Real-World Case Study

Before Optimization

// Naive implementation
class DataProcessor {
    void process() {
        for (auto& item : input) {
            auto* result = new Result();
            compute(item, result);
            output.push_back(result);
        }
    }
};
// Performance: 1000 items/sec
// Memory usage: 500MB

After Optimization

class OptimizedProcessor {
    ObjectPool<Result> result_pool;
    Arena temp_arena;

    void process() {
        for (auto& item : input) {
            auto* result = result_pool.acquire();
            compute_with_arena(item, result, temp_arena);
            output.push_back(result);
            temp_arena.reset();  // Bulk free temporary allocations
        }
    }
};
// Performance: 15000 items/sec (15x speedup)
// Memory usage: 50MB (10x reduction)

Memory Allocation Debugging

Detecting Leaks

class LeakDetector {
    std::unordered_map<void*, size_t> allocations;

    void* allocate(size_t size) {
        void* ptr = malloc(size);
        allocations[ptr] = size;
        return ptr;
    }

    void deallocate(void* ptr) {
        allocations.erase(ptr);
        free(ptr);
    }

    ~LeakDetector() {
        if (!allocations.empty()) {
            std::cerr << "Memory leaks detected: "
                      << allocations.size() << " blocks\n";
        }
    }
};

Detecting Use-After-Free

class SafeAllocator {
    void deallocate(void* ptr, size_t size) {
        memset(ptr, 0xDE, size);  // Poison memory
        free(ptr);
    }
};

Best Practices

Profile first: Don’t guess where allocations hurt
Batch allocations: Allocate multiple objects at once
Avoid allocations in hot paths: Pre-allocate or use stack
Consider memory layout: Group related data for cache efficiency
Use appropriate allocators: One size doesn’t fit all

Modern C++ Alternatives

std::pmr (C++17)

#include <memory_resource>

std::pmr::monotonic_buffer_resource pool_resource;
std::pmr::vector<int> vec(&pool_resource);

std::make_unique/make_shared

// Single allocation for control block + object
auto ptr = std::make_shared<MyClass>();

Performance Comparison

Allocator	Small Objects	Large Objects	Multithreaded
malloc	100 ns	1000 ns	500 ns
tcmalloc	50 ns	800 ns	100 ns
jemalloc	60 ns	750 ns	120 ns
Pool	5 ns	N/A	10 ns
Arena	3 ns	10 ns	20 ns

Conclusion

Memory allocation optimization can yield dramatic performance improvements. The key is understanding your allocation patterns and choosing the right strategy:

Many small objects: Use object pools
Temporary allocations: Use arena allocators
Multithreaded: Use thread-local allocation
Mixed sizes: Use size-class segregation

Remember: premature optimization is the root of all evil, but informed optimization based on profiling data is the path to performance.

Next time: We’ll explore how these techniques apply to GPU memory management in CUDA/ROCm applications.