Memory allocation made right

Read Time: 18 minutes

Memory allocation is often a silent performance killer in C++ applications. A single malloc() call can take anywhere from nanoseconds to milliseconds, depending on the allocator state, fragmentation, and system load. In this comprehensive post, I’ll share techniques for optimizing memory allocation that we’ve successfully applied in high-performance systems.

The Problem with malloc()

Standard allocators like glibc’s malloc are general-purpose - they work well on average but can be suboptimal for specific workloads:

// Innocent looking code
for (int i = 0; i < 1000000; i++) {
    auto* obj = new MyObject();  // Hidden malloc() call
    process(obj);
    delete obj;  // Hidden free() call
}

Problems:

  1. Thread contention: Global heap locks cause serialization
  2. Fragmentation: Mixed-size allocations lead to wasted memory
  3. Cache misses: Scattered allocations destroy locality
  4. System calls: Large allocations trigger expensive mmap() calls

Profiling Memory Allocation

Before optimizing, measure:

Using perf

perf record -e syscalls:sys_enter_mmap ./program
perf report

Using tcmalloc’s profiler

#include <gperftools/heap-profiler.h>
HeapProfilerStart("profile");
// ... your code ...
HeapProfilerDump("checkpoint");
HeapProfilerStop();

Custom allocation tracking

void* operator new(size_t size) {
    allocation_count++;
    total_bytes += size;
    return malloc(size);
}

Optimization Strategies

1. Object Pools

Pre-allocate objects and reuse them:

template<typename T>
class ObjectPool {
    std::vector<T> pool;
    std::stack<T*> available;

public:
    ObjectPool(size_t size) : pool(size) {
        for (auto& obj : pool) {
            available.push(&obj);
        }
    }

    T* acquire() {
        if (available.empty()) {
            pool.emplace_back();
            return &pool.back();
        }
        T* obj = available.top();
        available.pop();
        return obj;
    }

    void release(T* obj) {
        obj->reset();  // Clear object state
        available.push(obj);
    }
};

Result: 10x speedup for frequent allocations.

2. Arena Allocators

Allocate from a large chunk:

class Arena {
    std::vector<std::unique_ptr<char[]>> blocks;
    char* current = nullptr;
    size_t remaining = 0;

public:
    void* allocate(size_t size) {
        if (size > remaining) {
            // Allocate new block
            size_t block_size = std::max(size, 64 * 1024);
            blocks.push_back(std::make_unique<char[]>(block_size));
            current = blocks.back().get();
            remaining = block_size;
        }

        void* result = current;
        current += size;
        remaining -= size;
        return result;
    }

    void reset() {
        blocks.clear();
        current = nullptr;
        remaining = 0;
    }
};

Benefits:

  • No fragmentation
  • Bulk deallocation
  • Excellent cache locality

3. Thread-Local Allocation

Eliminate contention with thread-local heaps:

thread_local Arena tl_arena;

template<typename T>
T* tl_new() {
    void* mem = tl_arena.allocate(sizeof(T));
    return new(mem) T();  // Placement new
}

4. Size-Class Segregation

Different pools for different sizes:

class SizeClassAllocator {
    ObjectPool<SmallObject> small_pool;   // ≤ 64 bytes
    ObjectPool<MediumObject> medium_pool; // ≤ 512 bytes
    Arena large_arena;                    // > 512 bytes

public:
    void* allocate(size_t size) {
        if (size <= 64) return small_pool.acquire();
        if (size <= 512) return medium_pool.acquire();
        return large_arena.allocate(size);
    }
};

5. Custom STL Allocators

Make STL containers use your allocator:

template<typename T>
class PoolAllocator {
    ObjectPool<T>* pool;

public:
    using value_type = T;

    PoolAllocator(ObjectPool<T>* p) : pool(p) {}

    T* allocate(size_t n) {
        if (n != 1) throw std::bad_alloc();
        return pool->acquire();
    }

    void deallocate(T* p, size_t) {
        pool->release(p);
    }
};

// Usage
ObjectPool<int> pool(1000);
std::vector<int, PoolAllocator<int>> vec(&pool);

Advanced Techniques

Memory Prefetching

void process_array(int* data, size_t size) {
    for (size_t i = 0; i < size; i++) {
        __builtin_prefetch(&data[i + 8], 0, 1);  // Prefetch ahead
        compute(data[i]);
    }
}

NUMA-Aware Allocation

#include <numa.h>

void* numa_allocate(size_t size, int node) {
    return numa_alloc_onnode(size, node);
}

Huge Pages

void* huge_malloc(size_t size) {
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                     -1, 0);
    if (ptr == MAP_FAILED) {
        // Fallback to regular pages
        ptr = mmap(nullptr, size,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS,
                   -1, 0);
    }
    return ptr;
}

Real-World Case Study

Before Optimization

// Naive implementation
class DataProcessor {
    void process() {
        for (auto& item : input) {
            auto* result = new Result();
            compute(item, result);
            output.push_back(result);
        }
    }
};
// Performance: 1000 items/sec
// Memory usage: 500MB

After Optimization

class OptimizedProcessor {
    ObjectPool<Result> result_pool;
    Arena temp_arena;

    void process() {
        for (auto& item : input) {
            auto* result = result_pool.acquire();
            compute_with_arena(item, result, temp_arena);
            output.push_back(result);
            temp_arena.reset();  // Bulk free temporary allocations
        }
    }
};
// Performance: 15000 items/sec (15x speedup)
// Memory usage: 50MB (10x reduction)

Memory Allocation Debugging

Detecting Leaks

class LeakDetector {
    std::unordered_map<void*, size_t> allocations;

    void* allocate(size_t size) {
        void* ptr = malloc(size);
        allocations[ptr] = size;
        return ptr;
    }

    void deallocate(void* ptr) {
        allocations.erase(ptr);
        free(ptr);
    }

    ~LeakDetector() {
        if (!allocations.empty()) {
            std::cerr << "Memory leaks detected: "
                      << allocations.size() << " blocks\n";
        }
    }
};

Detecting Use-After-Free

class SafeAllocator {
    void deallocate(void* ptr, size_t size) {
        memset(ptr, 0xDE, size);  // Poison memory
        free(ptr);
    }
};

Best Practices

  1. Profile first: Don’t guess where allocations hurt
  2. Batch allocations: Allocate multiple objects at once
  3. Avoid allocations in hot paths: Pre-allocate or use stack
  4. Consider memory layout: Group related data for cache efficiency
  5. Use appropriate allocators: One size doesn’t fit all

Modern C++ Alternatives

std::pmr (C++17)

#include <memory_resource>

std::pmr::monotonic_buffer_resource pool_resource;
std::pmr::vector<int> vec(&pool_resource);

std::make_unique/make_shared

// Single allocation for control block + object
auto ptr = std::make_shared<MyClass>();

Performance Comparison

Allocator Small Objects Large Objects Multithreaded
malloc 100 ns 1000 ns 500 ns
tcmalloc 50 ns 800 ns 100 ns
jemalloc 60 ns 750 ns 120 ns
Pool 5 ns N/A 10 ns
Arena 3 ns 10 ns 20 ns

Conclusion

Memory allocation optimization can yield dramatic performance improvements. The key is understanding your allocation patterns and choosing the right strategy:

  • Many small objects: Use object pools
  • Temporary allocations: Use arena allocators
  • Multithreaded: Use thread-local allocation
  • Mixed sizes: Use size-class segregation

Remember: premature optimization is the root of all evil, but informed optimization based on profiling data is the path to performance.

Next time: We’ll explore how these techniques apply to GPU memory management in CUDA/ROCm applications.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Blazing-Fast Code Editing via Multi-Layer Speculation
  • Progressive bug finding in the open-source of Deep Learning
  • What we talk when we talk about coverage