Compile time huffman trees in C++20

December 15, 2024

On how to build huffman trees from huffman tables at compile-time.

Intro

Here’s a neat trick I invented when working on the MP3 decoder for SerenityOS.

Say you have a hard-coded table of Huffman codes:

0b0 -> c
0b1000 -> z
0b1001 -> b
0b101 -> a
0b110 -> d
0b111 -> e

Your task is to write a decoder. So in a naive way you keep the flattened list of codes and do a simple search over it for every bit read:

template<typename Symbol>
struct HuffmanEntry {
    using symbol_type = Symbol;
    unsigned int code;
    size_t code_len;
    Symbol symbol;
};

constexpr std::array<HuffmanEntry<char>, 6> code_list {
    HuffmanEntry<char> {0b0, 1, 'c'},
    HuffmanEntry<char> {0b1000, 4, 'z'},
    HuffmanEntry<char> {0b1001, 4, 'b'},
    HuffmanEntry<char> {0b101, 3, 'a'},
    HuffmanEntry<char> {0b110, 3, 'd'},
    HuffmanEntry<char> {0b111, 3, 'e'}
};

auto decode_huffman(auto const& entries, BitStream& stream) {
    uint32_t code = 0;
    uint32_t code_len = 0;

    do {
        code = (code << 1) | stream.read_bit();
        code_len++;

        for (auto const& entry : entries) {
            if (code == entry.code && code_len == entry.code_len) {
                return entry.symbol;
            }
        }
    } while (stream.bits_remaining() > 0);

    throw "Illegal huffman code";
}

int main() {
    BitStream stream;
    stream.append_bits(0b1011100, 7);

    // output: adc
    while (stream.bits_remaining() > 0) {
        std::cout << decode_huffman(code_list, stream);
    }
}

This iterates over all Huffman codes for every new bit that is read again and again, and that’s just very inefficient. Huffman trees are called trees for a reason. If we make use of them, we cut the complexity down from O(n) to O(log n).

But how to structure the tree? We could create a new tree structure in which a node has pointers to its children. This would mean allocating memory during runtime. But we already have all the information available during compile time, so in theory it should be possible to create those trees then.

Instead of pointers, we can use one big array of nodes and reference the nodes using their indices.

First, we need to know the number of nodes so that we can allocate an array big enough. We only really know if we build the tree, but we also need an array to store the nodes in while building the tree. What we can do however is get the worst case size, which is 1 << (bit_count_of_longest_code + 1). Then we definitely have enough space to store all nodes.

#include <iostream>
#include <array>

template<typename Symbol>
struct HuffmanEntry {
    using symbol_type = Symbol;
    unsigned int code;
    size_t code_len;
    Symbol symbol;
};

template<typename Symbol>
struct HuffmanNode {
    int left;
    int right;
    unsigned int code;
    size_t code_length;
    Symbol symbol;
    constexpr bool is_leaf() const { return left == -1 && right == -1; }
};

template<typename Symbol, size_t Size>
using HuffmanNodes = std::array<HuffmanNode<Symbol>, Size>;

constexpr std::array<HuffmanEntry<char>, 6> code_list {
    HuffmanEntry<char> {0b0, 1, 'c'},
    HuffmanEntry<char> {0b1000, 4, 'z'},
    HuffmanEntry<char> {0b1001, 4, 'b'},
    HuffmanEntry<char> {0b101, 3, 'a'},
    HuffmanEntry<char> {0b110, 3, 'd'},
    HuffmanEntry<char> {0b111, 3, 'e'}
};

template<auto table>
constexpr size_t length_of_longest_huffman_code() {
    return std::max_element(table.begin(), table.end(), [](auto const& a, auto const& b) {
        return a.code_len < b.code_len;
    })->code_len;
}

template<auto table>
consteval auto make_huffman_tree() {
    using Symbol = decltype(table)::value_type::symbol_type;
    constexpr size_t size_worst_case = 1 << (length_of_longest_huffman_code<table>() + 1);

    HuffmanNodes<Symbol, size_worst_case> nodes;
    make_huffman_tree_internal(code_list, nodes);

    return nodes;
}

constexpr auto code_tree = make_huffman_tree<code_list>();

Of course make_huffman_tree_internal is still missing. It’s really cool how this can be simply written as a regular function with a stright forward implementation, just in a consteval context:

consteval size_t make_huffman_tree_internal(auto const& table, auto& nodes) {
    size_t allocation_count = 1;

    nodes = {};
    nodes[0].left = nodes[0].right = -1;

    for (auto const& entry : table) {
        int tree_pointer = 0;

        for (size_t j = 0; j < entry.code_len; j++) {
            bool const bit = (entry.code >> (entry.code_len - j - 1)) & 1;
            bool const end_of_code = j == entry.code_len - 1;

            int& target_index = bit ? nodes[tree_pointer].left
                                    : nodes[tree_pointer].right;

            if (target_index != -1) {
                if (end_of_code)
                    throw "Node is already filled in.";
                tree_pointer = target_index;
            } else {
                tree_pointer = target_index = allocation_count++;
                nodes[target_index].left = nodes[target_index].right = -1;
                nodes[target_index].code_length = j + 1;
                if (end_of_code) {
                    nodes[target_index].symbol = entry.symbol;
                    nodes[target_index].code = entry.code;
                }
            }
        }
    }

    return allocation_count;
}

And just for the fun of it, here’s a consteval decoder. Of course it’s not practical. Just decode the data once before compiling. But nevertheless, it’s cool. And the compiler output only includes the decoded chars “a”, “d” and “e”.

template<typename Symbol, size_t Size>
consteval std::array<Symbol, 256> decode_huffman(
    HuffmanNodes<Symbol, Size> const& tree,
    std::string_view const& bit_sequence) {

    std::array<Symbol, 256> output;
    size_t symbol_count = 0;
    size_t tree_pointer = 0;

    for (char bit_char : bit_sequence) {
        const bool bit = (bit_char == '1');

        tree_pointer = bit ? tree[tree_pointer].left
                           : tree[tree_pointer].right;

        if (tree_pointer == -1) throw "Illegal bit sequence detected";

        if (tree[tree_pointer].is_leaf()) {
            output[symbol_count++] = tree[tree_pointer].symbol;
            tree_pointer = 0;
        }
    }

    return output;
}

int main() {
    constexpr std::string_view encoded_sequence = "101110111";

    std::cout << "Decoding the sequence: " << encoded_sequence << std::endl;

    auto output = decode_huffman(code_tree, encoded_sequence);

    for (size_t i = 0; i < 3; i++) {
        std::cout << output[i];
    }

    return 0;
}
Compile time huffman trees in C++20 - December 15, 2024 - Arne Elster