Handling endianness universally rather than instantially

I'm reading from files whose format is consistent across platforms, but may be big or little endian depending on the platform the file has been built for. Said platform is defined by a value in the file.

Currently, the way I'm handling endianness is with if statements, one reading the file normally, the other using byteswap intrinsics:

// source.h
class File {
public:
    enum class Endian {
        Little = 1,
        Big = 2
    };
};
// ...removed...

// source.cpp
#include "source.h"
#include <fstream>
std::ifstream file;
File::Endian endianness;

// ...removed...

bool GetPlatform() {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        endianness = File::Endian::Little;
    }
    else if (platform == 2 << 24) {
        endianness = File::Endian::Big;
    }
    // ...removed...
}

void ReadData() {
    uint32_t data;
    uint32_t dataLittle;

    if (endianness == File::Endian::Little) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
    }
    else if (endianness == File::Endian::Big) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        dataLittle = _byteswap_ulong(data);
    }
}

My question is, would it be possible to forgo swapping of each value when big endian and instead set the endianness universally? Below is a potential example of what I mean:

bool GetPlatform() {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        // Universally set the endianness to little endian
    }
    else if (platform == 2 << 24) {
        // Universally set the endianness to big endian
    }
    // ...removed...
}

void ReadData() {
    uint32_t data;
    file.read(reinterpret_cast<char*>(&data), sizeof(data)); // Data is now read correctly regardless of endianness
}

My main reason for asking this is it would essentially halve the amount of code per function as it would no longer require if statements for endianness.

Additionally, could std::endian be of use for this task? Its examples only indicate use in detecting host endianness, but I'm unsure as to whether it has any further uses or not.

If I understand your situation, your basic issue is that you lack a level of abstraction. You have a bunch of functions that read various data structures from your file. Since these functions directly call std::ifstream::read, they all need to know both the structure they are reading and the layout of the file. That is two tasks, which is one more than ideal. You would be better off splitting this logic into two levels of abstraction. Let's call the functions for the new level ReadBytes since they focus on getting bytes from the file. Since Microsoft provides three byteswap intrinsics, there would be three of these functions. Here's a first stab at the one for 4-byte values.

void ReadBytes(std::ifstream & file, File::Endian endianness, uint32_t & data) {
    file.read(reinterpret_cast<char*>(&data), sizeof(data));
    if (endianness == File::Endian::Big) {
        data = _byteswap_ulong(data);
    }
}

Note that I've returned the data via a parameter. This is to allow all three functions to have the same name; the type of this parameter tells the compiler which overload to use. (There are other approaches. Coding styles differ.)

There are other improvements to be made, but this is enough to create the new level of abstraction. Your various functions that read data from the file would change to look like the following.

void ReadData() {
    uint32_t data;

    ReadBytes(file, endianness, data);
    // More processing here, maybe more reads.
}

With this small sample code, the savings are not readily apparent. However, you indicated that there could be numerous functions filling the role of ReadData. This approach shifts the burden of correcting endianness from those functions down to the new ReadBytes functions. The number of if statements gets reduced from "hundreds, if not thousands" to three.

This change is motivated by a programming principle often called "don't repeat yourself". The same principle can motivate questions like "why is there more than one function that needs this code?"


Another issue complicating things for you is that you seem to have taken a procedural approach to the problem, rather than an object-oriented one. Symptoms of a procedural approach can include excessive function parameters (e.g. endianness as a parameter) and global variables. The interface would be easier to use if it was wrapped in a class. Here is a start toward declaring such a class (i.e. a start to the header file). Note that the endianness is private, and that this header has no indication of how endianness is determined. If you have good encapsulation, code outside this class will not care which platform created the file.

// Designed as a drop-in replacement for an ifstream.
// (Non-public inheritance *might* be appropriate if you want to restrict the interface.)
class IFile : public std::ifstream {
private:
    File::Endian endianness;

public:
    // Mimic the constructors of std::ifstream that you need.
    explicit IFile(const std::string & filename);

    // It should be possible to use some template magic to simplify the
    // definition of these three functions, but since there are only three:
    void ReadBytes(uint16_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_ushort(data);
        }
    }
    void ReadBytes(uint32_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_ulong(data);
        }
    }
    void ReadBytes(uint64_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_uint64(data);
        }
    }
};

This is just a start. The interface needs more work, for one thing. Furthermore, the ReadBytes functions could be written a bit more portably, perhaps using std::endian instead of assuming little-endian. (Boost has an endian library that could help you make truly portable code. It even defaults to using intrinsics when they are available.)

The determination of endianness is done in the implementation (source) file. It seems like this should be done as part of opening the file. I've put that as part of the constructor for this example, but you may want more flexibility (use the ifstream interface as a guideline). In any event, the logic for detecting the platform should not need to be accessible outside the implementation of this class. Here is a start for the implementation.

// Helper function, not needed outside this class.
// This should be either static or put into an anonymous namespace.
static File::Endian ReadEndian(std::ifstream & file) {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        return File::Endian::Little;
    }
    else if (platform == 2 << 24) {
        return File::Endian::Big;
    }
    // Handle unrecognized platform here
}

IFile::IFile(const std::string & filename) : std::ifstream(filename),
    endianness(ReadEndian(file))
{}

At this point, your various ReadData functions could look like the following (without using global variables).

void ReadData(IFile & file) {
    uint32_t data;

    file.ReadBytes(data);
}

This is even simpler than what you were looking for since there is even less repeated code. (Casting to char* and getting the size no longer need to be repeated everywhere.)


In summary, there are two major areas for improvement.

  1. Don't repeat yourself. Code that is repeated often should be moved to a separate function.
  2. Object-oriented. Rely on objects handling routine tasks, rather than delegating them to the people using the objects.

Both of these contribute to making it easier to safely make sweeping changes like supporting a new endianness. There is no pre-built switch to set the endianness, but it's not that hard to build one when your code is better organized.

Newest 'endianness' Questions - Page 2, "Big-Endian" values are stored with their most-significant byte first, and "Little- Endian" values are Handling endianness universally rather than instantially. The Alpha architecture's TLB is managed in PALcode, rather than in the operating system. As the PALcode for a processor can be processor-specific and operating-system-specific, this allows different versions of PALcode to implement different page-table formats for different operating systems, without requiring that the TLB format, and the

I think the usual answer is to #ifdef the definition of a function like read64:

int64_t read64(char *pos) {
#ifdef IS_BIG_ENDIAN
  ...
#elif IS_LITTLE_ENDIAN
  ...
#else
  // probably # error
#endif
}

Extracting partial residuals from effects package for a linear mixed , How do I call clearInterval from multiple places in my application? → Handling endianness universally rather than instantially� I would suggest that, rather than thinking in terms of swapping endianness, it would be wiser to focus instead on writing functions to read or write 16- or 32-bit quantities with a given endianness from a given address (probably passed as a void*). In most contexts where it would be necessary to swap endianness, the data may have uncertain

The only way to "automatically" read in the correct endian-ness is to have the CPU's native-endianness match the endian-ness of the bytes in the file. If they don't match, then something in your code needs to know to do the necessary byte-swaps from the file's-endianness to the CPU's endian-ness (when reading from the file) and vice-versa (when writing to the file).

Check out how ntohl() and htonl() are implemented to deal with network-order (aka big-endian) data -- on big-endian platforms (like PowerPC) they are simple no-ops that return their argument verbatim. On little-endian platforms (like Intel) they return their argument byte-swapped. That way the code that calls them doesn't have to do any conditional tests at run-time to figure out whether a byte-swap is appropriate or not, it just unconditionally runs all the data it is reading through ntohl() or ntohs(), and trusts that they will do the right thing with the data on all platforms. Similarly when writing data it unconditionally runs all data values through htonl() or htons() before sending the data out to the file/network/whatever.

Your program can do something similar, either by calling those actual functions, or (if you need to read more data-types than just 16-bit and/or 32-bit integers) by finding or writing your own functions that are similar to those in spirit, e.g. something like:

inline uint32_t NativeToLittleEndianUint32(uint32_t val) {...}
inline uint32_t LittleEndianToNativeUint32(uint32_t val) {...}

inline uint32_t NativeToBigEndianUint32(uint32_t val) {...}
inline uint32_t BigEndianToNativeUint32(uint32_t val) {...}
[...]

inline uint64_t NativeToLittleEndianUint64(uint64_t val) {...}
inline uint64_t NativeToBigEndianUint64(uint64_t val) {...}

inline uint64_t LittleEndianToNativeUint64(uint64_t val) {...}
inline uint64_t BigEndianToNativeUint64(uint64_t val) {...}

[...]

... and so on. The all of the thousands of if/then clauses in your code disappear, replaced with compile-time conditional-logic instead. That makes the code both more efficient, easier to test, and much less error-prone. If you like templated-functions, you can use them to reduce the number of function names the calling-code writer needs to remember (e.g. you could have inline template<T> NativeToLittleEndian(T val) {...} with template-overrides to do the right thing for all the types you need to support)

If you want to go a bit further, you can combine the reading/writing and byte-swapping functions together into a single larger function, and thus avoid having to make two function calls per data value.

Note: be careful when implementing these functions for floating-point types; some CPU architectures (such as Intel) will implicitly modify unexpected floating-point bit-patterns, which means that e.g. when endian-swapping a 32-bit floating point value you need to store the non-native/external/byte-swapped representation of that value as a uint32_t rather than as a "float". If you want to see an example of how I dealt with this issue in my code, check out e.g. the definitions of B_HOST_TO_BENDIAN_IFLOAT and B_BENDIAN_TO_HOST_IFLOAT macros in this file.

Writing Efficient Endian-Independent Code in C++, First of all, where do we need to deal with little-endian/big-endian issues? C structures over the network is universally considered a Big No-No. it using unsigned integers rather than signed ones; while using signed types� The endianness issue comes into play because the AVR doesn’t have a way to transmit 16-bits of data through the UART in a single transaction: the 16-bit value has to be transmitted in two single-byte transactions. The endianness of the system receiving the bytes determines what order the high and low byte of the 16-bit value need to be sent.

Newest 'endianness' Questions, Endianness refers to how multi-byte values are stored in memory, sent between devices or stored on Handling endianness universally rather than instantially. Pandas could have followed R's lead in specifying bit patterns for each individual data type to indicate nullness, but this approach turns out to be rather unwieldy. While R contains four basic data types, NumPy supports far more than this: for example, while R has a single integer type, NumPy supports fourteen basic integer types once you

Endianness, In computing, endianness is the ordering or sequencing of bytes of a word of digital data in Larger groups comprise two or more bytes, for example a 32-bit word structures of fields that may consist of more data than is stored in one byte . Dealing with data of different endianness is sometimes termed the NUXI problem. The port numbers are what makes UDP into a real transport protocol: with them, an application can now connect to an individual server process (that is, the process “owning” the port number in question), rather than simply to a host.

Byte order mark, The byte order mark (BOM) is a particular usage of the special Unicode character , U+FEFF A large number (i.e. far higher than random chance) in the same order is a "utf-16le" are to be interpreted as little-endian "to deal with deployed content". Blocks � Universal Character Set � Character charts � Character property� A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result in both false positives and false negatives.

Comments
  • "halve the amount of code per function" -- why is there more than one function that needs this code?
  • @JaMiT The code above is only an example of a much larger program. The full program reads hundreds, if not thousands, of different data structures, and all of them have this same issue. Setting the endianness in this way would reduce the total size of the program immensely.
  • Sorry, I still don't get it. In fact, having to repeat this thousands of times is even more reason to have it all in just one function (template?), and have those thousands of locations call that function. That is the traditional first approach, no?
  • @JaMiT My bad, I worded it confusingly. The functions referred to previously are unrelated. I'm asking whether it's possible to read data using the correct endianness in the first place, rather than byteswap each instance as above.
  • No, I think you worded that part fine. However, you don't yet understand that you are asking an XY question. An answer to the real question requires more context, though; please provide a minimal reproducible example. (The example can still be short, showing just a single read, maybe two if you want to demonstrate that they are unrelated. The point is to get those reads in the context of functions rather than as isolated code.)
  • I've implemented a variant of the object-oriented approach provided here - it works brilliantly. Using std::endian was also a good suggestion, as it eliminates the need for File::Endian completely, though I'm still considering using the Boost endian library in the future. Regardless, thanks for taking time to help; this solved the issue entirely, and then some.