Why might a System.String object not cache its hash code?

c# string hash unique
string gethashcode returns different values
override gethashcode c#
c# override gethashcode best practice
c# gethashcode implementation
c string gethashcode alternative
c# hash sha256
c# md5 hash string

A glance at the source code for string.GetHashCode using Reflector reveals the following (for mscorlib.dll version 4.0):

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

Now, I realize that the implementation of GetHashCode is not specified and is implementation-dependent, so the question "is GetHashCode implemented in the form of X or Y?" is not really answerable. I'm just curious about a few things:

  1. If Reflector has disassembled the DLL correctly and this is the implementation of GetHashCode (in my environment), am I correct in interpreting this code to indicate that a string object, based on this particular implementation, would not cache its hash code?
  2. Assuming the answer is yes, why would this be? It seems to me that the memory cost would be minimal (one more 32-bit integer, a drop in the pond compared to the size of the string itself) whereas the savings would be significant, especially in cases where, e.g., strings are used as keys in a hashtable-based collection like a Dictionary<string, [...]>. And since the string class is immutable, it isn't like the value returned by GetHashCode will ever even change.

What could I be missing?


UPDATE: In response to Andras Zoltan's closing remark:

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

Whoa, whoa there! This is an interesting point to make (and yes it's very true), but I really doubt that this was taken into consideration in the implementation of GetHashCode. The statement "therefore to cache the result would be wrong" implies to me that the framework's attitude regarding strings is "Well, they're supposed to be immutable, but really if developers want to get sneaky they're mutable so we'll treat them as such." This is definitely not how the framework views strings. It fully relies on their immutability in so many ways (interning of string literals, assignment of all zero-length strings to string.Empty, etc.) that, basically, if you mutate a string, you're writing code whose behavior is entirely undefined and unpredictable.

I guess my point is that for the author(s) of this implementation to worry, "What if this string instance is modified between calls, even though the class as it is publicly exposed is immutable?" would be like for someone planning a casual outdoor BBQ to think to him-/herself, "What if someone brings an atomic bomb to the party?" Look, if someone brings an atom bomb, party's over.


Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the implementation - unless you always compute the hash code up-front, which is a cost of computing it once for every string, regardless of whether you ever hash it at all.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.

String.GetHashCode Method (System), Returns the hash code for the provided read-only character span using the of GetHashCode is dependent on its implementation, which might change from one If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across .NET implementations, across .NET versions, and across .NET platforms (such as 32-bit and 64-bit) for a single version of .NET.


Firstly - there's no knowing if caching this result would actually improve Dictionary<string, ...> et al because they don't necessarily use String.GetHashCode, because it uses an IComparer to get the hashcode for a string.

And if you follow the likely call chain for the StringComparer class, it ends up going through to the System.Globalization.CompareInfo class, which finally terminates at this method:

[SecurityCritical, SuppressUnmanagedCodeSecurity, DllImport("QCall",
   CharSet=CharSet.Unicode)]
private static extern int InternalGetGlobalizedHashCode(IntPtr handle, string
   localeName, string source, int length, int dwFlags);

There's no knowing if that library - which appears to be a native method - doesn't use some form of internal caching based on the underlying .Net object data structure that we can't get at once inside the .Net runtime.

However, the important thing to note with this is that one string can have many different hash codes based on how you chose to interpret the characters. Granted, this implementation is culture-inspecific - which is why it's unsuitable for these comparers.

So, whilst the additional memory storage could be a factor, I actually think it's because to store a hash code along with an instance of the string misleads the caller, and indeed the .Net internal dev team(!), into thinking that the string only has one hash code, when in fact it entirely depends on how you're going to interpret it - as a series of bytes (which most of us do not), or as a series of printable characters.

From a performance point of view, then, if we also accept that these comparers used by Dictionary<,> etc can't be using the internal implementation, not caching this result probably doesn't have much of an impact because, frankly, how often will this method actually get called in the real world: since most of the time a hashcode of a string is most likely calculated via some other mechanism.

EDIT

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

AN ADDITIONAL EDIT(!)

Dan makes the point that strings are meant to be immutable within the Net sphere and therefore that string should be free to cache it's own hashcode based on this. The problem here is that the .Net framework also provides a legitimate way to change the supposedly immutable string that does not involve privileged reflection or anything else. It's a fundamental problem with strings, it's a pointer to a buffer that you cannot control. Never mind in the C# world, what about in C++, where vectoring over and modifying memory buffers is common-place. Just because you ideally shouldn't do it doesn't mean that the framework should expect you not to.

.Net happens to provide this functionality, and therefore if this was a design decision by the .Net team in response to the kind of binary thuggery suggested by Tim, then they were very wise to have taken it into account. Whether they did, or whether it is by fluke, is another matter entirely! :)

Why is string.GetHashCode() different each time I run my program in , string.GetHashCode(), why it's randomised, and hash-attacks. I also provide a deterministic implementation you can use when it's safe. Just remember it's not safe to use in any situations vulnerable to hash-based attacks! using System; static class Program { static void Main(string[] args) { Console. The code should be the result of combining the hash codes of a small collection of strings. The hash codes will be part of generating a cache key, so ideally they should be unique however the number of possible values that are being hashed is small so I THINK probability is in my favour here.


I may have made a wrong conclusion here, but isn't it true that while the string is immutable in the context of a .NET String object, it's still possible to change the value?

For instance, if you were so inclined to do this...

String example = "Hello World";

unsafe
{
    fixed (char* strPointer = myString) {
        strPointer[1] = 'a';
    }
} 

...wouldn't example still represent the same String object, but now with a value that would compute a different value for GetHashCode()? I may be off-base here, but since you could easily (if not pointlessly) do this, that would cause some issues as well.

Will Guid.GetHashCode() return different value with different , The GetHashCode method, as inherited from System. Do not use the hash code as the key to retrieve an object from a keyed collection. indicate that a string object, based on this particular implementation, would not cache its hash code? .net - Why might a System.String object not cache its hash code? A glance at the source code for string.GetHashCode using Reflector reveals the following(for mscorlib.dll version 4.0): public override unsafe int GetHashCode(){fixed(char* str=((char*) this…


One more potential reason for this is that interned strings (specifically those that are added as shared readonly data by the compiler) can have exactly the same format as any other string. The fact that these strings are loaded into readonly memory means that those data pages can be shared easily across process, but that the it would not be possible to also have them cache a hashcode.

But as others have mentioned, the primary reason for not caching the value is that the additional memory usage is likely to far outweigh the potential savings of hashcode caching. The execution time of GetHashCode is O(N) on the length of the string so the worst case scenario of repeated hashing is well bounded.

Big Data and Smart Service Systems, Once incomplete data (e.g., if there is bit rot that might alter the code) are found, the file Object replication uses a hash list to quickly compare each child of a virtual section. using an auth token (a string which can be sent back to the auth system for validation), and the server validates each token and caches the result. Because the hash code cache field gets serialized and deserialized by Pickle, when you deserialize a cache_hash=True attrs object, the hashcode will be the hashcode the object had at serialization-time. However, if your object has fields with hash codes which are not deterministic between interpreter runs, then on a new interpreter run your


IronPython .NET Integration, installation folder of the application, or in the GAC (Global assembly cache). Assemblies can be loaded by using the methods of the clr module. For example, if a file called System.py exists in the path, it will get imported instead of the System Object.GetHashCode(s) == System.String.GetHashCode(s) True >> > from  To enable caching of hash codes, specify ``cache_hash=True``. This may only be done if ``attrs`` is already generating a hash function for the object. If the hash code is cached, no field involved in hash code computation may be mutated after construction. It is strongly recommended that classes with cached hashcodes be ``frozen``


How does the default hashCode() work?, public String toString() { return getClass().getName() + "@" + Integer.toHexString( hashCode()); }. After overriding toString() , our custom hashCode() was no longer always get the identity hash code of an object o by calling System. Second, who is this monitor and why does it have our object's header? The previous example returns the same hash code for (n1, n2) and (n2, n1), and so may generate more collisions than are desirable. A number of solutions are available so that hash codes in these cases are not identical. One is to return the hash code of a Tuple object that reflects the order of each field.


[PDF] Compact and Efficient Strings for Java, Institute for System Software, Christian Doppler Laboratory for Automated Software Although string objects can share their character arrays, this is not the common case. The computed hash code is cached in the object header instead. When overridden in a derived class, inserts a cache entry into the cache, by using a key, an object for the cache entry, an absolute expiration value, and an optional region to add the cache into. Contains(String, String) When overridden in a derived class, checks whether the cache entry already exists in the cache.


Integer (Java Platform SE 7 ), The Integer class wraps a value of the primitive type int in an object. An object of type Determines the integer value of the system property with the specified name. static Integer Returns a hash code for this Integer . static int Throws: NumberFormatException - if the String does not contain a parsable integer. See Also:  Returns the same hash code for the given object as would be returned by the default method hashCode(), whether or not the given object's class overrides hashCode(). The . System.identityHashCode(Object) method provides the hash code of the provided Object as would be returned from its ultimate . Object parent