Consistent String#hash based only on the string's content

string to hash
ruby hash to string
ruby hash string keys
md5 hash string ruby
ruby deserialize hash from string
ruby string to hash eval
rails json string to hash
how to turn string into hash ruby

GOAL: Map every URL handled by a server to 0, 1, 2, or 3, distributing as uniformly as possible.

While the documentation for ruby's String#hash method says it will "return a hash based on the string‘s length and content," this clearly isn't the whole story. A given string's hash is not consistent across invocations of the interpreter:

$ irb
ruby-1.9.2-p180 :001 > "foo".hash
 => 360517580588231756 
ruby-1.9.2-p180 :002 > ^D

$ irb
ruby-1.9.2-p180 :001 > "foo".hash
 => -2716152678666510148 

This means a particular string's hash value may differ across, say, servers. Rails uses String#hash internally to map a URL path to one of four asset hosts (if the app's asset_host is so configured), but this feature is a lot less efficient than it could be because of the cross-machine inconsistencies; different servers may map the same URL to different asset hosts, reducing the effectiveness of caches, clouding skies, cooling cups of tea prematurely, besmirching the reputations of otherwise fine programmers.

Can you suggest an alternate hash function that could effectively and speedily distribute hashes across a typical app's URL space, preferably one that produces a Fixnum since, in the end, I'll want to map it into one of four asset hosts?

there are lot of such functionality in ruby's digest module: http://ruby-doc.org/stdlib/libdoc/digest/rdoc/index.html

simple example:

require 'digest/sha1'
Digest::SHA1.hexdigest("some string")

ruby, GOAL: Map every URL handled by a server to 0, 1, 2, or 3, distributing as uniformly as possible. While the documentation for ruby's String#hash method says it  If we only want this hash function to distinguish between all strings consisting of lowercase characters of length smaller than 15, then already the hash wouldn't fit into a 64 bit integer (e.g. unsigned long long) any more, because there are so many of them.

There is tiny library xxHash:

XXhash.xxh32('qwe') #=> 2396643526
XXhash.xxh64('qwe') #=> 9343136760830690622

Maybe it will have more collisions but it is 10x faster than SHA1:

Benchmark.bm do |x|
  n = 100_000
  str = 'qweqweqwe'
  x.report('xxhash32') { n.times { XXhash.xxh32(str) } }
  x.report('xxhash64') { n.times { XXhash.xxh64(str) } }
  x.report('hexadigest') { n.times { Digest::SHA1.hexdigest(str) } }
end;1

#       user     system      total        real
# xxhash32  0.020000   0.000000   0.020000 (  0.021948)
# xxhash64  0.040000   0.000000   0.040000 (  0.036340)
# hexadigest  0.240000   0.030000   0.270000 (  0.276443)

Hashing Tutorial: Section 2.4, Now we will examine some hash functions suitable for storing strings of then this hash function should do a good job of distributing strings For a hash table of size 1000, the distribution is terrible because only Can you control input to make different strings hash to the same slot in a consistent way? String 'This is a string.' in domain 'PerDomain.exe': 941BCEAC String 'This is a string.' in domain 'NewDomain': 941BCEAC However, if you add the following configuration file to the example's directory and then run the example, the hash codes for the same string will differ by application domain.

You can try to_i(36).

"Hash me please :(".to_i(36)
=> 807137

Do identical strings always have the same SHA-256 value , Yes, if you hash the same input with the same function, you will always get the same result. This follows from the fact that it is a hash-function. For example, because the ASCII value for ``A'' is 65 and ``Z'' is 90, sum will always be in the range 650 to 900 for a string of ten upper case letters. For a hash table of size 100 or less, a reasonable distribution results. For a hash table of size 1000, the distribution is terrible because only slots 650

The easiest (and consistent) way may be this (and it's fast):

"https://www.example.com/abc/def/123?hij=345".sum % 4

That will always produce an integer 0 - 3, is quite fast, and should be fairly well distributed (though I haven't actually run tests on distribution).

The Ultimate Guide to Consistent Hashing, Consistent Hashing is a distributed hashing scheme that operates For instance​, some hash function designed to hash strings, with an output range of 0 . email to act as the key so that the hash function would be applied to the email only. The String Converter - Hash, Encode and Decode strings using any known technique. Category: Web Tools:: This tool is also available through the Codepunker API. Convert, encode and hash strings to almost anything you can think of. Encode or decode strings to and from base64. Url-encode or decode strings; Calculate almost any hash for the given

Hashing Strings with Python, A hash function is a function that takes input of a variable length sequence this converts the string to bytes, because the hashing function only  The following example defines a DisplayString class that includes a private string constant, s, whose value is "This is a string." It also includes a ShowStringHashCode method that displays the string value and its hash code along with the name of the application domain in which the method is executing.

Hashing, With this magic function our search is reduced to just one probe, giving us a hashCode for an integer is 2009 hashCode for a string is 1537223 Though, because of collisions, we cannot guarantee the constant runtime in the worst-​case. String hash function #1 This hash function adds up the integer values of the chars in the string (then need to take the result mod the size of the table): int hash(std::string const & key) {

String.GetHashCode Method (System), Returns the hash code for the provided read-only character span using the specified rules. Returns the hash code for this string using the specified rules. returned by a cryptographic hashing function if you need a cryptographically strong hash. It defines a DisplayString class that includes a private string constant,  The .NET Framework 4.7.1 reference source for string.GetHashCode() An unmerged PR that fixes a long-standing bug in the .NET Framework 64-bit version of string.GetHashCode() A simplified, non-randomised version of string.GetHashCode used in CoreFX. This explicitly calls out the implementation as vulnerable to hash-based attacks.

Comments
  • Just wondering, did you ever find a good solution to this?
  • CRC may be a good solution - see stackoverflow.com/questions/4452161/…
  • True, but are SHA1 and MD5 overkill? Too slow?
  • fast enough i'd say. you can check google's CityHash which targets performance and minimizing collisions in general text strings but give sha1/md5 a try and test the performance.
  • @RobDavis the "hex" in #hexdigest suggests that output is a hexadecimal number, to convert it to int all you have to do is call to_i(16)
  • i don't know constraints of your system but sha is a real solution to your problem, just as md5 is and murmur hash and crc32. the difference is the space you're mapping urls onto - only 4bn numbers in 32 bits which is not good enough for almost any problem i can think of. instead of restricting yourself to short 32bit integers, try representing large integers in higher base - base64/62 is a great way to represent huge integers with few characters.
  • Just benchmark note: on my i7 8th gen Lenovo laptop the XXhash.xxh32(str) was ` 0.016127` ; xx64(str) was 0.020158) and hexdigest was ` 0.212843)` so i would def go with xxh32
  • Although that only seems to look at the first four characters: "Hash something else".to_i(36) also produces 807137
  • It works only until first space. So it is usable for URLs. There is similar method on Fixnum called to_s(36).
  • Actually, it seems to stop at the first slash.
  • It's going to stop at the first non-alphanumeric character, because you are specifying a number base. You could strip such characters, but then you might have to worry about overflow (since you are basically asking it to parse a huge number).
  • This also doesn't distribute evenly, but rather is weighted according to the frequency of the last letter. E.g. "e" is more common in English, so you'd have more requests going to the server associated with e.