Really simple short string compression

compression of short strings
string table compression
simple compression algorithm
c# compress(string)
string compression python
compress base64 string
string compression and decompression in c++
c# compression

Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?

I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.

I think the key question here is "Why do you want to compress URLs?"

Trying to shorten long urls for the address bar?

You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.

For example:

http://mydomain.com/folder1/folder2/page1.aspx

Could be shorted to:

http://mydomain.com/2d4f1c8a

Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.

Storing lots of URLs in memory or on disk?

Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.

shoco, shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but  SIGVARIS® Access Up to 50% Off Now. Best Prices, Largest Selection, Free S&H!

As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.

DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:

string[] orig = {
    "folder1/folder2/page1.aspx",
    "folderBB/folderAA/page2.aspx",
};
public void Run()
{
    foreach (string s in orig)
    {
        System.Console.WriteLine("original    : {0}", s);
        byte[] compressed = DeflateStream.CompressString(s);
        System.Console.WriteLine("compressed  : {0}", ByteArrayToHexString(compressed));
        string uncompressed = DeflateStream.UncompressString(compressed);
        System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
    }
}

Using that code, here are my test results:

original    : folder1/folder2/page1.aspx
compressed  : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx

original    : folderBB/folderAA/page2.aspx
compressed  : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx

So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.

You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.


EDIT Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.

antirez/smaz: Small strings compression library, is a simple compression library suitable for compressing very short strings. General purpose compression libraries will build the state needed for compressing  The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.

I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.

garysharp/SmazSharp: Small strings compression library , SmazSharp is a simple compression library suitable for compressing very short strings. General purpose compression libraries will build the state needed for compressing data dynamically, in order to be able to compress every kind of data. An optimal compression for short strings could probably be achieved using an arithmetic coder (also a type of entropy encoder), but most likely one could not achieve the same kind of performance that shoco delivers.

Compress the string, Examples A. shoco is a C library to compress and decompress short strings. It is very fast and easy to use. ; Previous story Reverse a given string, also using  That's really the "guts" of any compression algorithm, no matter how simple (like this one) or how sophisticated. As a practical matter, I honestly can see an application where this method would be preferable over any of the myriad of other compression options are out there.

I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).

see http://blog.alivate.com.au/packed-url/

smaz - Rust, This crate implements the smaz algorithm for compressing very short strings. Smaz is a simple compression library compressed by 39%; Nothing is more  Introduction. This article discusses string compression with optional decent encryption with pure VB.NET code, and no external tools required. It can easily be integrated into existing projects. As the code is kept simple, it's suitable for beginners and a conversion to C# can be done easily.

Compression algorithms for small strings, A simple variant of Huffman is due, I believe, to David Wheeler. should be really common in the modified string, so it should get a really short  The ultimate compression short for guys who love a good deal. For a fraction of the cost of your gym membership, you can nab three pairs of second-skin shorts made of soft, lightweight fabric

(PDF) Unishox, shoco. Smaz is a simple compression library suitable for compressing very short. strings [10]. It was developed by Salvatore Sanfilippo and is released under the. Run Length Encoding. Given an input string, write a function that returns the Run Length Encoded string for the input string. For example, if the input string is “wwwwaaadexxxxxx”, then the function should return “w4a3d1e1x6”. a) Pick the first character from source string. b) Append the picked character to the destination string.

Packed URL, https://stackoverflow.com/questions/1192732/really-simple-short-string-​compression. With no packing and then converting to base64 encoding  Does anyone know of a compression for very short number/binary strings? For example, I would like to compress any numbers (0-9) or binary of i.e. 3,2,6,4 or I would like to compress numbers or binary of 2,1,3,6,1,4,13. Applied to pick 4 and euromillions study.

Comments
  • Why? There is probably a better way to do what you're asking.
  • "Why" is certainly a good answer. However, as a side note, Huffman coding works great for simple text compression without having to resort to external libraries and LZW compression.
  • possible duplicate of Best compression algorithm for short text strings
  • That's not an answer to the question. What if you have nowhere to store the hashtable?
  • @endolith - The point is string compression will not help you here, only relating it to a hash or similar. See Cheeso's answer for real world example compressions longer and just as long in the original when converted back to valid URLs. You always have "somewhere" to store a hash. Hard code it into your URL redirection code if you really do have "nowhere" to store it!
  • You don't always have somewhere to store a hashtable, and it doesn't always make the URL longer. en.wikipedia.org/wiki/Data_URI_scheme, for instance
  • Data uri is not any sort of compression, and had nothing to do with shortening urls. In fact data uri is for embedding data in web pages and uses base64, which if you read chesso's answer you will see is much longer. In which case would you not have somewhere to store url/hash code references? If you have a form of compression which will shorten a URL and still be a valid URL please post it as an answer, I'm sure the community will benefit.
  • Using hex is pretty stupid, it's not a dense format at all. Using base64 or even base85 and replacing the invalid characters by correct ones (escaping again takes space) will certainly reduce the output. Not as much as you are claiming though, your math is off. Of course, the shorter the URI's, the less compression you can expect, and it also matters what the context is.
  • The conclusion of this answer ("using compression then .... is still not a good approach") is no longer valid - see my answer - stackoverflow.com/a/50751602/887092
  • Not concerned with strength of compression - I am looking for something that performs very well and is quick to implement. Can you point me to base64?
  • Base64 is not going to compress anything :)
  • @Jon Grant: Correct. Base64 was a stupid suggestion. Would only work after actually compressing to get something that (perhaps) is smaller, but still ascii. Have removed all trace of the suggestion.
  • I think this is exactly what I'm looking for. Do you have any example code or a project you could share? I couldn't find anything on the site you linked.
  • I do have some code I can dig up. Please leave a comment on my blog and we can connect that way.