base64 encoding alternative to underscore

We're using a file system/url safe variation of base64 encoding such that:

"=" replaced with ""  
"+" replaced with "-"  
"/" replaced with "_"  

We are now using Azure blob storage that does not allow use of "_" within container names.

We are base64 encoding a Guid. If I was to replace underscore with say a "0" am I at risk of collisions?

Update

Not sure why the downvote. But to clarify.

Why not just use a Guid?

  1. The Guid is the id of an entity within my application. Since the paths are public, I don't really like exposing the Id, hence why I'm encoding it.
  2. I want shorter and more friendly looking paths. Contrary to one of the comments below, the base 64 encoding is NOT longer:

    Guid: 5b263cdd-2bc2-485d-83d4-81b96930dc5a Base64 Encoded: 3TwmW8IrXUiD1IG5aTDcWg== (even shorter after removing ==)

(Another) Update

Seems there is some confusion about what it is I'm trying to achieve (so sorry about that). Heres the short version.

  • I have a Guid that represents an entity in my application.
  • I need to create a publicly accessible directory for the entity (via a Url).
  • I don't want to use the Guid as the directory name, for the reasons above.
  • I asked previously on SO about how I could generate a friendlier looking Url that guaranteed uniqueness and did not expose the original Guid. The suggestion was Base64 encoding.
  • This has worked fine until recently when we needed to use Azure blob storage, which does not allow underscores "_" in it's directory (Container) names.

This is where I'm at.

Just "encode" the GUID in base16. The only characters it uses are 0123456789ABCDEF which should be safe for most purposes.

var encoded = guid.ToString("N");

Hexadecimal encoding is another popular way. However, base64 is more effective, as it’s representing 3 byte of data through 4 bytes of encoded data, while hex would represent 1 byte of data through 2 bytes of encoded data…. Basically you’re free to encode your data any way you want.

The base 64 character set is

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=

So you can't use 0 since it is already in use.

Unfortunately, if you ever need to use Base64 data in a filename or URL, these characters are problematic - they are already used by some filing systems. There is an alternative alphabet, the Filename Safe Alphabet, which uses: "-" (minus) instead of "+" for vaue 62. "_" (underscore) instead of "/" for value 63.

Instead of taking base64 and change 4 characters you could encode your data in base60.

Your base60 char list doesn't contain the 4 chars you don't like and so there's no need to replace anything.

Base64 encode your data in a hassle-free way, or decode it into human-readable format. Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

Encoding your identifiers does not encrypt them. Any technically savvy observer can base64-uncode an identifier. If you want to make your paths opaque, then either encrypt them or hash them with a salt. If you do want to keep your paths transparent, just use hex without any hyphens or braces. That way, your UUID is serialized to 32 code points, whereas Azure container names can be up to 63 character long.


If you really want shorter and funnier container names, and if Azure supports internationalized domain names, Braille encoding fits the bill as the least typable option. Here's a Haskell one-liner for generating a UUIDv4, mapping each octet of the UUID to a braille letter and encoding the resulting string in UTF-16BE (for a total of 32 octets).

import Data.Binary (encode)
import Data.ByteString.Lazy (intersperse, cons)
import Data.Functor ((<&>))
import Data.UUID.V4 (nextRandom)

braille :: IO Data.ByteString.Lazy.Internal.ByteString
braille = nextRandom <&> encode <&> intersperse 40 <&> cons 40

(In F#, |> would be used instead of <&>.)

For your amusement, see the following gist for how to convert an octet-stream into UTF-16LE or UTF-8 encoded braille strings which makes each bit literally stand out.

https://gist.github.com/bjartur/ea5db281f0b88128455ed79621abbd1d

Encode a string using Base64. s is the string to encode. Optional altchars must be a string of at least length 2 (additional characters are ignored) which specifies an alternative alphabet for the + and / characters. This allows an application to e.g. generate URL or filesystem safe Base64 strings.

Base64 encode your data in a hassle-free way, or decode it into human-readable format. Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

modp_base64_decode expects aligned strings as input. For Intel, is doesn't matter. But for ARM chips it can segfault. modp_numtoa.c functions may produce different rounding than whatever printf is on your system. On mingw (windows) compiling with -D__USE_MINGW_ANSI_STDIO gives the standard prinf behavior. Alpine Linux / musl-libc

The default b64encode () functions uses the standard Base64 alphabet that contains characters A-Z, a-z, 0-9, +, and /. Since + and / characters are not URL and filename safe, The RFC 3548 defines another variant of Base64 encoding whose output is URL and Filename safe. This variant replaces + with minus (-) and / with underscore (_)

Comments
  • Why do you need to use base-64 encoding to encode a GUID? The only characters valid in a GUID are '{', '}', '0'-'9', 'A'-'F' and '-'.
  • What's the thinking process behind "let's encode a GUID with base64 because GUID has invalid chars and base64 has even more"?
  • @Ben: base64-encoded data is always longer than the original data.
  • @Ben Tell us what are you really trying to do.
  • It sounds like you want encryption rather than encoding.
  • Using Base16 results in a 33% longer string than using Base64. Having said that, if the OP finds a 24-character random-ish string "short and friendly" then I'm sure they wouldn't have too much trouble with a 32-character string either.
  • @LukeH but it's 400% friendlier because it uses less distinct characters! :)
  • @Martinho, agree. I've also updated my question as to why I was encoding. Do I have any risk of collisions with this?
  • @Ben: It's a 1-to-1 map, so you only have collisions if you have colliding GUIDs.
  • @Ben: I thought you wanted some way of encoding a GUID that didn't have any invalid characters. This one fits that purpose. I won't suggest an alternative if you don't tell what other requirements you have.