I just got reminded by timehop of something I shared a few years back and wanted to share it again but talk about it a little more in dept. It’s an interesting subject if you know the ideas and problems addressed by cryptography, this one is about base64. It address the issue that, if we encode something in base64, then encode the result using the same method, the more we repeat this the closer we come to a fixed point the method hits. That is to say base64 has point where it will give you the same answer as the product you passed in.
This issue was looked into by Fran Mota who is a computer science student (according to his website at the time of writing) at Cornell University. Fran created the post The Base64 Encoder Has A Fixed Point back in 2013, so all credit to him for his research. I’m only want to chat about it.
The issue is, in base64 if we encode:
If we decode this string we get
So, Fran takes the string ‘lol’ and runs it through python’s base64 encoder and repeats the process and gets something like this
>>> base64.b64encode('lol') 'bG9s' >>> base64.b64encode(_) 'Ykc5cw==' >>> base64.b64encode(_) 'WWtjNWN3PT0=' >>> base64.b64encode(_) 'V1d0ak5XTjNQVDA9' >>> base64.b64encode(_) 'VjFkMGFrNVhUak5RVkRBOQ==' >>> base64.b64encode(_) 'VmpGa01HRnJOVmhVYWs1UlZrUkJPUT09' >>> base64.b64encode(_) 'Vm1wR2EwMUhSbkpPVm1oVllXczFVbFpyVWtKUFVUMDk=' >>> base64.b64encode(_) 'Vm0xd1IyRXdNVWhTYmtwUFZtMW9WbGxYY3pGVmJGcHlWV3RLVUZWVU1Eaz0=' >>> base64.b64encode(_) 'Vm0weGQxSXlSWGROVldoVFltdHdVRlp0TVc5V2JHeFlZM3BHVm1KR2NIbFdWM1JMVlVaV1ZVMUVhejA9' >>> base64.b64encode(_) 'Vm0wd2VHUXhTWGxTV0dST1ZsZG9WRmx0ZEhkVlJscDBUVmM1VjJKSGVGbFpNM0JIVm0xS1IyTkliRmRXTTFKTVZsVmFWMVpWTVVWaGVqQTk=' >>> base64.b64encode(_) 'Vm0wd2QyVkhVWGhUV0d4VFYwZFNUMVpzWkc5V1JteDBaRWhrVmxKc2NEQlVWbU0xVmpKS1NHVkdiRnBOTTBKSVZtMHhTMUl5VGtsaVJtUlhUVEZLVFZac1ZtRldNVnBXVFZWV2FHVnFRVGs9'
You can see that after four or five runs of this, some features start to stick. Once we get to nine, ten and eleven we really start to see things staying the same. This is because, as he states on the page, base64 has two phases if considered from a high level
- It takes a sequence of bytes (that is, digits in base 256), and interprets them as a sequence of digits in base 64, using four digits for every three bytes.
- It encodes the base 64 digits as a sequence of bytes, using one byte for every digit.
To explain the first phase, you need 8 bits to represent a byte, but a digit in base 64 only represents 6 bits. So in phase 1, base64 looks at 3 bytes at a time, and maps them to 4 corresponding base 64 digits. 3 bytes = 24 bits = 4 digits.
Then in the second phase, base64 makes these digits human readable. In doing so, it represents the 6 bit digits as 8 bit bytes, which is fine, if a little wasteful. So what was originally three bytes in the input becomes four bytes in the output.
As explained on his blog which I feel bad to quote anymore or, in a nut shell, the more we encode and the more we transition between 64 an 256 bits, the more restricted the map we can encode from becomes and there therefore, base64 will always converge to a point, a fixed point, a string known as