While looking at my old drafts I found this one I wrote a few years back while working on a project for the DPS915 class at Seneca. I had a great time in that class and re-reading the post makes me want to get back on track with the HashDecrypt project.

Post:


Overview

OK, last semester I was working on the implementation of a brute force MD5 application leveraging the power of the GPU using the CUDA SDK.

Up to a point I managed to implement an optimized version of the MD5 algorithm on the GPU. However, when comparing with other MD5 crackers I was not getting the results I wanted.

I pretty much extended the optimizations to a maximum (of my knowledge)
So I started to look for other possible solutions to optmize the algorithm itself.

One thing to keep in mind is that the performance of the program is not solely lying on the implementation of the MD5 algorithm alone.
The logic used to communicate between the CPU and the GPU and how you choose to split the problem into several tasks matter as much if not more compared to the algorithm itself.

The approach I’m using is very straight forward. I split the execution of the program into several Streams. Each Stream perform a brute force attack on a pre defined range. If a Stream finds a match for the hash being broken it sets a flag in global memory so it can identify the hash back in the host.

In some post to follow I will get into more details of which other approaches I tried to use and show some benchmarks comparing their performances.

Anyway, what I’m currently focused is to keep optimizing the MD5 algorithm, for running both on the CPU or GPU.

A quick background of what I’ve done with the algorithm so far:

I took an open implementation of MD5 written in C and started decomposing it to use just the pieces needed for my specific use case, which is brute force passwords of up to 128 bits starting with a base64 charset (upper and lower case letters, numbers and the minus and plus sign)

MD5 Rounds

So knowing that I would only use the MD5 algorithm in a very narrow scenario, there was no need to leave the support for string with more than 512 bits. Remember that the MD5 algorithm iterates on 512 bit blocks.

So I removed a lot of the logic that enabled the algorithm to hash long set of bits (512+)

In the end I ended up with some like this:

 #define ROUND_1  
 t = a + (c) + x_0 + T1; a = ((t << 7) | (t >> (25))) + b;  
 t = d + ((a & b) | (~a & c)) + x_1 + T2; d = ((t << 12) | (t >> (20))) + a;  
 t = c + ((d & a) | (~d & b)) + T3; c = ((t << 17) | (t >> (15))) + d;  
 t = b + ((c & d) | (~c & a)) + T4; b = ((t << 22) | (t >> (10))) + c;  
 t = a + ((b & c) | (~b & d)) + T5; a = ((t << 7) | (t >> (25))) + b;  
 t = d + ((a & b) | (~a & c)) + T6; d = ((t << 12) | (t >> (20))) + a;  
 t = c + ((d & a) | (~d & b)) + T7; c = ((t << 17) | (t >> (15))) + d;  
 t = b + ((c & d) | (~c & a)) + T8; b = ((t << 22) | (t >> (10))) + c;  
 t = a + ((b & c) | (~b & d)) + T9; a = ((t << 7) | (t >> (25))) + b;  
 t = d + ((a & b) | (~a & c)) + T10; d = ((t << 12) | (t >> (20))) + a;  
 t = c + ((d & a) | (~d & b)) + T11; c = ((t << 17) | (t >> (15))) + d;  
 t = b + ((c & d) | (~c & a)) + T12; b = ((t << 22) | (t >> (10))) + c;  
 t = a + ((b & c) | (~b & d)) + T13; a = ((t << 7) | (t >> (25))) + b;  
 t = d + ((a & b) | (~a & c)) + T14; d = ((t << 12) | (t >> (20))) + a;  
 t = c + ((d & a) | (~d & b)) + X_14 + T15; c = ((t << 17) | (t >> (15))) + d;  
 t = b + ((c & d) | (~c & a)) + T16; b = ((t << 22) | (t >> (10))) + c;

/* Round 2. */  
 /* Let [abcd k s i] denote the operation  
 a = b + ((a + G(b,c,d) + X[k] + T[i]) <<< s).  
 */  
 /* Do the following 16 operations. */  
 #define ROUND_2  
 t = a + ((b & d) | (c & ~d)) + x_1 + T17; a = ((t << 5) | (t >> (27))) + b;  
 t = d + ((a & c) | (b & ~c)) + T18; d = ((t << 9) | (t >> (23))) + a;  
 t = c + ((d & b) | (a & ~b)) + T19; c = ((t << 14) | (t >> (18))) + d;  
 t = b + ((c & a) | (d & ~a)) + x_0 + T20; b = ((t << 20) | (t >> (12))) + c;  
 t = a + ((b & d) | (c & ~d)) + T21; a = ((t << 5) | (t >> (27))) + b;  
 t = d + ((a & c) | (b & ~c)) + T22; d = ((t << 9) | (t >> (23))) + a;  
 t = c + ((d & b) | (a & ~b)) + T23; c = ((t << 14) | (t >> (18))) + d;  
 t = b + ((c & a) | (d & ~a)) + T24; b = ((t << 20) | (t >> (12))) + c;  
 t = a + ((b & d) | (c & ~d)) + T25; a = ((t << 5) | (t >> (27))) + b;  
 t = d + ((a & c) | (b & ~c)) + X_14 + T26; d = ((t << 9) | (t >> (23))) + a;  
 t = c + ((d & b) | (a & ~b)) + T27; c = ((t << 14) | (t >> (18))) + d;  
 t = b + ((c & a) | (d & ~a)) + T28; b = ((t << 20) | (t >> (12))) + c;  
 t = a + ((b & d) | (c & ~d)) + T29; a = ((t << 5) | (t >> (27))) + b;  
 t = d + ((a & c) | (b & ~c)) + T30; d = ((t << 9) | (t >> (23))) + a;  
 t = c + ((d & b) | (a & ~b)) + T31; c = ((t << 14) | (t >> (18))) + d;  
 t = b + ((c & a) | (d & ~a)) + T32; b = ((t << 20) | (t >> (12))) + c;

/* Round 3. */  
 /* Let [abcd k s t] denote the operation  
 a = b + ((a + H(b,c,d) + X[k] + T[i]) <<< s).  
 */  
 /* Do the following 16 operations. */  
 #define ROUND_3  
 t = a + (b ^ c ^ d) + T33; a = ((t << 4) | (t >> (28))) + b;  
 t = d + (a ^ b ^ c) + T34; d = ((t << 11) | (t >> (21))) + a;  
 t = c + (d ^ a ^ b) + T35; c = ((t << 16) | (t >> (16))) + d;  
 t = b + (c ^ d ^ a) + X_14 + T36; b = ((t << 23) | (t >> (9))) + c;  
 t = a + (b ^ c ^ d) + x_1 + T37; a = ((t << 4) | (t >> (28))) + b;  
 t = d + (a ^ b ^ c) + T38; d = ((t << 11) | (t >> (21))) + a;  
 t = c + (d ^ a ^ b) + T39; c = ((t << 16) | (t >> (16))) + d;  
 t = b + (c ^ d ^ a) + T40; b = ((t << 23) | (t >> (9))) + c;  
 t = a + (b ^ c ^ d) + T41; a = ((t << 4) | (t >> (28))) + b;  
 t = d + (a ^ b ^ c) + x_0 + T42; d = ((t << 11) | (t >> (21))) + a;  
 t = c + (d ^ a ^ b) + T43; c = ((t << 16) | (t >> (16))) + d;  
 t = b + (c ^ d ^ a) + T44; b = ((t << 23) | (t >> (9))) + c;  
 t = a + (b ^ c ^ d) + T45; a = ((t << 4) | (t >> (28))) + b;  
 t = d + (a ^ b ^ c) + T46; d = ((t << 11) | (t >> (21))) + a;  
 t = c + (d ^ a ^ b) + T47; c = ((t << 16) | (t >> (16))) + d;  
 t = b + (c ^ d ^ a) + T48; b = ((t << 23) | (t >> (9))) + c;

/* Round 4. */  
 /* Let [abcd k s t] denote the operation  
 a = b + ((a + I(b,c,d) + X[k] + T[i]) <<< s).  
 */  
 /* Do the following 16 operations. */  
 #define ROUND_4  
 t = a + (c ^ (b | ~d)) + x_0 + T49; a = ((t << 6) | (t >> (26))) + b;  
 t = d + (b ^ (a | ~c)) + T50; d = ((t << 10) | (t >> (22))) + a;  
 t = c + (a ^ (d | ~b)) + X_14 + T51; c = ((t << 15) | (t >> (17))) + d;  
 t = b + (d ^ (c | ~a)) + T52; b = ((t << 21) | (t >> (11))) + c;  
 t = a + (c ^ (b | ~d)) + T53; a = ((t << 6) | (t >> (26))) + b;  
 t = d + (b ^ (a | ~c)) + T54; d = ((t << 10) | (t >> (22))) + a;  
 t = c + (a ^ (d | ~b)) + T55; c = ((t << 15) | (t >> (17))) + d;  
 t = b + (d ^ (c | ~a)) + x_1 + T56; b = ((t << 21) | (t >> (11))) + c;  
 t = a + (c ^ (b | ~d)) + T57; a = ((t << 6) | (t >> (26))) + b;  
 t = d + (b ^ (a | ~c)) + T58; d = ((t << 10) | (t >> (22))) + a;  
 t = c + (a ^ (d | ~b)) + T59; c = ((t << 15) | (t >> (17))) + d;  
 t = b + (d ^ (c | ~a)) + T60; b = ((t << 21) | (t >> (11))) + c;  
 t = a + (c ^ (b | ~d)) + T61; a = ((t << 6) | (t >> (26))) + b;  
 t = d + (b ^ (a | ~c)) + T62; d = ((t << 10) | (t >> (22))) + a;  
 t = c + (a ^ (d | ~b)) + T63; c = ((t << 15) | (t >> (17))) + d;  
 t = b + (d ^ (c | ~a)) + T64; b = ((t << 21) | (t >> (11))) + c;  

Even though I was already getting a big bump in performance I knew I could improve it.
So after thinking about different ways to optimize the algorithm one idea that stuck on my mind was that I should, somehow be able to reverse engineer some but not all steps of the algorithm.

For example:
The algorithm has 64 steps grouped into 4 rounds

 Round 1  
 F(X,Y,Z) = (X & Y) | (~X & Z)

Round 2  
 G(X,Y,Z) = (X & Z) | (Y & ~Z)

Round 3  
 H(X,Y,Z = X ^ Y ^ Z

Round 4  
 I(X,Y,Z) = Y ^ (X | ~Z)  

Optimizing the Rounds

Somehow I should be able to perform the same operations from Round 4 on the hash being brute force so all the new hashes being created would only need to do Rounds 1-3, right?
Is not that simple, but after reading a few articles and forums I found it was doable.

I’ll try to walk through my logic:

First of all, the 64th step doesn’t need to be reversed since its result will be the final hash value.
So we can start from step 63.

In most implementation you will find something like this:

 /* Round 4. */  
 /* Let [abcd k s t] denote the operation  
 a = b + ((a + I(b,c,d) + X[k] + T[i]) <<< s). */  
 #define I(x, y, z) ((y) ^ ((x) | ~(z)))  
 #define SET(a, b, c, d, k, s, Ti)  
 t = a + I(b,c,d) + X[k] + Ti;  
 a = ROTATE_LEFT(t, s) + b  
 /* Do the following 16 operations. */  
 SET(a, b, c, d, 0, 6, T49);  
 SET(d, a, b, c, 7, 10, T50);  
 SET(c, d, a, b, 14, 15, T51);  
 SET(b, c, d, a, 5, 21, T52);  
 SET(a, b, c, d, 12, 6, T53);  
 SET(d, a, b, c, 3, 10, T54);  
 SET(c, d, a, b, 10, 15, T55);  
 SET(b, c, d, a, 1, 21, T56);  
 SET(a, b, c, d, 8, 6, T57);  
 SET(d, a, b, c, 15, 10, T58);  
 SET(c, d, a, b, 6, 15, T59);  
 SET(b, c, d, a, 13, 21, T60);  
 SET(a, b, c, d, 4, 6, T61);  
 SET(d, a, b, c, 11, 10, T62);  
 SET(c, d, a, b, 2, 15, T63);  
 SET(b, c, d, a, 9, 21, T64);  
 #undef SET  

Those are all the operations performed on round 4

So breaking down operation 63 will look something like this:

 c = (((c + (a ^ (d | ~b)) + X[2] + T63) << 15)  
 |  
 ((c + (a ^ (d | ~b)) + X[2] + T63) >> (17))) + d  
  • a,b,c,d are 32bit integers were all the operations are performed.
  • X[2] is the array where the original data(the one being hashed) is stored
  • X is a unsigned array, so it moves in 4 bytes chunks
  • Thats means that X[2] has the values for the chars 4-7 of the original input.
  • Remember that a char has the size of 8 bits
  • T63 is a constant value defined in the spec
  • 15 and 17 are also valued defined by the spec

So with that in mind, to start the reverse engineering process we can break down the problem into smaller portions.

The first part is:

 ((c + (a ^ (d | ~b)) + X[2] + T63) << 15)  

The result of the operation above will be ORed with:

 ((c + (a ^ (d | ~b)) + X[2] + T63) >> (17)))  

The result of the OR operation will be added with d
So the first step is to subtract d from c

 c -= d  

Now we are left with only

 ((c + (a ^ (d | ~b)) + X[2] + T63) << 15)  
 |  
 ((c + (a ^ (d | ~b)) + X[2] + T63) >> (17))  

To help me understand how to reverse those operations I created a simple example using basic algebra operations.

  • First replacing c with 2 and (a ^ (d | ~b)) + X[2] + T63) with 3 and 15 with 2.
  • Then A left shift is the same as multiplying by the power of 2, so we can replace by a multiplication operation: (2 + (3 + 3)) * 2

The operation above will yield 16

So how can I reverse 16 back to 2? At first I thought this:

  • Substituting 2 by 16 and instead of multiplying dividing: (16 – (3 + 3)) / 2

However, that operation will yield 5

When reversing I needed to divide 16 by 2 before subtracting: (3 + 3) --> 16 / 2 – (3 + 3)

Figuring this out I just applied the same concept to the MD5 algorithm

Does the order of the OR operation matters on the result? NO

 c -= d  
 c = (c >> 15) | (c << 17)  
 c -= (a ^ (d | ~b)) + X[2] + T63)  

I just found out that both &(AND) and |(OR) are destructive operations. Only ^(XOR) and ~(NOT) can be reversed

The arithmetic example:

c = 4  
c = (c + 3 + 3) * 2  
c = (4 + 3 + 3) * 2  
c = 10 * 2  
c = 20  

To reverse, meaning find the value of c before that operation
all the steps needed to be done backwards, so that means I needed to divide before subtracting

 c = (c / 2) – 3 – 3  
 c = (20 / 2) – 3 – 3  
 c = 10 – 3 – 3  
 c = 4  

When first looking at this problem is seems that the subtractions needed to be performed before the division
However the additions were performed in the original value of c
before the multiplication so first the division needs to be done so the value is back in the position were the subtraction can be made.

  • The last operation done on c was adding d, so to reverse we start by subtracting d from c
  • The second last operation was the OR, so that is the second thing we do.
  • Before the OR was the right and left shift
  • And finally the addition with the other values, which in this case we subtract their total amount

This part is still a little bit confusing to me.
I don’t have a background in math, so if anybody could explain why:

 c =  
 ((c + (a ^ (d | ~b)) + X[2] + T63 << 15)  
 |  
 (c + (a ^ (d | ~b)) + X[2] + T63 >> 17))  
 + d  

is reversed like this:

 c -= d  
 c = (c >> 15) | (c << 17)  
 c -= (a ^ (d | ~b)) + X[2] + T63)  

and not like this:

 c -= d  
 c = (c >> 15 – (a ^ (d | ~b)) + X[2] + T63))  
 |  
 (c << 17 – (a ^ (d | ~b)) + X[2] + T63))  

Or would that even make a difference?

My point is that the OR being a lossy operation how the bits can be recovered?
And why the subtraction needs to be performed after the shifts and OR?

I have a gist with a working example