`hash()`

algorithm in this article has been slightly altered so that the code below doesn’t work. This is intentional: this code should not be used for secure hashing as it is merely a demonstration of why the same password can generate a different hash.
Take a simple password like “mypassword1”. We can make a simple hash from it using just `hash()`

.

`echo hash("mypassword1");`

Output:

`0d28e4080dc8f64fc9603639bb7aa1b9`

This creates a hash, which is usually a string that basically summarizes data then outputs it. This can mix up the data so that doesn’t make sense to humans, like “4b492f”. It always mixes the data up in the same way, so the output won’t change unless the input changes.

Notice how these to hashings produce the same output:

```
echo hash("mypassword1");
echo "<br>";
echo hash("mypassword1");
```

Output:

```
0d28e4080dc8f64fc9603639bb7aa1b9
0d28e4080dc8f64fc9603639bb7aa1b9
```

A problem of having the same result for the same input is that once a hacker finds the matching hash for your password, they not only know your password, but can share it, sell it, or even log into your account. Passwords, and entire databases, can also get stolen.

The hash above has has already been cracked and added to a table of cracked passwords, as have many others. We need to find one that hasn’t been cracked. Using a complicated password helps, but another solution is to use a more complicated hashing function. If we add words to the hash, we will get a different result.

```
echo hash("mypassword1");
echo "<br>";
echo hash("mickey"."mypassword1");
```

Output:

```
0d28e4080dc8f64fc9603639bb7aa1b9
b70e1b3957c924300658efa08d73cf55
```

This is a good start to securing a password, but not nearly good enough. The appended characters are known as a pepper. Since the hash has changed, it might no longer be part of pre-hacked table that hackers send around, but there’s still a problem: the same pepper is being used for every password.

Here’s an example of two different passwords using the same pepper:

```
echo hash("mickey"."mypassword1");
echo "br";
echo hash("mickey"."christmas2007");
```

Output:

```
b70e1b3957c924300658efa08d73cf55
c36654fa9526abfdc6c1befdb435e895
```

While it’s great that their hashes are different, the pepper is the same. If a hacker finds out the pepper, it makes cracking hashes easier. If they get access to a database that uses the same pepper on millions of hashes, they’ll all be easier to crack, because cracking usually involves hashing millions of possible password combinations until a match is found in the database. If they know the pepper, they need fewer guesses. If they try “mickeyhelpme” then they won’t get a match, but if they try “mickeymypassword1” they’ll get a match and they’ll know your password.

Unlike a pepper, a salt is a random word or string of characters that are usually generated randomly. This solves the pepper problem just mentioned. Having different salts results in different hashes on the same data:

First, let’s make a simple salt. We’ll just use integers from 1,000 to 9999 so that the salt is always four digits long. This simplifies things later on.

```
$salt = random_int(1000, 9999);
echo $salt;
```

Output:

`1802 `

We can append this to the password as our random salt.

```
$salt = random_int (1000, 9999);
echo hash($salt."mypassword1");
```

Output:

`0e0a0545cfb80d7cb300147536b556f7`

Notice that the hash has changed yet again now that a random salt has been applied. There’s still a problem, however. A powerful computer can still crack this password by hashing many different passwords and seeing if that hash matches your hash. If it does, they know your password. We can exploit the fact that computers must try many attempts at guessing a password by making each attempt costly. When there are millions of attempt’s, just adding a few microseconds adds up to hours of extra time to crack just one password! A simple and effective method to add work, also called “difficulty” or “cost,” is to simply rehash the hash many times.

We’ll rehash the password 50 times:

```
$salt = random_int(1000, 9999);
$hash = $hash = hash($salt."mypassword1");
for($i = 0; $i < 50; $i++){
$hash = hash($salt.$hash);
}
echo $hash;
```

Output:

`27f8e572bf73b5d5dfa21441550e7928`

Notice that we repeatedly pass the output hash back into the hash loop, changing it every time. Cracking this hash forces a hacker to do more work, but we still have the same problem of consistency: we’re using the same number of rehashes. We can randomize this too.

We can force any hacker to use between 100 and 999 iterations of the hashing. Limiting the range to 100 to 999 results in a value that’s three digits long.

```
$salt = random_int(1000, 9999);
$iterations = random_int(100 , 999);
$hash = hash($salt."mypassword1");
for($i = 0; $i < $iterations; $i++){
$hash = hash($salt.$hash);
}
echo $hash;
```

Output:

`2249c804ae8ccbbcb6f6322ba34fce45`

The password is far better hashed now. But there is one more problem: the purpose of storing a password hash is to check if a password you enter is valid. This is done by entering a password when you log in, hashing it, and checking if it matches the stored hash. If they match, then your password is verified. But how does the hashing routine know how many iterations the password was originally hashed, or what random salt was used? You can actually make these public. The goal isn’t to hide this information, but to make the hash itself as random as possible to avoid a match during cracking. We can append the salt and iterations right to the hash. If the database gets lost or stolen by hackers, it doesn’t make any difference if the iterations and salt are public or not. In fact, other security methods make parts of the password public, usually in the form of a key.

I’ll refer to the iterations and salt parts of the hash as “segments.” I’ll delimit each segment with a hyphen. This allows for the following code to create a hash:

```
$salt = random_int(1000, 9999);
$iterations = random_int(100, 999);
$hash = hash($salt."mypassword1");
for($i = 0; $i < $iterations; $i++){
$hash = hash($salt.$hash);
}
echo $iterations."-".$salt."-".$hash;
```

Output:

`170-6382-91967eb06ee5bc1b2f7b12ee4df90177`

The `hash()`

method used above contains an algorithm that creates the hashed ouput. But what if one day it’s cracked? We need a backup plan. We can simply use other algorithms for hashing. There are plenty of hashing algorithms available. We could make “1” represent the current `hash()`

algorithm, “2” represent the `sha256()`

algorithm, and “3” represent “`bcrypt ()`

We can prepend this code to the hash above as another segment:

`1-170-6382-91967eb06ee5bc1b2f7b12ee4df90177`

Lastly, we can make two functions: one to create the hash from the password and another to compare hashes. Again, the first segment is the hashing algorithm used, the second is the iterations, and the third is the salt. Combined with the hyphens, the output will always be 42 characters.

```
$passwordinput = "mypassword1";
$algorithm = 1;
$storedhash = "1324980ee39de659444e7fc10b5cb816b7341ba";
function get_hash($passwordinput, $algorithm) {
$salt = random_int(1000, 9999);
$iterations = random_int(100, 999);
$hash = hash($salt."mypassword1");
for($i = 0; $i < $iterations; $i++){
$hash = hash($salt.$hash);
}
return $algorithm."-".$iterations."-".$salt."-".$hash;
}
function compare_hashes($passwordinput, $storedhash) {
$storediterations = substr($storedhash, 0, 3);
$storedsalt = substr($storedhash, 3, 4);
$realstoredhash = substr($storedhash, 7);
$hashedpasswordinput = hash($storedsalt.$passwordinput);
for($i = 0; $i < $storediterations; $i++){
$hashedpasswordinput = hash($storedsalt.$hashedpasswordinput);
}
if ($hashedpasswordinput === $realstoredhash) {
return true;
}
}
```

Output:

`1`

In reality, a salt would have 16 or more random bytes, there would be far more iterations, and different hashing algorithms would create a hash of different lengths. I also cheated in the code and ignored checking the hashing algorithm used to hash. Fortunately, PHP has taken care of password creation and verifying (matching)with just two simple functions.

PHP’s `password_hash()`

is similar to the method above. By using random parameters, a password will change each time it’s hashed despite using the same password. `password_hash()`

is currently the favored method for hashing passwords, hopefully through a secure socket. As with the method above, you don’t need to generate your own salt or cost, and the algorithm can change at any time. `password_verify()`

tries to match a password against its hash. Since the parameters are obtained from the hash, the only inputs you need is the password and the stored hash.

Continued from Part 1.

Let’s start with some definitions.

** Tree: **A hierarchical structure of nodes and connections between those nodes (branches) with parent-child relationships. Child nodes have parent nodes, which in turn may have their own parents nodes. The highest node is the root node.

*Decision tree***:** a flow-chart-like structure where each internal (non-leaf) node denotes a test on an attribute, and each branch represents the outcome of a test. Each leaf (or terminal) node holds a class label.

** Computer model**: A computer-based model is a computer program that is designed to simulate what did or what could happen in a situation.

We made a model in Part 1, but it wasn’t a computer model. You can think of a model as a simple or smaller example of something larger or more complex. Models may be based on real objects or on theories.

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Tree models where the target variable can take a discrete set of values are called classification trees.

*Information Gain***:** a metric used to measure homogeneity. It is used to choose an attribute on which the data has to be split a node in a decision tree — where the split has the highest purity among the daughters.

** Information entropy**: the average rate at which information is produced by a randomly determined source of data. The measure of heterogeneity in a collection. The lower the entropy, the higher the purity (homogeneity).

Constructing a decision tree involves finding the attribute that returns the highest information gain (i.e., the most homogeneous branches) and repeating this until the tree is complete. The topmost node in a tree is the root node. The first split is done at the root, and it’s the split with the highest information gain.

An example of a randomly-determined (stochastic) source of data is a dice roll. Entropy measures the average information content you can expect to get if an event occurs, so casting a dice has more entropy than tossing a coin because each outcome of the dice has smaller probability than each outcome of the coin.

Let’s look at our distinctive feature graph. Notice that the /ɑ/ (“father a”) has three distinctive features: [+syllabic +back +low].

Given a phoneme, we can use a decision tree to figure out if a phoneme is one of two classes: /ɑ/ or not /ɑ/ (¬/ɑ/) using our three given features.

You can traverse the tree downward, answering each hypothesis with a yes or no until you arrive at the appropriate class of /ɑ/ or ¬/ɑ/.

Imagine you knew only the features and not the phoneme (class). Given the feature matrix [+syllabic +back +high], what is the outcome? If you traverse the tree, you see that the criteria for +syllabic? is true, or “yes”, so you proceed down the “yes” branch. You then see that +back? Is true, so you proceed down the “yes” branch. However, you see that +low? Is false, so you proceed down the “no” branch and find that the feature matrix represents the ¬/ɑ/ class. We’ve just used a decision tree to classify a phoneme!

Side note: we’re just using two classes, so this tree is a binary classifier.

Let’s use the equations above to calculate the splits. Information gain is used to decide which feature to split on at each step in building the tree. At each step we choose the split that results in the purest daughter nodes. Every available split must be tested to find which gives the purest result.

Entropy is defined below:

Where *p**1**, p**2**, . . *** .** are fractions of examples in class

Let’s use the following data to calculate entropy:

There are actually 3 features (we’re not counting “class” as a feature, since it’s the answer), so J = 3.

To get the entropy of the “height” node, get the counts of the +high and the counts of +low.

P_{+high}= 2

P_{+low}= 2

Divide both by the count of all observations. Since there are 4 observations (rows), we divide each by 4:

P_{+high} = 2/4 = 0.5

P_{+low}= 2/4 = 0.5

We now have to use the logarithm. A calculator will tell you that log2(0.5) – log2(0.5) = 1.

Notice we have two class labels, so we just have to get the log of two values. Also notice that with 2 classes, when the two classes are evenly split (0.5), which gives us the maximal entropy of 1.

We now know that the entropy of the “height” node is 1.

For each example in the “height” column, we’ll look at the corresponding syllabic value in that row. Our split has just two branches because there are just two values for the “height” feature: +high and +low.

*When calculating entropy,1/1 and 0/0 are actually 0 because every example has the same value, making it totally pure, so there is no entropy.

We don’t need to calculate the entropy of the -syllabic branch because we already know it’s 0. We just need to calculate the entropy for P_{+high} and P+low for the +syllabic branch.

entropy = -(2/3)log(2/3) – (1/3)log(1/3) = 0.9184

The entropy of the children is the weighted average of the two branches.

entropy(children) = (3/4)(0.9184) + (1/4)(0)

We’ll split our data using information gain. It’s the entropy of the parent minus the weighted sum of the entropy of its children:

The split will try to maximize the information gain. Let’s continue by using our data above to get the information gain from the “syllabic” feature. Let’s assume that the parent is “height”, which has an entropy of 1.

syllabic Information gain = 1 – (3/4)(0.9184) + (1/4)(0) = 0.3112

We subtracted the entropy of the children from the entropy of the parent to get the information gain for the syllabic split. We have to compare this to the backness split, so we have to use the same method above to calculate the backness information gain.

We repeat this process, recursively comparing information gain. A loss function is determines which split is better. Once these features are used for a split, they are no longer candidates for further splits.

Rather than repeating lots of notation and explanations, I’ll use a JavaScript example in the next part.

]]>