Having read various resources about password strength I'm trying to create an algorithm that will provide a rough estimation of how much entropy a password has.
I'm trying to create an algorithm that's as comprehensive as possible. At this point I only have pseudocode, but the algorithm covers the following:
password length
repeated characters
patterns (logical)
different character spaces (LC, UC, Numeric, Special, Extended)
dictionary attacks
It does NOT cover the following, and SHOULD cover it WELL (though not perfectly):
ordering (passwords can be strictly ordered by output of this algorithm)
patterns (spatial)
Can anyone provide some insight on what this algorithm might be weak to? Specifically, can anyone think of situations where feeding a password to the algorithm would OVERESTIMATE its strength? Underestimations are less of an issue.
The algorithm:
// the password to test
password = ?
length = length(password)
// unique character counts from password (duplicates discarded)
uqlca = number of unique lowercase alphabetic characters in password
uquca = number of uppercase alphabetic characters
uqd = number of unique digits
uqsp = number of unique special characters (anything with a key on the keyboard)
uqxc = number of unique special special characters (alt codes, extended-ascii stuff)
// algorithm parameters, total sizes of alphabet spaces
Nlca = total possible number of lowercase letters (26)
Nuca = total uppercase letters (26)
Nd = total digits (10)
Nsp = total special characters (32 or something)
Nxc = total extended ascii characters that dont fit into other categorys (idk, 50?)
// algorithm parameters, pw strength growth rates as percentages (per character)
flca = entropy growth factor for lowercase letters (.25 is probably a good value)
fuca = EGF for uppercase letters (.4 is probably good)
fd = EGF for digits (.4 is probably good)
fsp = EGF for special chars (.5 is probably good)
fxc = EGF for extended ascii chars (.75 is probably good)
// repetition factors. few unique letters == low factor, many unique == high
rflca = (1 - (1 - flca) ^ uqlca)
rfuca = (1 - (1 - fuca) ^ uquca)
rfd = (1 - (1 - fd ) ^ uqd )
rfsp = (1 - (1 - fsp ) ^ uqsp )
rfxc = (1 - (1 - fxc ) ^ uqxc )
// digit strengths
strength =
( rflca * Nlca +
rfuca * Nuca +
rfd * Nd +
rfsp * Nsp +
rfxc * Nxc ) ^ length
entropybits = log_base_2(strength)
A few inputs and their desired and actual entropy_bits outputs:
INPUT DESIRED ACTUAL
aaa very pathetic 8.1
aaaaaaaaa pathetic 24.7
abcdefghi weak 31.2
H0ley$Mol3y_ strong 72.2
s^fU¬5ü;y34G< wtf 88.9
[a^36]* pathetic 97.2
[a^20]A[a^15]* strong 146.8
xkcd1** medium 79.3
xkcd2** wtf 160.5
* these 2 passwords use shortened notation, where [a^N] expands to N a's.
** xkcd1 = "Tr0ub4dor&3", xkcd2 = "correct horse battery staple"
The algorithm does realize (correctly) that increasing the alphabet size (even by one digit) vastly strengthens long passwords, as shown by the difference in entropy_bits for the 6th and 7th passwords, which both consist of 36 a's, but the second's 21st a is capitalized. However, they do not account for the fact that having a password of 36 a's is not a good idea, it's easily broken with a weak password cracker (and anyone who watches you type it will see it) and the algorithm doesn't reflect that.
It does, however, reflect the fact that xkcd1 is a weak password compared to xkcd2, despite having greater complexity density (is this even a thing?).
How can I improve this algorithm?
Addendum 1
Dictionary attacks and pattern based attacks seem to be the big thing, so I'll take a stab at addressing those.
I could perform a comprehensive search through the password for words from a word list and replace words with tokens unique to the words they represent. Word-tokens would then be treated as characters and have their own weight system, and would add their own weights to the password. I'd need a few new algorithm parameters (I'll call them lw, Nw ~= 2^11, fw ~= .5, and rfw) and I'd factor the weight into the password as I would any of the other weights.
This word search could be specially modified to match both lowercase and uppercase letters as well as common character substitutions, like that of E with 3. If I didn't add extra weight to such matched words, the algorithm would underestimate their strength by a bit or two per word, which is OK. Otherwise, a general rule would be, for each non-perfect character match, give the word a bonus bit.
I could then perform simple pattern checks, such as searches for runs of repeated characters and derivative tests (take the difference between each character), which would identify patterns such as 'aaaaa' and '12345', and replace each detected pattern with a pattern token, unique to the pattern and length. The algorithmic parameters (specifically, entropy per pattern) could be generated on the fly based on the pattern.
At this point, I'd take the length of the password. Each word token and pattern token would count as one character; each token would replace the characters they symbolically represented.
I made up some sort of pattern notation, but it includes the pattern length l, the pattern order o, and the base element b. This information could be used to compute some arbitrary weight for each pattern. I'd do something better in actual code.
Modified Example:
Password: 1234kitty$$$$$herpderp
Tokenized: 1 2 3 4 k i t t y $ $ $ $ $ h e r p d e r p
Words Filtered: 1 2 3 4 @W5783 $ $ $ $ $ @W9001 @W9002
Patterns Filtered: @P[l=4,o=1,b='1'] @W5783 @P[l=5,o=0,b='$'] @W9001 @W9002
Breakdown: 3 small, unique words and 2 patterns
Entropy: about 45 bits, as per modified algorithm
Password: correcthorsebatterystaple
Tokenized: c o r r e c t h o r s e b a t t e r y s t a p l e
Words Filtered: @W6783 @W7923 @W1535 @W2285
Breakdown: 4 small, unique words and no patterns
Entropy: 43 bits, as per modified algorithm
The exact semantics of how entropy is calculated from patterns is up for discussion. I was thinking something like:
entropy(b) * l * (o + 1) // o will be either zero or one
The modified algorithm would find flaws with and reduce the strength of each password in the original table, with the exception of s^fU¬5ü;y34G<, which contains no words or patterns.