Wordle uses two different sets of words:
- Solution: Words to guess, mostly everyday words.
- guesses: Words you can use to guess Wordle but it’s never the answer, and some are ambiguous.
In order to find the solution and guess the words, I turned to the game’s JavaScript source code file:
Solutions: [“cigar”,”rebut”,”sissy”, … ,”artsy”,”rural”,”shave”]Guesses: [“aahed”,”aalii”,”aargh”, … ,”zygon”,”zymes”,”zymic”]
Extracting and saving it to a CSV file is easy:
The output is:
First three solution words:
"cigar","solution"
"rebut","solution"
"sissy","solution"Last three guess words:
"zygon","guess"
"zymes","guess"
"zymic","guess"Number of words per type:
2315 "solution"
10657 "guess"
The Wordle dictionary consists of 12,972 words that you can use to play the game: 2,315 solutions and 10,657 guesses.
I will create a new Google BigQuery dataset called Word To keep things tidy. Then I will load these words into a table (raw_words
) to start the exploration.
Let me quickly check that all the data has been loaded correctly into Google BigQuery:
Since Wordle uses all uppercase and the extracted words are in lowercase, I will quickly create a new table (normalized_words
All words are capitalized to make things look beautiful. I’ll keep the original because storage is cheap in Google BigQuery:
The next step is to create a lookup table (letters
) with all the different letters in all the words (solution and guessing).
I will pre-compute a number (letter_bitmask
) for each letter. This will make my life easier down the road.
This table will look like this:
As you can see in the image above, the original Wordle uses 26 characters, so think of a character’s bit mask as a string of 25 0 and one exactly one that moves.
If I consider these 0 and 1 as a binary number and convert them to a decimal number, I will get a bitmask for the character:
A = 0000000000000000000000000000001 = 1
B = 0000000000000000000000000000010 = 2
C = 0000000000000000000000000000100 = 4
D = 0000000000000000000000000001000 = 8
E = 0000000000000000000000000010000 = 16
F = 0000000000000000000000000100000 = 32
...
X = 0010000000000000000000000000000 = 8,388,608
Y = 0100000000000000000000000000000 = 16,777,216
Z = 1000000000000000000000000000000 = 33,554,432
At this point, I can start analyzing the Wordle dictionary. I’ll calculate a few things first:
- Length: In the original Wordle, this would always be 5. No surprises here.
- mask word: This is the number that encodes Unique The letters used by the word.
- unique characters: How many different letters does the word use.
Let me briefly explain how word bitmask has been calculated.
I start by separating the five letters that make up the word, then remove any duplicates, and finally add the single letter masks for the remaining letters. some examples:
BITMASK(ADDED) → {A,D,D,E,D} → {A,D,E} → {1,8,16} → 25
BITMASK(FADED) → {F,A,D,E,D} → {A,D,E,F} → {1,8,16,32} → 57
BITMASK(FACED) → {F,A,C,E,D} → {A,C,D,E,F} → {1,4,8,16,32} → 61
BITMASK(DECAF) → {D,E,C,A,F} → {A,C,D,E,F} → {1,4,8,16,32} → 61
Note that FACED and DECAF have the same bitmap masks for the word because they both use the same characters, albeit in a different order. These are the only two words with A, C, D, E, and F in the Wordle dictionary:
Fun fact: Wordle lets you play with 14 words that can be formed by merging the letters A, D, E and R. And repeating them. How many words do you know? (full list at the end)
Counting the number of unique characters a word uses is very easy: I just need to count how many 1 characters are in the word’s bitmask.
Google BigQuery has the extension BIT_COUNT(expression)
The function will do exactly what the name says: Count the number of 1 present in expresion
.
Let’s take some bit masks for words and their binary equivalents and count the ones:
BIT_COUNT(BITMASK(ADDED) = BIT_COUNT(25) = BIT_COUNT(0b011001) = 3
BIT_COUNT(BITMASK(FADED) = BIT_COUNT(57) = BIT_COUNT(0b111001) = 4
BIT_COUNT(BITMASK(FACED) = BIT_COUNT(61) = BIT_COUNT(0b111101) = 5
BIT_COUNT(BITMASK(DECAF) = BIT_COUNT(61) = BIT_COUNT(0b111101) = 5
Let’s see how many solution words there are given the number of unique characters:
This information may come in handy when you only have one turn left and can’t decide whether to do or go your last word.
For every 3 words, the odds are that 2 will have 5 unique letters and 1 will have 4. Therefore, the most likely answer is a verb: the word with 5 unique letters.
Maybe it’s just me, but the fewer unique letters a word has, the more difficult you will find it to solve.
Fun fact: There is only one solution that uses two different characters. Don’t worry, I won’t spoil the surprise