Using Data Studio’s revamped data blending feature to explore how Wordle has changed after NYT’s acquisition
The current internet sensation Wordle was recently acquired by The New York Times for an amount in the low seven figures. The first noticeable change is that Wordle has found a new home in NYT’s game hub.
However, people are wondering if there are more subtle changes under the hood:
- Have any words been removed? ➖
- Have any new words been added? ➕
- Which words remain the same?
- Has the game become more difficult to play now? 😓
In a totally unrelated manner, and almost at the same time, Google’s Data Studio launched new data blending improvements.
Immediately something popped into my head: is it possible to blend the two to answer those pressing Wordle questions?
Surprisingly… the answer is yes!
As I explained in my previous story that discussed how to improve your winning odds just by using three specific words, Wordle uses two different lists of curated words:
- Solutions: the words that solve the daily game, the most common ones. For example: fresh, crust, stool…
- Guesses: words that you can use to solve the game but are never the solution itself, many of them (quite) obscure. For example: kauru, sekts, zazen…
Wordle is coded to pick a word from the solution list based on today’s date: that’s what makes it possible for everybody to play the same word each day. Let’s suppose the solution list looks something like this:
word_1, word_2, word_3, word_4, word_5, word_6, word_7, ... word_n
Say Wordle picked
word_3 as the 1st February’s solution, then
word_4 will be 2nd February’s solution,
word_5 will be 3rd February’s solution and so on and so forth. This fact will come in handy later on to determine if Wordle’s difficulty has changed.
Data blending has been possible in Data Studio since 2018. However, it was limited to left outer joins. With the latest improvements, the following join operators are now available:
- Left outer join: Returns matching rows from the right table, plus non-matching rows from the left table.
- Right outer join: Returns matching rows from the left tables, plus non-matching rows from the right table.
- Inner join: Returns only matching rows from the left and right tables.
- Full outer join: Returns all matching rows from the left tables or the right table.
- Cross join: Returns every possible combination of rows from the left and right tables.
If we consider the original Wordle list of words as a blue set, the NYT list of words as a yellow set and we intersect them, we get this diagram:
The different colors can be interpreted as:
- blue: words removed from the NYT version of the game
- green: words common to both versions
- yellow: words added by the NYT version of the game
There is a small problem though: if you check the image above, you will notice there is no join that resembles the blue or yellow shape. Fortunately, there is a simple workaround: use a full outer join combined with a filter.
Let’s see how a full outer join would work in a simple example first. Consider the following two tables:
The full outer join of these two tables looks like this:
The different colors can be interpreted as:
- blue: words only found in Table A
- green: words common to both Table A and Table B
- yellow: words only found in Table B
Notice how all the rows in blue have Word B equal to NULL. Likewise, all rows in yellow have to Word A equal to NULL. And this is our cue to create a filter that will exclude rows to compose the blue or yellow shape.
Let’s replicate those steps in Data Studio to compose the blue shape:
- Create a new blend called Classic vs NYT that combines the Original and NYT tables and select all fields from both tables:
2. Select the full outer join operator and the join condition on the word field from each table:
3. Create a chart (for example a table) that uses the blended data source and select the words from Wordle original:
4. Create a filter on the blend created earlier that just includes rows where the word from the NYT table is Null:
5. Finally, apply the filter to a chart that uses the blend:
Notice that the chart was displaying words from Wordle original and the filter was removing rows where there was no NYT match: words that are not present in Wordle NYT (removed)
Conversely, if I need to recreate the yellow shape, I just need to follow the same steps except that in step 4 I need to create a filter that just includes rows where the word from the Original table is Null:
Now that we understand how Wordle works, the different kinds of joins that we can use, and the way to configure Data Studio the way we need, the next order of business is extracting the word lists from both games.
Since the original Wordle web page no longer exists and instead redirects to The New York Times games hub, I need to resort to a cached version of it. I will also use a cached version of the new Wordle so that the code I post here will also work for you if NYT decides to change the game code.
Whenever I need a cached version of a web page I immediately think of the Internet Archive. They are a non-profit building a digital library of Internet sites and other cultural artifacts in digital form. Today they have 25+ years of web history accessible through the Wayback Machine. Let’s put it to good use:
Searching the original URL (https://powerlanguage.co.uk/wordle/), Wayback Machine shows that the page was cached several times, even several times a day.
Fun fact: According to Wayback Machine snapshots, Wordle was moved to The New York Times games hub on Feb 10, 2022 sometime between 18:56:02 GMT and 20:33:54 GMT.
As usual, I’ll be using Google Cloud Shell as my serverless environment to extract words:
When the script is run, the output looks like this:
Fetching game file...
Number of words extracted from Wordle original:
Running the script produces this output:
Fetching game file...
Number of words extracted from Wordle NYT:
Aha! The numbers don’t match. Let’s find answers!
Data Studio is able to connect to multiple data sources: Google Analytics, Google Ads, BigQuery, Google Sheet, MySQL, Microsoft SQL Server and many more.
For this exercise, I will use the File Upload connector which takes CSV files like the ones generated when extracting words from Wordle:
So let’s create a couple of data sources by uploading
Pro tip: The File Upload connector is able to combine several CSV files provided they have the same schema.
Let’s try this feature by creating a new data source that combines both CSV files:
Now that the data sources are in place, let’s build a quick chat to check all data has been correctly uploaded:
Perfect, everything looks good. And just by looking at the data that was loaded, we can see the NYT’s version of the game has 25 less words: 6 solutions and 19 guesses.
We have everything we need, let’s start answering Wordle questions!
Using the full outer join operator plus an NYT word is a null filter, the following table shows any removed words:
The NYT has removed 25 obscure, insensitive and offensive words from the original Wordle game. I have redacted them but you can easily fill in the missing letters if you are really curious…
Again, using the full outer join operator plus an original word is a null filter, the following table shows any added words:
Turns out, NYT has not added any new words. At least not in the cached page I fetched from the Internet Archive.
Now let’s take a look at the words that remain the same in NYT. Have they been changed in any way? Have any guess words been promoted to solution words? Conversely, have any solution words that have been demoted to guess words?
Comparing the “word type” from both original and NYT yields these answers. Word’s type has not been changed either:
OK, let’s a recap: so far we know NYT has removed some words, has not added any and no types have been changed.
So… if NYT has just removed 25 words, is there any way they could make the game more difficult to play?
From time to time, Wordle players will complain that today’s word was hard. Maybe it is a rare word not used often. Whatever the reason, people are unanimous: today was difficult!
The only way NYT could possibly alter the game at this point is by reordering the words to put the most difficult first. How do we check this theory?
When I extracted the words from both games I took the time to add a number to each word that indicated their sequence in the complete list.
The solution words from the original Wordle (
wordle_original.csv) look like this:
Similarly, the solution words from NYT (
wordle_nyt.csv) look like this:
I have redacted words that have not been played yet to prevent spoilers…
Comparing the sequence number from the original list to NYT list it is easy to see which words, if any, have been reordered. Furthermore, calculating the delta (difference) between the two sequence numbers really gives it away.