Scrabble CD-ROM Word Lists

Screenshot of Scrabble 1.0 showing an easter egg: searching the dictionary for "FEINBERG" which returns the text "FEINBERG JIM: Karting fool. Or is that fool karting? Special thanks to Allen "The Fudge Man", Matt, Marilyn and Ceaser Feinberg. Go Ernie! RIP Friskie.

Hasbro’s Em@il Games Scrabble does not come with a dictionary. Instead, checking words for validity is done by the server, and only when a player challenges a play. This was probably for a few reasons – reducing the client download size, allowing Hasbro to update it as needed, and as a safeguard to keep the word list out of public circulation.

This means I need to provide my own dictionary for Hasbro PBEM Proxy to support the game. It seemed a simple enough problem: just get a word list from somewhere, and hook it up. But which one? I started by using CSW21, the latest (at the time) list of competitively accepted words. Yet this bothered me as it seems anachronistic: the game released on Feb. 5, 1999, so shouldn’t the word list be equally contemporary?

Thus begins another great (and probably foolish) journey into software archaeology. “What dictionary were people using in 1999?” turns out to be a complicated question, due to the fact that different lists were used for casual vs tournament play, US vs UK, short vs long words, etc. and unification of these was not really a priority at the time. Plus, some of the word lists remain unavailable unless you were a paying member of NASPA or another Scrabble professional organization. Besides, even if I had the word lists, is there any certainty that those were the ones backing the Em@il Games service?

I decided to try a different approach. Presumably, Hasbro had provided the dictionary to the developers that they wanted to be used. It makes sense that they would use the same dictionary in other branded Scrabble products at the time. In 1996 Random Games and Hasbro released a version of Scrabble on CD-ROM, helmed by lead developer Brian Sheppard (who had independently created a top performing shareware Scrabble AI called “Maven” two years before, and been hired by Hasbro to run this project — after discontinuing his own version, of course!) Three years later, “Scrabble Version 2.0” came out, with a complete redesign of the UI and a higher resolution gameplay. This version did not run well on Windows XP, so in 2002 Hasbro released “Scrabble Complete”, a slight rebuild of 2.0 with XP support and a few new background images.

As these products neatly bracket the release of Em@il Games Scrabble, I sought to get the word lists from each, and see how they stacked up to dictionaries in use.

This presents a new problem: after installing the games to a virtual machine, I’m left scratching my head… where the heck are the words? I expected a “WORDS.TXT” simple listing, but no such luck. There are a couple of likely places – “wordlist.daw”, which is a binary file that looks like a bunch of offsets, and “strings.dat” (or later .fil) containing a formatted bunch of strings. Perhaps wordlist.daw indexes into strings.dat?

Screenshot of hex editor with a record for "ABACUS" highlighted, showing its definition and associated words.

The format of STRINGS.DAT from the 1996 game was straightforward: a definition (with trailing space), followed by one (or more) words it applies to, all null terminated. Pulling these out with a bit of Perl and I find 100,477 words mapping to 45,495 definitions. Not bad… but something’s not right here. For one thing, Scrabble dictionaries usually have somewhere around 200,000 words in them, so I’m only getting about half of what I expected to find. Second… there are easter eggs in here! The game developers snuck in some entries for themselves that will show a dedication or small message when searched in the dictionary. At the same time, trying to actually play one of these on the board gets challenged – so it doesn’t look like the in-game dictionary is the true source of valid words.

Scrabble Complete’s STRINGS.FIL is in a slightly different format and even worse shape: there are duplicate words (e.g. “spurred” in both the definition for “spur”, but also as its own entry) AND words with multiple definition entries (e.g. “vacuum” as a noun and a verb). Most of the easter eggs are gone, except Brian Sheppard’s, who appears on a search for either his first or last name. Still, only 100,497 distinct word entries.

It’s time to revisit WORDLIST.DAW. The file differs between versions, but the format looks similar. My first instinct – that it’s offsets into STRINGS.DAT or similar – doesn’t lead anywhere, so I start thinking on whether it’s some sort of baked data structure that is fast for doing word lookups (a binary tree, maybe?). It looks like 4-byte entries, and after a bit of squinting I noticed that the last byte of every dword is a letter “a” through “z”. Well, this idea looks more promising.

Screenshot of hex editor showing repeated 4-byte records, of which every 4th byte is an ASCII letter

At this point I am wondering, “is this a solved problem already?” There may be an existing format, or structure, designed for compact word storage or fast search on possible endings, etc. Searching various words like “DAW Scrabble” and “Scrabble word data structure” eventually leads me to discover some really useful information on “Directed Acyclic Word Graph“, or a DAWG for short. This is a graph where nodes are individual letters, with transitions to the next letters, and a bit indicating whether the graph can terminate at this point for a valid letter. This way “CAT” and “CATS” can be stored in a single graph, C -> A -> T -> S, and with the T and S nodes marked as “terminal”. Walking the graph and printing the string at every terminal point unpacks the entire thing to a full wordlist.

Diagram of a DAWG, showing the packing of 12 words into a directed graph of 8 nodes.

The data structure was proposed in a 1988 paper called “The World’s Fastest Scrabble Program“, by Andrew W. Appel and Guy J. Jacobson. It later found its way into Maven and from there into official Hasbro Scrabble games. As a graph can be stored on disk / in an array using only a few pieces of information – an index to the list of possible next letters, a bit indicating when the sub-list terminates, a terminal bit, and a letter – I’m now armed with enough info to knock together an unpacker for the Scrabble .daw files.

#!/usr/bin/env perl
use strict;
use warnings;

my @dawg = ();

# read the .daw file and unpack its bits
open my $fp, '<:raw', $ARGV[0] or die "Failed to open $ARGV[0]: $!";
while (!eof ($fp)) {
  # read an int
  die "Short read on $fp: $!" unless 4 == read($fp, my $buf, 4);

  my $in = unpack('N', $buf);
  my $letter = ($in & 0xFF) ? chr($in & 0xFF) : '';
  my $bit0 = ($in >> 8) & 1;
  my $bit1 = ($in >> 9) & 1;
  my $index = ($in >> 10);

  push @dawg, { idx => $index, term => $bit0, sib => ! $bit1, ltr => $letter };
}

#use Data::Dumper;
#print Dumper(\@dawg);

# recursive function to walk the graph
sub walk
 {
  my ($idx, $prefix) = @_;
  my $v;

  # visit each candidate letter in list,
  #  until the sibling flag is false
  do {
    $v = $dawg[$idx];

    my $str = $prefix . $v->{ltr};
    # print terminals, as these indicate a full word
    if ($v->{term}) { print $str . "\n" };
    # recurse if the index points to another list
    if ($v->{idx}) { walk($v->{idx}, $str) }

    $idx ++;
  } while ($v->{sib});
}

# Scrabble .daw is built "backwards" where the full a-z list begins
#  26 bytes before the end
walk(scalar @dawg - 26, '');

Run this on one of the wordlist .daw files, and, we have listoff!

$ ./dawg2txt.pl WORDLIST.DAW
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark
aardwolf
aardwolves
aargh
aarrgh
aarrghh
aas
...

Now that it’s shown to work, let’s get some distributions to get a better idea of what’s inside.

LetterS PER WORDScrabble (1996)Scrabble 2.0 (1999)Difference
29796-1
3969964-5
438783876-2
5859785981
615183151907
7230492306112
8283762838711
9247482478537
102019920188-11
111541815403-15
12112581126911
137768777911
14509651004
15317731792
(subtotal)16781316787562
1619400-1940
1711270-1127
185960-596
193270-327
201600-160
21620-62
22310-31
23130-13
2490-9
2520-2
26000
2720-2
2810-1
TOTAL172083167875-4208

Immediately we see a major source of differences: while Scrabble 2.0 has no words over 15 letters, the original game does recognize 4,270 longer words, up to the 28-letter word “ethylenediaminetetraacetates“! Keep in mind, a Scrabble board is only 15 rows by 15 columns in size, meaning it is literally impossible to play a word over 15 letters long. So why does the game include these? The answer may lie in the way that word lists were sourced in 1996. At the time, the standard dictionary used would have been something like the Official Scrabble Player’s Dictionary (3rd Edition, 1995), which has words and brief definitions up to 8 letters, and the longer words come from the Merriam-Webster Collegiate Dictionary (10th Edition). Since the latter dictionary is not Scrabble specific, it does not limit itself to 15 letters, and so the impossibly long ones make it into the game.

Two screenshots of Scrabble 2.0, the first showing "no definition available" for "ZOOLOGIST", the second showing the definition of "zoology" for "ZOOLOGIES"

In fact, parsing the dictionary entries was a hint after all: the “missing” definitions all correspond to words at least 9 letters in length. For example “ZOOLOGIST” is a valid word that does not have a definition in the lookup. (But not all 9+ letter words are missing a definition: if their root word is 8 or less, it may be matched this way. For example, “ZOOLOGIES” has a definition, because its root “ZOOLOGY” is only 7 letters.)

Another clue pointing to early OSPD3 is that there are 97 2-letter words, an unusual number: generally there were 96 for years, until later expansion in OSPD4. The additional word is “DA”, which was cut from later OSPD3 revisions in errata as a “foreign” word, and not re-introduced until OSPD5.

The Scrabble 2.0 game appears to much more closely mirror finalized OSPD3. Errata from the NASPA Wiki lists deletion of “DA”, and other differences between the two are also mentioned (e.g. removal of “SWIMWEARS” / “NECKWEARS” / “RAINWEARS” / “SKIWEARS”, addition of “LATTE”, “CHEMO”, and “DECLAW”, etc). Some other changes are not accounted for anywhere – for example, 1.0 accepts the word “NIGHTLIVES”, but 2.0 changed the spelling to only accept “NIGHTLIFES” instead. “MICROBREWS” exists in 2.0 but not in 1.0. All told there are 162 words only found in 1.0, and 224 words from 2.0 alone.

So, back to the original question: is Scrabble 1.0 really using “first edition” OSPD3, while 2.0 uses “final” OSPD3? And if so, where did the long words come from? There are meta-lists that were used for tournaments at the time (like TWL98) which were supersets of OSPD3 or variations of it, and often included the “long words”… but the raw counts produced from either list don’t match up, and certainly TWL98 did not exist in 1996. I considered pulling more dictionaries from additional scrabble games to build a chronology, but I’m already a couple days into this project, and a friend wisely advised me against getting into an even bigger mess than I already am.

Two screenshots of Scrabble 2.0, showing no words matching the pattern "NOTAWORD", but showing a match for the pattern "XXXBRIANXSHEPPA"

In the end, I plan to use the 2.0 list in Hasbro PBEM Proxy, as it appears to be the most contemporary list used by Random Games / Hasbro in Scrabble computer games, and hope that an expert can fill me in on any of the remaining questions.

One last thing: while looking through the new 2.0 word list, I stumbled upon a word that exists in the new revision but nowhere else. If you somehow manage to acquire four “x” tiles, you should be able to spell “xxxbrianxsheppa” and get away with it!

2 thoughts on “Scrabble CD-ROM Word Lists

  1. admin Post author

    After a bit more research I’ve discovered the rest of the differences.

    Scrabble 2.0 is very nearly using final OSPD3; however, it appears to be an earlier printing in which “GOOK(S)” is omitted and “SQUAW(S)” allowed. Later printings reversed these two, as the first word found an alternate meaning (noun, meaning “goo”) and the second was deemed offensive.

    The rest of the changes are documented here: http://web.archive.org/web/20190203125700/http://home.teleport.com/~stevena/scrabble/expurg.html

    though even that page is missing some of the 9+ letter words, which would not be in OSPD3 to begin with. Cross reference to here, as well: http://www.seattlescrabble.org/expurgOWL2.php

    It should be possible now to build a complete, final OSPD3.txt, which contrasts with TWL98.txt by having exactly 218 fewer words. (OSPD3: 167874 words, TWL98: 168092 words)

    Reply
  2. Pingback: Recreating Scrabble CD-ROM word lists #VintageSoftware #History #Gaming « Adafruit Industries – Makers, hackers, artists, designers and engineers!

Leave a Reply

Your email address will not be published. Required fields are marked *