Particularly while testing software, we often find ourselves generating
large numbers of “garbage” records simply to emulate real data.
For example, one might need to create a number of user account
Given the choice, there are several ways one might go about this.
1. Using a number (or a prefix followed by a number). The Lisp built-in
GENSYM works like this; eg (gensym “counter-“) ⇒ #:counter-2145.
2. Concatenating a number of random characters together; eg,
3. Creating a sequence of Consonant-vowel pairs, eg “tesutohece”
4. Creating a sequence of word-like syllables that follow some
unambiguous clustering rules in English. One pattern I like to re-use is
that each syllable begins with one of ‘p’, ‘y’, ‘f’, ‘g,’ ‘r’, ‘l’, ‘d’,
‘t’, ‘n’, ‘s’, ‘j’, ‘k’, ‘b’, ‘m’, ‘w’, ‘v’, followed by a vowel, and
ends with one of ‘p’, ‘f’, ‘g’, ‘r’, ‘l’, ‘t’, ‘n’, ‘s’, ‘j’, ‘k’, ‘x’,
‘b’, ‘m’, ‘w’, ‘v’. This leads to strings like “sanmelfar”.
5. Using actual words, adjoined using some kind of pattern; eg,
adjective + noun. Docker.io, for example, creates container names like
this; eg, “exuberant-curie”
These are in order for a very good reason: Debugging.
Suppose that you’re entering some new records into a database.
They’re garbage records, only intended to live until the end of
your test. You’re not testing the “name” field (for which, naturally,
a full fuzz test with out-of-conformance UTF-8 values, UTF-16
substitution ranges, non-printable code points, de-normalized combining
characters, and so forth will all be necessary) — but you do need to
supply some kind of name so that the records will be valid on the face
of them, in order to get to your actual test.
Now, it’s technically equally valid to choose from any of the five
options listed above. Why on Earth, then, might you care to move lower
on the list than something like GENSYM?
There’s the rub. Sooner or later, a test will fail. If it didn’t ever
happen, we wouldn’t need the tests at all, would we? When that day
comes, almost inevitably, you’ll find yourself slogging through
a post-mortem of your database, your files, your logs, trying to
discover what went wrong.
At that point, all five ways are still technically valid, however, as
you move lower down the list (toward #5), they decrease your
cognitive burden. In other words, it is easier for your brain to
recognize, internalize, and match against the strings the more that they
are like real words. Numbers are tough. Garbage strings are just about
as bad. But “wa-wa” or “CV” strings are at least somewhat word-like, and
can be parsed by your brain just a bit easier. The word-like syllables
patterns are even better, and real words, particularly when combined in
reasonable or meaningful-seeming ways, are what this whole “reading”
thing is all about.
Making things just a little easier for yourself (or whoever it is that
has to deal with the melt-down when it comes) is always a great option —
so move lower on the list whenever you have the opportunity. Pulling in
full dictionaries might not often be a choice, but throw together
a little “random pronounceable string” function in your toolkit.
Taken conversely: If you do not spend the five minutes now to put
together a pseudoword generator, you are intentionally handicapping the
ability of the human brain to recognize the data patterns that you have
created, which will increase the time you will need to spend debugging,
as well as taking away your concentration from the task at hand in order
to focus on, and repeatedly re-verify the matching of, sequences which
would otherwise be completely transparent to you.
By the way, the effects are not only measurable, but have been measured.
Take a look at, eg. “Better the DVL You Know: Acronyms Reveal the
Contribution of Familiarity to Single-Word Reading” by Laszlo and
Federmeier (Psychol Sci, Feb 2007;
NIHMS109307) or “The acronym superiority effect” (same authors).
The recognition levels electrical activation in the middle
parietal site (graph on p. 8) show similar brain activity for words,
familiar acronyms, and pseudowords, but have a very poor correspondence
with random garbage strings. The graph on the following page shows
striking differences in recognition of repeated presentations of the
same patterns; illegal strings and unfamiliar acronyms are much more
poorly recognized than words or pseudowords.
Take a look at this:
On the right, we see that the first time an illegal string is presented,
you spend a lot of mental work trying to make sense of it. When it’s
repeated, though, you have nearly no hope of actually recognizing
Don’t handicap yourself needlessly. Masochism does not make for good