Like many millenials, I grew up playing/watching/collecting Pokémon. Back then, there were only 151, but that was still enough variety to make naming all of them quite difficult - even if you were fan. For parents and the uninterested, the names were quite difficult to keep in one’s head. Now there’s almost 1000 and some of them are quite strange. Looking at you “klink, klang, and klinklang”.
Ah yes, pokechu! The electric mouse Pokémon!
With so many Pokémon now, I thought it would be fun to see if I could generate names that sounded like real Pokémon using ruby.
Naïve Attempt
My first thought was to simply create names of roughly the right length with roughly the right letters. Simply gather these stats from the corpus of actual Pokémon names and then generate names that match these stats.
statistics = {
letter_count_distribution: names.map(&:size)
letter_distribution: names.split("").flatten
}
name = statistics[:letter_count_distribution].sample.times.map do
statistics[:letter_distribution].sample
end.join
puts name
That didn’t work. It produced unpronouncable nonsense.
eareredneemoo
gedgfa
lbphoobng
ureerbeier
airmr
The problem was that this naïve algorithm doesn’t take into account letter to
letter changes. The letters lbph
can’t all appear next to each other.
That sounds like what a Markov Chain is for!
Markov Chains
I’m not an AI or language processing expert, but I vaguely remember covering Markov Chains at uni. Their main use is predication. For instance, predicting which word you mean, given the first few letters you’ve typed.
I wasn’t trying to predict which name the user was typing as they were typing it - I was trying to create new names. So I had to use the concepts of a Markov Chain a little loosely.
At the core of Markov Chains is the endcoding of all the valid ways to get from one letter (or phoneme or word or w/e) to the next, including which letters can start the name and which names can end it. That’s exactly what was missing from the naïve algorithm.
So instead of simply gathering stats on length and letter frequency, I built
up map of valid letter transitions and their relative frequencies. From this
map all I had to do was choose a random letter to start, and then random
subsequent letters until the name ended when it selected nil
.
# create a statistics map of all the valid transitions
markov_statistics = {}
pokemon_names.each do |name|
last_letter = nil
name.split("").each do |letter|
if markov_statistics[last_letter]
markov_statistics[last_letter] << letter
else
markov_statistics[last_letter] = [letter]
end
last_letter = letter
end
if markov_statistics[last_letter]
markov_statistics[last_letter] << nil
else
markov_statistics[last_letter] = [nil]
end
end
# create the name by picking valid letters
pokemon_name = ""
current_letter = statistics[nil].sample
loop do
break if current_letter.nil?
pokemon_name << current_letter
current_letter = statistics[current_letter].sample
end
puts pokemon_name
The markov_statistics
is a hash that encodes all the valid transitions in the
corpus. The key of the hash is the start letter and the values of the hash are
all the letters that might follow the start letter. For letters that ended the
name nil
is included in the value array.
For example:
corpus = ["bulbasaur", "charmander", "squirtle"]
statistics = {
nil => ["b", "c", "s"],
"a" => ["r", "s", "u"],
"b" => ["a", "b"],
...
}
I allowed letters to be included in the value array multiple times so that when
I choose a letter from the value array with .sample
it was weighted to the
frequency of that pair in the corpus. That way more common pairs would be
chosen more often.
This produced better names, but some of them only had one letter in them. How is that valid?
s
munegogunchezatoy
pitulor
kiterlving
higraponeo
Context
The problem was that nil->s
was valid, and so was s->nil
. So, the algorithm
could choose nil->s->nil
- it thinks s
is a perfectly valid name!
This fix for this was to introduce some context. In markov chains, we consider all the letters typed so far. I didn’t want to consider all the letters in the name before choosing the next letter - then it would only produce actual Pokémon names! But, I could consider more that one letter at a time.
# create statistics of valid transitions,
# given a configurable length of context
markov_statistics = {}
pokemon_names.each do |name|
context = []
name.split("").each do |letter|
if markov_statistics[context]
markov_statistics[context] << letter
else
markov_statistics[context] = [letter]
end
context = [context, letter].flatten
context = context.drop(1) if context.size > context_length
end
if markov_statistics[context]
markov_statistics[context] << nil
else
markov_statistics[context] = [nil]
end
end
# create the name by picking valid letters,
# keeping a configurable length of context
pokemon_name = ""
context = []
current_letter = statistics[context].sample
loop do
break if current_letter.nil?
pokemon_name << current_letter
context << current_letter
context = context.drop(1) if context.size > context_length
current_letter = statistics[context].sample
end
puts pokemon_name
In this version markov_statistics
contains a map where the keys are a list of
letters that form some context and the values are the possible next letters e.g.
[b, u, l]->[b, g, ...]
where bul
can be followed by b
or g
etc.
After this, and after playing around with a sensible value for the length of the context, I got some better sounding names.
turne
vespichu
ursarill
As the size of the context increased, it produced actual Pokémon names with greater and greater frequency. Above 5 all the names produced were actual Pokémon, in machine learning this is called over-training.
Comparing
With testing real AI/ML applications, the system is given half of the training or classification data, and the other half is used to test the classification/prediction accuracy. I could do that with my generator.
If I gave the generator half the available Pokémon names, could it generate the other half?
actual_pokemon = corpus.pokemon.shuffle
mid_point = (actual_pokemon.size + 1) / 2
training_data = actual_pokemon[..mid_point]
test_data = actual_pokemon[mid_point..]
algorithm = Markov.new(training_data, context_length: 3)
generated_names = 5000.times.map { algorithm.generate_name }.uniq
test_data_names_generated = test_data
.map { |datum| datum.join("") }
.intersection(generated_names)
puts "Test Data Names Generated: #{test_data_names_generated}}"
Well, the answer was no. It produced roughly 3 of the test data names for every 5000 names produced. As a classification/prediction engine that would be pretty poor performance. But for a fake name generator, that’s not bad. It actually produces names like “chespin” and “porygon” without ever even seeing them before. I call that a win.
Is it a Pokémon?
Can it trick a person? That was the whole point of this after all. To answer that question, I built a quiz app to do just that.
I produced 10,000 names from the generator. This includes some real and some not real names. The app randomly picks one and which you think it is.
Try it now!
Phonemes
Another extension I wanted to try was to consider phonemes as the base unit of the names, rather than just letters. For this I needed to collect a list of phonemes in the English languge and then break the names into groups of letters based on those phonemes.
pokemon_phonemes = pokemon_names.map do |this_pokemon_name|
remaining_pokemon_name = this_pokemon_name
this_pokemon_phonemes = []
loop do
break if remaining_pokemon_name.empty?
phoneme = phonemes.find { |phoneme| remaining_pokemon_name.start_with?(phoneme) }
this_pokemon_phonemes << phoneme
remaining_pokemon_name = remaining_pokemon_name.sub(phoneme, "")
end
[this_pokemon_name, this_pokemon_phonemes]
end.to_h
The resulting hash had entries that looked like this.
{ "charmander" => ["ch", "ar", "m", "a", "n", "d", "er"] }
The hope I had with this extension was to be able to treat certain groups of letters that make a sound (like “ch”) as if it was a single letter in the Markov Chain. I hoped this, in turn, would restrict the possibilities for each phoneme, creating more realistic suggestions.
It didn’t really make too much of a difference, though.
Code
You can find the code, both for the name generator and the quiz on Github.