12 July 2022

Weekend Project: Pokémon Name Generator

ruby

Like many millenials, I grew up playing/watching/collecting Pokémon. Back then, there were only 151, but that was still enough variety to make naming all of them quite difficult - even if you were fan. For parents and the uninterested, the names were quite difficult to keep in one’s head. Now there’s almost 1000 and some of them are quite strange. Looking at you “klink, klang, and klinklang”.

Pikachu

Ah yes, pokechu! The electric mouse Pokémon!

With so many Pokémon now, I thought it would be fun to see if I could generate names that sounded like real Pokémon using ruby.

Naïve Attempt

My first thought was to simply create names of roughly the right length with roughly the right letters. Simply gather these stats from the corpus of actual Pokémon names and then generate names that match these stats.

statistics = {
  letter_count_distribution: names.map(&:size)
  letter_distribution: names.split("").flatten
}

name = statistics[:letter_count_distribution].sample.times.map do
  statistics[:letter_distribution].sample
end.join

puts name

That didn’t work. It produced unpronouncable nonsense.

eareredneemoo
gedgfa
lbphoobng
ureerbeier
airmr

The problem was that this naïve algorithm doesn’t take into account letter to letter changes. The letters lbph can’t all appear next to each other.

That sounds like what a Markov Chain is for!

Markov Chains

I’m not an AI or language processing expert, but I vaguely remember covering Markov Chains at uni. Their main use is predication. For instance, predicting which word you mean, given the first few letters you’ve typed.

I wasn’t trying to predict which name the user was typing as they were typing it - I was trying to create new names. So I had to use the concepts of a Markov Chain a little loosely.

At the core of Markov Chains is the endcoding of all the valid ways to get from one letter (or phoneme or word or w/e) to the next, including which letters can start the name and which names can end it. That’s exactly what was missing from the naïve algorithm.

So instead of simply gathering stats on length and letter frequency, I built up map of valid letter transitions and their relative frequencies. From this map all I had to do was choose a random letter to start, and then random subsequent letters until the name ended when it selected nil.

# create a statistics map of all the valid transitions
markov_statistics = {}

pokemon_names.each do |name|
  last_letter = nil

  name.split("").each do |letter|
    if markov_statistics[last_letter]
      markov_statistics[last_letter] << letter
    else
      markov_statistics[last_letter] = [letter]
    end

    last_letter = letter
  end

  if markov_statistics[last_letter]
    markov_statistics[last_letter] << nil
  else
    markov_statistics[last_letter] = [nil]
  end
end

# create the name by picking valid letters
pokemon_name = ""
current_letter = statistics[nil].sample

loop do
  break if current_letter.nil?

  pokemon_name << current_letter
  current_letter = statistics[current_letter].sample
end

puts pokemon_name

The markov_statistics is a hash that encodes all the valid transitions in the corpus. The key of the hash is the start letter and the values of the hash are all the letters that might follow the start letter. For letters that ended the name nil is included in the value array.

For example:

corpus = ["bulbasaur", "charmander", "squirtle"]

statistics = {
  nil => ["b", "c", "s"],
  "a" => ["r", "s", "u"],
  "b" => ["a", "b"],
  ...
}

I allowed letters to be included in the value array multiple times so that when I choose a letter from the value array with .sample it was weighted to the frequency of that pair in the corpus. That way more common pairs would be chosen more often.

This produced better names, but some of them only had one letter in them. How is that valid?

s
munegogunchezatoy
pitulor
kiterlving
higraponeo

Context

The problem was that nil->s was valid, and so was s->nil. So, the algorithm could choose nil->s->nil - it thinks s is a perfectly valid name!

This fix for this was to introduce some context. In markov chains, we consider all the letters typed so far. I didn’t want to consider all the letters in the name before choosing the next letter - then it would only produce actual Pokémon names! But, I could consider more that one letter at a time.

# create statistics of valid transitions,
# given a configurable length of context
markov_statistics = {}

pokemon_names.each do |name|
  context = []

  name.split("").each do |letter|
    if markov_statistics[context]
      markov_statistics[context] << letter
    else
      markov_statistics[context] = [letter]
    end

    context = [context, letter].flatten
    context = context.drop(1) if context.size > context_length
  end

  if markov_statistics[context]
    markov_statistics[context] << nil
  else
    markov_statistics[context] = [nil]
  end
end

# create the name by picking valid letters,
# keeping a configurable length of context
pokemon_name = ""
context = []
current_letter = statistics[context].sample

loop do
  break if current_letter.nil?

  pokemon_name << current_letter
  context << current_letter
  context = context.drop(1) if context.size > context_length
  current_letter = statistics[context].sample
end

puts pokemon_name

In this version markov_statistics contains a map where the keys are a list of letters that form some context and the values are the possible next letters e.g. [b, u, l]->[b, g, ...] where bul can be followed by b or g etc.

After this, and after playing around with a sensible value for the length of the context, I got some better sounding names.

turne
vespichu
ursarill

As the size of the context increased, it produced actual Pokémon names with greater and greater frequency. Above 5 all the names produced were actual Pokémon, in machine learning this is called over-training.

Comparing

With testing real AI/ML applications, the system is given half of the training or classification data, and the other half is used to test the classification/prediction accuracy. I could do that with my generator.

If I gave the generator half the available Pokémon names, could it generate the other half?

actual_pokemon = corpus.pokemon.shuffle

mid_point = (actual_pokemon.size + 1) / 2
training_data = actual_pokemon[..mid_point]
test_data = actual_pokemon[mid_point..]

algorithm = Markov.new(training_data, context_length: 3)
generated_names = 5000.times.map { algorithm.generate_name }.uniq
test_data_names_generated = test_data
  .map { |datum| datum.join("") }
  .intersection(generated_names)

puts "Test Data Names Generated: #{test_data_names_generated}}"

Well, the answer was no. It produced roughly 3 of the test data names for every 5000 names produced. As a classification/prediction engine that would be pretty poor performance. But for a fake name generator, that’s not bad. It actually produces names like “chespin” and “porygon” without ever even seeing them before. I call that a win.

Is it a Pokémon?

Can it trick a person? That was the whole point of this after all. To answer that question, I built a quiz app to do just that.

I produced 10,000 names from the generator. This includes some real and some not real names. The app randomly picks one and which you think it is.

Try it now!

Phonemes

Another extension I wanted to try was to consider phonemes as the base unit of the names, rather than just letters. For this I needed to collect a list of phonemes in the English languge and then break the names into groups of letters based on those phonemes.

pokemon_phonemes = pokemon_names.map do |this_pokemon_name|
  remaining_pokemon_name = this_pokemon_name
  this_pokemon_phonemes = []

  loop do
    break if remaining_pokemon_name.empty?

    phoneme = phonemes.find { |phoneme| remaining_pokemon_name.start_with?(phoneme) }

    this_pokemon_phonemes << phoneme
    remaining_pokemon_name = remaining_pokemon_name.sub(phoneme, "")
  end

  [this_pokemon_name, this_pokemon_phonemes]
end.to_h

The resulting hash had entries that looked like this.

{ "charmander" => ["ch", "ar", "m", "a", "n", "d", "er"] }

The hope I had with this extension was to be able to treat certain groups of letters that make a sound (like “ch”) as if it was a single letter in the Markov Chain. I hoped this, in turn, would restrict the possibilities for each phoneme, creating more realistic suggestions.

It didn’t really make too much of a difference, though.

Code

You can find the code, both for the name generator and the quiz on Github.