Building up the model

First idea of a music composition model with recurrent NN (abandoned)

I have been doing a lot of thinking about the model of control (ie. the behavior) of the object. The model should meet the following objectives:

  • Work within the constraints of the system
  • Create a good experience of interaction
  • Represent the loneliness/company-seeking agenda of the object
  • Produce a variety of sounds/interactions

Different ideas

I have thus started by setting up a constraint on the type of sound the device could produce. I have chosen to work with simple combinations of wave signals. I have read the paper Creating Melodies with Evolving Recurrent Neural Networks which basically uses a neural network to synthesize music scores that "feel like" Bartok's. However, the method generates "fixed" notes (ie. it generates half-tones), so the result would be really musical and I'm afraid it will make the device loose a bit of it's "personality".

I preferred an approach where I would generate utterance-like sounds. I thought about having the object generate "triplets". The triplets would be defined by 4 components: an amplitude and 3 frequencies. When the action is taken, the object plays all 3 frequencies in order, according to the amplitude.

I then thought about an adaptation of the recurrent network model presented in the paper. The idea was to use a recurrent neural network to learn sequences of these triplets, according to the mean amplitude/mean frequency heard from a speaker talking into the mic. I abandoned to idea however because it would have taken lots of efforts implementing, taken up half of the RAM and would have yielded unpredictable results (imho probably random stuff) because there is not a really strong and stable reinforcement attached. The model is shown in the image.

Proposed model

I finally came up with a model I believe is really fit and will, expectedly, yield good results. The idea came up after reading a paper about the use of reinforcement learning for Aibo training. I started thinking about using genetic algorithms to generate the triplets and making choices according to a reinforcement learning algorithm called Sarsa learning.

The idea is the following. At each time t, the object is in a state s_t and can take an action a_t. The action is taken according to a function Q(s,a) in an epsilon-greedy fashion, which means that in state s_t the object will take the action a_t = a for which Q(s_t,a) is maximal with probability (1-epsilon), and take a random action with probability epsilon. The parameter epsilon thus controls the desire of the object for exploration (over exploitation). Once an action is taken, a reward r is issued.

The possible actions, at each time, are:

  • Keep silent / passive listening
  • Active listening (after emitting tune)
  • Play tune: single tone #i (A_i, f_i, T_i)
  • Play tune: triplet #j (A_j, f1_j, f2_j, f3_j, T_j)
  • Change tune #i or #j with a genetic algorithm

The state is composed of the following features:

  • Current goal: loneliness / company
  • Current status: alone / not sure / not alone
  • Tune #i or #j was played x seconds ago
  • Tune #i or #j was changed with GA

The rewards are set as follow:

  1. Active actions (like playing tune) yield a reward of -1
  2. Hearing (above threshold) yields a reward of -1 if seeking loneliness and +0 if seeking company
  3. Hearing a responsive sound (ie. when actively listening) yields a reward of -2 if seeking loneliness and +2 if seeking company

This set of rewards should enforce the following behaviors:

  1. Try to economize your batteries by keeping silent if possible
  2. When you want to be alone, prefer quietness
  3. When you want to be alone, avoid interaction; when you want company, dig it

Since the state-space is quite large, I propose to use a linear function approximation to estimate Q(a,s). This is a standard procedure and is described thoroughly in this page.

See also this paper on reinforcement learning for game fighting.