Crowdsourcing Desirabilities

Magnus Kalkuhl sent me an email suggesting that I crowdsource the hard subjective decisions in Siboot. In other words, instead of writing scripts that make tough dramatic decisions, I would have a select group of people play the game and make all the subjective decisions, and then I would use the database that they create to guide the decisions in the published version of the game.

This idea has come to me in a number of variations over the years, and it entails a number of complexities that I need to explain. I confess that, in my intense concentration of the problems of the moment, I had forgotten that possibility. That’s one reason why I write these essays; every now and then I go back and re-read them just to refresh my memory. In this case, though, I have never written down my thoughts on the crowdsourcing option because I found fatal flaws that caused me to abandon it.

How it would work
My technology organizes the storytelling process into Verbs, Roles, Options, Events, WordSockets, Acceptability and Desirability scripts. Every Verb has one or more Roles, each of which contains one or more Options, each of which contains the WordSockets defined for that Option. Each WordSocket has an Acceptable script and a Desirable script. The Acceptable script specifies which words are logically possible for the WordSocket; the Desirability script specifies how strongly the situation inclines toward that word.

Example: Joe punches Fred. The Verb ‘punch’ has a Role for the DirObject — in this case, Fred. That Role contains three Options: punch, run, or shoot. Each of those Options has its own Desirability script specifying how likely Fred is to exercise that Option. If Fred has a bad temper and he really hates Joe, then the Desirability for the Option ‘shoot’ will be high. If Fred is a coward, the Desirability for ‘run’ will be high. 

Hence I have to go through every single decision and write a script that expresses how likely people are to take various actions. That’s a lot of work and has proven to be very difficult, slowing things down greatly. For example, just now I am stumped over the Desirability of an actor lying to another. How does the possible liar take into account the chances that he might get caught? That’s a horribly complicated decision, and writing a script for it has got me tearing my hair out.

The crowdsourcing approach would take place in three stages. In the first two stages, I gather data for a database that will make the decisions made by actors in stage two.

In the first stage, the Desirability scripts are all set to random numbers and I play the game as each of the actors. My decisions are recorded and entered into a database. Once I have seeded the game with, say, a thousand playings, then I bring in a larger group of trusted people to play the game; their decisions are added to the database. Once I have a big enough database, I can use it to provide the decisions otherwise handled with Desirability scripts. 

Problems
Sounds simple enough, doesn’t it? Well, there are some serious problems with this technique. Consider, for example, what happens if we use the database in its simplest form. For each Role, we have relative percentages for how often each of the Options was chosen by a player. Thus, using the above example, suppose our database shows that Fred punches Joe 50% of the time, runs away 30% of the time, and shoots Joe 20% of the time. Then we just flip a random number and pick the option based on the random number. Sounds simple, no?

But wait! This doesn’t take into account any of the circumstances. For example, what if Joe has a hot temper but Tom does not. Both Joe and Tom will have the same probability of shooting, but that’s not right: Joe should be more strongly inclined to shoot than Tom. We can’t have all the actors behaving exactly the same way; that denies the concept of personality. So we have to somehow sort the data by personality traits. 

But how? We could sort all the statistics by actor; Joe has one set of probabilities for making the choice, while Tom has another set of probabilities based on his recorded behavior. However, this shrinks our applicable statistical database; we have fewer cases on which to base our decision. We’ll need a larger crowd to source this.

But wait! There’s another problem: Joe likes Tom but hates Fred. If Joe were punched by Tom, he’d react differently from how he’d react to Fred doing the same thing. The relationship between two actors will strongly influence their decisions. We’ll have to break the database down even further — but do we break it down by actor or by affection? That is, do we record the number of times that Joe shot Tom as compared to the number of times that Joe shot Fred? Or do we record the number of times that Joe shot somebody he likes versus the number of times he shot somebody he didn’t like? Do we figure out a linear relationship between amount of affection and probability of shooting? 

Behaviorism and its costs
We can answer this question by using a purely behaviorist approach. We don’t postulate anything like ‘affection’; that’s a hidden variable and should simply be dismissed. We look at behavior only, not some imaginary concept like affection. 

But a behaviorist approach requires us to consider the entirety of an actor’s experiences. That is, the entire history an actor’s life must constitutue the basis for the statistic we look up. The question we ask is this: for all recorded cases in which our actor experienced EXACTLY the same events that he has experienced up to this point, what were the percentages for his choices? 

Let’s work some numbers into this. Let’s say that a single actor’s experience in a typical story comprises only 100 events. Let’s say that, for each of these hundred events, the actor had only two options. This simple tree has 2^100 branches; that’s 10^30 branches. That’s how large our database would have to be to provide us with reliable statistical data. To acquire that much data, we’d need a crowd of a billion people playing the game; if each one played one game per second, it would take them only about a quadrillion years to provide us with the data we need.

You have a suggestion: for the early portions of the game, we don’t need so large a database. This raises an even nastier problem. If we’re going to dispense with hidden variables and base everything on behavior, then we need an initial behavioral database before the game begins. If we don’t know anything at all about the actors on the first turn, how are they to make appropriate decisions? They need to start the game with a history of behavior that we can use to guide their decisions.

Looser approaches
OK, so what if we loosen our requirements for statistical applicability? What if we decide that an actor bases his decision on the closest set of conditions for which we have data? This can be done, but it gets hairy. We have to examine every single story in the database and come up with a number representing how closely that story matches the current story; then we use that ‘closeness of fit’ value to weight the decision that the actor made in the historical story. We accumulate all the results and that gives us our decision.

This will be computationally expensive. Suppose that we build a crowd-sourced database with 10,000 games, and suppose that each of those games has 100 events. That makes a million events that we have to search, and we would probably have to retrieve that database from a hard drive, meaning that this operation will probably take on the order of one second to carry out. We cannot possibly afford to devote that much CPU time to making a single decision. 

OK, perhaps we could come up with some reduction statistic that characterizes each story by a small number of quickly-searched values that we could use for purposes of comparison. Is that possible? Sure, I could come up with something, but would it really work? For example, I could evaluate each story by, say, a few dozen variables, such as the sum of the Verb import values for the story, the number of turns played, how close each player came to winning, how many lies were told, how many deals were made, and so on. When an actor needs to make a decision, we compile the “state variable set” for the story, and carry out a least-squares fit to the state variable set of each of the 10,000 stories in our database, which values can be retained in memory for faster performance. We then make a decision based on the decisions made in the best-fitting stories. 

This can surely be built, but how reliable would the results be? How do I know that my state variable set provides matches that are appropriate for the conditions? The only way to find out will be to go through the entire exercise — and if I’m wrong, then the whole effort is for naught.

Problems with the crowd
How can I trust the results coming from the crowd? Each player must accurately represent the behavior of a defined actor. That is, if Joe Schmoe will play as Skordokott, I have to tell him all about Skordokott’s personality and trust him to behave according to that personality. How can I trust Joe Schmoe to do so? What if Joe Schmoe wants to change Skordokott’s personality? Shouldn’t he have the freedom to do so? But if he does, won’t he screw up the database?

Perhaps the database should be set up once by a small, focused crowd. In this approach, I would recruit a small crowd of volunteers who agree to play as specific characters. In other words, I would recruit one group of people to play as Skordokott, another to play as Caroonycorck, and so on. They would follow the lead that I set up, but their actions would behaviorally define the actor. But is there a tradeoff between the size of this defining crowd and the clarity of the personality of the actors? Will not more testers bring different perceptions to the role, resulting in less clearly expressed personalities?

Problems with evolution
What if I want to change something in the game? Suppose, for example, that I add a new Verb to the storyworld. Will that not instantly obsolete the database, which has zero entries for that Verb? How can I improve the storyworld if it is chained to the database? 

Problems with art
If I crowdsource this, am I not abdicating my responsibility as an artist? Is it not my responsibility to define the behavior of the actors? Does not the artistic content of my creation lie in the algorithms I create?

I think you can see why I have, so far, decided against this approach. But it’s worth occasional reconsideration.

Later the same day…
After much thinking, I continue to oscillate on the question:

Against:
The idea of crowdsourcing is fundamentally inimical to the very concept of computing. The whole idea is that the human should provide an abstraction of truth and the computer should work out all the niggling instances. In physics, for example, the human should say “F = ma” and the computer should work out all the specifics from that. Crowdsourcing the algorithms reverses that relationship, making the computer the abstracter and the human the instantiator. That’s backwards.

For:
What if we delegated the artistic expressed in a more rigorous fashion? What if I recruited seven experienced actors or role-players and asked each one to act as one actor? This would require a big investment on each one’s part, but they would get artistic credit for their performances. Those performances would consist of many playings of the game, trying to act in the way they imagine their character would act. 

Details: I would first have to create a version of the system that operates over the web. I recruit the actors and spend some weeks or months going over the scenario with them, developing the characters personalities in greater detail. Once everybody is happy with the preliminaries, we begin playing — but each player plays alone. At first, all actors except the player’s actor are set to respond randomly. Once we have a little data from the players, we make that the source of behavior. From that point on, it’s just a matter of players fleshing out the databases. This would require many playings, but I think we’d get better results from a small set of dedicated artists/players than from a large set of random schnooks/players.

This idea has me excited. This really could work. But I need time to develop it further.


Even Later…
Can I really be sure that each of the seven authors will do the job appropriately? What if somebody fails to enter enough data for the database?

How do I know that I can find an appropriate search algorithm that will reveal the best choice in each case? One improvement on the search system I mentioned earlier is to work backwards from the existing situation. That is, I start with the event that has just taken place, and then search the database for the same event. I should find a number of cases of the same event. I then look at the option chosen by the actor in response to that event. If I get the same response in every case, then the answer is easy: use the same response. But if I get different responses, then I resolve the issue by searching backwards, compiling a record of differences in previous history. That should give me enough to select the best option.

However, this will require a full database search, and I’ll definitely want to organize the database for fast searches. In truth, the main clause of an event is the defining characteristic, and that can be squeezed down to 16 bits: 3 for the Subject, 3 for the DirObject, and 10 for the verb — and that’s assuming a thousand verbs in the set! If I assume that each artist/player plays the game for a hundred hours, requiring 10 seconds per decision, that means that each actor’s database will consist of 36,000 events, for a total of 252,000 events: that’s just half a megabyte of storage. I can do that in memory. However, that’s only a partial database: it does not include all the extra wordsockets, which might be important. I will probably have to keep a full version of the database on the hard drive. That in turn means that my searching database will be much bigger, because it will have to store indeces as well as key values. Perhaps I should study database design more carefully.

So I’m still not certain as to whether I should proceed with the “mini-crowdsourced” strategy.