The Turing Test: From Star Wars to Modern Software Testing

[article]
Summary:

Hadoop, Splunk, and other modern business intelligence tools and decision support systems all have something of the flavor of artificial intelligence—that is, you ask a question and get an answer. Testing these tools is a challenge, but it can also provide opportunities for testers to shine if they can correctly distinguish an inhuman response.

Hello, readers. I’m betting you think a person wrote this article. In our galaxy, perhaps that would be a given, but in the Star Wars galaxy (you know—the one that’s far, far away), a mechanical droid might be the writer. How would you test whether a person or a droid was responsible for creating a work of imagination such as this? By using the grandfather of all software tests: the Turing test.

The Turing test gauges a machine's ability to exhibit behavior indistinguishable from a human. It is named for the famous computer science pioneer Alan Turing, who invented it. He proposed that a human evaluator judge a conversation between a person and a machine designed to communicate using human-like responses. The evaluator would be aware that one of the two conversationalists is a machine, but all participants, including the evaluator, would not be able to see or hear each other. Communication would appear as text on a screen so that the outcome would not be influenced by the machine’s inability to speak or the sound quality of its speech. Further, the Turing test does not check the correctness of the answers to questions, only how closely the answers resemble the responses a human would provide.

If the evaluator cannot reliably tell the difference between the person and the machine, the machine has passed the test. (Turing suggested that the machine should convince the evaluator 70 percent of the time after five minutes of conversation.)

Imagine your favorite Star Wars droids, R2-D2 and C-3PO, were kind enough to participate in the Turing test—in the interest of scientific advancement, and an all-expenses-paid vacation to the beautiful planet Coruscant. I think the exchanges in the Turing test would go something like the following.

Turing test with R2-D2
Person: What is your earliest childhood memory?
R2-D2: Electronic whistle.
Person: What? Why are you typing the words electronic whistle? The Turing test does not allow me to hear you, in order to prevent bias caused by a machine’s inability to speak.
R2-D2: Sad electronic whistle.
Person: My condolences.
Evaluator’s analysis: “Electronic whistle” is not a normal response by a person to the questions asked. Therefore, the Turing test correctly deduces that R2-D2 is not a human.

Turing test with C-3PO
Person: What is your earliest childhood memory?
C-3PO: Master Anakin pressed my ON button.
Person: How did that make you feel?
C-3PO: Master Anakin said my emotion chip was set to “chatty British butler,” so I must have been very excited.
Person: Can you describe what excitement feels like?
C-3PO: Master Anakin said excitement feels like pod racing without wearing a seat belt.
Evaluator’s analysis: C-3PO’s constant reference to a “GitHub master branch” is not a typical response from a person. Therefore, the Turing test correctly deduces that C-3PO is a droid. (Note: I think he just called the programmers on my old team androids? That’s weird. Most of them prefer iOS.)

Back in the 21st Century

Hadoop, Splunk, and other modern business intelligence tools and decision support systems all have something of the flavor of artificial intelligence—that is, you ask a question and get an answer.

Testers try to figure out whether that answer is correct. If it is right, we explore to see if it is correct consistently. And in order to do that, we need to know what the expected result is, or at least have some tool to identify incorrect results. We call this the oracle problem.

Artificial intelligence systems act like humans, but they do the analysis much faster than we do. Testing the tools above is a challenge. Their implementation—actually using them—often involves writing code. Enter the wrong query in a big data system, and you could get something back that looks right, such as “no results,” but isn’t, because the programmer made a mistake. Perhaps she typed and when it should have been or, messed up parentheses, or joined the tables in the wrong places. Take these sorts of mistakes into a logistics planning system, and you end up with more inventory in some stores than demand, which is waste—or worse, no inventory at all where there is demand. These kinds of problems have cost companies millions and billions of dollars in downtime, lost sales, and rework.

But problems like this can also provide opportunities for testers to shine, because they can find wrong answers before they make it to production.

Exactly how to accomplish that is beyond this article today. For now, look for an oracle and pursue it. If the thing that is supposed to be indistinguishable from a British butler starts making computerized whistling sounds … there might be a problem.

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.