Tumgik
#paper was coauthored by a PhD student (who should know better) and his thesis adviser (who should REALLY know better)
thosearentcrimes · 1 month
Text
Read an AI hype preprint because apparently I am not very discerning with my time. There's a lot that's kinda funny about it.
So the title of this paper is "People cannot distinguish GPT-4 from a human in a Turing test". The purpose of the title is evidently not to provide a clear reference for other researchers, but rather to produce AI hype and be reproduced in credulous headlines, as we'll see shortly.
The paper opens with a brief history and presentation of the Turing test. The important features are that the interrogator questions an AI and a human and is tasked with identifying the AI and has 5 minutes to do so. Then it presents the testing setup they used. 500 people are recruited, they are split into five categories of 100 each (making the percentage signs redundant), 400 interrogators, who talk to GPT-4, GPT-3.5, ELIZA, or a human (from the remaining 100) for five minutes and then are asked to determine if they talked to a human or not. It is a two-player setup. Why, though? You just explained that the Turing test was formulated on the premise of three participants, what is the reasoning for departing from that? Oh, ok, here it is:
We used a two-player formulation of the game, where a single human interrogator conversed with a single witness who was either a human or a machine. While this differs from Turing’s original three-player formulation, it has become a standard operationalisation of the test because it eliminates the confound of the third player’s humanlikeness and is easier to implement
WHAT? The "confound" of the third player's humanlikeness is the point! That is part of the premise of the test, it's trying to test/compare humanlikeness. Margarine marketers are more honest than this. By the end of the Introduction section of this paper titled "People cannot distinguish GPT-4 from a human in a Turing test" the authors have explained that they did not administer a Turing test because they were pretty sure that if they did GPT-4 would fail and they wanted a positive result so they designed their own test it could succeed instead. This is just outright fraud!
It's also kind of weird to talk about "GPT-4" succeeding at the test, because it was in fact a specific elaborate prompt of GPT-4, selected on the basis of prior research that had found it the most effective strategy. I mean, it needs to be prompted with something and it's not entirely clear to me why I think something more neutral without specific listed strategies would be a more honest implementation but I do. I guess it's because the way this finding is presented it's claiming that the AI is good at deception, and I mean it just really isn't. The AI didn't come up with the idea of doing an exploratory study with a wide variety of prompts (or for that matter, come up with the prompts themselves) and then using the ones that worked best, that was the researchers. The researchers who have pretty obviously rigged the whole study, in fact.
There is also some possible arguable dishonesty in the way the situation was presented to the human participants. They say the participants were told they would be put in conversation with either a human or a machine. Now, unless they were told more than is mentioned, I contend those people would have been entitled to treat this as an implication that those were equally likely possibilities, especially given the analogy with the actual Turing test, in which the population sizes of humans and machines are necessarily identical. Note also that all in all the interrogators turned out just a little under 50% human verdicts. In fact, the probability that they would be assigned to speak to a machine was 75%. If the interrogators had been told that there was only a 25% probability they would be assigned to speak to a human, would they have been more critical? Would they have assigned closer to 25% human verdicts? Even if it weren't that much closer, I suspect it would bring the GPT-4 success rate below 50%, which was treated as an important benchmark for god knows what reason. It might also bring the human success rate below 50%, of course. All of this could have been avoided by simply running the actual Turing test, but again, GPT-4 would have failed so they couldn't do that.
I wouldn't be too surprised if this paper did get accepted without major revisions, science is as vulnerable to hype cycles as everyone else, but if it does then what an indictment of the field (hm, actually, what field, I don't think "dicking around with GPT" has really been standardized yet).
8 notes · View notes