AI chatbots like ChatGPT and Bard have been all over the news. From Google trying to catch up to the progress that ChatGPT has made, to the growing concern of ChatGPT in the education sector, AI chatbots have totally dominated in 2023.

As a result of the creation of these AI chatbots, detection tools are now starting to creep into the spotlight. We’ve been reading a lot lately about AI Text Detectors and Classifiers that claim that they can spot AI-generated text. This would be a huge step in efforts to crack down on plagiarism, misinformation, and cheating in schools if we knew just how accurate they really were.

So it go us thinking… why don’t we test some of these AI detector tools out to see how accurate they really are? We would have AI generate text and see if different detection tools could spot if it was written by a human or by a machine. The blog staff here at SBS did some quick research, compiled a list of some of the more popular tools we’ve seen others test, and we conducted our experiment.

To begin, we utilized ChatGPT to obtain our sample of AI-generated content. We asked the chatbot to “Explain the benefits of egaming in schools in two paragraphs”. This gave us the content we needed, as some of the detection tools had a minimum character count that needed to be fulfilled. In a few short seconds, ChatGPT gave us this:

“eGaming in schools offers numerous benefits to students. Firstly, it promotes teamwork and collaboration among students who play video games in a team setting. This helps them learn how to communicate effectively, build relationships, and work together to achieve a common goal. Additionally, playing video games requires cognitive skills such as problem-solving, critical thinking, decision-making, and strategizing. eGaming, therefore, can improve these cognitive skills and help students develop a better understanding of how to approach complex situations.

Secondly, eGaming provides a safe and controlled environment for students to experiment and make mistakes without experiencing significant consequences. In addition, it helps reduce the temptation to engage in unsafe or illegal activities outside the classroom. Moreover, it provides an excellent platform for students to connect with peers who share similar interests, fostering a sense of belonging and community. Finally, eGaming helps develop perseverance and resilience, as players must have the ability to adapt to different challenges and overcome setbacks, which are valuable skills that can be applied on other areas of life.”

Now that we had our sample, we were ready to being testing. In our research we came across several different sites that claimed that they were able to spot AI-generated text. After we narrowed the field to the tools that were free to use, we ended up with seven sites, (including one from OpenAI, who generated the text for us in the first place) and we began our experiment.

The seven tools we used were from GPTZero, Writer.com, AI Writing Check, Content at Scale, GPT Radar, Copyleaks, and the OpenAI Text Classifier. In each case, we submitted our AI-generated text in the tool, and these were the results:

GPTZero

Result: Written by AI

GPTZero called out several different sentences in the sample where AI-generated content was detected. It gave the content a very low Perplexity score of 82.444. Perplexity scores measure the “randomness” of the text. In this case, they were spot on.

Writer.com AI Content Detector

Result: Written by AI

The Writer.com tool didn’t give a perplexity score, but it did tell us that they felt only 16% of the sample submitted was written by Humans. Although this is still incorrect, as none of the sample was written by humans, they were able to find that it was largely generated by AI. They even supplied some constructive feedback that said “You should edit your text until there’s less detectable AI content.”

AI Writing Check

Result: Written by AI

So far, we’re 3 for 3. The AI Writing Check tool required us to submit a sample of at least 100 words, which was met. Their analysis was that our sample was indeed written by AI. They did not supply any other information aside from their determination.

Content at Scale AI Detector

Result: Written by Human

On test number 4, things took a turn. We submitted our sample of text, and when the results came back they were wildly inaccurate. This tool uses “Three P’s” to predict their guess – Predictability, Probability, and Pattern – and our scores were 48%, 100% and 100% respectively. Using this analysis they determined that our sample was 83% likely to have been written by a Human. Oh, how wrong they were.

Copyleaks AI Content Detector

Result: Written by AI

The tool from Copyleaks was another one that didn’t give us information after we submitted our sample, except for the fact that they correctly identified that this was indeed AI-generated text.

GPTRadar

Result: Written by Human

The post-submission analysis on GPTRadar was a little overwhelming to read. They gave each word in our sample its own determination of whether they felt it was written by a human or AI. The GPTRadar tool also give us a Perplexity score of 13 - which seemed incredibly low based off of the score received by GPTZero – but it wasn’t clear if their scale is measured in the same way as some of the other sites. Either way, GPTRadar incorrectly surmised with 79% certainty that our text was generated by a human.

OpenAI Text Classifer

Result: Written by AI

Phew, this could’ve been real embarrassing. Thankfully, the tool from OpenAI recognized text generated from their own AI chatbot. Aside from “The classifier considers the text to be possibly AI-generated”, there were no other explanations explaining how they determined their guess.

From our research, we appeared to get relatively lucky with the success rate of the tools we used. More often than not, the tools that were successful for us in finding AI-generated text failed for other sources who conducted their own tests.

In this experiment conducted by TechCrunch, the OpenAI Classifier failed to detect AI-generated content 6 out of 7 times. This is pretty incredible considering a detection tool from the very organization that started this craze to begin with had an 86% failure rate in their testing.

Experimentation like this simply confirms that there is still a lot of work that needs to go into the responsible implementation of AI chatbots. Additionally, the same kind of care needs to be taken when developing tools to detect AI in your child’s school essay, or the email you received last week from a tech guru.

For the record, this post was human-generated (with the exception of our sample). So if you decide you want to do a little testing for yourself…