Given the popularity of our recent studies into Google Bard content and Mixtral AI content, we have decided to conduct another small study into whether or not Grok AI content can get detected by some of the most popular AI content detection tools, including Originality.ai.
Here, we will look at the Grok AI model in greater detail and test the AI content detection tool’s abilities to detect whether the content generated is AI-written.
Read on to learn about the top-level results from this study, alongside a short analysis of how each tool performed.
You can also access the dataset here or the open-source AI detection testing/statistical analysis tool used for this study.
We have made and open-sourced an AI detector efficacy research tool that is even easier for researchers to use to test AI detector efficacy against a dataset.
Note: this is ONLY a 200-sample test which is significantly too small for any conclusive answers. For a more complete AI detection accuracy study see this study.
To see how well each of these tools worked, we used machine learning best-practises, testing a wide variety of AI-generated content for maximum effect.
If you want to learn a little bit more about that process, check out this detailed guide. You can also learn more about the accuracy of AI detectors as a whole.
When trying to determine how effective an AI detector is, the easiest and most consistent method to use is by focusing on the confusion matrix (which you will see outlined for each detector in this article) and the F1 score. The F1 score is frequently used as a metric to convert the overall Confusion Matrix into a single figure.
F1 score: 0.95
Recall: 0.9
Accuracy: 0.9
From our 200+ AI-generated article sample range, Originality.ai correctly identified 90% of the content as AI-written while incorrectly attributing 10% to human-written content.
F1 score: 0.81
Recall: 0.69
Accuracy: 0.69
GPTZero performed slightly worse for this test, detecting 68.6% of the content as AI-generated and incorrectly attributing 31.4% of the content as human-written.
F1 score: 0.81
Recall: 0.68
Accuracy: 0.68
CopyLeaks performed similarly to GPTZero, also significantly underperforming compared to Originality.ai, identifying 67.5% of the content as AI-generated and incorrectly claiming that 32.5% of the content is human-written.
F1 score: 0.83
Recall: 0.71
Accuracy: 0.71
Sapling faired a little better than both CopyLeaks and GPTZero, but worse than Originality.ai, detecting 71% of the content as AI-generated and incorrectly determining 29% is AI-generated.
As you can see from the results of this test, Grok AI’s detectability is very similar across all tools compared to our other similar tests for Google Bard, ChatGPT, and Mixtral.
Originality.ai performed the strongest, with a 90% success rate compared to Sapling’s 71%, GPTZero’s 68.6%, and CopyLeaks’ 67.5%.
As you can see from this small dataset, even the best AI detectors have flaws, and that must be taken into account. However, it is clear from this study that the Originality.ai tool continues to lead the way with the most accurate AI content detection software.
It also highlights why these types of study are so important as we continue to push the conversation of AI transparency forward, to allow all of us to continue to learn, grow, and improve together.
With that in mind, if you are interested in running your own study, please reach out, as we are happy to offer research credits.
We are also still looking for a participant in our challenge for charity (do AI detectors work). If you'd like to get involved, please get in touch.
Content creators, editors and writers are entering uncharted territory with regard to AI. In the past, it was easy to identify AI-generated content at-a-glance. However, with new developments in machine learning and natural language processing, the waters become even more murky.