We're back with another study, this time focusing on whether Mixtral AI content can be detected by market-leading AI content detection tools, including Originality.ai.
In this study, we focus on the Mixtral AI model and Originality AI’s ability to detect its content as AI-generated.
Below are the top-level findings from this study, including a breakdown of how we executed the testing and a short analysis of the results.
You can also access the dataset here or the open-source AI detection testing/statistical analysis tool used for this study.
We have made and open-sourced an AI detector efficacy research tool that is even easier for researchers to use to test AI detector efficacy against a dataset.
Note: this is ONLY a 200-sample test which is significantly too small for any conclusive answers. For a more complete AI detection accuracy study see this study.
We use machine learning best practises to see how well a classifier works. Here's a guide that goes into the process in greater detail, including more details on the accuracy of AI detectors as a whole.
When assessing the performance of an AI detector, the best way to do so is by looking at the confusion matrix (which you will see outlined for each detector in this article) and the F1 score. The F1 score is frequently used as a metric to convert the overall Confusion Matrix into a single figure.
F1 score: 0.97
Recall: 0.94
Accuracy: 0.94
Originality detected that 94.3% of the AI-written content was infact, AI-generated, mistakenly identifying it as human-written 5.7% of the time.
F1 score: 0.76
Recall: 0.62
Accuracy: 0.62
GPTZero did not perform as well as Originality.ai for this test, only correctly identifying 61.7% of the content as AI-generated, mistakenly attributing it as human-written 38.3% of the time.
F1 score: 0.78
Recall: 0.63
Accuracy: 0.63
CopyLeaks faired slightly better than GPTZero (+1.5% correct detections), but it was still only successful less than two-thirds of the time, with 36.8% of articles incorrectly identified as human-written.
F1 score: 0.86
Recall: 0.68
Accuracy: 0.68
Similarly to GPTZero and CopyLeaks, Sapling was only able to correctly identify about two thirds of the AI generated articles in this test (67.5%).
Based on the results of this 200-article test, it appears that Mixtral AI detectability is in keeping with other LLMs, such as Google Bard and ChatGPT
Overall, Originality.ai significantly outperformed GPTZero, CopyLeaks, and Sapling at detecting AI-generated content. While those three tools all sat between 62%-68% accuracy, Originality boasted a 95% accuracy rate.
It's important to bear in mind that AI detectors all have flaws, and there are instances of false positives and missed AI content. However, this study further proofs the reliability of the Originality.ai tool.
it also highlights the importance of studies like this to continue our confidence in the content we are consuming, furthering the value of transparency. If you are interested in running your own study please reach out, as we are happy to offer research credits.
We are also still looking for a participant in our challenge for charity (do AI detectors work). If you'd like to get involved, please get in touch.
Content creators, editors and writers are entering uncharted territory with regard to AI. In the past, it was easy to identify AI-generated content at-a-glance. However, with new developments in machine learning and natural language processing, the waters become even more murky.