Claude 3. We wanted to test how well the Originality AI detector and other leading alternatives could detect its content as AI-generated
Below, you can find a summary of the key findings, how these findings were discovered through our testing, and a short analysis of the results.
Plus, you can also access the dataset and the open-source AI detection testing/statistical analysis tool used for this study.
To get the most accurate results, we used machine learning best practices to evaluate a classifier's efficacy. If you want to learn more about the measures used and AI detector accuracy, check out this detailed guide.
F1 score: 0.971
Precision: 1.0
Recall (True Positive Rate): 0.943
Specificity (True Negative Rate): 0.0
False Positive Rate: 0.0
Accuracy: 0.94
Originality AI Standard 2.0 was able to detect the correct answer 94.3% of the time with a 0% false positive rate.
F1 score: 0.99
Precision: 1.0
Recall (True Positive Rate): 0.981
Specificity (True Negative Rate): 0.0
False Positive Rate: 0.0
Accuracy: 0.981
Originality AI Turbo 3.0 provided the best results in this test, correctly identifying 98.1% of the content as AI-generated and only getting 1.9% incorrect.
F1 score: 0.903
Precision: 1.0
Recall (True Positive Rate): 0.824
Specificity (True Negative Rate): 0.0
False Positive Rate: 0.0
Accuracy: 0.824
GPTZero also performed solidly on this test, correctly identifying 82.4% of the AI-generated articles as AI-written and 17.6% of them incorrectly as human-written.
F1 score: 0.90
Precision: 1.0
Recall (True Positive Rate): 0.813
Specificity (True Negative Rate): 0.0
False Positive Rate: 0.0
Accuracy: 0.813
Sapling performed the worst out of all the tools tested but still managed to identify 81.3% of the content as AI-generated, incorrectly identifying 18.7% as human-written.
F1 score: 0.958
Precision: 1.0
Recall (True Positive Rate): 0.919
Specificity (True Negative Rate): 0.0
False Positive Rate: 0.0
Accuracy: 0.919
Copyleaks also did well when testing Claude 3, identifying 91.9% of the AI content as AI-generated and 8.1% falsely identified as human-written.
From the results of this study, Anthropic AI’s Claude 3 detectability appears to align with that of other LLMs, such as ChatGPT (GPT-3.5, GPT-4).
Both Originality Standard 2.0 and Turbo 3.0 outperformed GPTZero, Sapling, and Copyleaks on AI detection, with our Turbo 3.0 model performing particularly well.
Always remember that AI detectors are not perfect and do produce false positives. However, they do work and we are still looking for a participant in our challenge for charity (do AI detectors work).
If you are interested in running your own study, please reach out, as we are happy to offer research credits.