May 21, 2024

Revisiting Numi: Part Two

After I posted my update about Numi online, the president of the Early American Coppers (EAC) group, Bill Eckberg, responded to my analysis and raised serious points about the nature of coin grading and the need for more comprehensive testing. I had a chance to connect with Bill to understand his thinking better. His insights prompted me to reevaluate my approach to evaluating AI coin grading. I agree with Bill, it does not make sense to say Numi saw a 32.47% increase in overall accuracy. The test results should be broken down by each grade.

While reanalyzing my test results to reflect this new approach, OpenAI released ChatGPT-4o on May 13th. Quite a nice surprise!

ChatGPT-4o is a groundbreaking model because it is natively multimodal. Previously, ChatGPT-4 and most other AI models relied on translation systems to communicate between different modalities. When my AI app Numi was using GPT-4, when a user uploaded an image, Numi did not view the image directly. Rather, the images are translated into a detailed text description by another AI model and then given back to Numi. Numi never analyzed any coin images, only text descriptions of the coins.

ChatGPT-4o is the next leap forward. It can process text, vision, and audio inputs natively within a single neural network. Details are not lost as we no longer need multiple models to translate information between each other. ChatGPT-4o actually analyzes images.

To date, Numi ran a custom version of ChatGPT-4. Numi was customized with tons of training documentation, such as the grading standards from the ANA and custom instructions on how to handle different types of coins. However, I felt all these customizations were hampering its accuracy. With ChatGPT-4o's release and it being free and available to everyone, I wanted to know how much power was in every collector's phone right now. You can try it out yourself but make sure to select ChatGPT 4o from the model drop-down menu!

My third test run on AI coin grading was conducted with just the base model of ChatGPT-4o. No additional customization. These are the results I found uploading images of coins and asking the AI, What grade would you give this coin?

During my testing, I investigated two key areas, data & accuracy.


"How much data about a coin does the AI need to yield the most accurate results?"

"Has the amount of data required changed as new AI models are released?."

To answer this, I tested each coin using 2, 4, 6, 8, and 10 photos.

The results show that as AI models advance, coins with grades 20 and above need fewer photos to achieve the most accurate results. This is encouraging, as it suggests collectors will only need to upload so many pictures to achieve reliable grading. However, the trend for coins graded 15 and below is inconclusive.


"How accurate is the most popular public AI model [ChatGPT]?"
"How has ChatGPT's accuracy changed throughout major model updates?"

The line chart below illustrates the results of three tests conducted in December 2023, April 2024, and May 2024, using different versions of ChatGPT. The data should be interpreted as how AI progresses at each grade level, rather than AI grading as a whole.

The trend looks positive for lower-condition coins. We can see with each model release, the average deviation from the actual grade has decreased.

I am pleasantly surprised to see the improvement for XF-40. The first model was 20 points off. Now it's ‘only' 5 points off.

Mint State coins have generally been accurate between all three models.


Although the results look positive, there are limitations to my testing. I only tested 12 different U.S. coins across 12 grade levels. Many grades are missing from the test. I also did not test any world or ancient coins.

My goal going forward is to amass a diverse collection of graded coins to test against. If any readers are planning on throwing away cheap graded slabs, please send them to me instead!

If you want to dive into the data yourself, you can find detailed test results on this Google Sheet. It includes links to photos of the coins tested.

Combined Test Results
ChatGPT-4 [Dec 2023 model] Results for G-4
ChatGPT-4 [Apr 2024 model] Results for G-4
ChatGPT-4o [May 2024 model] Results For G-4

Looking Forward

As you can tell, I'm excited about artificial intelligence and its opportunities for Numismatics. I love this community and I believe that advanced AI models like GPT-4o will make our hobby accessible to a wider audience. Plus, it's cool to play around with.

If I was a betting man, I'd predict that by mid-2025:

  • At least one estate liquidator will start scanning coins, either through phones or glasses, and see the estimated value of each coin in real time.
  • At least one coin dealer will kick out a customer for trying to cherry-pick coins with their phone or glasses
  • At least one third-party grader will announce a project utilizing AI somehow.

The future is bright and I'm excited to see what comes next. Thanks again to Bill Eckberg for his insights on grading. I'll keep my eye out for future AI model releases and will update the community with future test results. In the meantime, I'll be toiling away on the next iteration of Numi…

This update article first appeared on The E-Sylum