Revisiting Numi: Part Two

After I posted my update about Numi online, the president of the Early American Coppers (EAC) group, Bill Eckberg, responded to my analysis and raised severe points about the nature of coin grading and the need for more comprehensive testing. I had a chance to connect with Bill to understand his thinking better. His insights prompted me to reevaluate my approach to evaluating AI coin grading. I agree with Bill; it does not make sense to say Numi saw a 32.47% increase in overall accuracy. Each grade should break down the test results.

While reanalyzing my test results to reflect this new approach, OpenAI released ChatGPT-4o on May 13th. Quite a pleasant surprise!

ChatGPT-4o is a groundbreaking model because it is natively multimodal. Previously, ChatGPT-4 and most other AI models relied on translation systems to communicate between different modalities. When my AI app Numi was using GPT-4, when a user uploaded an image, Numi did not view the image directly. Instead, another AI model translated the image into a detailed text description and returned it to Numi. Numi never analyzed any coin images, only text descriptions of the coins.

ChatGPT-4o is the next leap forward. It can natively process text, vision, and audio inputs within a single neural network. Details are not lost as we no longer need multiple models to translate information between each other. ChatGPT-4o analyzes images.

To date, Numi has run a custom version of ChatGPT-4. Numi was customized with tons of training documentation, such as the grading standards from the ANA and custom instructions on handling different types of coins. However, all these customizations were hampering its accuracy. With ChatGPT-4o's release being accessible and available to everyone, I wanted to know how much power was in every collector's phone. You can try it out yourself, but select ChatGPT-4o from the model drop-down menu!

My third test run on AI coin grading was conducted with just the base model of ChatGPT-4o, with no additional customization. These are the results I found uploading images of coins and asking the AI, "What grade would you give this coin?"

During my testing, I investigated two key areas: data & accuracy.

Data‍

"How much data about a coin does the AI need to yield the most accurate results?"

"Has the amount of data required changed as new AI models are released?."

I tested each coin using 2, 4, 6, 8, and 10 photos to answer this.

The results show that coins with grades 20 and above need fewer photos as AI models advance to achieve the most accurate results. This is encouraging, as it suggests collectors only need to upload so many pictures to achieve reliable grading. However, the trend for coins graded 15 and below needs to be more conclusive.

Accuracy‍

"How accurate is the most popular public AI model [ChatGPT]?"
"How has ChatGPT's accuracy changed throughout major model updates?"

The line chart below illustrates the results of three tests conducted in December 2023, April 2024, and May 2024, using different versions of ChatGPT. The data should be interpreted as how AI progresses at each grade level rather than AI grading.

The trend looks positive for lower-condition coins. With each model release, the average deviation from the actual grade has decreased.

I am pleasantly surprised to see the improvement in XF-40. The first model was 20 points off, but now it's ‘only' 5 points off.

Mint State coins have generally been accurate between all three models.

Limitation‍

Although the results look positive, my testing has limitations. I only tested 12 different U.S. coins across 12 grade levels, and many grades are missing from the test. I also did not test any world or ancient coins.

My goal going forward is to amass a diverse collection of graded coins to test against. If any readers plan to throw away cheap graded slabs, please send them to me instead!

If you want to dive into the data, you can find detailed test results on this Google Sheet. It includes links to photos of the coins tested.

ChatGPT-4 [Dec 2023 model] Results for G-4

ChatGPT-4 [Apr 2024 model] Results for G-4

ChatGPT-4o [May 2024 model] Results For G-4

Looking Forward‍

As you can tell, I'm excited about artificial intelligence and its opportunities for Numismatics. I love this community, and I believe that advanced AI models like GPT-4o will make our hobby accessible to a broader audience. Plus, it's cool to play around with.

If I were a betting man, I'd predict that by mid-2025:

At least one estate liquidator will start scanning coins, either through phones or glasses, and see the estimated value of each coin in real-time.
At least one coin dealer will kick out a customer for trying to cherry-pick coins with their phone or glasses.
At least one third-party grader will announce a project utilizing AI somehow.

The future is bright, and I'm excited to see what comes next. Thanks again to Bill Eckberg for his insights on grading. I'll keep my eye out for future AI model releases and update the community with test results. In the meantime, I'll be toiling away on the next iteration of Numi…

‍

Update: Since this post, I have begun working on Numi v2.0! Check out how I pivoted Numi to become an AI-powered coin-sorting robot here.

‍

^{This update article first appeared on}^{The E-Sylum}