January 9, 2024
3 Minutes

Numi: Stepping Back & Lessons Learned

If you want to read a complete retrospective of Numi, please check out Numi's project page.

--------------------------------------

I revamped Numi to v1.33 to use the official American Numismatic Association's grading standards. My goal was to transform Numi to focus solely on technical grading, which is much more objective than market grading.

Below are the results for coins ranging from G-4 to PR-70.

G-4 Test
VG-6 Test
F-12 Test
F-15 Test
VF-20 Test
XF-40 Test
MS-63 Test
MS-65 Test
MS-66 Test
PR-70 Test

As you can see, the results were quite off. Numi became increasingly accurate as the coin improved.

Numi struggles quite a bit with recognizing wear on a coin. Interestingly enough, this was the exact issue Compugrade experienced back in the 90's. Even though their algorithms were completely different than today's AI Large Language Models.

Given Numi's failure to achieve consistent results, I am putting the project on ice until the next OpenAI GPT model is released. I'm also interested in testing Numi using Google's Gemini Ultra model once that releases sometime 2024. Once they do, I will revisit these tests and see how much Numi has improved.

Observations & Lessons Learned

  • Like what many online are reporting, GPT-4 has substantially degraded in performance in the past few months. I found myself constantly arguing with the AI to get it to follow the most my custom instructions
  • The model can follow custom instructions well for the first 8-10k tokens, but perception and logic fall off a cliff after that. Even with highly tailored instructions with clear steps to follow, Numi would veer off course when it had to process too much info. This was incredibly frustrating as GPT-4 Turbo claimed to have a token limit of 128k [i.e. ability to remember up to nearly 25k words]
  • The vision model needs a major boost. The Optical Character Reader was great at picking up texts on coins. I rarely had it misread the text. But often it would fail to pick up the mint mark on more worn coins. The vision model could generally tell when a coin was more worn, but it failed to apply a correct grade.
  • Numi was excellent at identifying non-ancient coins. And was quite often great at identifying tokens. But it struggled immensely with Ancient coins. The only ancient coins it could identify were coins in very high condition.

Numi was one heck of a fun project to work on. I got to apply my AI knowledge to my hobby and I'm excited to see how future models perform. I got to talk with so many member of the coin community and hear a ton of diverse views on coin grading. Somehow I even ended up on the Coin World Podcast talking about Numi!

AI is a force to be reckoned with and it's going to be an invaluable resource for researchers and those trying to learn more about their coins. I left this project feeling that AI technical grading would eventually work. Time will tell.