Testing AI Models: Roman Game Benchmark

Grok3 Think mode is impressive. Since ChatGPT came out 27 months ago, at every model release I have tested it with my own test, a roman game that 1) Is hard to draw 2) Has multiple states

Both challenges have been intractable to models so far. Grok3 did it in 3 prompts, one of which was a esoteric bug that I found by chance. The ask for version to play against the human user, not only provided a good version with one prompt, but beautified it in a creative manner.

Feel free to ask me for the code.