Debugging the GPU with telnet

Debugging the GPU with telnet

February 16, 2019 (411 words)

This is a series about stories about debugging that involves more than just starting up your debugger and stepping through the code. The level beyond debugging with a debugger involves more complicated situations, often you have a class of bugs that are elusive, seemingly random or very expensive to reproduce.

In these cases you have to prepare to do either in production debugging, or have enough extra context and information at the time when a report comes to you that you can start answering questions.

Let me tell you about such a time…

The elusive GPU

On one of the games I worked on, towards the end of production we had QA play through the whole game every day. It was done on the most recent discs that we started to burn in the evening and QA came in and play through the game starting early in the morning while the rest of the staff slept from burning the midnight oil.

During this time, we had a very occasional infinite GPU hang. Infinite GPU hangs just manifested themselves as the game froze, but if you connected with the debugger to the console that froze you found the CPU still engaged waiting for the GPU to finish a frame. Essentially you got no information.

Telnet

We of course had a lot of debugging infrastructure in the game, we had numerous menus and inputs to do visual debugging, and toggle systems on and off. But remember, at this point the game has hung the CPU and we don’t really know what the GPU is doing.

So after the first time we saw this I was thinking about what systems we could put in place to find out more next time we had the same hang.

In the end I wrote a small telnet server in a separate thread that we booted up as soon as the game started. The telnet server had just one command to call a debug function in the SDK to query the GPU what it was doing and optionally make the GPU coredump.

A little while later we had the bug again and I got called over. I logged into the console with telnet and asked the GPU what was wrong. I forgot the exact details of the actual bug – this was many years ago, but we did find it.

Without some forethought and scaffolding of functionality beforehand we could not have found this.

Resources