Muddling through SDL GPU Part 1 - Getting Started

Jumping in with the absolute basics here, let’s get a window on screen, the GPU API initialized, and clear the render target to a known color. The source code for this post is under the Example1-GettingStarted directory in the project repo. Here’s the end goal:

An empty macOS window with a green background

We’ll start with a single Example1.c file, using SDL3’s new main functions. We’ll need to implement four functions: SDL_AppInit, SDL_AppEvent, SDL_AppIterate, and SDL_AppQuit. And hey, that gives me a convenient structure for this post, so let’s go.

Initialization

We’ll need to:

  • Initialize SDL itself, at least the Video and Event subsystems.
  • Create both a GPU device and a Window, and attach them together.
  • And to practice good hygiene, we’ll keep references to these in a context structure instead of just making them globals.
// Before including SDL_main.h, define this to enable the new
// application lifecycle stuff.
#define SDL_MAIN_USE_CALLBACKS

// Pull in SDL3, obviously.
#include <SDL3/SDL.h>

// Include SDL_main.h in the file where you define your main
// function(s).
#include <SDL3/SDL_main.h>

// We'll have some things we want to keep track of as we move
// through the lifecycle functions. Globals would be fine for
// this example, but SDL gives you a way to pipe a data structure
// through the functions too, so we'll use that.
//
// For now, we need to keep track of the window we're creating
// and the GPU driver device.
typedef struct AppContext {
  SDL_Window* window;
  SDL_GPUDevice* device;
} AppContext;

// SDL_AppInit is the first function that will be called. This is
// where you initialize SDL, load resources that your game will
// need from the start, etc.
SDL_AppResult SDL_AppInit(
    // Allows you to return a data structure to pass through
    void** appState,

    // Normal main argc & argv
    int argc, char** argv)
{
  // This isn't strictly necessary, but if you provide a little
  // bit of metadata here SDL will use it in things like the
  // About window on macOS.
  SDL_SetAppMetadata("GPU by Example - Getting Started", "0.0.1",
      "net.jonathanfischer.GpuByExample1");

  // Initialize the video and event subsystems
  if (!SDL_Init(SDL_INIT_VIDEO | SDL_INIT_EVENTS)) {
    SDL_LogError(SDL_LOG_CATEGORY_APPLICATION,
        "Couldn't initialize SDL: %s", SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // Create a window. I'm creating a high pixel density window
  // because without that, I was getting blurry text on macOS.
  // (text comes in a later post, promise.)
  SDL_WindowFlags windowFlags =
      SDL_WINDOW_HIGH_PIXEL_DENSITY | SDL_WINDOW_RESIZABLE;
      
  SDL_Window* window = SDL_CreateWindow(
      "GPU by Example - Getting Started", 800, 600, windowFlags);

  if (window == NULL) {
    SDL_LogError(SDL_LOG_CATEGORY_APPLICATION,
        "Couldn't create window: %s", SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // Next up, let's create a GPU device. You'll need to tell the
  // API up front what shader languages you plan on supporting.
  // SDL looks through its list of drivers in "a reasonable
  // order" to pick which one to use. Fun surprise here: on
  // Windows, it's going to prefer Vulkan over Direct3D 12 if
  // it's available. Here, we're enabling Vulkan (SPIRV),
  // Direct3D 12 (DXIL), and Metal (MSL).
  SDL_GPUShaderFormat shaderFormats =
      SDL_GPU_SHADERFORMAT_SPIRV | 
      SDL_GPU_SHADERFORMAT_DXIL |
      SDL_GPU_SHADERFORMAT_MSL;

  SDL_GPUDevice* device = SDL_CreateGPUDevice(shaderFormats,
      false, NULL);
  if (device == NULL) {
    SDL_LogError(SDL_LOG_CATEGORY_APPLICATION,
        "Couldn't not create GPU device: %s", SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // Just so we know what we're working with, log the driver that
  // SDL picked for us.
  SDL_Log("Using %s GPU implementation.",
      SDL_GetGPUDeviceDriver(device));

  // Then bind the window and GPU device together
  if (!SDL_ClaimWindowForGPUDevice(device, window)) {
    SDL_Log("SDL_ClaimWindowForGPUDevice failed: %s",
        SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // By default, SDL GPU enables VSYNC, which is generally what I
  // want. If you want to change it, now is the time to do that;
  // look at SDL_SetGPUSwapchainParameters in the documentation.
  // https://wiki.libsdl.org/SDL3/SDL_SetGPUSwapchainParameters

  // Last up, let's create our context object and store pointers
  // to our window and GPU device. We stick it in the appState
  // argument passed to this function and SDL will provide it in
  // later calls.
  AppContext* context = SDL_malloc(sizeof(AppContext));
  context->window = window;
  context->device = device;
  *appState = context;

  // And that's it for initialization.
  return SDL_APP_CONTINUE;
}

The Main Loop

In earlier versions of SDL, once we have everything initialized we’d start our main loop, where we poll for input and windowing events, update our game, draw, etc. When using SDL3’s main functions, we instead implement a couple of callbacks and let SDL handle the looping portions.

Once a frame, SDL will call SDL_AppIterate. There’s no guarantees about exactly how often this gets called, but it’s intended to be as fast as possible, or tied into the display refresh rate. You’ll have to track how much time has passed yourself; I’ll start doing that in the next post.

The basic per-frame work we need to do with SDL GPU is:

  • Acquire a command buffer, which is what we use to submit drawing commands to the device.
  • Wait for the primary render target to be available; SDL calls this the Swapchain Texture. I believe this is where the vsync wait actually happens.
  • Begin a render pass.
  • Submit any drawing commands for the pass.
  • End the pass.
  • Possibly repeat if you have more passes.
  • Finally submit the command buffer to the device.

At this point we’re just clearing the framebuffer to a known color, so we’ll begin and end a single render pass with no drawing commands in it.

SDL_AppResult SDL_AppIterate(void* appState)
{
  // Our AppContext instance is passed in through the appState
  // pointer.
  AppContext* context = (AppContext*)appState;

  // Generally speaking, this is where you'd track frame times,
  // update your game state, etc. I'll be doing that in later
  // posts.

  // Once you're ready to start drawing, begin by grabbing a
  // command buffer and a reference to the swapchain texture.
  SDL_GPUCommandBuffer* cmdBuf;
  cmdBuf = SDL_AcquireGPUCommandBuffer(context->device);
  if (cmdBuf == NULL) {
    SDL_Log("SDL_AcquireGPUCommandBuffer failed: %s",
        SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // As I understand it, _this_ is where it's going to wait for
  // Vsync, not in the loop that calls SDL_AppIterate.
  SDL_GPUTexture* swapchainTexture;
  if (!SDL_WaitAndAcquireGPUSwapchainTexture(cmdBuf,
          context->window, &swapchainTexture, NULL, NULL)) {
    SDL_Log("SDL_WaitAndAcquireGPUSwapchainTexture: %s",
        SDL_GetError());
    return SDL_APP_FAILURE;
  }

  // With the command buffer and swapchain texture in hand, we
  // can begin and end our render pass
  if (swapchainTexture != NULL) {
    // There are a lot more options you can set for a render
    // pass, see SDL_GPUColorTargetInfo in the SDL documentation
    // for more.
    // https://wiki.libsdl.org/SDL3/SDL_GPUColorTargetInfo
    SDL_GPUColorTargetInfo targetInfo = {
        // The texture that we're drawing in to
        .texture = swapchainTexture,

        // Whether to cycle that texture. See
        // https://moonside.games/posts/sdl-gpu-concepts-cycling/
        // for more info
        .cycle = true,

        // Clear the texture to a known color before drawing
        .load_op = SDL_GPU_LOADOP_CLEAR,

        // Keep the rendered output
        .store_op = SDL_GPU_STOREOP_STORE,

        // And here's the clear color, a nice green.
        .clear_color = {0.16f, 0.47f, 0.34f, 1.0f}};

    // Begin and end the render pass. With no drawing commands,
    // this will clear the swapchain texture to the color
    // provided above and nothing else.
    SDL_GPURenderPass* renderPass;
    renderPass = SDL_BeginGPURenderPass(cmdBuf, &targetInfo,
        1, NULL);
    SDL_EndGPURenderPass(renderPass);
  }

  // And finally, submit the command buffer for drawing. The
  // driver will take over at this point and do all the rendering
  // we've asked it to.
  SDL_SubmitGPUCommandBuffer(cmdBuf);

  // That's it for this frame.
  return SDL_APP_CONTINUE;
}

Handling events

SDL will handle polling for input and windowing events for us, and when one shows up, it’ll call SDL_AppEvent. This isn’t terribly interesting yet; all we care about is whether it’s time to close the application.

SDL_AppResult SDL_AppEvent(void* appState, SDL_Event* event)
{
  // SDL_EVENT_QUIT is sent when the main (last?) application
  // window closes.
  if (event->type == SDL_EVENT_QUIT) {
    // SDL_APP_SUCCESS means we're making a clean exit.
    // SDL_APP_FAILURE would mean something went wrong.
    return SDL_APP_SUCCESS;
  }

  // For convenience, I'm also quitting when the user presses the
  // escape key. It makes life easier when I'm testing on a Steam
  // Deck.
  if (event->type == SDL_EVENT_KEY_DOWN &&
      event->key.key == SDLK_ESCAPE) {
    return SDL_APP_SUCCESS;
  }

  // Nothing else to do, so just continue on with the next frame
  // or event.
  return SDL_APP_CONTINUE;
}

Cleaning up

Finally, shutting down. This’ll be called if SDL_APP_SUCCESS or SDL_APP_FAILURE is returned from either SDL_AppEvent or SDL_AppIterate. It gives you a chance to gracefully shut things down.

void SDL_AppQuit(void* appState, SDL_AppResult result)
{
  AppContext* context = (AppContext*)appState;

  // Just cleaning things up, making sure we're working with
  // valid pointers as we go.
  if (context != NULL) {
    if (context->device != NULL) {
      if (context->window != NULL) {
        SDL_ReleaseWindowFromGPUDevice(context->device,
            context->window);
        SDL_DestroyWindow(context->window);
      }

      SDL_DestroyGPUDevice(context->device);
    }

    SDL_free(context);
  }

  SDL_Quit();
}

That feels like a lot of work just to clear the screen, but it also sets up all the scaffolding for issuing actual drawing commands. Next post: let’s get a triangle on the screen, yeah? To do that I’ll need to get shaders in there. 😱


Muddling through SDL GPU - The Plan

Ok, so. I want to learn how to use a modern GPU API, and since SDL3 was just released with a new GPU API abstraction over the major 3 GPU APIS (Direct3D 12, Vulkan, and Metal), it seems like a good time! Except, it’s maybe a little too new: the documentation on it is great, but it feels like it assumes you already know how to work with one of the big 3 APIs. I don’t. I never really moved beyond OpenGL 1.2. Maybe 1.1? Either way, my graphics programming knowledge is more than 20 years out of date.

The tutorials I’ve been able to find are good, if you already understand the concepts. Moonside Games in particular has some good information, and there’s an example repository, but I need something a little more basic. I’ve never written a shader or assembled a pipeline; I don’t even think I’ve used vertex buffers.

So I started trying to learn how to do all of this. I was mostly interested in Metal (I use macOS 99% of the time) so I started by translating Metal by Example by Warren Moore to SDL’s GPU API and trying to get it working on macOS, Windows, and Linux. I started writing down some notes as I went, and that ballooned into “I should blog this”, and here we are.

Anyway, I have a general outline in mind, which is:

  1. Part 1 - Getting Started. How to initialize SDL, get a window on screen, hook it up to the GPU subsystem, and clear the window to a solid color.
  2. Part 2 - Drawing Primitives. Set up a basic GPU pipeline with shaders that do almost nothing, draw a single triangle to the window.
  3. Part 2.5 - Compiling Shaders. What shader languages and formats we need and how to compile them.
  4. Part 3 - Uniforms. Pass extra parameters to your shaders, use them to animate and change things on the fly.
  5. Part 4 - Texturing. Load a texture, paint your geometry with it.
  6. Part 5 - Text rendering with SDL_ttf. Take the previous parts and put them together to draw something meaningful. Maybe add in an extra render target?
  7. Part 6 - Lighting
  8. Part 7 - Load and render a model, with some sort of animation done in the vertex shader.

There’s a repository to go along with this series available at mohiji/gpu-by-example on Github. I’m going to organize it into one example project per part (except the Compiling Shaders one) and provide projects to run on Windows, macOS, and Linux.

I do want to stress again: I’m learning as I go. Let’s have fun with it!


Summer adventure games

It’s the first weekday of summer vacation, and we don’t quite have childcare worked out yet, so my poor kids are stuck in the office with me this afternoon.

Kayla trying not to crack a smile

Kayla failing to not crack a smile

A few things come to mind. First, thank goodness for the Switch. Kayla’s playing Breath of the Wild quietly. Second, yay, there’s a new Monument Valley out today! That helps a ton.

Mostly though, I think I need to load up that computer in the background with old Sierra games. One of my elementary school summers (I think it was between 4th and 5th grade?) I remember spending a bunch of time at my friend’s dad’s dental office playing King’s Quest 3 and the Colonel’s Bequest on their computer.

I need to find some good adventure games for the kids. Good ones that take hours and hours to play though. :D


Silly benchmarks

It started with me being curious: if I use an Integer in Kotlin, am I going to be paying a penalty for using a boxed primitive? So I wrote some silly benchmarks to confirm. First, in Java:

public class SumTestPrimitive {
  public static void main(String[] args) {
    long sum = 0;
    for (long i = 1; i <= Integer.MAX_VALUE; i++) {
      sum += i;
    }
    
    System.out.println(sum);
  }
}

public class SumTestBoxed {
  public static void main(String[] args) {
    Long sum = 0L;
    for (Long i = 1L; i <= Integer.MAX_VALUE; i++) {
      sum += i;
    }
    
    System.out.println(sum);
  }
}

On my system (a mid-2011 MacBook Pro, 2.4 GHz i5) the primitive version takes 1.77 seconds, and the boxed version takes 20.8 seconds. Then in Kotlin:

fun main(args: Array<String>) {
	var sum = 0L
	var i = 0L
	while (i <= Integer.MAX_VALUE) {
		sum += i
		i += 1L
	}

	println(sum)
}

The Kotlin one takes 1.81 seconds. Tiny bit slower than the Java primitive one, but that’s probably just due to needing a little more time for Kotlin’s runtime to load. Kotlin does unboxed primitives properly, yay!

Now I’m curious though: how do the other languages I use on the regular perform? Let’s try Clojure first, both a straightforward implementation and one tailored to match the Java one better:

(defn sum-test-straightforward []
  (loop [sum 0
         i 0]
    (if (<= i Integer/MAX_VALUE)
      (recur (+ sum i) (+ i 1))
      sum)))

(defn ^long sum-test-gofast []
  (loop [sum 0
         i 0]
    (if (<= i Integer/MAX_VALUE)
      (recur (unchecked-add sum i) (unchecked-add i 1))
      sum)))

sum-test-straightforward took 5.1 seconds, and sum-test-gofast 1.69 seconds. The gofast one is comparable to the Java one, probably a little slower: I ran these at a REPL, so there’s no startup time involved.

Ok, how about Common Lisp? I can think of 3 approaches to take off the top of my head.

;; 2147483647 is the same value as Java's Integer/MAX_VALUE.

(defun sum-test-iterative ()
  (let ((sum 0))
    (dotimes (i 2147483647)      
      (setf sum (+ sum i)))
    sum))

(defun sum-test-recursive ()
  (labels ((sum-fn (sum i)
             (if (<= i 2147483647)
                 (sum-fn (+ sum i) (+ 1 i))
                 sum)))
    (sum-fn 0 0)))

(defun sum-test-loop ()
  (loop for i from 1 to 2147483647 sum i))

Using ClozureCL, all 3 of these perform abysmally:

  • sum-test-iterative: 62.98 seconds, 2.82 of which were GC time. 20 GiB allocated.
  • sum-test-recursive: 74.11 seconds, 3.52 of which were GC. 20 GiB allocated.
  • sum-test-loop: 50.7 seconds, 2.58 of which were GC. 20 GiB allocated.

SBCL does much better off the bat, but still not great:

  • sum-test-iterative: 7.7 seconds, no allocation
  • sum-test-recursive: 17.1 seconds, no allocation
  • sum-test-loop: 7.63 seconds, no allocation

Adding some type annotations and optimize flags helped SBCL, but ClozureCL’s times stayed the same:

(defun sum-test-iterative ()
  (declare (optimize speed (safety 0)))
  (let ((sum 0))
    (declare ((signed-byte 64) sum))
    (dotimes (i 2147483647)
      (setf sum (the (signed-byte 64) (+ sum i))))
    sum))

SBCL’s sum-test-iterative drops down to 3.13 seconds, still no allocation. No change on Clozure. I’m probably doing something wrong here, but it’s not clear to me what. The disassembly of sum-test-iterative on SBCL shows that there’s still an allocation going on there: maybe the problem is just that 64-bit integers don’t work unboxed due to SBCL’s pointer tagging?

* (disassemble 'sum-test-iterative)

; disassembly for SUM-TEST-ITERATIVE
; Size: 69 bytes. Origin: #x1002AF97D7
; 7D7:       31C9             XOR ECX, ECX                    ; no-arg-parsing entry point
; 7D9:       31C0             XOR EAX, EAX
; 7DB:       EB2D             JMP L3
; 7DD:       0F1F00           NOP
; 7E0: L0:   488BD1           MOV RDX, RCX
; 7E3:       48D1F9           SAR RCX, 1
; 7E6:       7304             JNB L1
; 7E8:       488B4AF9         MOV RCX, [RDX-7]
; 7EC: L1:   488BD0           MOV RDX, RAX
; 7EF:       48D1FA           SAR RDX, 1
; 7F2:       4801D1           ADD RCX, RDX
; 7F5:       48D1E1           SHL RCX, 1
; 7F8:       710C             JNO L2
; 7FA:       48D1D9           RCR RCX, 1
; 7FD:       41BB00070020     MOV R11D, #x20000700            ; ALLOC-SIGNED-BIGNUM-IN-RCX
; 803:       41FFD3           CALL R11
; 806: L2:   4883C002         ADD RAX, 2
; 80A: L3:   483B057FFFFFFF   CMP RAX, [RIP-129]              ; [#x1002AF9790] = FFFFFFFE
; 811:       7CCD             JL L0
; 813:       488BD1           MOV RDX, RCX
; 816:       488BE5           MOV RSP, RBP
; 819:       F8               CLC
; 81A:       5D               POP RBP
; 81B:       C3               RET
NIL

Next up, Swift:

var sum: UInt64 = 0
var count: UInt64 = 2147483647

for i: UInt64 in 0 ..< count {
    sum = sum + i
}

print(sum)

Without optimizations, 16 minutes 50 seconds. Holy shit.

With optimizations, 1.11 seconds.

Ok, last one, C:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
	long sum = 0;
	const long count = strtol(argv[1], NULL, 10);
	for (long i = 0; i <= count; i++) {
		sum += i;
	}
	printf("%ld", sum);
	return 0;
}

Why take in the count parameter from the command line? Because Clang cheats. If I use the constant in there, it’s smart enough to just precalculate the whole thing and just return the final result.

Without optimizations:

solace:sum-tests jfischer$ clang -o sum-test SumTest.c 
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real	0m8.247s
user	0m8.190s
sys	0m0.035s

With optimizations:

solace:sum-tests jfischer$ clang -Os -o sum-test SumTest.c
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real	0m0.006s
user	0m0.002s
sys	0m0.002s

It turns out, Clang still cheats even if the loop counter comes from outside. I’m pretty sure it recognized what I’m doing and just turned that loop into Gauss’ trick for computing an arithmetic series. It doesn’t matter what loop count I give it, it always takes the same amount of time with optimizations.

I can’t read/write assembly, but playing around on godbolt.org makes it look like that’s the case: https://godbolt.org/g/FmL66q. (There’s no loop in the disassembly.) And I can’t figure out how to trick it into not doing that, so I’ll call it quits for now.


Sneaking Clojure in - Part 2

I ended up turning to a Clojure REPL to solve an issue in that project I totally didn’t sneak Clojure into before and realized I did some things the hard way last time.

First up: you don’t need to create and compile a Java class from Clojure to call into Clojure code from Java. If I had actually read the Java Interop reference guide on Clojure.org, I would have noticed that there’s a section on calling Clojure from Java. It’s much, much easier.

If I define this namespace/function:

(ns project.util)

(defn get-my-thing []
  {:my :thing})

I can call it like so:

// In Java code:

// First, find the require function. Then use it to load the project.util namespace
IFn require = Clojure.var("clojure.core", "require");
require.invoke(Clojure.read("project.util"));

// After project.util is loaded, we can look up the function and call it directly.
IFn getMyThing = Clojure.var("project.util", "get-my-thing");
getMyThing.invoke();

Easy peasy. I don’t have to jump through the gen-class hoops, and bonus! I don’t have to compile my Clojure code ahead of time. I just need to make sure the source files are on the class path.

You should of course compile your Clojure code if you’re distributing an application built on it. It’ll load faster, plus you might not want it readable.

What I specifically didn’t want to hook into that project that I totally wasn’t sneaking Clojure into is a REPL: I want to be able to poke directly at the application’s state while it’s running. To do that, I’ll need to make sure that tools.nrepl is available on the classpath, and require/launch it from within the application.

I could probably use Clojure 1.8’s socket server repl instead, but I plan on using Cider to talk to it, so nrepl’s a better choice.

In Java code:

public static void launchNrepl(int port) {
  try {
    IFn require = Clojure.var("clojure.core", "require");
    require.invoke(Clojure.read("clojure.tools.nrepl.server"));

    // Note: passing "::" as the :bind parameter makes this listen on all interfaces.
    // You might not want that.
    IFn startServer = Clojure.var("clojure.tools.nrepl.server", "start-server");
    startServer.invoke(Clojure.read(":bind"), "::", Clojure.read(":port"), port);
  }
  catch (Exception e) {
    // log the error
  }
}

In my theoretical project where I totally didn’t do this I also load in a namespace of helper code I’ve written to wrap around the Java objects we already have written.