Pushing Polygons on the Mega Drive
This is a write-up of the polygon renderer used for the Mega Drive demo “Overdrive 2” by Titan, released at the Revision 2017 Demoparty. As the Mega Drive can only display tilemaps, not bitmaps, and does not have the video memory mapped into the CPU address space, this turned out to be an interesting problem. If you have not seen the demo yet, I recommend watching it before continuing. You can find a hardware capture on YouTube:
Ironically, the 3D shown in YouTube’s preview screenshot is not done using the renderer described here; the Mega Drive and ship scenes are, though.
3D Renderer or Video Player?
After our demo was shown at Revision, people naturally started speculating about how we realized several of the effects. Some quickly concluded that a complete 3D renderer in fullscreen, with shading and at that framerate seems implausible and I would not disagree. An alternative theory was that it is just streaming the frames (as nametable and patterns) from the cartridge. That would certainly work and would actually be faster. It has the huge downside of taking way too much ROM space though. A back-of-the-envelope calculation, assuming deduplication of solid-colored tiles and a slightly higher framerate, ends up at roughly half of our 8MB ROM being used for 3D frames. That is about ten times as much as the 3D scenes are using now. Together with the large amount of PCM samples used by Strobe’s awesome soundtrack that would not leave much for the other effects or Alien’s beautiful graphics. While there are some ways to compress frames a bit further, we decided on something more suitable for the 3D scenes.
If a complete 3D renderer is not possible, it might be a good idea to pre-compute parts of it and only do the final steps on the Mega Drive. That would be rasterization of projected, shaded and clipped polygons. The data needed to describe a frame at that stage is quite small. It consists of the polygon vertices in fixed-point screen coordinates and a palette index for the color. Fixed-point is needed as rounding to integer coordinates looks wobbly when animated. Choosing a suitable palette for each frame can be done during the preprocessing. Storing it takes just a few bytes. This is a good starting point, but now we need to figure out how to make drawing polygons fast.
There are three optimizations used to speed up polygon rendering:
- Avoiding overdraw,
- Drawing from left to right,
- Quickly drawing tiles.
I will go through them in order and explain the details.
There are two problems with just a list of projected polygons. First, the order in which they are drawn matters. Assuming no intersecting polygons, sorting them from back to front gives the correct result. But this still leaves us with a second problem: overdraw. This is not a problem in the sense that the rendering breaks, but rather in that a pixel is wastefully drawn multiple times, discarding the previous value. Having to draw into a tilemap amplifies this problem.
The solution is to split all polygons that intersect or overlap, throwing away any resulting polygon that is occluded by another polygon. This leaves us with a partition or, more specifically, a tessellation of the view plane. As a further optimization adjacent polygons that have the same color can be joined, as their common edge(s) are not visible. This can result in quite complex, even non-convex, polygons. Apart from a small exception described in the next section, this is not a problem though.
Drawing from Left to Right
As we now have a tessellation of the view plane, drawing all polygons would compute and draw each edge twice; once for each adjacent polygon. This can be avoided. If we draw polygons strictly from left to right, we can use a single table to store all edges that have just one adjacent polygon drawn. The table is just the x-position of the rightmost drawn edge for each scanline. I am calling that table the “fringe table”. It is initialized with all zeros, i.e. the left edge of the view.
There is a small problem though with some non-convex polygons. If a polygon is U-shaped it is impossible to draw strictly from left to right; the enclosed polygon would have to go in between. Rotated by 90 degree, i.e. a C-shaped polygon, it is not a problem though. This is solved by breaking U-shaped polygons into multiple polygons that do not have gaps in any scanline. As a small optimization those breaks are preferably inserted on a tile boundary. Why this is an advantage will become clear later.
When drawing polygons like this, there is no need to even store the left side of a polygon. Whenever a polygon is drawn, the left edge is already stored in the fringe table. So apart from the computation time saved, we also save storage space.
So far we only avoided re-computing the left edges of a polygon; we still need to draw them. In fact there is no way to avoid drawing the left edges, but we can save time drawing the right edges. As we know anything beyond the right edges will be overdrawn, we do not need to be exact while drawing. An easy way to save some time while drawing into a tile map is to completely fill any tile containing a right edge, leaving it to the adjacent polygon to draw the exact edge. This brings us to the next optimization.
Quickly Drawing Tiles
The final challenge to overcome is efficiently drawing to a tilemap. As a first step the line drawing is decoupled from the handling of the tilemap. This is done by introducing a copy of the fringe table, called the “outline table”. In between drawing polygons those two tables are always the same. The line drawing routine updates the outline table to contain the right side of the polygon. This is done for all right side edges of the polygon before any actual drawing to the tilemap happens. Afterwards the polygon to draw is exactly the area delimited to the left by the fringe table and to the right by the outline table.
The line drawing routine also outputs the topmost and bottommost y-coordinates of the polygon. Those are rounded outwards to the next tile boundary, i.e. a multiple of 8 pixels. This is safe to do, as the fringe and outline table are identical for scanlines outside of the polygon area, indicating that nothing should be drawn there. This allows us to process the polygon tile-row by tile-row without a special case for partial tile-rows.
The tile-rows are processed from top to bottom. First we compute three x-coordinates for each tile-row. The leftmost and rightmost value in the fringe as well as the rightmost value in the outline of the tile-row. Those span the area where the polygon needs to be drawn and also divide the tile-row into two segments. The left segment contains edges of already drawn polygons while the right segment does not. To avoid special cases, those values are rounded to tile boundaries too.
Both segments are then processed tile by tile from left to right. The left segment is drawn before the right segment, but we will look at the right one which does not contain left edges or any already drawn polygons first. All tiles in the right segment can be completely filled with the color of the current polygon. This is legal as it will not draw over existing polygons and will only overshoot to the right. The overshoot will be fixed by drawing subsequent polygons. To speed this up even more, we precompute a solid-colored pattern for each of the 16 colors. This means we can draw an 8x8 tile in the right segment by updating a single word in the nametable to point to the precomputed pattern.
Before this happens, though, the left segment is drawn. Although we draw it first, we can be sure that every tile of the left segment was a tile of a right segment of a previous polygon. This might sound counterintuitive, but it is possible as the left segment can consist of zero tiles, which it will for the leftmost polygons.
For each tile of the segment there are two cases we need to consider. One is that the tile was only drawn as part of a right segment so far, the other is that it was also part of one or more polygon’s left segments.
In the first case, the nametable entry for the tile points to one of the precomputed solid patterns. In the second case, the nametable entry points to an individual pattern just for this tile.
If the tile was solid so far, we need change the nametable to point to an individual pattern. This is done by using a simple bump allocator that allocates a continuous range of tiles. Having a fixed pattern address for each tile would probably be faster here, but it would also mean that the used patterns are scattered throughout memory. This is a huge downside on the mega drive as the VRAM is not memory mapped. In fact, while drawing a frame, we are not updating the pattern data or nametable at all, but a shadow copy in work memory. After we are done, a DMA transfer is used to quickly copy it over to VRAM. At that point having a compact consecutive memory area containing all patterns saves a lot more cycles than using fixed pattern addresses here.
After allocating a pattern, we need to draw it. A newly allocated pattern always needs to be drawn in two colors. The color of the previously solid tile and the color of the current polygon. Each line of the pattern will have the old color on the left and the new color on the right, potentially consisting of just one of those colors. The fringe table tells us where the polygon edge is, i.e. where the color change must be.
Conveniently, a line of a pattern, consisting of 8 pixels, 4 bit each, perfectly fits into a 32 bit register of the 68000 CPU. This allows us to apply a mask telling us where each color is supposed to go using a single
and.l (a0, d0), d1 instruction. The register
d0 contains the value coming from the fringe table (multiplied by 4 to do long-word-wise addressing) and
d1 contains the data to mask. The register
a0 points into a special table. The table looks like this, where each line are the bits of a 32-bit long word.
11111111111111111111111111111111 11111111111111111111111111111111 ... many repetitions ... 11111111111111111111111111111111 11111111111111111111111111111111 00001111111111111111111111111111 00000000111111111111111111111111 00000000000011111111111111111111 00000000000000001111111111111111 00000000000000000000111111111111 00000000000000000000000011111111 00000000000000000000000000001111 00000000000000000000000000000000 00000000000000000000000000000000 ... many repetitions ... 00000000000000000000000000000000 00000000000000000000000000000000
Depending on the x-coordinate of the tile
a0 points to a different position in that table. By padding the table with enough all-one or all-zero words, there is no need to do any clipping to the tile boundary, which greatly speeds up drawing. Together with a smaller table containing solid colored lines and some bit twiddeling this completes the newly allocated pattern drawing routine.
In the case of a pattern that was already allocated, the same approach is used. The only difference is that the mask isn’t used to mask between two colors, but between the old data of a line and the new color.
After completing a row the corresponding entries of the outline table are copied into the fringe table and the next tile row is processed. When all tile rows are drawn the fringe table is the same as the outline table again, ready for the next polygon to be drawn.
Putting it All Together
This concludes the description of our polygon renderer routine. You can see the Mega Drive implementation in action below. This animation was captured from an emulator running a patched version of the routine. The patched version copies the nametable and pattern data into VRAM after every polygon and waits for the next frame. The palette is updated in the end, resulting in the false colors while drawing is in progress. The garbage tiles that appear sometimes are nametable entries that were not yet touched in the current frame. Those entries might point to already allocated and redrawn patterns.
As an overview for anyone who wants to implement this or a similar routine I’ve summarized it using pseudocode:
routine draw_polygon(right_edges, color): foreach edge in right_edges: update outline, min_y and max_y using line drawing routine round min_y and max_y to tiles for each row within min_y ... max_y: min_fringe = min(fringe[y] for y in current row) max_fringe = max(fringe[y] for y in current row) max_outline = max(outline[y] for y in current row) round min_fringe, max_fringe and max_outline to tiles for each column within min_fringe ... max_fringe: column_line_table = line_table + column * 8 entries if nametable[column, row] is solid: old_color = color of nametable[column, row] pattern = nametable[column, row] = alloc_pointer increment alloc_pointer old_pixels = color_table[old_color] new_pixels = color_table[color] for y in current row: mask = column_line_table[fringe[y]] pattern[y] = (new_pixels & mask) | (old_pixels & ~mask) else: pattern = nametable[column, row] new_pixels = color_table[color] for y in current row: old_pixels = pattern[y] mask = column_line_table[fringe[y]] pattern[y] = (new_pixels & mask) | (old_pixels & ~mask) for each column within max_fringe ... max_outline: nametable[column, row] = solid pattern for color for y in current row: fringe[y] = outline[y]
The actual implementation consists of around 600 lines of 68000 assembly, making some use of macros and repeat statements. The preprocessor was implemented in Rust, which I can highly recommend. The implementation and fine-tuning took somewhere around 3 weeks plus some evening coding. Most of it was done during the last summer. Coming up with and improving the concept behind this was done on and off over many years, not targeting the mega drive in particular.
Working on and releasing Overdrive 2 was an awesome experience. I want to thank everyone involved for making this possible.