Conversation
|
NB: The build will fail until #338 lands as it links to that chapter. |
With special thanks to @spencer-lunarg Co-authored-by: Spencer Fricke <115671160+spencer-lunarg@users.noreply.github.com>
SaschaWillems
left a comment
There was a problem hiding this comment.
I don't have that much experience with embedded programming outside some experiments with Raspberry PIs and Android TVs, but this looks good to me. Only a few minor remarks.
Having a few more links inside the text (e.g. to spec chapters) would make this a bit easier to follow.
| * **Control Lists (CL)**: The VideoCore GPU doesn't use standard command buffers in the way a desktop GPU does. Instead, the driver generates "Control Lists" that the hardware's V3D unit executes. | ||
| * **Contiguous Memory Allocator (CMA)**: On Linux, the GPU requires physically contiguous memory. This is managed by the kernel's CMA pool. If your application crashes or fails to allocate memory despite plenty of RAM being available, you may need to increase the CMA size in `/boot/config.txt`. | ||
| ** Example: `dtoverlay=vc4-kms-v3d,cma-512` allocates 512MB to the GPU. | ||
| * **Performance Tipping Points**: The `v3dv` driver is very efficient, but it has specific "tipping points" where it must flush the tile buffer to RAM (a "resolve"). To avoid this, ensure your render passes are structured to fit within the tile buffer limits (which vary based on the number of samples and the format of the attachments). |
There was a problem hiding this comment.
Is there a way to query this limit? If so, can this be added?
| * **Memory Alignment**: Alignment requirements for certain resources (like `minStorageBufferOffsetAlignment`) can be much larger on embedded GPUs than on desktop counterparts. Always check the limits in `VkPhysicalDeviceProperties`. | ||
| * **Fragmented Memory**: In systems with long uptimes (like industrial controllers), memory fragmentation can lead to allocation failures even when "free" memory appears available. Reusing allocations or using a robust allocator like the Vulkan Memory Allocator (VMA) is highly recommended. | ||
|
|
||
| == The Direct-to-Display Workflow (VK_KHR_display) |
There was a problem hiding this comment.
This should link to https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_display.html somewhere
|
|
||
| * **Subgroup Operations**: Use `VK_KHR_shader_subgroup` to share data between shader invocations. For example, if you need to calculate an average of pixels in a neighborhood, use subgroup arithmetic instead of writing to and reading from shared memory (`shared` variables). This keeps the data within the GPU's register file, saving significant power. | ||
| * **Reduced Precision**: Most embedded GPUs are twice as fast when performing 16-bit arithmetic compared to 32-bit. Use `VK_KHR_shader_float16_int8` to use half-precision types. This not only doubles throughput but also reduces the number of registers used by the shader, which allows more workgroups to run in parallel. | ||
| * **Circular Display Optimization**: Since many smartwatches use circular displays within square memory buffers, the corners represent approximately 21.5% of the total area (the geometric difference between a square and its inscribed circle). While Vulkan renders to rectangular surfaces, you can use `discard` or `VK_EXT_discard_rectangles` (if supported) to avoid fragment processing in these non-visible regions, significantly reducing GPU ALU load and power consumption. |
There was a problem hiding this comment.
Isn't discard considered expensive? Would using scissors instead be a more viable/faster option?
It looked like we were debating where to put generic embedded information in the TBR chapter. That made me realize, we have a gap so, attempting to fill it in.
NB: This should be accepted AFTER the TBR chapter in PR #338 lands.