Skip to content

Add embedded programming chapter#367

Open
gpx1000 wants to merge 2 commits intoKhronosGroup:mainfrom
gpx1000:embedded_programming
Open

Add embedded programming chapter#367
gpx1000 wants to merge 2 commits intoKhronosGroup:mainfrom
gpx1000:embedded_programming

Conversation

@gpx1000
Copy link
Contributor

@gpx1000 gpx1000 commented Mar 16, 2026

It looked like we were debating where to put generic embedded information in the TBR chapter. That made me realize, we have a gap so, attempting to fill it in.

NB: This should be accepted AFTER the TBR chapter in PR #338 lands.

@gpx1000 gpx1000 mentioned this pull request Mar 16, 2026
@gpx1000
Copy link
Contributor Author

gpx1000 commented Mar 16, 2026

NB: The build will fail until #338 lands as it links to that chapter.

spencer-lunarg
spencer-lunarg previously approved these changes Mar 17, 2026
With special thanks to @spencer-lunarg

Co-authored-by: Spencer Fricke <115671160+spencer-lunarg@users.noreply.github.com>
Copy link
Collaborator

@SaschaWillems SaschaWillems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have that much experience with embedded programming outside some experiments with Raspberry PIs and Android TVs, but this looks good to me. Only a few minor remarks.

Having a few more links inside the text (e.g. to spec chapters) would make this a bit easier to follow.

* **Control Lists (CL)**: The VideoCore GPU doesn't use standard command buffers in the way a desktop GPU does. Instead, the driver generates "Control Lists" that the hardware's V3D unit executes.
* **Contiguous Memory Allocator (CMA)**: On Linux, the GPU requires physically contiguous memory. This is managed by the kernel's CMA pool. If your application crashes or fails to allocate memory despite plenty of RAM being available, you may need to increase the CMA size in `/boot/config.txt`.
** Example: `dtoverlay=vc4-kms-v3d,cma-512` allocates 512MB to the GPU.
* **Performance Tipping Points**: The `v3dv` driver is very efficient, but it has specific "tipping points" where it must flush the tile buffer to RAM (a "resolve"). To avoid this, ensure your render passes are structured to fit within the tile buffer limits (which vary based on the number of samples and the format of the attachments).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to query this limit? If so, can this be added?

* **Memory Alignment**: Alignment requirements for certain resources (like `minStorageBufferOffsetAlignment`) can be much larger on embedded GPUs than on desktop counterparts. Always check the limits in `VkPhysicalDeviceProperties`.
* **Fragmented Memory**: In systems with long uptimes (like industrial controllers), memory fragmentation can lead to allocation failures even when "free" memory appears available. Reusing allocations or using a robust allocator like the Vulkan Memory Allocator (VMA) is highly recommended.

== The Direct-to-Display Workflow (VK_KHR_display)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


* **Subgroup Operations**: Use `VK_KHR_shader_subgroup` to share data between shader invocations. For example, if you need to calculate an average of pixels in a neighborhood, use subgroup arithmetic instead of writing to and reading from shared memory (`shared` variables). This keeps the data within the GPU's register file, saving significant power.
* **Reduced Precision**: Most embedded GPUs are twice as fast when performing 16-bit arithmetic compared to 32-bit. Use `VK_KHR_shader_float16_int8` to use half-precision types. This not only doubles throughput but also reduces the number of registers used by the shader, which allows more workgroups to run in parallel.
* **Circular Display Optimization**: Since many smartwatches use circular displays within square memory buffers, the corners represent approximately 21.5% of the total area (the geometric difference between a square and its inscribed circle). While Vulkan renders to rectangular surfaces, you can use `discard` or `VK_EXT_discard_rectangles` (if supported) to avoid fragment processing in these non-visible regions, significantly reducing GPU ALU load and power consumption.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't discard considered expensive? Would using scissors instead be a more viable/faster option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants