-
Notifications
You must be signed in to change notification settings - Fork 205
Open
Labels
NumPy 2.x ComplianceAligns behavior with NumPy 2.x (NEPs, breaking changes)Aligns behavior with NumPy 2.x (NEPs, breaking changes)coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsperformancePerformance improvements or optimizationsPerformance improvements or optimizations
Description
Problem
UnmanagedMemoryBlock.Casting.cs contains 2,228 lines of repetitive type-dispatch code with 144 nested switch cases (12 input types × 12 output types), each containing nearly identical for-loops:
case NPTypeCode.Boolean:
{
var src = (bool*)source.Address;
switch (InfoOf<TOut>.NPTypeCode)
{
case NPTypeCode.Int32:
var dst = (int*)ret.Address;
for (int i = 0; i < len; i++)
*(dst + i) = Converts.ToInt32(*(src + i));
break;
// ... 11 more output types
}
break;
}
// ... 11 more input types (144 total combinations)Issues
| Problem | Impact |
|---|---|
| Code bloat | 2,228 lines for a simple operation |
| Maintenance burden | Changes must be replicated across 144 branches |
| Regen dependency | Uses #if _REGEN template generation |
| No SIMD | Scalar loops where vectorization is possible |
| Cache pollution | 144 code paths = poor instruction cache utilization |
Proposed Solution
Replace with IL-generated kernels using the established ILKernelGenerator pattern:
New API (~20 lines)
public static partial class UnmanagedMemoryBlock
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static IMemoryBlock CastTo(this IMemoryBlock source, NPTypeCode to)
{
if (source.TypeCode == to)
return source.Clone();
return CastKernelGenerator.Execute(source, to);
}
}Kernel Generator (~300 lines)
public static class CastKernelGenerator
{
private delegate void CastKernel(IntPtr src, IntPtr dst, int count);
private static readonly ConcurrentDictionary<(NPTypeCode, NPTypeCode), CastKernel> _cache = new();
public static IMemoryBlock Execute(IMemoryBlock source, NPTypeCode dstType)
{
var kernel = _cache.GetOrAdd(
(source.TypeCode, dstType),
key => GenerateKernel(key.Item1, key.Item2));
var dst = AllocateBlock(dstType, source.Count);
kernel((IntPtr)source.Address, (IntPtr)dst.Address, source.Count);
return dst;
}
private static CastKernel GenerateKernel(NPTypeCode srcType, NPTypeCode dstType)
{
var method = new DynamicMethod($"Cast_{srcType}_{dstType}", ...);
var il = method.GetILGenerator();
// Try SIMD for compatible types (widening, float<->double)
if (TryEmitSimdCast(il, srcType, dstType))
return (CastKernel)method.CreateDelegate(typeof(CastKernel));
// Fallback: scalar loop with IL conversion opcodes
EmitScalarCast(il, srcType, dstType);
return (CastKernel)method.CreateDelegate(typeof(CastKernel));
}
}IL Emission (uses native conversion opcodes)
private static void EmitConversion(ILGenerator il, NPTypeCode srcType, NPTypeCode dstType)
{
switch (dstType)
{
case NPTypeCode.Byte: il.Emit(OpCodes.Conv_U1); break;
case NPTypeCode.Int16: il.Emit(OpCodes.Conv_I2); break;
case NPTypeCode.Int32: il.Emit(OpCodes.Conv_I4); break;
case NPTypeCode.Int64: il.Emit(OpCodes.Conv_I8); break;
case NPTypeCode.Single: il.Emit(OpCodes.Conv_R4); break;
case NPTypeCode.Double: il.Emit(OpCodes.Conv_R8); break;
// ... etc
}
}Expected Outcome
| Metric | Before | After | Change |
|---|---|---|---|
| Lines of code | 2,228 | ~320 | -86% |
| Type switches | 144 | 2 | -99% |
| For-loops in source | 291 | 0 | -100% |
| SIMD support | None | Yes | New |
| Regen dependency | Yes | No | Removed |
SIMD Opportunities
| Conversion | SIMD Method |
|---|---|
int32 → int64 |
Avx2.ConvertToVector256Int64(Vector128<int>) |
float → double |
Avx.ConvertToVector256Double(Vector128<float>) |
byte → int32 |
Avx2.ConvertToVector256Int32(Vector64<byte>) |
| Same-size reinterpret | Buffer.MemoryCopy |
Implementation Plan
- Create
ILKernelGenerator.Cast.cswith scalar conversion loop - Add kernel caching with
(srcType, dstType)key - Implement SIMD paths for widening conversions
- Implement SIMD paths for float↔double
- Update
UnmanagedMemoryBlock.CastToto use new generator - Add unit tests for all 144 type pairs
- Remove old
UnmanagedMemoryBlock.Casting.cs - Update
ArrayConvert.csto reuse cast kernels
Complexity Assessment
| Aspect | Difficulty | Notes |
|---|---|---|
| IL emission basics | Easy | Copy patterns from ILKernelGenerator.Binary.cs |
| Conversion opcodes | Easy | IL has native Conv_* opcodes |
| Decimal handling | Medium | Requires Convert.ToDecimal() call |
| SIMD widening | Medium | Well-documented intrinsics |
| Testing 144 pairs | Tedious | Straightforward but time-consuming |
Related Files
Will be deleted:
src/NumSharp.Core/Backends/Unmanaged/UnmanagedMemoryBlock.Casting.cs(2,228 lines)
Will be simplified:
src/NumSharp.Core/Utilities/ArrayConvert.cs(can reuse cast kernels)
New file:
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Cast.cs(~300 lines)
References
- Existing pattern:
ILKernelGenerator.Binary.cs,ILKernelGenerator.Unary.cs - Design doc:
docs/examples/CastKernel_Proposal.cs - Parent tracking issue:
docs/ISSUE_IL_MIGRATION.md
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
NumPy 2.x ComplianceAligns behavior with NumPy 2.x (NEPs, breaking changes)Aligns behavior with NumPy 2.x (NEPs, breaking changes)coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsperformancePerformance improvements or optimizationsPerformance improvements or optimizations