[Refactor] IL-generated kernels for UnmanagedMemoryBlock

## Problem

`UnmanagedMemoryBlock.Casting.cs` contains **2,228 lines** of repetitive type-dispatch code with 144 nested switch cases (12 input types × 12 output types), each containing nearly identical for-loops:

```csharp
case NPTypeCode.Boolean:
{
    var src = (bool*)source.Address;
    switch (InfoOf<TOut>.NPTypeCode)
    {
        case NPTypeCode.Int32:
            var dst = (int*)ret.Address;
            for (int i = 0; i < len; i++)
                *(dst + i) = Converts.ToInt32(*(src + i));
            break;
        // ... 11 more output types
    }
    break;
}
// ... 11 more input types (144 total combinations)
```

### Issues

| Problem | Impact |
|---------|--------|
| **Code bloat** | 2,228 lines for a simple operation |
| **Maintenance burden** | Changes must be replicated across 144 branches |
| **Regen dependency** | Uses `#if _REGEN` template generation |
| **No SIMD** | Scalar loops where vectorization is possible |
| **Cache pollution** | 144 code paths = poor instruction cache utilization |

## Proposed Solution

Replace with IL-generated kernels using the established `ILKernelGenerator` pattern:

### New API (~20 lines)

```csharp
public static partial class UnmanagedMemoryBlock
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static IMemoryBlock CastTo(this IMemoryBlock source, NPTypeCode to)
    {
        if (source.TypeCode == to)
            return source.Clone();
        return CastKernelGenerator.Execute(source, to);
    }
}
```

### Kernel Generator (~300 lines)

```csharp
public static class CastKernelGenerator
{
    private delegate void CastKernel(IntPtr src, IntPtr dst, int count);
    private static readonly ConcurrentDictionary<(NPTypeCode, NPTypeCode), CastKernel> _cache = new();

    public static IMemoryBlock Execute(IMemoryBlock source, NPTypeCode dstType)
    {
        var kernel = _cache.GetOrAdd(
            (source.TypeCode, dstType), 
            key => GenerateKernel(key.Item1, key.Item2));
        
        var dst = AllocateBlock(dstType, source.Count);
        kernel((IntPtr)source.Address, (IntPtr)dst.Address, source.Count);
        return dst;
    }

    private static CastKernel GenerateKernel(NPTypeCode srcType, NPTypeCode dstType)
    {
        var method = new DynamicMethod($"Cast_{srcType}_{dstType}", ...);
        var il = method.GetILGenerator();
        
        // Try SIMD for compatible types (widening, float<->double)
        if (TryEmitSimdCast(il, srcType, dstType))
            return (CastKernel)method.CreateDelegate(typeof(CastKernel));
        
        // Fallback: scalar loop with IL conversion opcodes
        EmitScalarCast(il, srcType, dstType);
        return (CastKernel)method.CreateDelegate(typeof(CastKernel));
    }
}
```

### IL Emission (uses native conversion opcodes)

```csharp
private static void EmitConversion(ILGenerator il, NPTypeCode srcType, NPTypeCode dstType)
{
    switch (dstType)
    {
        case NPTypeCode.Byte:    il.Emit(OpCodes.Conv_U1); break;
        case NPTypeCode.Int16:   il.Emit(OpCodes.Conv_I2); break;
        case NPTypeCode.Int32:   il.Emit(OpCodes.Conv_I4); break;
        case NPTypeCode.Int64:   il.Emit(OpCodes.Conv_I8); break;
        case NPTypeCode.Single:  il.Emit(OpCodes.Conv_R4); break;
        case NPTypeCode.Double:  il.Emit(OpCodes.Conv_R8); break;
        // ... etc
    }
}
```

## Expected Outcome

| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Lines of code | 2,228 | ~320 | **-86%** |
| Type switches | 144 | 2 | **-99%** |
| For-loops in source | 291 | 0 | **-100%** |
| SIMD support | None | Yes | **New** |
| Regen dependency | Yes | No | **Removed** |

## SIMD Opportunities

| Conversion | SIMD Method |
|------------|-------------|
| `int32 → int64` | `Avx2.ConvertToVector256Int64(Vector128<int>)` |
| `float → double` | `Avx.ConvertToVector256Double(Vector128<float>)` |
| `byte → int32` | `Avx2.ConvertToVector256Int32(Vector64<byte>)` |
| Same-size reinterpret | `Buffer.MemoryCopy` |

## Implementation Plan

- [ ] Create `ILKernelGenerator.Cast.cs` with scalar conversion loop
- [ ] Add kernel caching with `(srcType, dstType)` key
- [ ] Implement SIMD paths for widening conversions
- [ ] Implement SIMD paths for float↔double
- [ ] Update `UnmanagedMemoryBlock.CastTo` to use new generator
- [ ] Add unit tests for all 144 type pairs
- [ ] Remove old `UnmanagedMemoryBlock.Casting.cs`
- [ ] Update `ArrayConvert.cs` to reuse cast kernels

## Complexity Assessment

| Aspect | Difficulty | Notes |
|--------|------------|-------|
| IL emission basics | Easy | Copy patterns from `ILKernelGenerator.Binary.cs` |
| Conversion opcodes | Easy | IL has native `Conv_*` opcodes |
| Decimal handling | Medium | Requires `Convert.ToDecimal()` call |
| SIMD widening | Medium | Well-documented intrinsics |
| Testing 144 pairs | Tedious | Straightforward but time-consuming |

## Related Files

**Will be deleted:**
- `src/NumSharp.Core/Backends/Unmanaged/UnmanagedMemoryBlock.Casting.cs` (2,228 lines)

**Will be simplified:**
- `src/NumSharp.Core/Utilities/ArrayConvert.cs` (can reuse cast kernels)

**New file:**
- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Cast.cs` (~300 lines)

## References

- Existing pattern: `ILKernelGenerator.Binary.cs`, `ILKernelGenerator.Unary.cs`
- Design doc: `docs/examples/CastKernel_Proposal.cs`
- Parent tracking issue: `docs/ISSUE_IL_MIGRATION.md`

Conversion	SIMD Method
`int32 → int64`	`Avx2.ConvertToVector256Int64(Vector128<int>)`
`float → double`	`Avx.ConvertToVector256Double(Vector128<float>)`
`byte → int32`	`Avx2.ConvertToVector256Int32(Vector64<byte>)`
Same-size reinterpret	`Buffer.MemoryCopy`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] IL-generated kernels for UnmanagedMemoryBlock #585

Problem

Issues

Proposed Solution

New API (~20 lines)

Kernel Generator (~300 lines)

IL Emission (uses native conversion opcodes)

Expected Outcome

SIMD Opportunities

Implementation Plan

Complexity Assessment

Related Files

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem	Impact
Code bloat	2,228 lines for a simple operation
Maintenance burden	Changes must be replicated across 144 branches
Regen dependency	Uses `#if _REGEN` template generation
No SIMD	Scalar loops where vectorization is possible
Cache pollution	144 code paths = poor instruction cache utilization

Metric	Before	After	Change
Lines of code	2,228	~320	-86%
Type switches	144	2	-99%
For-loops in source	291	0	-100%
SIMD support	None	Yes	New
Regen dependency	Yes	No	Removed

Aspect	Difficulty	Notes
IL emission basics	Easy	Copy patterns from `ILKernelGenerator.Binary.cs`
Conversion opcodes	Easy	IL has native `Conv_*` opcodes
Decimal handling	Medium	Requires `Convert.ToDecimal()` call
SIMD widening	Medium	Well-documented intrinsics
Testing 144 pairs	Tedious	Straightforward but time-consuming

[Refactor] IL-generated kernels for UnmanagedMemoryBlock #585

Description

Problem

Issues

Proposed Solution

New API (~20 lines)

Kernel Generator (~300 lines)

IL Emission (uses native conversion opcodes)

Expected Outcome

SIMD Opportunities

Implementation Plan

Complexity Assessment

Related Files

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions