Create a user-friendly abstraction layer over low-level JIT kernels so that non-CUDA experts can easily apply high-performance FP8 quantization to their custom model architectures.
Suggested repo: NanoKernels
"High-performance inference kernels without the CUDA headache."
Estimated effort: 150h