-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan for BF16 datatype ? #80
Comments
Hi @pauldintel! That shouldn’t be too hard to add and can help a lot on older x86 and newer mobile CPUs. Would you like to contribute? Any specific distance functions you are looking for? |
I'm adding support for this. Would it make sense for f16 and bf16 to use check_c_source_compiles in cmake to detect compiler support?
We can retain the ability to disable with Note the bench disables native F16 - I think we can leave it on by default. |
Current benchmark results vs native f16
|
Yes, @MarkReedZ, the |
I put bf16, f16, and f32 dot_serial() in godbolt. You can add and remove flags (avx2, avx512fp16, etc) to see whats going on. Without flags the bf16 is longer. Is the compiler using avx2/avx512 on the f16 serial? That would explain the difference. https://godbolt.org/z/EKE66h9GM The bf16 and I'll play around with different compilers. |
Note avx512_bf16 only has support for conversion between bf16 and f32, and a dot product. So I believe our simd accelerated functions will be converting bf16 to f32, and running the f32 algorithms. Or perhaps it is possible to do a bf16 -> f16 conversion if we can find a way to just shift the exponent. |
In most cases it would be better to perform dot products in bf16, upscaling and accumulating in f32 down the road. |
I added the conversion function for compilers that don't support __bf16
And using the conversion to f32 instead of the native bf16 we get almost the same timings as with plain f32.
A PR will be up when I have a minute. |
@MarkReedZ which machine this benchmark is running for ? Intel Bf16 should show better results on 4th Gen Sapphire Rapids (SPR) with AMX accelerators enabled because BF16 supposed to show better results with matrix multiplication comparing FP16 . I am not from above which distance calculation require matrix mul operations. |
Alternatively, you can also test on AMD Genoa chips. Like Intel Sapphire Rapids they support AVX-512 BF16, unlike Intel - they don't support F16... so the relative win will be much larger. |
So far I know Genoa has no BF16 support as at this moment it works on Intel SPR with AMX acceleration only |
Hey @pauldintel! Have you had the chance to check the |
@ashvardanian we have tested two matrix with inner product and Before use of AMX Inner product we runtime used Intel oneDNN to reorder F32 data to BF16 . Certain dataset and batch operations we have seen 1.51 to 14x improvement (64 dimension to 1024 dimensions) of FAISS IndexFlat Scalar IP and all with Intel AMX. FAISS IndexFlat BLAS IP shows upto 4.8X gain with AMX. For single query, comparing to native fp32: Hope that helps |
any optimization on SIMD plan for BFloat16 datatype ?
thanks
The text was updated successfully, but these errors were encountered: