Plan for BF16 datatype ? #80

pauldintel · 2024-02-07T02:00:35Z

any optimization on SIMD plan for BFloat16 datatype ?
thanks

ashvardanian · 2024-02-07T02:38:09Z

Hi @pauldintel! That shouldn’t be too hard to add and can help a lot on older x86 and newer mobile CPUs. Would you like to contribute? Any specific distance functions you are looking for?

MarkReedZ · 2024-05-29T21:28:25Z

@ashvardanian

I'm adding support for this. Would it make sense for f16 and bf16 to use check_c_source_compiles in cmake to detect compiler support?

check_c_source_compiles(
  [=[
int
main(int argc, char *argv)
{
  __bf16 foo = 1.0;
  return 0;
}
]=]
  HAS_BFLOAT16)

We can retain the ability to disable with #define SIMSIMD_NATIVE_F16 0.

Note the bench disables native F16 - I think we can leave it on by default.

MarkReedZ · 2024-05-29T21:30:48Z

Current benchmark results vs native f16

dot_bf16_serial_1536d/min_time:10.000/threads:12        1372 ns
cos_bf16_serial_1536d/min_time:10.000/threads:12        1485 ns
l2sq_bf16_serial_1536d/min_time:10.000/threads:12       1393 ns
kl_bf16_serial_1536d/min_time:10.000/threads:12         3352 ns
js_bf16_serial_1536d/min_time:10.000/threads:12         5069 ns

dot_f16_serial_1536d/min_time:10.000/threads:12          264 ns
cos_f16_serial_1536d/min_time:10.000/threads:12          264 ns
l2sq_f16_serial_1536d/min_time:10.000/threads:12         264 ns
kl_f16_serial_1536d/min_time:10.000/threads:12          2983 ns
js_f16_serial_1536d/min_time:10.000/threads:12          7858 ns

ashvardanian · 2024-05-30T01:15:25Z

Yes, @MarkReedZ, the check_c_source_compiles makes a lot of sense! Can you please clarify the benchmarking results? I'd assume bf16 should be faster than f16`, so the duration/latency should be lower 🤔

MarkReedZ · 2024-05-30T03:00:26Z

I put bf16, f16, and f32 dot_serial() in godbolt. You can add and remove flags (avx2, avx512fp16, etc) to see whats going on. Without flags the bf16 is longer. Is the compiler using avx2/avx512 on the f16 serial? That would explain the difference.

https://godbolt.org/z/EKE66h9GM

The bf16 and unsigned short f16 have the same performance in dot/cos/l2sg, but bf16 is faster in kl/js.

I'll play around with different compilers.

MarkReedZ · 2024-05-30T03:12:00Z

Note avx512_bf16 only has support for conversion between bf16 and f32, and a dot product. So I believe our simd accelerated functions will be converting bf16 to f32, and running the f32 algorithms.

Or perhaps it is possible to do a bf16 -> f16 conversion if we can find a way to just shift the exponent.

ashvardanian · 2024-05-30T03:59:31Z

In most cases it would be better to perform dot products in bf16, upscaling and accumulating in f32 down the road.

MarkReedZ · 2024-05-30T19:51:11Z

I added the conversion function for compilers that don't support __bf16

SIMSIMD_PUBLIC simsimd_f32_t simsimd_uncompress_bf16(unsigned short x) {
    unsigned int tmp = x << 16; // Zero extends the mantissa
    return *((float*)&tmp);
}

And using the conversion to f32 instead of the native bf16 we get almost the same timings as with plain f32.

unsigned short bf16 -> f32 conversion

dot_bf16_serial_1536d/min_time:10.000/threads:12          183 ns
cos_bf16_serial_1536d/min_time:10.000/threads:12          202 ns
l2sq_bf16_serial_1536d/min_time:10.000/threads:12         166 ns
kl_bf16_serial_1536d/min_time:10.000/threads:12          1505 ns
js_bf16_serial_1536d/min_time:10.000/threads:12          3795 ns

A PR will be up when I have a minute.

pauldintel · 2024-05-31T20:53:56Z

@MarkReedZ which machine this benchmark is running for ? Intel Bf16 should show better results on 4th Gen Sapphire Rapids (SPR) with AMX accelerators enabled because BF16 supposed to show better results with matrix multiplication comparing FP16 . I am not from above which distance calculation require matrix mul operations.

ashvardanian · 2024-05-31T21:03:12Z

Alternatively, you can also test on AMD Genoa chips. Like Intel Sapphire Rapids they support AVX-512 BF16, unlike Intel - they don't support F16... so the relative win will be much larger.

pauldintel · 2024-05-31T23:14:24Z

So far I know Genoa has no BF16 support as at this moment it works on Intel SPR with AMX acceleration only

ashvardanian · 2024-06-02T01:42:32Z

@pauldintel it should be.

ashvardanian · 2024-08-29T00:59:17Z

Hey @pauldintel! Have you had the chance to check the bf16 functionality? Have you ever tried to use AMX for vector-matrix operations, aka when one of the arguments contains just one non-zero row/column?

pauldintel · 2024-08-29T04:50:37Z

@ashvardanian we have tested two matrix with inner product and Before use of AMX Inner product we runtime used Intel oneDNN to reorder F32 data to BF16 . Certain dataset and batch operations we have seen 1.51 to 14x improvement (64 dimension to 1024 dimensions) of FAISS IndexFlat Scalar IP and all with Intel AMX. FAISS IndexFlat BLAS IP shows upto 4.8X gain with AMX.

For single query, comparing to native fp32:
FP32->BF16 AMX speedup about 4.85x perf
BF16 AMX speedup about 33.7x perf

Hope that helps

ashvardanian added help wanted Extra attention is needed good first issue Good for newcomers labels May 18, 2024

ashvardanian mentioned this issue Jun 16, 2024

Rust has an f16 and bf16 type #124

Closed

ashvardanian closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan for BF16 datatype ? #80

Plan for BF16 datatype ? #80

pauldintel commented Feb 7, 2024

ashvardanian commented Feb 7, 2024

MarkReedZ commented May 29, 2024 •

edited

Loading

MarkReedZ commented May 29, 2024

ashvardanian commented May 30, 2024

MarkReedZ commented May 30, 2024 •

edited

Loading

MarkReedZ commented May 30, 2024 •

edited

Loading

ashvardanian commented May 30, 2024

MarkReedZ commented May 30, 2024

pauldintel commented May 31, 2024

ashvardanian commented May 31, 2024

pauldintel commented May 31, 2024

ashvardanian commented Jun 2, 2024

ashvardanian commented Aug 29, 2024

pauldintel commented Aug 29, 2024

Plan for BF16 datatype ? #80

Plan for BF16 datatype ? #80

Comments

pauldintel commented Feb 7, 2024

ashvardanian commented Feb 7, 2024

MarkReedZ commented May 29, 2024 • edited Loading

MarkReedZ commented May 29, 2024

ashvardanian commented May 30, 2024

MarkReedZ commented May 30, 2024 • edited Loading

MarkReedZ commented May 30, 2024 • edited Loading

ashvardanian commented May 30, 2024

MarkReedZ commented May 30, 2024

pauldintel commented May 31, 2024

ashvardanian commented May 31, 2024

pauldintel commented May 31, 2024

ashvardanian commented Jun 2, 2024

ashvardanian commented Aug 29, 2024

pauldintel commented Aug 29, 2024

MarkReedZ commented May 29, 2024 •

edited

Loading

MarkReedZ commented May 30, 2024 •

edited

Loading

MarkReedZ commented May 30, 2024 •

edited

Loading