-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use usize as the Block instead of u32 #74
Conversation
Reran the benchmarks. Some insights:
|
Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck. |
@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the
|
@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it. |
Thanks James. I see about the same speedups:
How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module? |
@jrraymond I've updated this PR to only include the usize changes. Please take another look. |
Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:
|
I've been looking at the generated assembly and the differences are exactly what I would expect: the
|
Thanks for the deep dive into this. What's the policy on this crate for the use of I've also, in the meantime, experimented with explicitly vectorized operations, using |
I think we should use I took a stab at this to see if this makes a difference: type SubBlock = u32;
const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
let block_ix = bit / BITS;
let block_offset = bit % BITS;
let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK block_offset / SUB_BLOCK_BITS;
let offset = bit % SUB_BLOCK_BITS;
let blocks: &[usize] = &self.data;
let sublocks: &[SubBlock] = unsafe {
let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
};
match sublocks.get(sub_block_ix) {
None => false,
Some(b) => {
(b & (1 << offset)) != 0
}
} The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks,
I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets: usize::FixedBitSet::contains:
mov rax, rsi
shr rax, 5
cmp rax, qword ptr [rdi 16]
jae .LBB10_1
mov rcx, qword ptr [rdi]
mov eax, dword ptr [rcx 4*rax]
bt eax, esi
setb al
ret
sublocks::FixedBitSet::contains:
mov rax, rsi
shr rax, 3
mov rcx, qword ptr [rdi 16]
shl rcx, 3
cmp rax, rcx
jae .LBB9_1
mov rcx, qword ptr [rdi]
movzx eax, byte ptr [rcx rax]
and esi, 7
bt eax, esi
setb al
ret Given this I'd say this lgtm. |
Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results. |
src/lib.rs
Outdated
@@ -274,12 274,13 @@ impl FixedBitSet { | |||
/// **Panics** if the range extends past the end of the bitset. | |||
#[inline] | |||
pub fn set_range<T: IndexRange>(&mut self, range: T, enabled: bool) { | |||
let slice = self.data.as_mut_slice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was a small optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was an attempt to try to resolve the smallvec
issues. This has since been reverted.
src/lib.rs
Outdated
while let Some(nxt) = self.iter.next() { | ||
if self.other.contains(nxt) { | ||
return Some(nxt); | ||
while self.bitset == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this yield a measurable speedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actually this only served to make it slower, though not by much. Ultimately reverted this.
lgtm. please squash commits then merge. |
Changed Block to be an alias on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.
This partially addresses #73. Though full SIMD support is probably more desirable here.