Here's a question: on $current_year x86-64 or ARM, which of these is fastest?

:mgsgb_1: ((b>>7)&1)+((b>>6)&1)+((b>>5)&1)+((b>>4)&1)+((b>>3)&1)+((b>>2)&1)+((b>>1)&1)+(b&1)

:mgsgb_2: lookuptable[b]

:mgsgb_3: switch/case with 256 entries

:mgsgb_4: for(i=0;b!=0;b&=b-1)i++;
tomatocomputer.gif

@p Whatever accesses memory the least so 1 probably

@matrix The first one is ~25 instructions. A single deref might be faster, but 3/4 also don't touch memory. They do both branch, though. 4 is a loop that boils down to maybe four instructions. I suspect 3 is the slowest but I don't know.

@p Hmmm, thinking about it more, I think a lookup table might be the fastest, because the whole table could fit into L1 cache.
Manual addition would probably be best on some older CPU.
Switch case is pretty fast, but could get slow with a poor branch prediction.
Loop has too many jumps and is variable.

@matrix I think 4 would be fastest on older CPUs, but it depends on whether older means "1980s" or "Pentium IV". (4 would definitely be slowest on a P4.)

It is hard to actually benchmark something that is potentially less expensive than the handful of lines that actually perform the benchmarking.
Follow

@p By "old" I was thinking around turn of the century.
I don't know how to properly benchmark such tiny tasks either. I think you just have to run each one like billion times and see the difference.

· · Web · 1 · 0 · 1
@matrix Yeah, but if it's buried in the branch, then you might not get an actual difference. There are CPU instructions that give you access to high-res timers but I think this might still be hard to do.
Sign in to participate in the conversation
Game Liberty Mastodon

Mainly gaming/nerd instance for people who value free speech. Everyone is welcome.