This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.
The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.
Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions will almost certainly be fastest, however, the single-instruction algorithms are 'usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster'.
A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.
If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.
I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):
int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}