Sandy Bridge:
   For popcnt method, use AVX compiler option (-mavx, /arch:AVX)
   For SSE3 method, do not use AVX compiler option

Best case code generation (unroller inner loop), per 256-bit section

              gcc         ms
avx2           14         14
avx            21         21
xmm popcnt     19         18-20