We used to increment the source buffer index by width * bpp, not by pitch.
This is incorrect and the root cause behind #670.
This is a regression that appeared in 196050bd23.
Extensive profiling reveals that the glibc memcpy performs up to 2x
faster then the existing SIMD implementation that was in use here. This
patch also will copy large 1MB chunks if the pitch of the source and
destination match further increasing throughput.
Note: This only works with the KVMFR kernel module in a VM->VM
configuration. If this causes issues it can be disabled with the new
option `app:allowDMA`
This changes the method of the memory copy from the host application to
the guest. Instead of performing a full copy from the capture device
into shared memory, and then flagging the new frame, we instead set a
write pointer, flag the client that there is a new frame and then copy
in chunks of 1024 bytes until the entire frame is copied. The client
upon seeing the new frame flag begins to poll at high frequency the
write pointer and upon each update copies as much as it can into the
texture.
This should improve latency but also slightly increase CPU usage on the
client due to the high frequency polling.