If the wait times out, we used to simply restart the loop, which causes
it to check this->stop and exit if set to true. However, nvfbc_stop
already calls lgSignalEvent, which would wake up the pointer thread to
perform the check, so there is no need to set a timeout on the wait.
This commit adds damage tracking to the DXGI textures, and only copies the
damaged areas to the textures with ID3D11DeviceContext::CopySubresourceRegion.
The sleep logic in waitFrame makes it difficult for this to reduce the
latency, but removing it shows significant improvements (6-7 ms to ~3 ms)
when a tiny portion of the screen is damaged, while showing no difference on
full screen damage.
This implementation uses a line sweep algorithm to copy the precisely the
intersection of all accumulated damage rectangles, ensuring that every
pixel is copied exactly once, and no pixel is ever copied multiple times.
Furthermore, once a row has been swept, we update the framebuffer write
pointer immediately.
Certain drivers do not support pitches that are not multiples of 128 bytes,
and instead just does some kind of rounding internally. On DXGI, this is not
a problem because the API rounds pixel pitch, but NvFBC does not. This causes
certain resolutions to simply not work with dmabuf, most notably 3440x1440,
which is 1440p ultrawide.
Since we are copying pixels with the CPU anyways, we might as well round the
pitch up to 128 bytes (32 pixels).
This commit adds a new host configuration option, nvfbc:diffRes, which
specifies the dimensions of every block in the diff map. This defaults to
128, meaning the default 128x128 block size.
Since block sizes other than 128x128 is not guaranteed to be supported by
NvFBC, the function NvFBCGetDiffMapBlockSize was introduced to query the
support and output the actual block size used.
For adjacent changed regions, we actually use the bounding box for the
entire polygon. This may result in more area being damaged than strictly
necessary, but is nevertheless desirable since it reduces the number of
rectangles.
Before we try and perhaps fail to init DXGI, we should print out what
the device is so that when there is an error report we can immediately
see if the user has the QXL device attached still.
While it's correct for DXGI to use a asyncronous waitFrame model, other
capture interfaces such as NvFBC it is not correct. This change allows
the capture interface to specify which is more correct for it and moves
the waitFrame/post into the main thread if async is not desired.
Before, we only break out of the current row when a change is detected,
and all subsequent rows are still scanned. Now we break out of the entire
loop. This should make change detection ever so slightly faster.
Testing shows that `D3DKMTSetProcessSchedulingPriorityClass` has a
positive performance impact for NvFBC as well as DXGI, as such always
try to boost the priority for the windows host.
This so called "enhanced" event logic is completely flawed and can never
work correctly, better to strip it out and put our faith in windows to
handle the events for us.
And yes, I am fully aware I wrote the utter trash in the first place :)
This change adds an average function to time how long it takes the GPU
to copy and map the texture, and then uses this average to sleep for 80%
of this average lowering CPU usage and potentially decreasing lock
contention.
It has been detemined that a failure to init NvFBC causes a 20-30%
performance penalty on non NvFBC supported hardware (GeForce) when using
DXGI, as such reverse the order and default to using DXGI as our first
option.
If NvFBC is still desired, pr #500 added the option `app:capture` which
can be used to force NvFBC.