Cuda optimized ProRes decoder

I don’t have a lot of free time last months, so really struggle to release a new Version of AMCDX Video Patcher. And to be fair when the main challenges solved I don’t have that much motivation.

I still interested in the performance part of the project what was to be fair the main goal.
There is though not really much left to optimize on CPU size (Prores Encoder/Decoder overperforms FFmpeg implementation, Frame Editor and File To File are fast enough)…

So I finally found some time to start the part of the project I was dreaming of during last year – GPU optimization.

So today I officially release ProRes Cuda optimized decoder. Its early beta: It doesn’t support interlaced frames yet, there plenty of room for extra optimizations, but it’s still good enough to show it.

So what supported in v0.1b?
1) Decoding of progressive ProRes frames on GPU (both 444 and 422 supported)
2) Always decodes to 12 bit

API
amcdx_cu_prores_decoder.dll exports functions:

1) void * amcdx_cupr_decoder_create() – creates decoder instance allocates memory so on.
Returns decoder handle.

2) int amcdx_cupr_decoder_decode(void * decoder, void * buffer, int size) – decodes passed frame
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – encoded ProRes frame
size – frame size
Returns 0 if success, otherwise returns error code.

3) unsigned int amcdx_cupr_get_pitch(void * decoder) – returns plane line size. in 444 case line size same for all 3 planes, in 422 case line size of chroma planes equals amcdx_cupr_get_pitch / 2
decoder – decoder handle returned by amcdx_cupr_decoder_create

4) unsigned int amcdx_cupr_get_width(void * decoder) – returns width of decoded frame
decoder – decoder handle returned by amcdx_cupr_decoder_create

5) unsigned int amcdx_cupr_get_height(void * decoder) – returns height of decoded frame
decoder – decoder handle returned by amcdx_cupr_decoder_create

6) int amcdx_cupr_is_444(void * decoder) – returns 1 if we have 444 chroma subsampling, otherwise returns 0
decoder – decoder handle returned by amcdx_cupr_decoder_create

7) void amcdx_cupr_decoder_read(void * decoder, void ** buffer) – copies decoded frame from GPU to CPU. this function should be called if you frame buffer line sizes equals to amcdx_cupr_get_pitch
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – output frame plane buffers

8) void amcdx_cupr_decoder_read_pitch(void * decoder, void ** buffer, int * pitch) – copies decoded frame from GPU to CPU. this function should be called if you frame buffer line sizes are not equal to amcdx_cupr_get_pitch
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – output frame plane buffers
pitch – output frame plane line sizes

9) void amcdx_cupr_decoder_destroy(void * decoder) – destroys decoder instance
decoder – decoder handle returned by amcdx_cupr_decoder_create

10) const char * amcdx_cupr_version() – returns library version string

I added a simple wrapper so it could be used with FFmpeg
Pre-built Binaries

P.S. As I mentioned before it’s an early beta, so I didn’t do a lot of benchmarking. Currently, I have ~80 FPS decoding ProRes XQ 4444, 4K on Quadro P4000