Tag Archives: Performance

Cuda optimized ProRes decoder V0.2b

After I released the early beta I received a couple of messages. Most of them to be fair something like: “80fps I`m not impressed…”
But there was one message from a company which as well work on GPU optimized decoder/encoder. And to be fair it’s good to know someone else works on it as probably at some point I hope to have a chance to compare performance/quality.
I don’t have much info on what exactly they try to do (most probably just optimize FFmpeg version) but as far as I know, their initial goals to be able to decode/encode 1000 fps ProRes 4:2:2 HD.

As my decoder is in the early-stage and the encoder way behind of decoder, I still not sure if 1000 is a limit, but I guess it will be the first goal I will try to achieve.

Even though I didn’t have much time last weeks I was inspired but the fact someone else works on it, so today I ready to release a new version of the decoder.

V0.2b is still early beta it still decodes only Progressive frames, but it’s twice faster than the previous version. (~150fps 4K and ~ 610fps HD)

API was not changed

Some thoughts on performance optimization

I have a bad habit to try to make my code highly optimized even if the unit not fully finished, and I`m serious when say its a bad habit as it really slows down the development process and sometimes affects code quality…

For example here is an interesting trick I added to my ProRes encoder/decoder:

When you encode/decode DC you should use one of 4 existing codebooks and codebook should be chosen based on value just encoded DC which obviously can be way bigger of 3. So what most of us will do here just add branching:

if (codebook > 3) {
    codebook = 3;
}

And in 999 of 1000 cases its probably will be the best solution. For my case, I have 3 * 30 DC per slice (technically 3 * 32 but first 2 DCs don’t need any branching to know codebook) or 364500 DC per UHD frame which is kinda bad…

as you might see max value of codebook == 3 eg (2^2 – 1) and this is the case where we can easily avoid branching, so I replaced that branching with code:

codebook = (3 & (4 - !!(codebook & 0xfffc))) + ((codebook & 3) & (4 - !(codebook & 0xfffc)));

Which is avoid branching and in my particular case improves performance a bit ~0.3-0.5% but to be fair this code would be hard to support and I still doubt if I need to push it.

This is really tricky example which probably shouldn’t be considered especially when your unit is not 100% complete and when there are for sure a million ways to optimize your code.

One more example

This one is more critical and which I faced doing my Contract.
I had to fix performance regression which the company faced after FFmpeg upgrade from FFmpeg 3.x to FFmpeg 4.2 as they use FFmpeg to demux MOV files.

One of the developers found out that FFmpeg finally added 12-bit decoding support and now they claim all HQX and 4444 profiles as 12 bit and indeed that commit causes regression. It sounds weird if you consider the fact they use official libs from Apple for decoding …

So how is it possible? My first thought was decoder still used somewhere how else is it possible File open would become x2 slower? So how do you open File with ffmpeg libs? Something like:

avformat_open_input ...
avformat_find_stream_info ..

What I found out avformat_find_stream_info reads the first frame from file and decodes it and does it single-threaded. How do you like it? To be fair there is a reason behind it as sometimes there is no way to get all needed metadata without decoding frame header (for example bit depth or pixel format so on) but the problem is we don’t need to decode whole frame to get that metadata we just need to decode frame header… So I added an extra flag wich force to stop FFmpeg Prores and DNx decoders after headers decoded, something like:

//proresdec2.c .  line 784
if (avctx->flags & AV_CODEC_STOP_AFTER_HEADER_DECODED) {
    return avpkt->size;
}

Believe it or not but instead of x2 slowdown we achieved x2 speed up and now file open became constant time regardless of resolution when with the previous version the higher resolution was the slower file open would be…