Tag Archives: ffmpeg

Cuda optimized ProRes decoder

I don’t have a lot of free time last months, so really struggle to release a new Version of AMCDX Video Patcher. And to be fair when the main challenges solved I don’t have that much motivation.

I still interested in the performance part of the project what was to be fair the main goal.
There is though not really much left to optimize on CPU size (Prores Encoder/Decoder overperforms FFmpeg implementation, Frame Editor and File To File are fast enough)…

So I finally found some time to start the part of the project I was dreaming of during last year – GPU optimization.

So today I officially release ProRes Cuda optimized decoder. Its early beta: It doesn’t support interlaced frames yet, there plenty of room for extra optimizations, but it’s still good enough to show it.

So what supported in v0.1b?
1) Decoding of progressive ProRes frames on GPU (both 444 and 422 supported)
2) Always decodes to 12 bit

API
amcdx_cu_prores_decoder.dll exports functions:

1) void * amcdx_cupr_decoder_create() – creates decoder instance allocates memory so on.
Returns decoder handle.

2) int amcdx_cupr_decoder_decode(void * decoder, void * buffer, int size) – decodes passed frame
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – encoded ProRes frame
size – frame size
Returns 0 if success, otherwise returns error code.

3) unsigned int amcdx_cupr_get_pitch(void * decoder) – returns plane line size. in 444 case line size same for all 3 planes, in 422 case line size of chroma planes equals amcdx_cupr_get_pitch / 2
decoder – decoder handle returned by amcdx_cupr_decoder_create

4) unsigned int amcdx_cupr_get_width(void * decoder) – returns width of decoded frame
decoder – decoder handle returned by amcdx_cupr_decoder_create

5) unsigned int amcdx_cupr_get_height(void * decoder) – returns height of decoded frame
decoder – decoder handle returned by amcdx_cupr_decoder_create

6) int amcdx_cupr_is_444(void * decoder) – returns 1 if we have 444 chroma subsampling, otherwise returns 0
decoder – decoder handle returned by amcdx_cupr_decoder_create

7) void amcdx_cupr_decoder_read(void * decoder, void ** buffer) – copies decoded frame from GPU to CPU. this function should be called if you frame buffer line sizes equals to amcdx_cupr_get_pitch
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – output frame plane buffers

8) void amcdx_cupr_decoder_read_pitch(void * decoder, void ** buffer, int * pitch) – copies decoded frame from GPU to CPU. this function should be called if you frame buffer line sizes are not equal to amcdx_cupr_get_pitch
decoder – decoder handle returned by amcdx_cupr_decoder_create
buffer – output frame plane buffers
pitch – output frame plane line sizes

9) void amcdx_cupr_decoder_destroy(void * decoder) – destroys decoder instance
decoder – decoder handle returned by amcdx_cupr_decoder_create

10) const char * amcdx_cupr_version() – returns library version string

I added a simple wrapper so it could be used with FFmpeg
Pre-built Binaries

P.S. As I mentioned before it’s an early beta, so I didn’t do a lot of benchmarking. Currently, I have ~80 FPS decoding ProRes XQ 4444, 4K on Quadro P4000

Some thoughts on performance optimization

I have a bad habit to try to make my code highly optimized even if the unit not fully finished, and I`m serious when say its a bad habit as it really slows down the development process and sometimes affects code quality…

For example here is an interesting trick I added to my ProRes encoder/decoder:

When you encode/decode DC you should use one of 4 existing codebooks and codebook should be chosen based on value just encoded DC which obviously can be way bigger of 3. So what most of us will do here just add branching:

if (codebook > 3) {
    codebook = 3;
}

And in 999 of 1000 cases its probably will be the best solution. For my case, I have 3 * 30 DC per slice (technically 3 * 32 but first 2 DCs don’t need any branching to know codebook) or 364500 DC per UHD frame which is kinda bad…

as you might see max value of codebook == 3 eg (2^2 – 1) and this is the case where we can easily avoid branching, so I replaced that branching with code:

codebook = (3 & (4 - !!(codebook & 0xfffc))) + ((codebook & 3) & (4 - !(codebook & 0xfffc)));

Which is avoid branching and in my particular case improves performance a bit ~0.3-0.5% but to be fair this code would be hard to support and I still doubt if I need to push it.

This is really tricky example which probably shouldn’t be considered especially when your unit is not 100% complete and when there are for sure a million ways to optimize your code.

One more example

This one is more critical and which I faced doing my Contract.
I had to fix performance regression which the company faced after FFmpeg upgrade from FFmpeg 3.x to FFmpeg 4.2 as they use FFmpeg to demux MOV files.

One of the developers found out that FFmpeg finally added 12-bit decoding support and now they claim all HQX and 4444 profiles as 12 bit and indeed that commit causes regression. It sounds weird if you consider the fact they use official libs from Apple for decoding …

So how is it possible? My first thought was decoder still used somewhere how else is it possible File open would become x2 slower? So how do you open File with ffmpeg libs? Something like:

avformat_open_input ...
avformat_find_stream_info ..

What I found out avformat_find_stream_info reads the first frame from file and decodes it and does it single-threaded. How do you like it? To be fair there is a reason behind it as sometimes there is no way to get all needed metadata without decoding frame header (for example bit depth or pixel format so on) but the problem is we don’t need to decode whole frame to get that metadata we just need to decode frame header… So I added an extra flag wich force to stop FFmpeg Prores and DNx decoders after headers decoded, something like:

//proresdec2.c .  line 784
if (avctx->flags & AV_CODEC_STOP_AFTER_HEADER_DECODED) {
    return avpkt->size;
}

Believe it or not but instead of x2 slowdown we achieved x2 speed up and now file open became constant time regardless of resolution when with the previous version the higher resolution was the slower file open would be…

FFMPEG + GPL

Just to make things clear

0) yes I agree I violated GLP and I feel sorry, but I didnt do it on purpose (and to be fair I didnt know much licensing details till the week I was noticed about my violation)

1) After Kieran left the comment about violation I made repo private (to spend some time to understand all details) and today I fully removed the repo

2) To be fair its easy to see it wasn’t done on purpose as I never posted updates even though the posted version had interlaced coding bug. I was asked a couple of times privately to make a custom build with my encoder and I rejected it as the only purpose I persued was to show encoder exists and performs well

3) Why dont I disclose source codes?

3.1) I was going to re-work prores_ks or add one more encoder and I had an exact plan on how to do it and I started with it. I sent couple patches 1st was approved 2nd still under review. I decided to not wait forever and Im not the one to ping every day/week to make it pushed (I believe if the community needs something it will be pushed, and its easy to prove with my mxf op1b research when my changes were pushed even though Ive never sent that patch)

3.2) I still was interested to finish my Prores encoder so I continued to work on it. And now when encoder done I plan to make a product based on it. It will be free but probably close-sourced.

3.3) Even if I want one day to make it part of ffmpeg it wont be easy to do, as my implementation done in C++ and ffmpeg is C

3.4) so the build I shared (and later removed) literally a bag of tricks.

3.5) so when Kieran/Martin/Carl or whoever says I should share prores_amcdx_encoder I can easily do it, but what will you see here? as its basically skeleton copied from prores_anatoly with whole logic replaced by calling functions from private static library

#include "libavutil/opt.h"
#include "avcodec.h"
#include "internal.h"
#include "profiles.h"
#include "prores_defs.hpp"

#define DEFAULT_SLICE_MB_WIDTH 8

static const AVProfile profiles[] = {
    { FF_PROFILE_PRORES_PROXY,    "apco"},
    { FF_PROFILE_PRORES_LT,       "apcs"},
    { FF_PROFILE_PRORES_STANDARD, "apcn"},
    { FF_PROFILE_PRORES_HQ,       "apch"},
    { FF_PROFILE_PRORES_4444,     "ap4h"},
    { FF_PROFILE_PRORES_XQ,       "ap4x"},
    { FF_PROFILE_UNKNOWN }
};

static const int valid_primaries[9]  = { AVCOL_PRI_RESERVED0, AVCOL_PRI_BT709, AVCOL_PRI_UNSPECIFIED, AVCOL_PRI_BT470BG,
                                         AVCOL_PRI_SMPTE170M, AVCOL_PRI_BT2020, AVCOL_PRI_SMPTE431, AVCOL_PRI_SMPTE432,INT_MAX };
static const int valid_trc[4]        = { AVCOL_TRC_RESERVED0, AVCOL_TRC_BT709, AVCOL_TRC_UNSPECIFIED, INT_MAX };
static const int valid_colorspace[5] = { AVCOL_SPC_BT709, AVCOL_SPC_UNSPECIFIED, AVCOL_SPC_SMPTE170M,
                                         AVCOL_SPC_BT2020_NCL, INT_MAX };

typedef struct {
    AVClass *class;
    void *encoder;
    int cs;
    int qual;
    int field_order;
    int planes;
    int target_size;
} ProresContext;

static int prores_encode_frame2(AVCodecContext *avctx, AVPacket *pkt,
                               const AVFrame *pict, int *got_packet)
{
    ProresContext *ctx = avctx->priv_data;
    int ret;
    int frame_size = amcdx_pr_encoder_encode(ctx->encoder, (void **)pict->data, (int *)pict->linesize, ctx->planes); //for the time being


    if ((ret = ff_alloc_packet2(avctx, pkt, frame_size, 0)) < 0)
        return ret;

    amcdx_pr_encoder_read(ctx->encoder, pkt->data, &pkt->size);


    pkt->flags |= AV_PKT_FLAG_KEY;

    *got_packet = 1;
    return 0;
}

static av_cold int prores_encode_init2(AVCodecContext *avctx)
{
    ProresContext* ctx = avctx->priv_data;

    avctx->bits_per_raw_sample = 10;

    if (avctx->width & 0x1) {
        av_log(avctx, AV_LOG_ERROR,
                "frame width needs to be multiple of 2\n");
        return AVERROR(EINVAL);
    }

    if (avctx->width > 65534 || avctx->height > 65535) {
        av_log(avctx, AV_LOG_ERROR, "The maximum dimensions are 65534x65535\n");
        return AVERROR(EINVAL);
    }

    switch (avctx->profile) {
    case FF_PROFILE_UNKNOWN:
    case FF_PROFILE_PRORES_STANDARD:
        ctx->qual = Quality_422;
        break;
    case FF_PROFILE_PRORES_4444:
        ctx->qual = Quality_4444;
        break;
    case FF_PROFILE_PRORES_HQ:
        ctx->qual = Quality_422HQ;
        break;
    case FF_PROFILE_PRORES_LT:
        ctx->qual = Quality_422LT;
        break;
    case FF_PROFILE_PRORES_PROXY:
        ctx->qual = Quality_422Proxy;
        break;
    case FF_PROFILE_PRORES_XQ:
        ctx->qual = Quality_4444XQ;
        break;
    default:
        return -1;
        break;
    }

    switch (avctx->pix_fmt) {
    case AV_PIX_FMT_UYVY422:
        ctx->cs = ColorSpace_uyvy;
        ctx->planes = 1;
        break;
    case AV_PIX_FMT_YUV422P10:
        ctx->cs = ColorSpace_yuv10_422_planar;
        ctx->planes = 3;
        break;
    case AV_PIX_FMT_YUV422P12:
        ctx->cs = ColorSpace_yuv12_422_planar;
        ctx->planes = 3;
        break;
    case AV_PIX_FMT_YUV444P12:
        ctx->cs = ColorSpace_yuv12_444_planar;
        ctx->planes = 3;
        break;
    default:
        break;
    }

     //for the time being

    switch (avctx->field_order)
    {
    case AV_FIELD_BT:
        avctx->field_order = FieldOrder_BottomFieldFirst;
        break;
    case AV_FIELD_TB:
        avctx->field_order = FieldOrder_TopFieldFirst;
        break;
    case AV_FIELD_PROGRESSIVE:
    default: //otherwise we think its progressive
        avctx->field_order = FieldOrder_Progressive;
        break;
    }

    ctx->encoder = amcdx_pr_encoder_create();

    if (ctx->target_size != 0) {
        amcdx_pr_encoder_set_frame_size(ctx->encoder, ctx->target_size);
    }

    avctx->codec_tag = MKTAG(profiles[avctx->profile].name[0], profiles[avctx->profile].name[1], profiles[avctx->profile].name[2], profiles[avctx->profile].name[3]);// AV_RL32((const uint8_t*)profiles[avctx->profile].name);

    return amcdx_pr_encoder_init(ctx->encoder, avctx->width, avctx->height, ctx->cs, ctx->qual, ctx->field_order) - 1;
}

static av_cold int prores_encode_close2(AVCodecContext *avctx)
{
    ProresContext* ctx = avctx->priv_data;
    amcdx_pr_encoder_destroy(ctx->encoder);

    return 0;
}

#define OFFSET(x) offsetof(ProresContext, x)
#define VE     AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM

static const AVOption options[] = {
    { "target_size", "force frame size", OFFSET(target_size), AV_OPT_TYPE_INT, { .i64 = 0 }, 0, INT_MAX, VE },
    { NULL }
};

static const AVClass proresamcdx_enc_class = {
    .class_name = "ProRes amcdx encoder",
    .item_name  = av_default_item_name,
    .option     = options,
    .version    = LIBAVUTIL_VERSION_INT,
};

AVCodec ff_prores_amcdx_encoder = {
    .name           = "prores_amcdx",
    .long_name      = NULL_IF_CONFIG_SMALL("Apple ProRes"),
    .type           = AVMEDIA_TYPE_VIDEO,
    .id             = AV_CODEC_ID_PRORES,
    .priv_data_size = sizeof(ProresContext),
    .init           = prores_encode_init2,
    .close          = prores_encode_close2,
    .encode2        = prores_encode_frame2,
    .pix_fmts       = (const enum AVPixelFormat[]){AV_PIX_FMT_UYVY422, AV_PIX_FMT_YUV422P10, AV_PIX_FMT_YUV422P12, AV_PIX_FMT_YUV444P12, AV_PIX_FMT_NONE},
    .capabilities   = AV_CODEC_CAP_FRAME_THREADS | AV_CODEC_CAP_INTRA_ONLY,
    .priv_class     = &proresamcdx_enc_class,
    .profiles       = NULL_IF_CONFIG_SMALL(ff_prores_profiles),
};

Prores QUALITY

UPD: ffmpeg builds I shared were added just to show encoder is no a myth, but as FFmpeg (or at least some guys from the community) has something against I had to remove repo…s o all github links below invalid, sorry

In the previous post I forgot to mention the problem I mentioned couple times before – Quality. Its not always easy to detect by eye big difference, but I have some test files where any of ffmpeg prores encoders really fails.

I uploaded one to github if you want to check:

https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/1.bmp

and you can see how badly ffmpeg encodes it if you want Proxy profile:

ffmpeg -i 1.bmp -c:v prores_aw -profile:v -pix_fmt yuv422p10le aw.mov

ffmpeg -i 1.bmp -c:v prores_ks -profile:v -pix_fmt yuv422p10le ks.mov

as you see both looks quite blury (aw looks better but as I said before there is nothing about rate control and aw guarantees nothing except correct bitstream)

Thats how looks same frame encoded with encoder I made:


ffmpeg -i 1.bmp -c:v prores_amcdx -profile:v -pix_fmt yuv422p10le amcdx.mov

I uploaded all 3 mov files so you can compare results by yourself:
https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/aw.mov
https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/ks.mov
https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/amcdx.mov

I also do believe you have your own test footage which you want to try encoder with, so I built ffmpeg master branch and added one more Prores Encoder, so you can test and check results . Usage:
https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/build/ffmpeg_win_MSVS2015.7z
https://github.com/da8eat/ffmpeg_prores_encoder/blob/master/build/ffmpeg_osx_clang.7z

ffmpeg.exe -i 1.bmp -c:v prores_amcdx -profile:v 5 -pix_fmt yuv444p12le xq.mov

profiles same as others ffmpeg Prores encoders: 0 – Proxy, 1 – LT, 2 – Standard, 3 – HQ, 4 – 4444, 5 – XQ

supported pixel formats: uyvy422, yuv422p10le, yuv422p12le, yuv444p12le

I do believe my encoder still have some bugs, so If you face any do not hesitate to message me

Prores progress updates

As I got some questions about progress I decided to post some updates and clarifications:

  1. I succeeded to improve performance so now encoder a bit faster of Apple implementation with identical output (and I still see room for improvements)
  2. I fixed some minor issues and fully implemented XQ profile
  3. About 12 bit support: there was a thread in ffmpeg dev list where some core developers were claiming 12-bit Prores is a myth so you know Apple encoder encode all data as 12 bit even if you pass 8-bit uyvy it first converted to 12 bits and encoded after that
  4. Based on statement (3) I can expose big mistake I made in Cinedeck Prores Insert-Edit. Basically Cinedeck checks some stream parameters to make the decision if input stream should be re-encoded or not before insert. One of them is src pixel format and if input and output has different src pixel formats video gets re-encoded before insert which now i can say is wrong behavior as basically on encoder side its always 12 bit and not src pixel format but chroma subsampling had to be checked
  5. There is one more util I work on. Its Prores smart transcoder:
    Lets say we need to transcode Prores HQ to Prores Proxy, thats how any transcoding app will do it:
    1) Decode frame (vlc decode -> dequantize -> inverse dct -> assemble slices to frame buffer
    2) Encode (disassemble frame to slices, -> forward dct -> rate control -> quantize -> vlc encode)

    From first point of view it looks ok, but from my point of view it should be:
    1) vlc decode -> dequantize -> rate control -> quantize -> vlc encode
    so basically I got rid of some heavy but useless steps which make transcode almost x2 faster comparing to the classical way
    Obviously, it works only if you transcode from Prores to Prores

I`m still quite far to show demo (except some command line applications), but here is priority list:

  1. Make user friendly UI so it easy to show and use
  2. Finish MOV parser/muxer (as im not a fan to use FFMPEG for demo)

MXF OP1b + FFmpeg Part1

some time ago I was requested to fix strange ffmpeg bug. by customer words they had op1b files from Panasonic camera ffmpeg doesnt read audio correct (first couple seconds were looped).

First thoughts was “easy money” so I signed up.
the problem was trivial basically op1b allows more 1 essence containers (smpte 319) and each essence stored in its own essence container as result each track is unique in essence container and all audio tracks have same track number so when ffmpeg assign new read essence packet it check track number and always assigned int to first audio track

static int mxf_get_stream_index(AVFormatContext *s, KLVPacket *klv)
{
    int i;
    for (i = 0; i < s->nb_streams; i++) {
        MXFTrack *track = s->streams[i]->priv_data;
        /* SMPTE 379M 7.3 */
        if (track && !memcmp(klv->key + sizeof(mxf_essence_element_key), track->track_number, sizeof(track->track_number))) {
            return i;
        }
    }
    /* return 0 if only one stream, for OP Atom files with 0 as track number */
    return s->nb_streams == 1 ? 0 : -1;
}

as file had 4 tracks it made loop effect because of all 4 tracks had same audio.

to resolve we need:

1) in MXFContentStorage read field which contains list of all EssenceContainerData (0x1902)

static int mxf_read_content_storage(void *arg, AVIOContext *pb, int tag, int size, UID uid, int64_t klv_offset)
{
    MXFContext *mxf = arg;
    switch (tag) {
    case 0x1901:
        if (mxf->packages_refs)
            av_log(mxf->fc, AV_LOG_VERBOSE, "Multiple packages_refs\n");
        av_free(mxf->packages_refs);
        return mxf_read_strong_ref_array(pb, &mxf->packages_refs, &mxf->packages_count);
    case 0x1902:
        av_free(mxf->essence_container_data_refs);
        return mxf_read_strong_ref_array(pb, &mxf->essence_container_data_refs, &mxf->essence_container_data_count);
    }
    return 0;
}


2) read each EssenceContainerData (ffmpeg didnt read it at all).

3) each EssenceContainerData has reference to SourcePackage and also holds index sid and body sid

typedef struct MXFEssenceContainerData {
    UID uid;
    enum MXFMetadataSetType type;
    UID package_uid;
    UID package_ul;
    int index_sid;
    int body_sid;
} MXFEssenceContainerData;

static int mxf_read_essence_container_data(void *arg, AVIOContext *pb, int tag, int size, UID uid, int64_t klv_offset)
{
    MXFEssenceContainerData * essence_data = arg;
    switch(tag) {
        case 0x2701:
            /* linked package umid UMID */
            avio_read(pb, essence_data->package_ul, 16);
            avio_read(pb, essence_data->package_uid, 16);
            break;
        case 0x3f06:
            essence_data->index_sid = avio_rb32(pb);
            break;
        case 0x3f07:
            essence_data->body_sid = avio_rb32(pb);
            break;
    }
    return 0;
}

static const MXFMetadataReadTableEntry mxf_metadata_read_table[] = {
//removed to not post too many code
    { { 0x06,0x0e,0x2b,0x34,0x02,0x53,0x01,0x01,0x0d,0x01,0x01,0x01,0x01,0x01,0x23,0x00 }, mxf_read_essence_container_data, sizeof(MXFEssenceContainerData), EssenceContainerData },
    { { 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00 }, NULL, 0, AnyType },
};


4) go through all tracks from each SourcePackage and assign to each track index and body sid we found in corresponding EssenceContainerData

for (k = 0; k < mxf->essence_container_data_count; k++) {
            if (!(essence_data = mxf_resolve_strong_ref(mxf, &mxf->essence_container_data_refs[k], EssenceContainerData))) {
                av_log(mxf, AV_LOG_TRACE, "could not resolve essence container data strong ref\n");
                continue;
            }

            if (memcmp(component->source_package_ul, essence_data->package_ul, sizeof(UID)) || memcmp(component->source_package_uid, essence_data->package_uid, sizeof(UID))) {
                continue;
            }

            source_track->body_sid = essence_data->body_sid;
            source_track->index_sid = essence_data->index_sid;
        }


5) when we read next KLV triplet we look for partition this triplet belongs to and each partition has body sid

static int find_body_sid_by_offset(MXFContext *mxf, int64_t offset) {
    //we basically look for partition where current klv triplet placed

    int i;
    MXFPartition * prev = 0;

        for (i = 0; i < mxf->partitions_count; ++i) {
            MXFPartition * partition = &mxf->partitions[i];

            if (partition->body_sid) {
                if (partition->this_partition < offset) {
                    prev = partition;
                }
                else {
                    break;
                }
            }
        }

    if (prev) {
        return prev->body_sid;
    }

    return 0;
}


6) when we look for which track to assign this triplet we compare track number and body sid (before was only track number compared)

static int mxf_get_stream_index(AVFormatContext *s, KLVPacket *klv, int body_sid)
{
    int i;
    for (i = 0; i < s->nb_streams; i++) {
        MXFTrack *track = s->streams[i]->priv_data;
        /* SMPTE 379M 7.3 */
        //we check body_sid and track->body_sid equal to zero just just to be compatible with old where no body_sid assigned to track
        if (track && (body_sid == 0 || track->body_sid == 0 || track->body_sid == body_sid) && !memcmp(klv->key + sizeof(mxf_essence_element_key), track->track_number, sizeof(track->track_number))) {
            return i;
        }
    }
    /* return 0 if only one stream, for OP Atom files with 0 as track number */
    return s->nb_streams == 1 ? 0 : -1;
}

Those changes fixed audio read 🙂

Unfortunately that wasnt it for me… as seek issues occurred and audio packets were too big (audio was custom wrapped e.g. clip wrapped but split on 2 seconds chunk each chunk in new partition) but this is topic for the next post

 

P.S. changes described in this post could be found by link:

https://github.com/da8eat/FFmpeg/blob/master/libavformat/mxfdec.c

Â