12-01-2013 06:38 PM
Some time ago I opened this thread and marked it as solved, but it is actually a sequel to the dicussion started there.
So let me explain what is going on.
I am making an application for BB10 that has a following feature:
- user records several MP4 video segments.
- app has to merge them into 1 file and prepare for upload to a service.
This is the structure of MP4 file that BB10 camera_api creates:
As you can see there is plenty of them atoms. I will go through all of them and explain what I did and express my doubts and my problems. Hopefully someone can help and point me in the right direction.
Starting from top, there are:
These 2 are straight forward, first is type declaration with major, minor versions, second one is a 8 byte placeholder in case you need to expand a little. Wide atom can be left out but for sake of authenticity, I create omne too in my end result file.
Follwing next are:
A word in general, if they are ordered like that mdat - moov then this is not really suitable for streaming, but this is not my case nor my problem as I download the files and cache them, thus play them from the filesystem. So I obeyed the order here too. If you want to optimise for streaming first step is to create moov - mdat order and then adjust several other atoms as well, but that is a subject of another topic.
In my merger class, I handled mdat like this:
This seems to work beacuse mdat is just a blob of data that needs its moov to be interpreted.
As for moov atom, its structure is a bit complexe.
In its first sublevel you get:
In mvhd everything stays the same but the duration. I recalculate duration after merging all my segment files and choose the bigger number between duration of audio or video tracks. All accordiong to specification for MP4 files.
Going deeper into sublevels we enter the trak atom. Here among many, first of interest is mdhd atom. So called media header atom has a few properties but again only duration is calculated and replaced to fit all new samples that are merged in. Duration calculation has been done according to specification, TimeToSample atom is responsible for that if you don't have edts atom, which I don't. Audio and video track durations don't have to be equal so that's something I learned the hard way.
I can skip a few atoms like:
smhd (or vmhd in video trak atom)
These haven' been tampered just created at image of what BB10 creates.
Now comes the fun part, the samlpe table atom - stbl.
First sublevel oi stbl is stsd
In case of audio track it holds mp4a with its esds.
In case of video track it holdsavc1 with avcC inside.
To be completely honest, I mirrored all properties just like what BB10 camera creates. After I examined dozens of esds and avcC atoms I came to the conclusion they are always the same since I choose the encoding variables.
In my merger class these are just plucked from any file in array.
This is where all hell breaks loose...
This is the TimeToSample atom. I create version and flags. Entry count is calculated as a sum of all stts samples from all files to be merged. Samples are then concatenated one after the other in their respective merge order. At this point above mentioned duration is calculated and set.
This is the SampleToChunk atom. I take a list of the samples, go through all of them, increase their first_chunk property where needed. The rest of the sample is not touched. This is where I am not sure if it's a correct thing to do. Entry count again is a sum of all samples from all files, version and flags are set to 0.
This is a SampleSize atom. Name is selfexplanatory. I just concat them all together. Version, flags and entry count is set accordingly. No voodoo here from my merger class.
This is a Chunk Offset atom and this one is a bit tricky as it involves mdat sizes.It represents offsets of mdat chunks but relative from the start of the file not the mdat atom. This is important to know. I go through all of them and recalculate their offset positions using mdat sizes to shift them further. Again, not sure if its correct approach.
In addition to atoms above this line, video track has 2 more:
A Sync Sample atom. These are also iterated through and each following list is upped by a factor multiplied by 30 since I hardcoded encoding like that. It represents keyframes for track interleaving. Since I control both start and end of creation of this, math here is simple.
Composition Offset atom is not really needed but good to have. I didn't even include it in my own result file since I did several comparisons with transcoded segments and one or none have it included. My problem may easily be in the fact I chose to exclude it from end result.
At last udta atom is just iTunes metadata. For sake of argument I will say BB10 adds a year or creation and full date as a name of the video. I have bigger plans for this but that's childs play to manipulate later.
OK, now you know all I know, be that little or enough.
My problem is following...
VLC, ffmpeg can chew on it and I can see errors like cannot decode frame and no frame found with negaticve NAL sizes.
But the goal here is to make it work on BB10 devices and has to be compatible with iOS and Android. Which both use ffmpeg for this sort of black magic.
My own code for merging MP4 files is a Qt class that has more than 1400 lines of code. I didn't include it in this post for sake of humanity and neatness, but if code is requested, it will be provided promptly. Snippets too.
If example files are needed, I will provide them too. Both segments, end result and a comparison file created by MP4Joiner.
I know this was a long and boring cry out for help, but I have no other choice after more than 3 weeks bashing at this.
I promise to provide a picture of a cat after solving this, please help.
Solved! Go to Solution.
12-02-2013 10:59 AM
will take a look later this week.
I'm going to suggest, at a high level, that you:
1. merge the multiple MDAT segments into one big mdat
2. move all the atoms except the start-of-file ones to the end (or to the beginning, if you want to support streaming easier)
3. rewrite that 1st atom so it knows the complete file size
4. re-offset all of the STCO atoms
If you are only working with small clips, then your STCO's should all be the 32-bit variants, so I don't think you will run into any problem of overflow into >4GB size, in which case you would have to go back and rewrite the WIDE as well and build the 64-bit STCO variants.
Note that as you point out mdat preceeds moov in our files. This is just an artifact of the way that the file is created and is typical for a "recording" use-case. We cannot write all the headers until the file is complete, and cannot estimate how much space we would have to reserve at the start of the file. This is the reason the headers all end up at the tail of the file.
Also, given that you seem pretty savvy in this department so far, you might want to investigate using camera_start_encode() which just gives you the MDAT chunks, and leaves you free to build your own file however you want. Maybe a future optimization Would also let you avoid the "stops-early" bug that we have in the current release when ending a recording.
12-04-2013 12:22 PM
Hey, so here's the feedback I got from someone a bit more familiar with our mp4 file format...
His general approach seems good.
'ctts' should not be required like he indicated as this is only required if we were using B frames where decoding and presentation order would be different which is not the case for us.
Nothing obvious in his post on what he is doing wrong. Assuming he is still struggling, he could give us an example of a file merge that does not work (original files prior to merge and merged file) and I could have a quick look to see if I can find what is not working for him. This is likely faster then looking at his source code.
So maybe you can attach a set of files (before & after) that are not working.
12-05-2013 08:58 AM
Hi Sean and The Mysterious Stranger that is helping out. Thanks both for your time.
Of course I can upload the files. Here is the Dropbox link:
I still need the segment files so I don't go low as recording only mdats. Chop bug isn't really a big deal for now. I always target times between 6.00 and 7.00 seconds anyways.
I have to admit, when I started the merger class I had no idea how MP4 is structured. Learned a lot, on Apple QuickTime docs, and been perfecting it so far.
Once it works I plan to hand it over to other devs too that need it, like for Instagram apps. Perhaps even make a standalone app later on.
Anyways, while I'm still poking it with a stick and examining STCO and STTS keep me in the loop and if soures are needed, I'll upload those as well.
Thanks once again.
12-05-2013 11:50 AM
we're looking at it now.
can you give us the same thing, but without audio? We're still looking, but it's easier to look at when it's not interleaved data &colon
are you using the C API? if so, you should be able to just do:
camera_set_video_property(handle, CAMERA_IMGPROP_AUDIOCODEC, CAMERA_AUDIOCODEC_NONE);
12-05-2013 12:03 PM
Ok, same folder, the NoAudio files.
I do use C API, CAMERA_IMGPROP_AUDIOCODEC is set to CAMERA_AUDIOCODEC_NONE.
Used to be AAC.
12-05-2013 12:49 PM
You can thank the helpful guy who sits next to me
The main issue I found was with the 'stsc' atom. This is the sample-to-chunk table showing how many samples are in each chunk. This is a sparse list in that it only adds a new entry in the table if a chunk has a different # of samples then the previous chunk. Each entry indicates:
first chunk: fist chunk that has this size (1st chunk is 1)
Samples per chunk: how many samples this chunk and subsequent ones have
Sample description ID: entry of 'stsd' entry for describing the samples (we only have 1 stsd entry, so always 1)
In the provided example:
Segment 1 had 5 stsc entries (first chunk, samples per chunk): (1, 16), (2, 15), (5, 13), (6, 14), (7, 15) and has 9 chunks in total (# of entries in stco atom)
Segment 2 has 2 stsc entries1, 16), (2, 15) and has 3 chunks in total
Your concatenated clip has 7 stsc entries: (1, 16), (2, 15), (5, 13), (6, 14), (7, 15), (8, 16), (9, 15) and has 12 chunks total
The error is the value you set for 'first chunk' for your 6th and 7th entry above. The first segment had 9 chunks, so chunk #7, 8 and 9 all have 15 entries. So your first chunk should be 10 when you start the 2nd segment and not 8.
I attached a fixed concatenated clip which fixes this table to be:
7 stsc entries: (1, 16), (2, 15), (5, 13), (6, 14), (7, 15), (10, 16), (11, 15)
With this change, the clip plays well.
Other small things I noticed:
- stts entry for 1st frame after a new segment – you may want to insert some time here to space out properly the last frame of the previous segment from the first frame of the new segment, as simply copying the value ends up with a very small delta between those 2 frames that does not match the frame rate. You would need to add the same amount of time to audio and video and to the durations. Likely not noticeable by any users, but would be nice to fix.
- duration in 'mdhd' for the video does not seem to be the sum of the two segments. Not sure if you recomputed this and there is a reason for this. First segment had a duration of 222276 and second segment had a duration of 94591. The sum here should be 316867 but you indicated 289245 in your concatenated clip. Likely does not matter much as most players ignore the duration of individual tracks, but wondering if there is a reason for this difference.
I will attach the modified clip.
12-05-2013 01:11 PM
I see what you mean about STSC, I have to adjust the math completely to merge them properly.
However, I transfered the Fixed MP4 on Z10 and video doesn't play still on the device.
I did try it on desktop with VLC and video does play properly. No more glitches or freezing. I noticed that Windows can't extrapolate a thumbnail.
Can it be there is something else wrong with it? Did you try to play it on BlackBerry10 device?