What would MPEG be without Systems?

The most visited articles on this blog Forty years of video coding and counting and Thirty years of audio coding and counting prove that MPEG is known for its audio and video coding standards. But I will not tire of saying that MPEG would not it be what it has become if the Systems aspects had not been part of most of its standards. This is what I intend to talk about in this article.

It is hard to acknowledge, but MPEG was not the first to deal with the problem of putting together digital audio and video for delivery purposes. In the second half of the 1990’s ITU-T dealt with the problem of handling audio-visual services using the basic ISDN access at 2B+D (2×64 kbit/s) or the primary ISDN access, the first digital streams made possible by ITU-T Recommendations.

Figure 1 depicts the solution specified in ITU Recommendation H.221. Let’s assume that we have 2 B channels at 64 kbit/s (Basic Access ISDN). H.221 creates on each B channel a Frame Structure of 80 bytes, i.e. 640 bits repeating itself 100 times per second. Each bit position in an octet can be considered as an 8 kbit/s sub-channel. The 8th bit in each octet represents the 8th sub-channel, called the Service Channel.

Within the Service Channel bits 1-8 are used by the Frame Alignment Signal (FAS) and bits 9-16 are used by the Bit Alignment Signal (BAS). Audio is always carried by the first B channel, e.g. by the first 2 subchannels, and Video and Data by the other subchannels (less the bitrate allocated to FAS and BAS).

Figure 1 – ITU Recommendation H.221

MPEG-1 Systems

The solution depicted in Figure 1 bears the mark of the transmission part of the telecom industry that had never been much friendly to packet communication. That is why MPEG in the late 1990’s had an opportunity to bring some fresh air in this space. Starting from a blank sheet of paper (at that time MPEG still used paper ?) MPEG designed a flexible packet-based multiplexer to convey in a single stream compressed audio and video, and clock information in such a way as to enable audio‑video synchronisation (Figure 2).

Figure 2 – MPEG-1 Systems

The MPEG Systems revolution took time to take effect. Indeed the European EU 95 project used MPEG-1 Audio layer 2, but designed a frame-based multiplexer for the Digital Audio Broadcasting service.

MPEG-2 Systems

In the early 1990’s MPEG started working on another blank sheet of paper. MPEG had the experience of MPEG-1 Systems design but the requirements were significantly different. In MPEG-1, audio and video (possibly many of them in the same stream) had a common time base, but the main users of MPEG-2 wanted a system that could deliver a plurality of TV programs, possibly coming from different sources (i.e. with different time bases) and with possibly a lot of metadata related to the programs, not to mention some key business enabler like conditional access information. Moreover, unlike MPEG-1 where it was safe to assume that the bits issuing from a Compact Disc would travel without errors to a demultiplexer, in MPEG-2 it was mandatory to assume that the transmission channel was anything but error-free.

MPEG-2 Transport Stream (TS) provides efficient mechanisms to multiplex multiple audio-visual data streams into one delivery stream. Audio-visual data streams are packetised into small fixed-size packets and interleaved to form a single stream. Information about the multiplexing structure is interleaved with the data packets so that the receiving entity can efficiently identify a specific stream. Sequence numbers help identify missing packets at the receiving end, and timing information is assigned after multiplexing with the assumption that the multiplexed stream will be delivered and played in sequential order.

MPEG-2 Systems is actually two specifications in one (Figure 3). The Transport Stream (TS) is a fixed-length packet-based transmission system designed to work for digital television distribution on error-prone physical channels, while the Program Stream (PS) is a packet-based multiplexer with many points in common with MPEG-1 Systems. `While TS and PS share significant information, moving from one to the other may not be immediate.

Figure 3 – MPEG-2 Systems

MPEG-4 Systems

MPEG-4 gave MPEG the opportunity to experience an epochal transition in data delivery. When MPEG-2 Systems was designed Asynchronous Transfer Mode (ATM) was high on the agenda of the telecom industry and was considered as the vehicle to transport MPEG-2 TS streams on telecommunication networks. Indeed, the Digital Audio-Visual Council (DAVIC) designed its specifications on that assumption. At that time, however, IP was still unknown to the telecom (at least to the transmission part, broadcast and consumer electronics worlds.

The MPEG-4 Systems work was a completely different story than MPEG-2 Systems. An MPEG4 Mux (M4Mux) was developed along the lines of MPEG-1 and MPEG-2 Systems, but MPEG had to face an unknown world where many transports were surging as possible candidates. MPEG was obviously unable to make choices (today, 25 years later, the choice is clear) and developed the notion of Delivery Multimedia Integration Framework (DMIF), where all communications and data transfers between the data source and the terminal were abstracted through a logical API called the DAI (DMIF Application Interface), independent of the transport type (broadcast, network, storage).

MPEG-4 Systems, however, was about more than interfacing with transport and multiplexing. The MPEG-4 model was a 3D space populated with dynamic audio, video and 3D Graphics objects. Binary Format for Scenes (BIFS) was the technology designed to provide the needed functionality.

Figure 4 shows the 4 MPEG-4 layers: Transport, Synchonisation, Compression and Composition.

Figure 4 – MPEG-4 Systems

MPEG-4 File Format

For almost 10 years – until 1997 – MPEG was a group who made intense use of IT tools (in the form of computer programs that simulated encoding and decoding operation of the standards it was developing) but was not an “IT group”. The proof? Until that time it had not developed a single file format. Today MPEG can claim to have another such attribute (IT group) along with the many others it has.

In those years MP3 files were already being created and exchanged by the millions, but the files did not provide any structure. The MP4 File Format, officially called ISO Base Media File Format (ISO BMFF), filled that gap as it can be used for editing, HTTP streaming and broadcasting.

Let’s have a high level look to understand the sea that separates MP3 files from the MP4 FF. MP4 FF contains tracks for each media type (audio, video etc.), with additional information: a four-character the media type ‘name’ with all parameters needed by the media type decoder. “Track selection data” helps a decoder identify what aspect of a track can be used and to determine which alternatives are available.

Data are stored in a basic structure called box with attributes of length, type (4 printable characters), possibly version and flags. No data can be found outside of a box. Figure 5 shows a possible organisation of an MP4 file


Figure 5 – Boxes in an MP4 File

MP4 FF can store:

  1. Structural and media data information for timed presentations of media data (e.g. audio, video, subtitles);
  2. Un-timed data (e.g. meta-data);
  3. Elementary stream encryption and encryption parameter (CENC);
  4. Media for adaptive streaming (e.g. DASH);
  5. High Efficiency Image Format (HEIF);
  6. Omnidirectional Media Format (OMAF);
  7. Files partially received over lossy links for further processing such as playback or repair (Partial File Format);
  8. Web resources (e.g. HTML, JavaScript, CSS, …).

Save for the first two features, all others were added in the years following 2001 when MP4 FF was approved. The last two are still under development.

MPEG-7 Systems

With MPEG-7, MPEG made the first big departure from media compression and turned its attention to media description including ways to compress that information. In addition to descriptors for visual audio and multimedia information, MPEG-7 includes a Systems layer used by an application, say, navigation of a multimedia information repository, to access coded information coming from a delivery layer in the form of coded descriptors (in XML or in BiM, MPEG’s XML compression technology). The figure illustrates the operation of MPEG-7 Systems decoder.

Figure 6 – MPEG-7 Systems

An MPEG-7 Systems decoder operates in two phases

  1. Initialisation when DecoderInit initialises the decoder by conveying description format information (textual or binary), a list of URIs that identifies schemas, parameters to configure the Fragment Update decoder, and an initial description. The list of URIs is passed to a schema resolver that associates the URIs with schemas to be passed to Fragment Update Decoder.
  2. Main operation, when the Description Stream (composed of Access Units containing fragment updates) is fed to the decoder which processes
    1. Fragment Update Command specifying the update type (i.e., add, replace or delete content or a node, or reset the current description tree);
    2. Fragment Update Context that identifies the data type in a given schema document, and points to the location in the current description tree where the fragment update command applies; and
    3. Fragment Update Payload conveying the coded description fragment to be added toor replaced in the description.


MPEG Multimedia Middleware (M3W), also called MPEG-E, is an 8-part standard defining the protocol stack of consumer-oriented multimedia devices, as depicted in Figure 7.

Figure 7 – MPEG Multimedia Middleware (M3W)

The M3W model includes 3 layers:

  1. Applications non part of the specifications but enabled by the M3W Middleware API;
  2. Middleware consisting of
    1. M3W middleware exposing the M3W Middleware API;
    2. Multimedia platform supporting the M3W Middleware by exposing the M3W Multimedia API;
    3. Support platform providing the means to manage the lifetime of, and interaction with, realisation entities by exposing the M3W Support API (it also enables management of support properties, e.g. resource management, fault management and integrity management);
  3. Computing platform: whose API are outside of M3W scope.


Multimedia service platform technologies (MPEG-M) specifies two main components of a multimedia device, called peer in MPEG-M.

As shown in Figure 8, the first component is API: High-Level for applications and Low Level for network, energy and security.

Figure 8 – High Level and Low Level API

The second components is a middleware called MXM that relies on its multimedia technologies

Figure 9 – The MXM architecture

The Middleware is composed of two types of engine. Technology Engines are used to call functionalities defined by MPEG standards such as creating or interpreting a licence attached to a content item. Protocol Engines are used to communicate with other peer, e.g. in case a peer does not have a particular Technology Engine that another peer has. For instance, a peer can use a Protocol Engine to call a licence server to get a licence to attach to a multimedia content item. The MPEG-M middleware has the ability to create chains of Technology Engines (Orchestration) or Protocol Engines (Aggregation).


MPEG Media Transport (MMT) is part 1 of High efficiency coding and media delivery in heterogeneous environments (MPEG-H). It is the solution for the new world of broadcasting where delivery of content can take place over different channels each with different characteristics, e.g. one-way (traditional broadcasting) and two-way (the ever more pervasive broadband network). MMT assumes that the Internet Protocol is common to all channels.

Figure 10 depicts the MMT protocol stack

Figure 10 – The MMT protocol stack

Figure 11 focuses on the MMT Payload, i.e. on the content structure.

Figure 11 – Structure of MMT Payload

The MMT Payload has an onion-like structure:

  1. . Media Fragment Unit (MFU), the atomic unit which can be independently decoded;
  2. Media Processing Unit (MPU), the atomic unit for storage and consumption of MMT content (structured according to ISO BMFF), containing one or more MFUs;
  3. MMT Asset, the logical unit for elementary streams of multimedia component, e.g. audio, video and data, containing one or more MPU files;
  4. MMT Package, a logical unit of multimedia content such as a broadcasting program, containing one or more MMT Assets, also containing
    1. Composition Information (CI), describing the spatio-temporal relationships among MMT Assets
    2. Delivery Information, describing the network characteristics.


Dynamic adaptive streaming over HTTP (DASH) is another MPEG Systems standard that was motivated by the popularity of HTTP streaming and the existence of different protocols used in different streaming platforms, e.g. different manifest and segment formats. By developing the DASH standard for HTTP streaming of multimedia content, MPEG has enabled a standard-based client to stream content from any standard-based server, thereby enabling interoperability between servers and clients of different make.

Figure 12 – DASH model

As depicted in Figure 12, the multimedia content is stored on an HTTP server in two components: 1) Media Presentation Description (MPD) which describes a manifest of the available content, its various alternatives, their URL addresses and other characteristics, and 2) Segments which contain the actual multimedia bitstreams in form of chunks, in single or multiple files.

A typical operation of the system would follow the steps

  1. DASH client obtains the MPD;
  2. Parses the MPD;
  3. Gets information on several parameters, e.g. program timing, media content availability, media types, resolutions, min/max bandwidths, existence of alternatives of multimedia components, accessibility features and the required protection mechanism, the location of each media component on the network and other content characteristics;
  4. Selects the appropriate encoded alternative and starts streaming the content by fetching the segments using HTTP GET requests;
  5. Fetches the subsequent segments after appropriate buffering to allow for network throughput variations
  6. Monitors the network bandwidth fluctuations;
  7. Decides how to adapt to the available bandwidth depending on its measurements by fetching segments of different alternatives (with lower or higher bitrate) to maintain an adequate buffer.

DASH only defines the MPD and the segment formats. MPD delivery and media encoding formats containing the segments as well as client behavior for fetching, adaptation heuristics and content playing are outside of MPEG-DASH’s scope.


The reader should not think that this is an exhaustive presentation of MPEG’s Systems work. I hope the description will reveal the amount of work that MPEG has invested in Systems aspects, sometimes per se, and sometimes to provide adequate support to users of its media coding standards. This article also describes some of the most successful MPEG standards. At the top certainly towers MPEG-2 Systems of which 9 editions have been produced to keep up with continuous user demands for new functionalities.

Without mentioning the fact that MPEG-2 Systems has received an Emmy Award ?.

Posts in this thread