Moving intelligence around


Artificial intelligence has reached the attention of mass media and technologies supporting it – Neural Networks (NN) – are being deployed in several contexts affecting end users, e.g. in their smart phones.

If a NN is used locally, it is possible to use existing digital representation of NNs (e.g., NNEF, ONNX). However, these format miss vital features for distributing intelligence, such as compression, scalability and incremental updates.

To appreciate the need for compression let’s consider the case of adjusting the automatic mode of a camera based on recognition of scene/object obtained by using a properly trained NN. As this area is intensely investigated, very soon there will be a new better trained version of the NN or a new NN with additional features. However, as the process to create the necessary “intelligence” usually takes time and labor (skilled and unskilled), in most cases the new created intelligence must be moved from the center to where the user handset is. With today’s NNs reaching a size of several hundred Mbytes and growing, a scenario where millions of users clog the network because they are all downloading the latest NN with great new features looks likely.

This article describes some elements of the MPEG work plan to develop one or more standards that enable compression of neural networks. Those wishing to know more please read Use cases and Requirements, and Call for Proposals.

About Neural Networks

A Neural Network is a system composed of connected nodes each of which can

  1. Receive input signals from other nodes,
  2. Process them and
  3. Transmit an output signal to other nodes.

Nodes are typically aggregated into layers, each performing different functions. Typically the “first layers” are rather specific of the signals (audio, video, various forms of text information etc.). Nodes can send signals to subsequent layers but, depending on the type of network, also to the preceding layers.

Training is the process of “teaching” a network to do a particular job, e.g. recognising a particular object or a particular word. This is done by presenting to the NN data from which it can “learn”. Inference is the process of presenting to a trained network new data to get a response about what the new data is.

When is NN compression useful?

Compression is useful whenever there is a need to distribute NNs to remotely located devices. Depending on the specific use case, compression should be accompanied by other features. In the following two major use cases will be analysed.

Public surveillance

In 2009 MPEG developed the Surveillance Application Format. This is a standard that specifies the package (file format) containing audio, video and metadata to be transmitted to a surveillance center. Today, however, it is possible to introduce to ask the surveillance network to do more more intelligent things by distributing intelligence even down to the level of visual and audio sensors.

For this more advanced scenarios MPEG is developing a suite of specifications under the title of Internet of Media Things (IoMT) where Media Things (MThing) are the media “versions” of IoT’s Things. The IoMT standard (ISO/IEC 23093) will reach FDIS level in March 2019.

The IoMT reference model is represented in the figure

IoMT standardises the following interfaces:

1: User commands (setup info.) between a system manager and an MThing

1’: User commands forwarded by an MThing to another MThing, possibly in a modified form (e.g., subset of 1)

2: Sensed data (Raw or processed data in the form of just compressed data or resulting from a semantic extraction) and actuation information

2’: Wrapped interface 2 (e.g. for transmission)

3: MThing characteristics, discovery

IoMT is neutral as to the type of semantic extraction or, more generally, to nature of intelligence actually present in the cameras. However, as NNs networks are demonstrating better and better results for visual pattern recognition, such as object detection, object tracking and action recognition, cameras can be equipped with NNs capable to process the information captured to achieve a level of understanding and transmit that understanding through interface 2.

Therefore, one can imagine that re-trained or brand new NNs can be regularly uploaded to a server that distributes NNs to surveillance cameras. Distribution need not be uniform since different neural networks may be needed at different areas, depending on the tasks that need to be specifically carried out at given areas.

NN compression is a vitally important technology to make the described scenarios real because automatic surveillance system may use many cameras (e.g. thousands and even million units) and because, as the technology to create NNs matures, the time between NN updates will progressively become shorter.

Distribution of NN-based apps to devices

There are many cases where compression is useful to efficiently distribute heavy NN-based apps to a large number of devices, in particular mobile. Here 3 case are considered.

  1. Visual apps. Updating a NN-based camera app in one’s mobile handset will soon become common place. Ditto for the many conceivable application where the smart phone understand some of the objects in the world around. Both will happen at an accelerated frequency.
  2. Machine translation (speech-to-text, translation, text-to-speech). NN-based translation apps already exist and their number, efficiency, and language support can only increase.
  3. Adaptive streaming. As AI-based methods can improve the QoE, the coded representation of NNs can initially be made available to clients prior to streaming while updates can be made during streaming to enable better adaptation decisions, i.e. better QoE.


The MPEG Call for Proposals identifies a number of requirements that a compressed neural network should satisfy. Even though not all applications need the support of all requirements, the NN comnpression algorithm must eventually be able to support all the identified requirements.

  1. Compression shall have a lossless mode, i.e. the performance of the compressed NN is exactly the same as the uncompressed NN
  2. Compression shall have a lossy mode, i.e. the performance of the decompressed NN can be different than the performance of the uncompressed NN of course in exchange for more compression
  3. Compression shall be scalable, i.e. even if only a subset of the compressed NN is used, there is still a level of performance
  4. Compression shall support incremental updates, i.e. as more data are received the performance of NN improves
  5. Decompression shall be possible with limited resources, i.e. with limited processing performance and memory
  6. Compression shall be error resilient, i.e. if an error occurs during transmission, the file is not lost
  7. Compression shall be robust to interference, i.e. it is possible to detect that the compressed NN has been tampered with
  8. Compression shall be possible even if there is no access to the original training data
  9. Inference shall be possible using compressed NN
  10. Compression shall supportincremental updates from multiple providers to improve performance of a NN


The currently published Call for Proposals is not requesting technologies for all requirements listed above (which are themselves a subset of all identified requirements). It is expected, however, that the responses to the CfP will provide enough technology to produce a base layer standard that will help the industry move its first steps in this exciting field that will shape the way intelligence is added to things near to all of us.

Posts in this thread

More standards – more successes – more failures


I have seen people ask the question: MPEG makes many very successful standard but many are not widely used. Why do you make so many standards?

I know they ask this question because they dare not ask this other question “Why don’t you make just the good standards?”. They do not do it because they know that the easy answer would be the famous phrase attributed to John Wanamaker: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half”.

In this article I do not want to brush off a serious question with an aphorism. When MPEG decides on developing a standard and when a company decides on developing a product face similar problems.

Therefore, I will first compare the processes to develop a company product and an MPEG standard, highlighting the similarities and the differences. Then I analyse some successes and failures of MPEG standards. I will also explain how MPEG can turn standards that should apparently be doomed to failure to an unexpected success.

Those looking for the perfect recipe that will lead to only successful standards should look at Mr. Wanamaker’s epigones, in the hope they have found an answer to his question.

Standards as products

A standard is the product that MPEG delivers to its customers. I would first like to show that the process used by a company to develop a product is somehow aligned to the MPEG process of standard development – with some remarkable differences.

Let’s see how a company could decide to make a new product:

  1. A new product idea is proposed
  2. Product idea is supported by market studies
  3. Technology is available/accessible to make the product
  4. Design resources are available
  5. “Product board” approves the project
  6. Design is developed.

Let us see the corresponding work flow of an MPEG standard (look at How does MPEG actually work to have more details about the process):

  1. An idea is proposed/discussed at a meeting
  2. Idea is clarified in context and objectives
  3. Use cases of the idea are developed
  4. Requirements are derived from use cases
  5. A Call for Evidence (CfE) is issued to check that technologies meeting the requirements exist
  6. A Call for Proposals (CfP) is issued to make the necessary technologies available to the committee
  7. National Bodies (NB) approve the project
  8. The standard is developed.

Let us compare and align the two processes because there are significant differences next to similarities:

# Company product steps MPEG standard steps
1 A new product idea is proposed Idea is aired/proposed at a meeting
2 Market studies support product Context & objectives of idea drafted

Use cases developed

3 Product requirements are developed Requirements derived from use cases
4 Technology is available/accessible Call for Evidence is issued

Call for Proposals is issued

5 Design resources are available MPEG looks for those interested
6 “Product board” approves the product NBs approve the project
7 Design is developed Test Model developed

Core Experiments carried out

Working Drafts produced

Standard improved in NB balloting

8 The design is approved NBs approve the standard

Comparing products and standards

With reference to the table above the following comparison can be made

  1. Product proposal: process is hard to compare. Any company has its own processs. In MPEG, proposals can come to the fore spontaneously from any member.
  2. Proposal justification: process is hard to compare. Any company has its own specific means to assess the viability of a proposed new product. In MPEG, when enough support exists, first the context in which the idea would be applied and for what purposes is documented. Then MPEG develops use cases to prove that a standard implementing the idea would support the use cases better than it is possible today or make possible use cases that today are not. As an entity, MPEG does not make “market studies” (because it does not have the means). It relies instead on members bringing relevant information into the committee when “context and objectives” and “use cases” are developed.
  3. Requirements definition: happens under different names and processes in companies and in MPEG.
  4. Technology availability is quite different. A company often owns a technology as a result of some R&D effort. If it does not have a technology for a product, it either develops it or acquires it. MPEG, too, does “own” a body of technologies, but typically a new proposal requires new technology. While MPEG members may know that a technology is actually available, they may not be allowed to talk about it. Therefore, MPEG needs in general two steps: 1) to become aware of technology (via CfE) and 2) to have the technology available (via CfP). In some cases, like in Systems standards, MPEG members may develop the technology collaboratively from a clean sheet of paper.
  5. Design resource availability is very different in the two environments. If a company sees a product opportunity, it has the means to deploy the appropriate resources (well, that also depends on internal product advocates’ influence). If MPEG sees a standard opportunity, it has no means to “command” members to do something because members report to their companies, not to MPEG. It would be great if some MPEG members who insist on MPEG pursuing certain opportunities without offering resources to achieve them understood this.
  6. Product approval: is very different in the two environments. Companies have their own internal processes to approve products. In MPEG the project for a new standard is approved by the shareholders, i.e. by the NBs, the simple majority of which must approve the project and a minimum of five NBs must commit resources to execute it.
  7. Design development: is very different in the two environments. Companies have their own internal processes to design a new product. In MPEG work obviously stops at the design phase but it entails the following steps: 1) Test Model creation, 2) Core Experiments execution, 3) Working Drafts development and 4) Standard improvement though NB balloting.
  8. Design approval: is very different in the two environments. Companies have their own internal processes to approve the design of a new product. In MPEG, again, the shareholders, i.e. the NBs, approve the standard with a qualified majority.

What is certainly common in the two processes is that the market response to the company product or to the MPEG standard is anybody’s guess. Some products/standards are widely successful, some fare so and so, and some are simply rejected by the market. Companies have resources that allow them to put in place other strategies to reduce the number of failures, but it is a reality that even companies darling of the market stumble from time to time.

MPEG is no exception.

How to cope with uncertainty

MPEG, being an organisation whose basis of operation is consensus, has advantages and disadvantages compared to a company. Let us see now how MPEG has managed the uncertainty surrounding its standards.

An interesting case is MPEG-1. The project was driven by the idea of video interactivity on CD and digital audio broadcasting. MPEG-1 did not have commercial success for both targets. However, Video CD, not even in the radar when MPEG-1 was started, used MPEG-1 and sold 1 billion units (and tens of billion CDs). MP3, too, was also not in the radar when MPEG-1 was approved and some members even argumented against the inclusion of such a “complex” technology into the standard. I doubt there is anybody now regretting the decision to make MP3 part of the MPEG-1 standard. If there is, it is for completely different reasons. The reason why the standard was eventually successful is that MPEG-1 was designed as a system (VCD is exactly that), but its parts were designed to be usable as stand-alone components (as in MP3).

The second case is MPEG-2. The project was driven by the idea of making television digital. When the first 3 MPEG-2 parts (Systems-Video-Audio) were consolidated, the possibility to use MPEG-2 for interactive video services on the telecom and cable networks became real. MPEG-2 Audio did not fare well in broadcasting (the demand for multichannel was also not there), but it did fare well in other domains. In any case many thought that MPEG-1 Audio delivered just enough. MPEG-2 AAC did fare well in broadcasting and laid the ground for the 20-year long MPEG-4 Audio ride. MPEG started the Digital Storage Media Command and Control (DSM-CC) standard (part 6 of MPEG-2). The DSM-CC carousel is used in broadcasting because it provides the means for a set top box to access various types of information that a broadcaster sends/updates at regular intervals.

MPEG-4 is rich in relevant examples. The MPEG-4 model was a 3D scene populated by “objects” that could be 1) static or dynamic, 2) natural or synthetic, 3) audio or visual in any combination. BIFS (the MPEG name for the 3D scene technology, an extension of VRML) did not fly (but VRML did not fly either). However, 10 years later the Korea-originated Digital Multimedia Broadcasting technology, which used BIFS scaled down to 2D, had a significant success in radio broadcasting.

Much of the MPEG-4 video work was driven by the idea of video “objects” which, along with BIFS, did not fly (the standard specified video objects but did not say how to make them, because that was an encoder issue). For a few years, MPEG-4 video was used in various environments. Unfortunately the main use – video streaming – was stopped by the “content fees” clause of the licensing terms. Part 10 of MPEG-4 Advanced Video Coding (AVC) was very successful, especially because patent holders did not repeat some of the mistakes they had made for MPEG-4 Visual. None of the 3 “royalty free” (Option 1 in ISO language) MPEG-4 video coding standards did fly, showing that in ISO today it is not practically possible to make a media-related standard that does not require onerous licensing of thirty party technology.

The MPEG-4 Parametric coding for high-quality audio did not fly, but a particular tool in it – Parametric Stereo (PS) – could very efficiently encode stereo music as a mono signal plus a small amount of side-information. MPEG combined the PS tool with HE-AAC and produced HE-AAC v2, an audio decoder that is on board of billions of mobile handsets today as it enables transmission of a stereo signal at 32 kb/s with very good audio quality.

For most MPEG standards, the reference model is the figure below

Different groups with different competences develop the different parts of a standard. Some parts are designed to work together with others in systems identified in the Context-Objectives-Use cases phase. However, the parts are not tightly bound because in general it is possible to use them separately.

The MPEG-7 project was driven by the idea of a world rich of audio-video-multimedia descriptors that would allow users to navigate the large amount of media content expected at that time and that we have today. Content descriptors were expressed in verbose XML, a tool at odds with the MPEG bit-thrifty approach. So MPEG developed the first standard for XML compression, a technology adopted in many fields.

Of MPEG-A is remarkable the Common Media Application Format (CMAF) standard. Several technologies drawn from different MPEG standards are integrated to efficiently deliver large scale, possibly protected, video applications, e.g. streaming of televised events. CMAF Segments can be delivered once to edge servers in content delivery networks, then accessed from cache by thousands of streaming video players without additional network backbone traffic or transmission delay.

MPEG-V – Media context and control is another typical example. The work was initiated in the wake of the success of Second Life, a service that looked like it could take over the world. The purpose of part 4 of MPEG-V Virtual world object characteristics was not to standardise a Second Life like service but the interfaces that would allow a user to move assets from one virtual space to another virtual space. The number of Second Life users dived and part 4 never took off. Other parts of MPEG-V concern formats and interfaces to enrich the the audio-visual user experience with, say, a breeze when there is a little wind in the movie, a smell when you are in a field of violets etc. So far, this apparently interesting extension of the user experience did not fly, but MPEG-V provides a very solid communication framework for sensors and actuator that finds use in other standards.

The MPEG-H MPEG Media Transport (MMT) project showed how it is possible to innovate without destabilising existing markets. MPEG-2 Transport Stream (TS) has been in use for 25 years (and MPEG has received an Emmy for that) and will continue to be used for the foreseable future. But MPEG-2 TS shows the signs of time because it has been designed for a one-way channel – an obvious choice 25 years ago – while so much video distribution today happens on two-way channels. MMT uses IP transport instead of MPEG-2 TS transport and achieves content delivery unification in both one-way and two-way distribution channels.

Is MPEG in the research business?

The simple and flat answer is NO. However, MPEG CfPs are great promoters of corporate research because they push companies to improve their technologies to enable them to make successful proposals in response to CfPs.

One of the reasons of MPEG success, but also of the difficulties highlighted in this article, is that, in the MPEG domain, standardisation is a process closer to research than to product design.

Roughly speaking, in the MPEG standardisation process, research happens in two phases: in the companies, in preparation for CfEs or CfPs (MPEG calls this competitive phase) and in what MPEG calls collaborative phase, i.e. during the development of Core Experiments (of course this research phase is still done by the companies, but in the coordinating framework of an MPEG standard).

The power of the MPEG competitive phase lies in the fact that MPEG receives many submissions from respondents to a CfP and pools together the components technologies. Therefore, the MPEG “product” has a much better performance than any autarchic product developed by an independent company because it uses many good technologies from many more companies than a single company could do.

Actually, improvement is even greater and the MPEG collaborative phase offers another opportunity to do more research. This has a much more limited scope because it is in the context of optimising a subset of the entire scope of the standard, but the sum of many small optimisations can provide big gains in performance. The shortcoming of this process is the possible introduction of a large number of IP items for a gain that some may may well consider not to justify the added IP onus and complexity.

With its MPEG-5 project MPEG is trying to see if a suitably placed lower limit to performance improvements can help solve the problems identified in the HEVC standard.


MPEG has a large number of successful standards. For many of them the unit of measure is billion of units, be they hardware, software and firmware.

MPEG has also had failures. The reasons for these can be manifold. One that is often quoted by outsiders is “the standard was too technology driven”. Of course technology plays an important part in MPEG standards. But then, what should MPEG do? Stop standards that are too technology driven? And how much is too much?

Excluding technology would be a mistake as I will show in two examples. If MPEG had done that, we would not have MP3. In 1992 layer 3 was a costly appendix to Layer 1 and 2 that just did a good job. If we had done that, we would not have Point Cloud Compression (now at CD level), a standard that industry dies for today. Sure, MPEG should establish firmer contacts with market needs, but the necessary expertise can only be provided by companies sending experts to MPEG.

MPEG needs more market, but do not expect that more market will necessarily have miraculous effects. The basic logic that has guided MPEG when making a decision on a standard has been “if there is a legitimate request to have a standard (within the constraints of 50% + 1 approval and 5 countries willing to provide experts to do the work), we do it”. More market information can certainly be useful to articulate a complete proposal and add more evidence at the time shareholders (NBs) vote.

MPEG’s value is in its capability to produce standards that anticipate the needs of the market in a process that may take years from the time an idea is launched to the time the standard is produced. People in our age are volatile and so are the markets. In comparison technology is stable.

Posts in this thread

Thirty years of audio coding and counting


Obviously, the electrical representation of sound information happened before the electrical representation of visual information and so did the services that used that representation to distribute sound information. The digital representation of audio, too, happened at different times than video’s. In the early 1980s the Compact Disc (CD) allowed record companies to distribute digital audio for the consumer market, while the D1 digital tape, available in the late 1980’s, was for the exclusive use of professional applications such as in the studio. Compression technologies reversed the order: compressed digital video happened before compressed digital audio by some 10 years. Therefore, unlike the title of the article Forty years of video coding and counting, the title of this post is Thirty years of audio coding and counting.

This statement can become a source of dispute, if a proper definition of Audio is not adopted. In this article by Audio we mean sound in the human audible range not generated by a human phonatory system or for any other sound source for which a sound production model is not available or not used. Indeed digital speech happened in professional applications (trunk network) some 20 years before the CD. ITU-T G.721 “32 kbit/s adaptive differential pulse code modulation (ADPCM)” dates back to 1984, the same year H.120 was approved as a recommendation.

Therefore the title of this article could very well have been Forty years of audio coding and counting. This would have come at the cost of a large number of speech compression standards and this article would have been overwhelmed by them. Therefore this article will only deal with audio compression standards where audio does not include speech. With one exception that will be mentioned later, I mean.

Unlike video compression where ITU-T is the non-MPEG body that develops video coding standards, in audio compression MPEG dominance is total. Indeed ITU-R, who does need audio compression for its digital audio broadcasting standards, prefers to rely on external sources, including MPEG.

MPEG-1 Audio

Those interested in knowing why and how a group – MPEG – working in video compression ended up also working on audio compression (and a few more other things) can look here. The kick off of the MPEG Audio group took place on 1-2 December 1988, when, in line with a tradition that at that time had not been fully established yet, a most diverse group of audio coding experts met in Hannover and kick-started the work that eventually gave rise to the MPEG-1 Audio standard released by MPEG in November 1992.

The Audio group in MPEG is very often the forerunner of things to come. In this instance the first is that while the broadcasting world shunned the low resolution MPEG-1 Video compression standard, it very much valued the MPEG-1 Audio compression standard. The second is that, unlike video, which relied on essentially the same coding architecture, the Audio Call for Proposals had yielded two classes of algorithms, one that was a well established, easy to implement but less performing and the other that was more recent, harder to implement (at that time) but more performing. The work to merge the two technologies was painstaking but eventually the standard included 3 layers (a notion later called profiles) where both technologies were used.

Layer 1 was used in Digital Compact Cassette (DCC), a product discontinued a few years later, Layer 2 was used in audio broadcasting and as the audio component of Video CD (VCD). Layer 3 (MP3) does not need a particular introduction 😉. As revised in the subsequent MPEG-2 effort, MP3 provided a user experience with no perceivable difference as compared to the original CD signal for most content at 128 kbit/s from a CD source of 1.44 Mbit/s, i.e with a compression of 11:1.

MPEG-2 Audio

The main goal of this standard, approved in 1994, was multi-channel audio with the key requirement that an MPEG-1 Audio decoder should be able to decode a stereo component of an MPEG-2 Audio bitstream. Backward compatibility is particularly useful in the broadcasting world because an operator can upgrade to a multi-channel services without losing the customers who only have an MPEG-1 Audio decoder.


Work on MPEG-2 Advanced Video Coding (AAC) was motivated by the request of those who wished to provide the best possible audio quality without backward compatibility constraints. This meant that layer 2 must decode both layer 1 and 2, and layer 3 must decode all layers. MPEG-2 AAC, released in April 1997, is built upon the MP3 technology and can provide perceptually transparent audio quality at 128 kbit/s for a stereo signal, and 320 kbit/s for a 5.1 channel signal (i.e. as in digital television).


In 1998 MPEG-4 Audio was released with the other 2 MPEG-4 components – Systems and Visual. Again MPEG-4 AAC is built on MPEG-2 AAC. The dominating role of MP3 in music distribution was shaken in 2003 when Apple announced that its iTunes and iPod products would use MPEG-4 AAC as primary audio compression algorithm. Most PCs, smart phones and later tablets could play AAC songs. Far from using AAC as a pure player technology, Apple started the iTunes service that provides songs in AAC format packaged in the MPEG-4 File Format, with filename extension “.m4a”.


In 1999 MPEG released MPEG-4 amendment 1 with a low delay version of AAC, called Low Delay AAC (AAC-LD). While a typical AAC encoder/decoder has a one-way latency of ~55 ms (transform delay plus look-ahead processing), AAC-LD achieves a one-way latency of only 21 ms by simplifying and replacing some AAC tools (new transform with lower latency and removal of look-ahead processing). AAC-LD can be used as a conversational codec, with a signal bandwidth and perceived quality of a music coder with excellent audio quality at 64 kb/s for a mono signal.


In 2003 MPEG released the MPEG-4 High Efficiency Advanced Audio Coding (HE-AAC), as amendment 1 to MPEG-4.  HE-AAC helped to consolidate the role of the mobile handset as the tool of choice to access very good audio quality stereo music at 48 kbit/s, more than a factor of 2.5 better than AAC, for a compression ratio of almost 30:1 relative to the CD signal.

HE-AAC adds the spectral bandwidth replication (SBR) tool to the core AAC compression engine. Since AAC was already widely deployed, this permitted extending this base to HE-AAC by only adding the SBR tool to existing AAC implementations.


In the same 2003, 9 months later, MPEG released the MPEG HE-AAC v2 profile. This originated from a tools contained in amendment 2 to MPEG-4 (Parametric coding for high-quality audio).  While the core parametric coder did not enjoy wide adoption, the Parametric Stereo (PS) tool in the amendment could very efficiently encode stereo music as a mono signal plus a small amount of side-information.  HE-AAC v2, the combination of PS tool with HE-AAC, enabled transmission of a stereo signal at 32 kb/s with very good audio quality.

This profile was also adopted by 3GPP under the name Enhanced aacPlus. Adoption by 3GPP paved the way for HE-AAC v2 technology to be incorporated into mobile phones.  Today, more than 10 billion mobile devices support streaming and playout of HE-AAC v2 format songs. Since HE-AAC is built on AAC, these phone also support streaming and playout of AAC format songs.


In 2005 MPEG released two algorithms for lossless compression of audio, MPEG Audio LosslesS coding (ALS) and Scalable to LosslesS coding (SLS). Both provide perfect (i.e. lossless) reconstruction of a standard Compact Disk audio signal with a compression ratio approximately 2:1. An important feature of SLS is that it has a variable compression ratio: it can compress a stereo signal to 128 kb/s (11:1 compression ratio) with excellent quality as an AAC codec but it can achieve lossless reconstruction with a compression ratio of 2:1 by increasing the coded bitrate (i.e. by decreasing the compression ratio) in a continuous fashion.

MPEG Surround

ALS/SLS were the last significant standards in MPEG-4 Audio, which is MPEG’s most long-lived audio standard. First issued in 1999, 20 years later (in 2019) MPEG is issuing its Fifth Edition.

After “closing the MPEG-4 era”, MPEG created the MPEG-D suite of audio compression standards. The first of these was MPEG Surround, issued in 2007. This technology is a generalised PS of HE-AAC v2 tool in the sense that, MPEG Surround can operate as a 5-to-2 channel compression tool or as an M-to-N channel compression tool. This “generalised PS” tool is followed by a HE-AAC codec. Therefore MPEG Surround builds on HE-AAC as much as HE-AAC builds on AAC. MPEG Surround provides an efficient bridge between stereo and multi-channel presentations in low-bitrate applications. It has very good compression while maintaining very good audio quality and also low computational complexity. While HE-AAC can transmit stereo at 48 kbit/s, MPEG Surround can transmit 5.1 channel audio within the same 48 kbit/s transmission budget. The complexity is no greater than stereo HE-AAC’s. Hence MPEG Surround is a “drop-in” replacement for stereo services to extend them to 5.1 channel audio!


In 2007 MPEG released Enhanced Low Delay AAC (AAC-ELD) technology. This combines tools from other profiles: SBR and PS from HE-AAC v2 profile and AAC-LD. The new codec provides even greater signal compression with only a modest increase in latency: AAC-ELD provides excellent audio quality at 48 kb/s for a mono signal with a one-way latency of only 32 ms.


In 2010 MPEG released MPEG-D Spatial Audio Object Coding (SAOC) which allows very efficient coding of a multi-channel signal that is a mix of objects (e.g. individual musical instruments). SAOC down-mixes the multi-channel signal, e.g. stereo to mono, codes and transmits the mono signal along with some side-information, and then up-mixes the received and decoded mono signal back to a stereo signal such that user perceives the instruments to be placed at the correct positions and the resulting stereo signal to be the same as the original. This is done by exploiting the fact that at any instant in time and any frequency region one of the instruments will tend to dominate the others so that in this time/frequency region the other signals will be perceived with much less acuity, if at all. SAOC analyses the input signal, divides each channel into time and frequency “tiles” and then decides to what extent each tile dominates. This is coded as side information.

An example SAOC application is teleconferencing, in which a multi-location conference call can be mixed at the conference bridge down to a single channel and transmitted to each conference participant, along with the SAOC side information. At the user’s terminal, the mono channel is up-mixed to stereo (or 3 channels – Left-Center-Right) and presented such that each remote conference participant is at a distinct location in the front sound stage.


Unified Speech and Audio Coding (USAC), released in 2011, combines the tools for speech coding and audio coding into one algorithm. USAC combines the tools from MPEG AAC (exploiting the means of human perception of audio) with the tools from a state-of-the-art speech coder (exploit the means of human production of speech). Therefore, the encoder has both a perceptual model and a speech excitation/vocal tract model and dynamically selects the music/speech coding tools every 20 ms. In this way USAC achieves a high level of performance for any input signal, be is music, speech or a mix of speech and music.

In the tradition of MPEG standards, USAC extends the range of “good” performance down to as low as 16 kb/s for a stereo signal and provides higher quality as the bitrate is increased. The quality at 128 kbit/s for a stereo signal is slightly better that MPEG-4 AAC so USAC can replace AAC, because its performance is equal or better than AAC at all bit rates. USACcan similarly code multichannel audio signals, and can also optimally encode speech content.


MPEG-D Dynamic Range Control (DRC) is a technology that gives the listener the ability to control the audio level. It can be a post-processor for every MPEG audio coding technology and modifies the dynamic range of the decoded signal as it is being played.  It can be used to reduce the loudest part of a movie so as not to disturb your neighbours, to make the quiet portions of the audio louder in hostile audio environments (car, bus, room with many people), to match the dynamics of the audio to that of a smart phone speaker output, which typically has very limited dynamic range. The DRC standard also plays the very important function of normalizing the loudness of the audio output signal, which may be mandated in some regulatory environments.  DRC was released in 2015 and extended in 2017 as Amendment 1 Parametric DRC, gain mapping and equalization tools.

3D Audio

MPEG-H 3D Audio, released in 2015, is part of the typical suite of MPEG tools: Systems, Video and Audio. It provides very efficient coding of immersive audio content, typically from 11 to 22 channels of content. The 3D Audio algorithms can actually process any mix of channels, objects and Higher Order Ambisonics (HOA) content, where objects are single-channel audio whose position can be dynamic in time and HOA can encode an entire sound scene as a multi-channel “HOA coefficient” signal.

Since 3D Audio content is immersive, it is conceived as being consumed as a 360-degree “movie” (i.e. video plus audio). The user sits at the center of a sphere (“sweet spot”) and the audio is decoded and presented so that the user perceives it to be coming from somewhere on the surrounding sphere. MPEG-H 3D audio also can be presented via headphones because not every consumer has an 11 or 22 channel listening space. Moreover MPEG-H 3D Audio supports use of a default or personalised Head Related Transfer Function (HRTF) to allow the listener to perceive the audio content as if it is from sources all around the listener, just as it would be when using loudspeakers. An added feature of 3D Audio playout to headphones, is that the audio heard by the listener can remain at the “correct” position when the user turns his or her head. In other words, a sound that is “straight ahead” when the user is looking straight ahead is perceived as coming from the left if the user turns to look right. Hence, MPEG-H 3D Audio is already a nearly complete solution for Video 360 applications.

Immersive Audio

This activity (to be released as a standard sometime in 2021) is part of the emerging MPEG-I Immersive Audio standard. MPEG is still defining the requirements and functionality of this standard, which will support audio in Virtual and Augmented Reality applications. It will be based on MPEG-H 3D Audio, which already supports a 360 degree view of a virtual world from one listener position (“3 degrees of freedom” or 3DoF) that the listener can move his or her head left, right, up, down or tilted left or right (so-called “yaw, pitch roll”). The Immersive Audio standard will add three additional degrees of freedom, i.e., permit the user to get up and walk around in the Virtual World. This additional movement is designated “x, y, z,” so that MPEG-I Immersive Audio supports 6 degrees of freedom (6 DoF) which are “yaw, pitch roll and x, y, z.” It is envisioned that MPEG-I Immersive Audio will use MPEG-H 3D Audio to compress the audio signals, and will specify additional metadata and technology so that the audio signals can be rendered in a fully flexible 6 DoF way.


MPEG is proud of the work done by the Audio group. For 30 years the group has injected generations of audio coding standards into the market. In the best MPEG tradition, the standards are generic in the sense that can be used in audio-only or audio+video applications and often scalable, with a new generation of audio coding standards building on previous ones.

This long ride is represented in the figure that ventures into the next step of the ride.

Today MPEG Audio already provides a realistic 3DoF experience in combination with MPEG Video standards. More will be needed to provide a complete and rewarding 6DoF experience, but MPEG’s ability to draw the necessary multi-domain expertise from its membership promises that the goal will be successfully achieved.


This article would not have been possible without the competent assistance – and memory – of Schuyler Quackenbush, the MPEG Audio Chair.

Posts in this thread (in bold this post)

Is there a logic in MPEG standards?

So far MPEG has developed, is completing or is planning to develop 22 standards for a total of 201 specifications. For those not in MPEG, and even for some active in MPEG, there is natural question: what is the purpose of all these standards? Assuming that the answer to this question is given, a second one pops up: is there a logic in all these MPEG standards?

Depending on the amount of understanding of the MPEG phenomenon, you can receive different answers ranging from

“There is no logic. MPEG started its first standard with a vision of giving the telco and CE industries a single format. Later it exploited the opportunities that that its growing expertise allowed.”


“There is a logic. The driver of MPEG work was to extend its vision to more industries leveraging its assets while covering more functionalities.”

I will leave it to the reader to decide where to place their decision on this continuum of possibilities after reading this article that will only deal with the first 5 standards.


The goal of MPEG-1 was to leverage the manufacturing power of the Consumer Electronics (CE) industry to develop the basic audio and video compression technology for an application that was considered particularly attractive when MPEG was established (1988), namely interactive audio and video on CD-ROM. This was the logic of the telco industry who thought that their future would be “real time audio-visual communication” but did not have a friendly industry to ask to develop the terminal equipment.

The bitrate of 1.5 Mbit/s mentioned in the official title of MPEG-1 Coding of moving pictures and associated audio at up to about 1,5 Mbit/s was an excellent common point for the telecom industry with their ADSL technology whose first generation targeted that bitrate and for the CE industry whose Compact Disc had a throughput of 1.44 Mbit/s (1.2 for the CD-ROM). With that bitrate, compression technology of the late 1980’s could only deal with a rather low, but still acceptable resolution (1/2 the horizontal and 1/2 the vertical resolution obtained by subsampling every other field, so that the input video is progressive), Considering that audio had to be impeccable (that is what humans want), at least 200 kbit/s had to be assigned to audio.

The figure below depicts the model of an MPEG-1 decoder


Figure 1 – Model of the MPEG-1 standard

The structure adopted for MPEG-1 set the pattern for most MPEG standards:

  1. Part 1 – Systems specifies how to combine one or more audio and video data streams with timing information to form a single stream (link)
  2. Part 2 – Video specifies the video coding algorithm applied to so-called SIF video of ¼ the standard definition TV (link)
  3. Part 3 – Audio specifies the audio compression. Audio is stereo and can be compressed with 3 different perfomance “layers”: layer 1 is for an entry level digital audio, layer 2 for digital broadcasting and layer 3, aka MP3, for digital music. The MPEG-1 Audio layers were the predecessors of MPEG-2 profiles (and of most subsequent MPEG standards) (link)
  4. Part 4 – Compliance testing (link)
  5. Part 5 – Software simulation (link).


MPEG-2 was a more complex beast to deal with. A digitised TV channel can yield 20-24 Mbit/s, depending on the delivery system (terrestrial/satellite broadcasting or cable TV). Digital stereo audio can take 0.2 Mbit/s and standard resolution 4 Mbit/s (say a little less with more compression). Audio could be multichannel (say, 5.1) and hopefully consume less bitrate for a total bitrate of a TV program of 4 Mbit/s. Hence the bandwidth taken by an analogue TV program can be used for 5-6 digital TV programs.

The fact that digital TV programs part of a multiplex may come from independent sources and that digital channels in the real world are subject to errors force the design of an entirely different Systems layer for MPEG-2. The fact that users need to access other data sent in a carousel, that in an interactive scenario (with a return channel) there is a need for session management and that a user may interact with a server forced MPEG to add a new stream for user-to-network and user-to-user protocols.

In conclusion the MPEG-2 model is a natural extension of the MPEG-1 model (superficially, the DSM-CC line, but the impact is more pervasive).

Figure 2 – Model of the MPEG-2 standard

The official title of MPEG-2 is Generic coding of moving pictures and associated audio information. It was originally intended for coding of standard definition television (MPEG-3 was expected to deal with coding of High Definition Television). As the work progressed, however, it became clear that a single format for both standard and high definition was not only desirable but possible. Therefore the MPEG-3 project never took off.

The standard is not specific of a video resolution (this was already the case for MPEG-1 Video) but rationalises the notion of profiles, i.e. assemblies of coding tools and levels a notion that applies to, say, resolution, bitrate etc. Profiles and levels have subsequently adopted in most MPEG standardisation areas.

The standard is composed of 10 parts, some of which are

  1. Part 1 – Systems specifies the Systems layer to enable the transport of a multichannel digital TV stream on a variety of delivery media (link)
  2. Part 2 – Video specifies the video coding algorithm. Video is interlaced and may have a wide range of resolutions with support to scalability and multiview in appropriate profiles (link)
  3. Part 3 – Audio specifies a MPEG-1 Audio backward-compatible multichannel audio coding algorithm. This means that an MPEG-1 Audio decoder is capable of extracting and decoding an MPEG-1 Audio bitstream (link)
  4. Part 6 – Extensions for DSM-CC specifies User-to-User and User-to-Network protocols for both broadcasting and interactive applications. For instance DSM-CC can be used to enable such functionalities as carousel or session set up (link)
  5. Part 7 – Advanced Audio Coding (AAC) specifies a non backward compatible multichannel audio coding algorithm. This was done because backward compatibility imposes too big a penalty for some applications, e.g. those that do not need backward compatibility (link), the first time MPEG was forced to develop two standards for apparently the same applications.


MPEG-4 had the ambition of bringing interactive 3D spaces to every home. Media objects such as audio, video, 2D graphics were an enticing notion in the mid-1990’s. The WWW had shown that it was possible to implement interactivity inexpensively and the extension to media interactivity looked like it would be the next step. Hence the official title of MPEG-4 Coding of audio-visual objects.

This vision did not become true and one could say that even today it is not entirely clear what is interactivity and what is the interactive media experience a user is seeking, assuming that just one exists.

Is this a signal that MPEG-4 was a failure?

  • Yes, it was a failure, and so what? MPEG operates like a company. Its “audio-visual objects” product looked like a great idea, but the market thought differently.
  • No, it was a success, because 6 years after MPEG-2, MPEG-4 Visual yielded some 30% improvement in terms of compression.
  • Yes, it was a failure because a patent pool dealt a fatal blow with their “content fee” (i.e. “you pay royalties by the amount of time you stream”).
  • No it was a success because MPEG-4 has 34 parts, the largest number ever achieved by MPEG in a standard, that include some of the most foundational and successful standards such as the AAC audio coding format, the MP4 File Format, the Open Font Format and, of course the still ubiquitous Advanced Video Coding AVC video coding format whose success was not dictated so much by the 20% more compression that it delivers compared to MPEG-4 Visual (always nice to have), but to the industry-friendly licence released by a patent pool. Most important, the development of most MPEG standards is driven by a vision. Therefore, users have available a packaged solution, but they can also take the pieces that they need.

Figure 3 – Model of the MPEG-4 standard

An overview of the entire MPEG-4 standard is available here. The standard is composed of 34 parts, some of which are

  1. Part 1 – Systems specifies the means to interactively and synchronously represent and deliver audio-visual content composed of various objects (link)
  2. Part 2 – Visual specifies the coded representation of visual information in the form of natural objects (video sequences of rectangular or arbitrarily shaped pictures) and synthetic visual objects (moving 2D meshes, animated 3D face and body models, and texture) (link).
  3. Part 3 – Audio specifies a multi-channel perceptual audio coder with transparent quality compression of Compact Disc music coded at 128 kb/s that made it the standard of choice for many streaming and downloading applications (link)
  4. Part 6 – Delivery Multimedia Integration Framework (DMIF) specifies interfaces to virtualise the network
  5. Part 9 – Reference hardware description specifies the VHDL representation of MPEG-4 Visual (link)
  6. Part 10 – Advanced Video Coding adds another 20% of performance to part 2 (link)
  7. Part 11 – Scene description and application engine provides a time dependent interactive 3D environment building on VRML (link)
  8. Part 12 – ISO base media file format specifies a file format that has been enriched with many functionalities over the years to satisfy the needs of the multiple MPEG client industries (link)
  9. Part 16 – Animation Framework eXtension (AFX) specifies a range of 3D Graphics technologies, including 3D mesh compression (link)
  10. Part 22 – Open Font Format (OFF) is the result of the MPEG effort that took over an industry initiative (OpenType font format specification), brought it under the folds of international standardisation and expanded/maintained it in response to evolving industry needs (link)
  11. Part 29 – Web video coding (WebVC) specifies the Constrained Baseline Profile of AVC in a separate document
  12. Part 30 – Timed text and other visual overlays in ISO base media file format supports applications that need to overlay other media to video (link)
  13. Part 31 – Video coding for browsers (VCB) specifies a video compression format (unpublished)
  14. Part 33 – Internet Video Coding (IVC) specifies a video compression format (link).

Parts 29, 31 and 33 are the results of 3 attempts made by MPEG to develop Option 1 Video Coding standards with a good performance. All did not reach the goal because ISO rules allow a company to make a patent declaration without specifying which is the patented technology that the declaring company alleges to be affected by a standard. The patented technologies could not be removed because MPEG did not have a clue about which were the allegedly infringing technologies.


In the late 1990’s the industry had been captured by the vision of “500 hundred channels” and telcos thought they could offer interactive media services. With the then being deployed MPEG-1 and MPEG-2, and with MPEG-4 under development,  MPEG expected that users would have zillions of media items.

MPEG-7 started with the idea of providing a standard that would enable users to find the media content of their interest in a sea of media content. Definitely MPEG-7 deviates from the logic of the previous two standards and the technologies used reflect that because it provides formats for data (called metadata) extracted from multimedia content to facilitate searching in multimedia items. As shown in the figure, metadata can be classified as Descriptions (metadata extracted from the media items, especially audio and video) and Description Schemes (compositions of descriptions). The figure also shows two additional key MPEG-7 technologies. The first is the Description Definition Language (DDL) used to define new Descriptors and the second id XML Compression. With Descriptions and Description Schemes represented in verbose XML, it is clear that MPEG needed a technology to effectively compress XML.


Figure 4 –Components of the MPEG-7 standard

 An overview of the entire MPEG-7 standard is available here. The official title of MPEG-7 is Multimedia content description interface and the standard is composed of 16 parts, some of which are:

  1. Part 1 – Systems has similar functions as the parts 1 of previous standards. In addition, it specifies a compression method for XML schemas used to represent MPEG-7 Descriptions and Description Schemes.
  2. Part 2 – Description definition language breaks the Systems-Video-Audio traditional sequences of previous standards to provide a language to describe descriptions (link)
  3. Part 3 – Visual specifies a wide variety of visual descriptors such as colour, texture, shape, motion etc. (link)
  4. Part 4 – Audio specifies a wide variety of audio descriptors such as signature, instrument timber, melody description, spoken content description etc. (link)
  5. Part 5 – Multimedia description schemes specifies description tools that are not visual and audio ones, i.e., generic and multimedia description tools such as description of the content structural aspects (link)
  6. Part 8 – Extraction and use of MPEG-7 descriptions explains how MPEG-7 descriptions can be practically extracted and used
  7. Part 12 – Query format defines format to query multimedia repositories (link)
  8. Part 13 – Compact descriptors for visual search specifies a format that can be used to search images (link)
  9. Part 15 – Compact descriptors for video analysis specifies a format that can be used to analyse video clips (link).


In the year 1999 MPEG understood that its technologies were having a disruptive impact on the media business. MPEG thought that the industry should not fend of a new threat with old repressive tools. The industry should convert the threat into an opportunity, but there were no standard tools to do that.

MPEG-21 is the standard resulting from the effort by MPEG to create a framework that would facilitate electronic commerce of digital media. It is a suite of specifications for end-to-end multimedia creation, delivery and consumption that can be used to enable open media markets.

This is represented in the figure below. The basic MPEG-21 element is the Digital Item, a structured digital object with a standard representation, identification and metadata, around which a number of specifications were developed. MPEG-21 also includes specifications of Rights and Contracts and basic technologies such as the file format.

Figure 5 –Components of the MPEG-21 standard

An overview of the entire MPEG-21 standard, whose official title of MPEG-21 is Multimedia Framework, is available here. Some of the 21 MPEG-21 parts are briefly described below:

  1. Part 2 – Digital Item Declaration specifies Digital Item (link)
  2. Part 3 – Digital Item Identification specifies identification methods for Digital Items and their components (link)
  3. Part 4 – Intellectual Property Management and Protection (IPMP) Components specifies how to include management and protection information and protected parts in a Digital Item (link)
  4. Part 5 – Rights Expression Language specifies a language to express rights (link)
  5. Part 6 – Rights Data Dictionary specifies a dictionary of rights-related data (link)
  6. Part 7 – Digital Item Adaptation specifies description tools to enable optimised adaptation of multimedia content (link)
  7. Part 15 – Event Reporting specifies a format to report events (links)
  8. Part 17 – Fragment Identification of MPEG Resources specifies a syntax for URI Fragment Identifiers (link)
  9. Part 19 – Media Value Chain Ontology specifies an ontology for Media Value Chains (link)
  10. Part 20 – Contract Expression Language specifies a language to express digital contracts (link)
  11. Part 21 – Media Contract Ontology specifies an ontology for media-related digital contracts (link).


The standards from MPEG-1 to MPEG-21 contain 86 specifications covering the entire 30 years of MPEG activity. They should give a rough idea of how MPEG started from the vision of single standards for all industries belonging to what we can call today the “media industry” and has kept on adapting – without disowning – its vision. The original vision has been a seed that has grown – and continues to grow – into a tree. MPEG keeps track of the evolution of technologies to provide more efficient standards and to the needs of the industry with refurbished old and brand new standards.

Posts in this thread (in bold this post)

Forty years of video coding and counting


For about 150 years, the telephone service has provided a socially important communication means to billions of people. For at least a century the telecom industry wanted to offer a more complete user experience (as we would call it today) by adding the visual to the speech component.

Probably the first large scale attempt at offering such an audio-visual service was AT&T’s PicturePhone in the mid 1960’s. The service was eventually discontinued but the idea of expanding the telephone service with a video service caught the attention of telephone companies. Many expected that digital video-phone or video-conference services on the emerging digital networks would guarantee the success that the PicturePhone service did not have and research in video coding was funded in many research labs of the telephone companies.

This article will tell the story of how this original investment, seconded by other industries, gave rise to the ever improving digital video experience that our generation is experiencing in ever greater number.

First Video Coding Standard

The first international standard that used video coding techniques – ITU-T Recommendation H.120 – originated from the European research project called COST 211. H.120 was intended for video-conference services, especially on satellite channels, was approved in 1984 and implemented in a limited number of specimens.

Second Video Coding Standard

The second international standard that used video coding techniques – ITU-T Recommendation H.261 – was intended for audio-visual services and was approved in 1988. This signaled the maturity of video coding standardisation that left the old and inefficient algorithms to enter the DCT/motion compensation age.

For several reasons H.261 was implemented by a limited number of manufacturing companies for a limited number of customers.

Third Video Coding Standard

Television broadcasting has always been – and, with challenges, continues to be also today – a socially important communication tool. Unlike audio-visual services that were mostly a strategic target on the part of the telecom industry, television broadcasting in the 1980’s was a thriving industry served by the Consumer Electronic (CE) industry providing devices to hundreds of millions of consumers.

The idea the originated ISO MPEG-1, the third international standard that used video coding techniques and intended for interactive video applications on CD-ROM, was approved by MPEG in November 1992. Besides the declared goal, the intention was to popularise video coding technologies by relying on the manufacturing prowess of the CE industry. MPEG-1 was the first example of a video coding standard developed by two industries that had had until that time very little in common: telecom and CE (terminals for the telecom market were developed by a special industry with little contact with the CE industry).

Fourth Video Coding Standard

Even though in the late 1990’s MPEG-1 Video eventually reached the 1 billion units sold with the nickname “Video CD”, especially in the Far East, the big game started with the fourth international standard that used video coding techniques – ISO MPEG-2 – whose original target was “digital television”. The number of industries interested in it made MPEG crowded: telecom had always sought to have a role in television, CE was obviously interested in having existing analogue TV sets replaced by shining digital TV sets or at least supplemented by a set top box, satellite broadcasters and cable were very keen on the idea of hundreds of TV programs in their bouquets, terrestrial broadcasters had different strategies in different regions but eventually joined, as well as the package media sector of the CE industry, with their tight contacts with the movie industry. This explains why the official title of MPEG-2 is “Generic coding of moving pictures and associated audio information” to signal the fact that MPEG-2 could be used by all the industries that, at that time, had an interest in digital video, a unique feat in the industry.

Fifth and Sixth Video Coding Standards

Remarkably, MPEG-2 Video (and Systems) was a standard jointly developed by MPEG and ITU-T. The world, however, follows the dictum of the Romance of Three Kingdoms (三國演義): 話說天下大勢.分久必合,合久必分. Adapted to the context this can be translated as in the world things divided for a long time shall unite, things united for a long time shall divide. So, the MPEG and ITU paths divided in the following phase. ITU-T developed its own H.263 Recommendation “Video coding for low bit rate communication” and MPEG developed its own MPEG-4 Visual standard, part 2 “Coding of audio-visual objects”. The conjunction of the two standards is a very tiny code that simply tells the decoder that a bitstream is H.263 or MPEG-4 Visual. A lot of coding tool commonality exists, but not at the bitstream level.

H.263 focused on low bitrate video communication, while MPEG-4 Visual kept on making real the vision of extending video coding to more industries: this time Information Technology and Mobile. MPEG-4 Visual was released in 2 versions in 1999 and 2000, while H.263 went through a series of updates documented in a series of Annexes to the H.263 Recommendation. H.263 enjoyed some success thanks to the common belief that it was “royalty free”, while MPEG-4 Visual suffered a devastating blow by a patent pool that decided to impose “content fees” on their licensing term.

Seventh Video Coding Standard

The year 2001 marked the return to the second half of Romance of Three Kingdoms’ dictum: 分久必合 (things separated for a long time shall divide), even though it was not too 久 (long time) since they had divided, certainly not on the scale intended by the Romance of Three Kingdoms. MPEG and ITU-T (through its Video Coding Experts Group – VCEG) joined forces again in 2001 and produced the seventh international standard in 2003. The standard is called Advanced Video Coding by both MPEG and ITU, but is labelled as AVC by MPEG and as H.264 by ITU-T. Reasonable licensing terms (of course always considered unreasonable by licensees) ensured AVC’s long-lasting success in the market place that continues to this day (for another 4 years and 3 months, I mean).

Eighth Video Coding Standard

The eight international video coding standard that used video coding techniques stands by itself because it is not a standard with “new” video coding technologies, but a standard that enables a video decoder to build a decoder matching the bitstream using standardised tools represented in a standard form available at the decoder. The technique, called Reconfigurable Video Coding (RVC) or, more generally, Reconfigurable Media Coding (RMC), because MPEG has applied the same technology to 3D Graphics Coding, is enabled by two standards: ISO/IEC 23002-4 Codec configuration representation and ISO/IEC 23003-4 Video tool library (VTL). The former defines the methods and general principles to describe codec configurations. The latter describes the MPEG VTL and specifies the Functional Units that are required to build a complete decoder for the following standards: MPEG-4 Simple Profile, AVC Constrained Baseline Profile and Progressive High Profile, MPEG-4 SC3DMC, and HEVC Main Profile.

Ninth Video Coding Standard

In 2010 MPEG and VCEG extended their collaboration to a new project: High Efficiency Video Coding (HEVC). A few months after the HEVC FDIS had been released, the HEVC Verification Tests showed that the standard had achieved 60% improvement on AVC, 10% more than originally planned. After that HEVC has been enriched with a number of features that at the time of development were not supported by previous standards such as High Dynamic Range (HDR) and Wide Colour Gamut (WCG), and support to Screen Content and omnidirectional video (video 360). Unfortunately, technical success did not translate into full market success because adoption of HEVC is still hampered – 6 years after its approval by MPEG – by an unclear licensing situation. In IP counting or revenue counting?; Business model based ISO/IEC standards, Can MPEG overcome its Video “crisis”? and A crisis, the causes and a solution an analysis is made of the reasons of the currently stalled situation and possible remedies are proposed.

Tenth Video Coding Standard

ISO, IEC and ITU share a common policy vis-à-vis patents in their standards. Using few imprecise but clear words (where a patent attorney would use many precise but unclear words), the policy is: it is good if a standard has no patents or if the patent holders are allowing use of their patents for free (Optioon 1); it is tolerable if a standard has patents but the patents holders allow use of their patent on fair and reasonable terms and non discriminatory conditions (Option 2); it is not permitted to have a standard with patents whose holders do not allow use of their patents (Option 3).

The target of MPEG standards until AVC had always been “best performance no matter what is the IPR involved” (of course if the IPR holders allow), but as the use of AVC extended to many domains, it was becoming clear that there was so much “old” IP (i.e. more than 20 years) that it was technically possible to make a standard whose IP components were Option 1.

In 2013 MPEG released the FDIS of WebVC, strictly speaking not a new standard because MPEG had simply extracted what was the Constrained Baseline Profile of AVC and made it a separate standard with the intention of making it Option 1. The attempt failed because some companies confirmed their Option 2 patent declarations already made against the AVC standard.

Eleventh Video Coding Standard

WebVC has not been the only effort made by MPEG to develop an Option 1 video coding standard (i.e. a standard for which only Option patent declarations have been made). A second effort, called Internet Video Coding (IVC), was concluded in 2017 with the release of the IVC FDIS. Verification Tests performed showed that the performance of IVC exceeded that of the best profile of AVC, by then a 14 years old standard. Three companies made Option 2 patent declarations that did not contain any detail so that MPEG could not remove the technologies in IVC that the companies claimed infringed their patents.

Twelfth Video Coding Standard

MPEG achieved a different result with its third attempt at developing an Option 1 video coding standard. The proposal made by a company in response to an MPEG Call for Proposals was reviewed by MPEG and achieved FDIS with the name of Video Coding for Browsers (VCB). However, a company made an Option 3 patent declaration that, like those made against IVC, did not contain any detail that would enable MPEG to remove the allegedly infringing technologies. Eventually ISO did not publish VCB.

Today ISO and IEC have disabled the possibility for companies to make Option 3 patent declarations without details (a policy that ITU had not allowed). As the VCB approval process has been completed, it is not possible to resume the study of VCB if MPEG does not restart the process. Therefore, VCB is likely to remain unpublished and therefore not an ISO standard.

Thirteenth Video Coding Standard

For the third time MPEG and ITU are collaborating in the development of a new video coding standard with the target of a 50% reduction of bitrate compared to HEVC. The development of Versatile Video Coding (VVC), as the new standard is called, is still under way and involves close to 300 experts attending VVC sessions. MPEG expects to reach the FDIS of Versatile Video Coding (VVC) in October 2020.

Fourteenth Video Coding Standard

Thirteen is a large number for video coding standards but this number should be measured against the number of years covered – close to 40. In this long period of time we have gone from 3 initial standards that were mostly application/industry-specific (H.120, MPEG-1 and H.261) to a series of generic (i.e. industry-neutral) standards (MPEG-2, MPEG-4 Visual, MPEG-4 AVC and HEVC) and then to a group of standards that sought to achieve Option 1 status (WebVC, IVC and VCB). Other proprietary video coding formats that have found significant use in the market point to the fact that MPEG cannot stay forever in its ivory tower of “best video coding standards no matter what”. MPEG has to face the reality of a market that becomes more and more diversified and where – unlike the golden age of a single coding standard – there is no longer one size that fits all.

At its 125th meeting MPEG has reviewed the responses to its Call for Proposals on a new video coding standard that sought proposals with a simplified coding structure and an accelerated development time of 12 months from working draft to FDIS. The new standard will be called MPEG-5 Essential Video Coding (EVC) and is expected to reach FDIS in January 2020.

The new video coding project will have a base layer/profile which is expected to be Option 1 and a second layer/profile that has already a performance ~25% better than HEVC. Licensing terms are expected to be published by patent holders within 2 years.

VCEG has decided not to work with MPEG on this coding standard. Are we back to the 合久必分 (things combined for a long time must split) situation? This is half true because the MPEG-VCEG collaboration in VVC is continuing. In any case VVC will provide 50% more than the HEVC compression performance.

Fifteenth Video Coding Standard

If there was a need to prove that there is no longer “one size fits all” in video coding, just look at the Call for Proposals for a “Low Complexity Video Coding Enhancements” standard issued by MPEG. This Call is not for a “new video codec”, but a technology capable to extend the capabilities of an existing video codec. A typical usage scenario is the addition of, say, the high definition capability to a set top boxes (typically deployed by the millions) that cannot be recalled. Proposals are due at the March 2019 meeting and FDIS is expected in April 2020.

Sixteenth Video Coding Standard

Point Clouds are not really the traditional “video” content as we know it, namely sequences of “frames” at a frequency sufficiently high frequency to fool the eye into believing that the motion is natural. In point clouds motion is given by dynamic point clouds that represent the surface of objects moving in the scene. For the eye, however, the end-result is the same: moving pictures displayed on a 2D surface, whose objects can be manipulated by the viewer (this, however, requires a system layer that MPEG is already developing).

MPEG is working on two different technologies: the first one uses HEVC to compress projections of portions of a point cloud (and is therefore well-suited for entertainment applications because it can rely on an existing HEVC decoder) and the second one uses computer graphics technologies (and is currently more suited to automotive applications). The former will achieve FDIS in January 2020 and the latter in April 2020.

Seventeenth and Eighteenth Video Coding Standards

Unfortunately, the crystal ball gets blurred as we move into the future. Therefore MPEG is investigating several technologies capable to providesolutions for alternative immersive experiences. After providing HEVC and OMAF for 3DoF experiences (where the user can only have roll, pitch, and yaw movement of the head), MPEG is working on OMAF v2 for 3DoF+ experiences (where the user can have a limited translation of the head). A Call for Proposal has been issued and responses are due in March 2019 and the FDIS is expected in July 2020. Investigations are being carried out on 6DoF (where the user can have full translation of the head) and on light field.


The last 40 years have seen digital video converted from a dream into a reality that involves billions of users every day. This long ride is represented in the figure that ventures into the next steps of the ride.

MPEG keeps working to make sure that manufacturers and content/services providers have access to more and better standard visual technologies for an increasingly diversified market of increasingly demanding users.

Posts in this thread (in bold this post)


The MPEG ecosystem


An ecosystem is composed of elements variously interconnected and variously dependent on one another. Standardisation is a particular type of ecosystem. Purpose of this article is to analyse the elements of the MPEG ecosystem and their relationships.

Standardisation in the past

In days long bygone, standardisation in what today we would call the “media industry” followed a rather simple process. A company wishing to attach a “standard” label to a product that had become successful in the market made a request to a standards committee whose members, typically from companies in the same industry, had an interest in getting an open specification of what had to be until then a closed system. A good example is offered by the video cassette player for which two products from two different companies, ostensibly for the same functionality – VHS and Beta – were approved by the same standard organisation – the International Electrotechnical Committee (IEC) and by the same committee – SC 60 B at that time.

Things were a little different in the International Telecommunication Union (ITU) where ITU-T (then called CCITT) had a Study Group where the telecommunication industry – represented by the Post and Telecommunication Administrations of the member countries, at that time the only ones admitted to the committee – requested a standard (called recommendation in the ITU) for digital telephony speech. ITU-T ended up with two different specifications in the same standard: one called A-law and the other called µ-law.

In ITU-R (then called CCIR) National Administrations were operating, or had authorised various entities to operate, television broadcasting services (some had even started doing so before WW II) and were therefore unable to settle on even a limited number of television systems. The only thing they could do was to produce a document called Report 624 Television Systems that collected the 3 main television systems (NTSC, PAL and SECAM) with tens of pages where country A selected, e.g., a different frequency or a different tolerance of the colour subcarrier than country B or C.

Standardisation, à la MPEG

Not unaware of past failures of standardisation and taking advantage of the radical technology discontinuity, MPEG took a different approach to standardisation which can be expressed by the synthetic expression “one functionality – one tool”. To apply this expression to the example of ITU-T’s A-law – µ-law dichotomy, if MPEG had to decide on a standard for digital speech, it would

  1. Develop requirements
  2. Select speech samples to be used for tests
  3. Issue a Call for Proposals (CfP)
  4. Run the selected test speech with the proposals
  5. Subjectively assess the quality
  6. Check the proposals for any issue such as complexty etc.
  7. Create a Test Model with the proposals
  8. Create Core Experiments (CE)
  9. Iterate the Test Model with the results of CEs
  10. Produce WD, CD, DIS and FDIS

The process would be long – an overkill in this case because a speech digitiser is a simple analogue-to-digital (A/D) converter – but not necessarily longer that having a committee decide on competing proposals with the goal of accepting only one. The result would be a single standard providing seamless bitstream interoperability without the need to convert speech from one format to another when speech moves from one environment (country, application etc.) to another.

If there were only the 10 points listed above do not make the MPEG process would not be much more complex than the ITU’s. The real difference is that MPEG does not have the mindset of the telecom industry who had decided A-law – µ-law digital speech 50+ years ago. MPEG is different because it would address speech digitisation taking into consideration the needs of a range of other industries who intend to use and hence want to have a say in how the standard is made: Consumer Electronic (CE), Information Technology (IT), broadcasting, gaming and probably more. Taking into account so many views is a burden for those developing the standard (actually, not necessarily) but the standards eventually produced is abstracted from the little (or big) needs that are specific of individual industries. Profiles and Level allow an industry not to be overburdened by technologies introduced to satisfy requirements from other industries that are irrelevant (and possibly costly) to an industry. Those who need the functionality, not matter what the cost, can do it with different profiles and levels.

Exploring the MPEG ecosystem

The article It worked twice and will work again contains a figure, reproduced below, that colourfully depicts with how MPEG has succeeded in its role of “abstracting” the needs of client “digital media” industries currently served by MPEG. The figure does not include other industries, such as genomics, that MPEG has begun to serve.

Figure 1 – MPEG and its client “digital media” industries

Figure 1, however, does not describe all the ecosystem actors. In MPEG-1 the Consumer Electronics industry was typically able to develop by itself (or found it more convenient to develop) the technology needed to make products that used the MPEG-1 standard. With MPEG-2 this was less the case as pointed out in the paragraph “A standard for all” in Why is MPEG successful? Today the industry implementing (as opposed to using or selling products based on) MPEG standards has grown to be a very important element of the MPEG ecosystem. This industry typically provides components to companies who actually manufacture a complete product (sometimes this happens inside the same company, but the logic is the same).

MPEG standards can be implemented using various combinations of software, hardware and hybrid software/hardware technologies. The choice for hardware is very wide: from various integrated circuit architectures to analogue technologies. The latter choice is for devices with extremely low power consumption, although with limited compression. Just about to come are devices that use neural networks. Other technologies are likely to find use in the future, such as quantum computing or even genomic technologies.

Figure 2 identifies 3 “layers” in the MPEG ecosystem, and the arrows show their relationships.

Figure 2 – MPEG, its Client Industries and Implementation Industries

Client industries in need of a standard provide requirements. However, the “Implementation layer” industries, examples of which have been provided above, also provide requirements. The MPEG layer eventually develops standards that are fed to the Client Industry layer that requested it, but also to the Implementation layer. Requests to implement a standard are generated by companies in the Client industry layer and directed to companies in the Implementation layer who eventually deliver the implementation to the companies requesting it. Conformance testing typically plays a role in assessing conformance of an implementation to the standard.

Figure 2, however, is not a full description of the MPEG ecosystem. More elements are provided by Figure 3 which describes how the MPEG process actually takes place.

Figure 3 – The MPEG process

The new elements highlighted by the Figure are

  1. The MPEG Toolkit assembling all technologies that have been used in MPEG standards
  2. The MPEG Competence Centres mastering specific technology areas and
  3. The Technology industries providing new technologies to MPEG by responding to Calls for Proposals (CfP).

In the early days the Implementation Industries did not have a clear identity and were largely merged with the Client Industries and Implementation Industries. Today, as highlighted above, the providers of basic technologies are well identified and separate industries.

Revisiting the MPEG process

Using Figure 3 it is possible to describe how the MPEG process unfolds (the elements of the MPEG ecosystem are in italic).

  1. MPEG receives a request for a standard from a Client Industry
  2. The MPEG Requirements Competence Centre develops requirements by interacting with Client Industries and Implementation Industries
  3. MPEG issues CfPs (Calls for technologies in the figure)
  4. Technology Industries respond to CfP by submitting technologies
  5. MPEG mobilises appropriate Competence Centres
  6. Competence Centres, coordinated by MPEG, develop standards by selecting/adapting existing technologies (drawn from the toolkit) and submitted technologies
  7. MPEG updates the toolkit with new technologies

It should be clear now that MPEG cannot be described by the simple “Standards Provider – Client Industry” relationship. MPEG is a complex ecosystem that works because all its entities play the role proper to them.

Dividing MPEG by Client Industries means losing the commonality of technologies. Dividing MPEG by Implementation Industries makes no sense, because in principle any MPEG standard must be implementable with different technology. Dividing MPEG by Competence Centres means losing the interaction between them.


MPEG is a complex ecosystem that has successfully operated for decades serving the needs of the growing number of its component industries. As much as you would not allow a child to open a toy with complicated gears inside just to see “how it works”, industry should not allow apprentice sorcerers to undo this wonderful ecosystem called MPEG.

Posts in this thread (in bold this post)

Why is MPEG successful?

There are people who do not like MPEG (I wonder why), but so far I have not found anybody disputing the success of MPEG. Some people claim that only a few MPEG standards are successful, but maybe that is because some MPEG standards are_so_ successful.

In this article the reasons of MPEG success are identified and analysed by using the 18 elements of the figure below.

A standard for all. In the late 1980’s many industries, regions and countries had understood that the state of digital technologies justified the switch from analogue to digital (some acted against that understanding and paid dearly for it). At that time several companies had developed prototypes, regional initiatives were attempting to develop formats for specific countries and industries, some companies were planning products and some standards organisations were actually developing standards for their industries. The MPEG proposal of a generic standard, i.e. a common technology for all industries, caught the attention because it offered global interoperability, created a market that was global – geographically and across industries – and placed the burden of developing the very costly VLSI technology on a specific industry accustomed to do that. The landscape today has changed beyond recognition, but today the revolutionary idea of that time is taken as a matter of fact.

One step at a time. Even before MPEG came to the fore many players were trying to be “first” and “impose” their early solution on other countries or industries or companies. If the newly-born MPEG had proposed itself as the developer of an ambitious generic standard digital media technology for all industries, the proposal would have been seen as far fetched. So, MPEG started with a moderately ambitious project: a video coding standard for interactive applications on digital storage media (CD-ROM) at a rather low bitrate (1.5 Mbit/s) targeting the market covered by the video cassette (VHS/Beta). Moving one step at a time has been MPEG policy for all its subsequent standards.

Complete standards. In 6 months after its inception MPEG had already realised the obvious, namely that digital media is not just video (although this it the first component that catches the attention), but it is also audio (no less challenging and with special quality requirements). In 12 months it had realised that bits do not flow in the air but that a stream of bits needs some means to adapt the stream to the mechanism that carries it (originally the CD-ROM). If the transport mechanism is analogue (as was 25 years ago and, to large extent, still today), the adaptation is even more challenging. Later MPEG also realised that a user interacts with the bits (even though it is so difficult to understand what exactly is the interaction that the user wants). With its MPEG-2 standard MPEG was able to provide the industry with a complete Audio-Video-Systems (and DSM-CC) solution whose pieces could also be used independently. That was possible because MPEG could attract, organise and retain the necessary expertise to address such a broad problem area and provide not just a solution that worked, but the best that technology could offer at the time.

Requirements first. Clarifying to yourself the purpose of something you want to make is a rule that should apply to any human endeavour. This rule is a must when you are dealing with a standard developed by a committee of like-minded people. When the standard is not designed by and for a single industry but by many, keeping this rule is vital for the success of the effort. When the standard involves disparate technologies whose practitioners are not even accustomed to talk to one another, complying with this rule is a prerequisite. Starting from its early days MPEG has developed a process designed to achieve the common understanding that lies at the basis of the technical work to follow: describe the environment (context and objectives), single out a set of exemplary uses of the target standard (use cases), and identify requirements.

Leveraging research. In the late 1980s compression of video and audio (and other data, e.g. facsimile) had been the subject of research for a quarter of a century, but how could MPEG access that wealth of technologies and know-how? The choice was the mechanism of Call for Proposals (CfP) because an MPEG CfP is open to anybody (not just the members of the committee) – see How does MPEG actually work? All respondents are given the opportunity to present their proposals that can, at their choice, address individual technologies, subsystems or full systems, and defend them (by becoming MPEG member). MPEG does not do research, MPEG uses the best research results to assemble the system specified in the requirements that always accompany a CfP. Therefore MPEG has a symbiotic relationship with research. MPEG could not operate without a tight relationship with research and research would certainly lose a big customer if that relationship did not exist.

Minimum standards. Industries can be happy to share the cost of an enabling technology but not at the cost of compromising their individual needs. MPEG develops standards so that the basic technology can be shared, but it must allow room for customisation. The notion of Profiles and Level provides the necessary flexibility to the many different users of MPEG standards. With profiles MPEG defines subsets of the general interoperability, with levels it defines different levels of performance within a profile. Further, by restricting standardisation to the decoding functionality MPEG extends the life of its standards because it allows industry players to compete on the basis of their constantly improved encoders.

Best out of good. When the responses to a CfP are on the table, how can MPEG select the best from the good? MPEG uses five tools:

  1. Comprehensive description of the technology proposed in each response (no black box allowed)
  2. Assessment of the performance of the technology proposed (e.g. subjective or objective tests)
  3. Line-up of aggressive “judges” (meeting participants, especially other proponents)
  4. Test Model assembling the candidate components selected by the “judges”
  5. Core Experiments to improve the Test Model.

By using these tools MPEG is able to provide the best standard in a given time frame.

Competition & collaboration. MPEG favours competition to the maximum extent possible. Many participants in the meeting are actual proponents of technologies in response to a CfP and obviously keen to have their proposals accepted. Extending competition beyond a certain point, however, is counterproductive and prevents the group from reaching the goal with the best results. MPEG uses the Test Model as the platform that help participants to collaborate by improving different areas of the Test Model. Improvement are obtained through

  1. Core Experiments, first defined in March 1992 as “a technical experiment where the alternatives considered are fully documented as part of the test model, ensuring that the results of independent experimenters are consistent”, a definition that applies unchanged to the work being done today;
  2. Reference Software, which today is a shared code base that is progressively improved with the addition of all the software validated through Core Experiments.

IPR (un)aware. MPEG uses the process described above and seeks to produce the best performing standards that satisfy the requirements independently of the existence of IPR. Should an IPR in the standard turn out not to be available, the technology protected by that IPR should be removed. Since only the best technologies are adopted and MPEG members are uniquely adept at integrating them, industry knows that the latest MPEG standards are top of the range. Of course one should not think that the best is free. In general it has a cost because IP holders need to be remunerated. Market (outside of MPEG) decides how that can be achieved.

Internal competition. If competition is the engine of innovation, why should those developing MPEG standards be shielded from it? The MPEG mission is not to please its members but to provide the best standards to industry. Probably the earliest example of this tenet is provided by MPEG-2 part 3 (Audio). When backward compatibility requirements did not allow the standard to yield the performance of algorithms not constrained by compatibility, MPEG issued a CfP and developed MPEG-2 part 7 (Advanced Audio Codec) that eventually evolved and became the now ubiquitous MPEG-4 AAC. Had MPEG not made this decision, probably we would still have MP3 everywhere, but no other MPEG Audio standards. MPEG values all its standards but cannot afford not to provide the best technology to those who demand it.

Separation of concerns. Even for the purpose of developing its earliest standards such as MPEG-1 and MPEG-2, MPEG needed to assemble disparate technological competences that had probably never worked together in a project (with its example MPEG has favoured the organisational aggregation of audio and video research in many institutions where the two were separate). To develop MPEG-4 (a standard with 34 parts whose development continues unabated), MPEG has assembled the largest ever number of competences ranging from audio and video to scene description, to XML compression, to font, timed text and many more. MPEG keeps competences organisationally separate in different in MPEG subgroups, but retains all flexibility to combine and deploy the needed resources to respond to specific needs.

New ways for new standards. MPEG works at the forefront of digital media technologies and it would be odd if it had not innovated the way it makes its own standard. Since its early days, MPEG has made massive use of ad hoc groups to progress collaborative work, innovated the way input and output documents are shared in the community and changed the way documents are discussed at meetings and edited in groups.

Talk to the world. Using extreme words, MPEG does have an industry of its own. It only has the industry that develops the technologies used to make standards for the industries it serves. Therefore MPEG needs to communicate its plans, the progress of its work and the results achieved more actively than other groups. See MPEG communicates the many ways MPEG uses to achieve this goal.

Standards are for a time. Digital media is one of the most fast evolving digital technology areas because most of the developers of good technologies incorporated in MPEG standards invest the royalties earned from previous standards to develop new technologies for new standards. As soon as a new technology shows interesting performance (which MPEG assesses by issuing Calls for Evidence – CfE) or the context changes offering new opportunities, MPEG swiftly examines the case, develops requirements and issues CfPs. For instance this has happened for its many video and audio compression standards. A paradigmatic case of a standard addressing a change of context is MPEG Media Transport (MMT) that MPEG designed having in mind a broadcasting system for which the layer below it is IP, unlike MPEG-2 Transport Stream, originally designed for a digitised analogue channel (but also used today for transport over IP as in IPTV).

Standards lead. When technology moves fast, as in the case of digital media, waiting is a luxury MPEG cannot afford. MPEG-1 and MPEG-2 were standards whose enabling technologies were already considered by some industries and MPEG-4 (started in 1993) was a bold and successful attempt to bring media into the IT world (or the other way around). That it is no longer possible to wait is shown by MPEG-I, a challenging undertaking where MPEG is addressing standards for interfaces that are still shaky or just hypothetical. Having standard that lead as opposed to trail, is a tough trial-and-error game, but the only possible game today. The alternative is to stop making standards for digital media because if MPEG waits until market needs are clear, the market is already full of incompatible solutions and there is no room left for standards.

Standards as enablers. An MPEG standard cannot be “owned” by an industry. Therefore MPEG, keeping faith to its “generic standards” mission, tries to accommodate all legitimate functional requirements when it develops a new standard. MPEG assesses each requirement for its merit (value of functionality, cost of implementation, possibility to aggregate the functionality with others etc.). Ditto if an industry comes with a legitimate request to add a functionality to an existing standard. The decision to accept or reject a request is only driven by a value substantiated by use cases, not because an industry gets an advantage or another is penalised.

What is “standard”? In human societies there are laws and entities (tribunals) with the authority to decide if a specific human action conforms to the law. In certain regulated environments (e.g. terrestrial broadcasting in many countries) there are standards and entities (authorised test laboratories) with the authority to decide if a specific implementation conforms to the standard. MPEG has neither but, in keeping with its “industry-neutral” mission, it provides the technical means – tools for conformance assessment, e.g. bitstreams and reference software – for industries to use in case they want to establish authorised test laboratories for their own purposes.

Rethinking what we are. MPEG started as a “club” of Telecommunication and Consumer Electronics companies. With MPEG-2 the “club” was enlarged to Terrestrial and Satellite Broadcasters, and Cable concerns. With MPEG-4, IT companies joined forces. Later, a large number of research institutions and academia joined (today they count for ~25% of the total membership). With MPEG-I, MPEG faces new challenges because the demand for standards for immersive services and applications is there, but technology immaturity deprives MPEG of its usual “anchors”. Thirty years ago MPEG was able to invent itself and, subsequently, to morph itself to adapt to the changed conditions while keeping its spirit intact. If MPEG will be able to continue to do as it did in the last 30 years, it can continue to support the industry it serves in the future, no matter what will be the changes of context.
I mean, if some mindless industry elements will not get in the way.

Suggestions? If you have comments or suggestions about MPEG, please write to

Posts in this thread (in bold this post)



MPEG can also be green


MPEG has given humans the means to add significant more effectiveness and enjoyment to their lives. This comes at a cost, though. Giving billions of people the means to stream video streamed to anywhere at any time of the day, adds to global energy consumption. Enhanced experiences provided by newer featurers such as High Dynamic Range further adds energy consumption in the display. More sophisticated compression algorithms consume more energy, even though this can be mitigated by more advanced circuit geometry.

In 2013 MPEG issued a Call for Proposal on “Green MPEG” requesting technologies that enable reduction of energy consumption in video codecs. In 2016 MPEG released ISO/IEC 23001-11 Green Metadata, followed by a number of ancillary activities.

It should be clear that Green Metadata should not be seen as an attempt at solving the global problem of energy consumption. More modestly Green Metadata seeks to reduce power consumption in the encoding, decoding, and display process while preserving the user’s quality of experience (QoE). At worst Green Metadata can be used to reduce the QoE in a controlled way.

The standard does not require changing the operation of a given encoder or decoder (i.e. changing the video coding standard). It just requires to be able to “access” and “influence” appropriate operating points of any or the encoder, decoder or display.

A system view

Green Metadata has been developed having as target metadata suitable for influencing the video encoding, decoding and display process. The framework, however, could be easily generalised by replacing “video” and “display” with “media” and “presentation”. However, the numerical results obtained in the video case cannot be directly extrapolated to other media.

Let’s start from the figure representing a conceptual diagram of a green encoder-decoder pair.

Figure 1 – Conceptual diagram of a green encoder-decoder pair

The Green Video Encoder (GVE), is a regular video encoder that generates a compressed video bitstream and also a stream of metadata (G-Metadata) for use by a Green Video Decoder (GVD) to reduce power consumption. When a return channel is available (e.g. on the internet), the GVD may generate feedback information (G-Feedback) that the GVE may use to generate a compressed video bitstream that demands less power for the GVD to decode.

To understand what is actually standardised by Green Metadata, it is worth digging a little bit in the following high-level diagram and see what is the new “green component” that is added. The figure below helps to understand such green components.

Figure 2 – Inside a green encoder-decoder pair

The GVE generates G-Metadata packaged by the G-Metadata Generator for transmission to a GVD. The GDV G-Metadata Extractor extracts the G-Metadata payload and passes the GVE G-Metadata to the GVD Power Manager along with G-Metadata coming from the GVD. The GVD Power Manager, based on the two G-Metadata streams and possibly other input such as user’s input (not shown in figure), may send

  1. Power Control data to the Video Decoder to change its operation
  2. G-Feedback data to the G-Feedback Generator to package it for transmission to the GVE.

At the GVE side the G-Feedback Extractor extracts the G-Feedback data and passes them to the GVE Power Manager. This may send Power Control data to the Video Encoder to change its operation.

To examine a bit more in detail how G-Metadata can be used, it is helpful to dissect the Video Encoder and Decoder pair.

Figure 3 – Inside the encoder and decoder

The Video Encoder is composed of a Media Preprocessor (e.g. a video format converter) and a Media Encoder. The Video Decoder is made of a Media Decoder and a Presentation Subsystem (e.g. to drive the display). All subsystems send G-Metadata and receive Power Contro. The Presentation Subsystem only receives Power Control.

What is standardised in Green Metadata? As always, the minimum that is required for interoperability. This means the Encoder Green Metadata and the Decoder Green Feedback (in red in the figure) that are exchanged by systems which are potentially manufactured by different entities. Other data formats inside the GVE and the GVD are a matter for GVE and GVD manufacturers to decide because they do not affect interoperability but may affect performance. In particular, the logic of the Power Manager that generates Power Control is the differentiating factor beyween implementations.

Achieving reduced power consumption

In the following the 3 areas positively affected by the use of the Green Metadata standard – encoder, decoder and display – will be illustrated.

Encoder. By using a segmented delivery mechanism (e.g. DASH), encoder power consumption can be reduced by encoding video segments with alternate high/low quality. Low-quality segments are generated by using lower-complexity encoding (e.g. fewer encoding modes and reference pictures, smaller search ranges etc.). Green Metadata include the quality of the last picture of each segment. The video decoder enhances the low-quality segment by using the metadata and the last high-quality video segment.

Decoder. Lowering the frequency of a CMOS circuit implementing a video decoder reduces power consumption because this roughly increases linearly with the clock frequency and quadratically with the voltage applied. In a software decoder picture complexity can be used to control the CPU frequency.

One type of Green Metadata signals the duration and degree of complexity of upcoming pictures. This can be used to select the most appropriate setting and offer the best QoE for a desired power-consumption level.

Display. The display adaptation technique known as backlight dimming reduces power consumption by dimming the LCD backlight while RGB values are scaled in proportion to the dimming level (RGB values do not have a strong influence on power consumption).

Green Metadata need to be carried

ISO/IEC 23001-11 only specifies the Green Metadata. The way this information is transported depends on the specific use scenarios (some of them are described in Context, Objectives, Use Cases and Requirements for Green MPEG).

Two transports have been standardised by MPEG. In the first Green Metadata is transported by a Supplementary Enhancement Information (SEI) message embedded in the video stream. This is a natural solution since Green Metadata are due to be processed in a Green Video Decoder that includes a regular video decoder. In this case, however, transport is limited to decoder metadata, not display metadata. In the second, suitable for a broadcast scenario, all Green Metadata is transported in the MPEG-2 Transport Stream.


Power consumption is a dimension that had not been tackled by MPEG, but the efforts that have led to the Green Metadata standard have been rewarded: with the currently standardised metadata 38% of video decoder power and 12% of video encoder power can be saved without affecting QoE and up to 80% of power can be saved with some degradation of the QoE. Power saving data were obtained using the Google Nexus 7 platform and the Monsoon power monitor, and a selection of video test material.

Interested readers can know more by visiting the MPEG web site and, more so, by purchasing the Green Metadata standard from the ISO website or from a National Body.

Posts in this thread (in bold this post)





The life of an MPEG standard


In How does MPEG actually work? I described the way MPEG develops its standards, an implementation of the ISO/IEC Directives for technical work. This article describes the life of one of MPEG most prestigious standards: MPEG-2 Systems, which has turned 26 in November 2018 and has played a major role in creating the digital world that we know.

What is MPEG-2 Systems?

When MPEG started, standards for compressed video and later audio was the immediate goal. But it was clear that the industry needed more than that. So, after starting MPEG-1 video compression and audio compression, MPEG soon started to investigate “systems” aspects. Seen with today’s eyes, the interactive CD-ROM target of MPEG-1 was an easy problem because all videos on a a CD-ROM are assumed to have the same time base, and bit delivery is error free and on-time because the time interval between a byte leaving the transmitter is the same as the time interval at its arrival at the receiver.

In July 1990, even before delivering the MPEG-1 standard (November 1992), MPEG started working on the much more challenging “digital television” problem. This can be described as: the deliver of a package of digital TV programs with different time bases and associated metadata over a variety of analogue channels – terrestrial, satellite and cable. Of course operators expected to be able to do the same operations in the network that the television industry had been accustomed to do in the several decades since TV distribution had become common place.

A unique group of experts from different – and competing – industries with their different cultural backgrounds and many countries, and the common experience of designing from scratch the MPEG-1 Systems standard, designed the MPEG-2 Systems standards, again from a blank sheet of paper.

The figure illustrates the high-level structure of an MPEG-2 decoder: waveforms are received from a physical channel (e.g. a Hertzian channel) and decoded to provide a bistream containing multiplexed TV programs. A transport stream demultipler and decoder extracts audio and video streams (and typically other streams not shown in the figure) and a clock that is used to drive the video and audio decoders.

The structure of the transport bitstream is depicted in the figure. The stream is organised in fixed-length packets of 188 bytes of which 184 bytes are used for the payload.

The impact of MPEG-2 Systems

MPEG-2 Systems is the container and adapter of the digital audio and video information to the physical world. It is used every day by billions of people who receive TV programs from a variety of sources, analogue and, often, digital as well (e.g. IPTV).

MPEG-2 Systems was approved in November 1994 while some companies who could not wait had already made implementations before the formal release of the standard. That date, however, far from marking the “end” of the standard, as it often happens, it signaled the beginning of a story that continues unabated today. Indeed, in the 26 years after its release, MPEG-2 Systems has been constantly evolving, while keeping complete backward compatibility with the original 1994 specification.

MPEG-2 Systems in action

So far MPEG has developed 34 amendments (ISO language to indicate the addition of functionality to a standard), 3 additional amendments are close to completion and one is planned. After a few amendments are developed, ISO requests that they be integrated in a new edition of the standard. So far 7 MPEG-2 Systems editions have been produced covering the transport of non-MPEG-2 native media and non-media data. This is an incomplete lists of the trasnport functionality added:

  1. Audio: MPEG-2 AAC, MPEG-4 AAC and MPEG-H 3D
  2. Video: MPEG-4 Visual, MPEG-4 AVC and its extensions (SVC and MVC), HEVC, HDR/WCG, JPEG2000, JPEG XS etc.
  3. Other data: streaming text, quality metadata, green metadata etc.
  4. Signalling: format descriptor, extensions of the transport stream format (e.g. Tables for splice parameters, DASH event signalling, virtual segment etc.), etc.

Producing an MPEG-2 Systems amendment is a serious job. You need experts with the full visibility of a 26 years old standard (i.e. don’t break what works) and the collaboration of experts of the carrier (MPEG-2 Systems) and of the data carried (audio, video etc.). MPEG can respond to the needs of the industry because it has available all component expertise.


MPEG-2 Systems is probably one of MPEG standards that is less “visible” to its users. Still it is one of the most important enablers of television distribution applications impacting the life of billions of people and tens of thousands of professionals. Its continuous support is vital for the well-being of the industry.

The importance of MPEG-2 Systems has been recognised by the Academy of Television Arts and Sciences who has awarded MPEG an Emmy for it.

MPEG-2 Systems Amendments

The table below reports the full list of MPEG-2 Systems amendments, The 1st column gives the edition, the 2nd column the sequential number of the amendment of that edition, the 3rd column the title of the amendment and the 4th the dates of the approval stages.


A Title


1 1 Format descriptor registration 95/11
2 Copyright descriptor registration 95/11
3 Transport Stream Description 97/04
4 Tables for splice parameters 97/07
5 Table entries for AAC 98/02
6 4:2:2 @HL splice parameters
7 Transport of MPEG-4 content 99/12
2 1 Transport of Metadata 02/10
2 IPMP support 03/03
3 Transport of AVC 03/07
4 Metadata Application Format CP 04/10
5 New Audio P&L Signaling 04/07
3 1 Transport of Streaming Text 06/10
2 Transport of Auxiliary Video Data
3 Transport of SVC 08/07
4 Transport of MVC 09/06
5 Transport of JPEG2000 11/01
6 MVC operation point descriptor 11/01
7 Signalling of stereoscopic video 12/02
8 Simplified carriage of MPEG-4 12/10
4 1 Simplified carriage of MPEG-4 12/07
2 MVC view, MIME type etc. 12/10
3 Transport of HEVC 13/07
4 DASH event signalling 13/07
5 Transport of MVC depth etc. 14/03
5 1 Timeline for External Data 14/10
2 Transport of layered HEVC 15/06
3 Transport of Green Metadata 15/06
4 Transport of MPEG-4 Audio P&L 15/10
5 Transport of Quality Metadata 16/02
6 Transport of MPEG-H 3D Audio 16/02
7 Virtual segment 16/10
8 Signaling of HDR/WCG 17/01
9 Ultra-Low-Latency & JPEG 2000 17/07
10 Media Orchestration & sample variants
11 Transport of HEVC tiles
6 1 Transport of JPEG XS
  2 Carriage of associated CMAF boxes  


Posts in this thread (in bold this post)

Genome is digital, and can be compressed


The well-known double helix carries the DNA of living beings. The human DNA contains about 3.2 billion nucleotide base pairs represented by the quaternary symbols (A, G, C, T). With high-speed sequencing machines today it is possible to “read” the DNA. The resulting file contains millions of “reads”, short segments of symbols, typically all of the same length, and weighs an unwieldy few Terabytes.

The upcoming MPEG-G standards, developed jointly by MPEG and ISO TC 276 Biotechnology, will reduce the size of the file, without loss of information, by exploiting the inherent redundancy of the reads and make at the same time the information in the file more easily accessible.

This article provides some context, and explains the basic ideas of the standard and the benefits it can yield to those who need to access genomic information.

Reading the DNA

There are two main obstacles preventing a direct use of files from sequencing machines: the position of a read on the DNA sample is unknown and the value of each symbol of the read is not entirely reliable.

The picture below represents a 17 reads with a read length of 15 nucleotides. These have been aligned to a reference genome (first line). Reads with a higher number start further down in the reference genome.

Reading column-wise, we see that in most cases the values have exactly the value of the reference genome. A single difference (represented by isolated red symbols) may be caused by read errors while a quasi completely different column (most symbols in red) may be caused by the fact that 1) a given DNA is unlikely to be exactly equal to a reference genome or 2) the person with this particular DNA may have health problems.

Use of genomics today

Genomics is already used in the clinical practice. An example of genomic workflow is depicted in the figure below which could very well represent a blood test workflow if “DNA” were replaced by “blood”. Patients go to a hospital where a sample of their DNA is taken and read by a sequencing machine. The files are analysed by experts who produce reports which are read and analysed by doctors who decide actions.

Use of genomics tomorrow

Today genomic workflows take time – even months – and are costly – thousands of USD per DNA sample. While there is not much room to cut the time it takes to obtain a DNA sample, sequencing cost has been decreasing and are expected to continue doing so.

Big savings could be achieved by acting on data transport and processing. If the size of a 3 Terabytes file is reduced by, say, a factor of 100, the transport of the resulting 30 Gigabytes would be compatible with today’s internet access speeds of 1 Gbit/s (~4 min). Faster data access, a by-product of compression, would allow doctors to get the information they are searching, locally or from remote, in a fraction of a second.

The new possible scenario is depicted in the figure below.

MPEG makes genome compression real

Not much had been done to make the scenario above real (zip is the oft-used compression technology today) until the time (April 2013) MPEG received a proposal to develop a standard to losslessly compress files from DNA sequencing machines.

The MPEG-G standard – titled Genomic Information Representation – has 5-parts: Parts 1 and 2 are expected to be approved at MPEG 125 (January 2018) and the other parts are expected to follow suit shortly after.

MPEG-G is an excellent example of how MPEG could apply its expertise to a different field than media. Part 1, an adaptation of the MP4 File Format present in all smartphones/tablets/PCs, specifies how to make and transport compressed files. Part 2 specifies how to compress reads and Part 3 how to invoke the APIs to access specific compressed portions of a file. Part 4 and 5 are Conformance and Reference Software, respectively.

The figure below depicts the very sophisticated operation specified in Part 2 in a simplified way.

An MPEG-G file can be created with the following sequence of operations:

  1. Put the reads in the input file (aligned or unaligned) in bins corresponding to segments of the reference genome
  2. Classify the reads in each bin in 6 classes: P (perfect match with the reference genome), M (reads with variants), etc.
  3. Convert the reads of each bin to a subset of 18 descriptors specific of the class: e.g., a class P descriptor is the start position of the read etc.
  4. Put the descriptors in the columns of a matrix
  5. Compress each descriptor column (MPEG-G uses the very efficient CABAC compressor already present in several video coding standards)
  6. Put compressed descriptors of a class of a bin in an Access Unit (AU) for a maximum of 6 AUs per bin

Therefore MPEG-G file contains all AUs of all bins corresponding to all segments of the reference genome. A file may contain the compressed reads of more than one DNA sample.

The benefits of MPEG-G

Compression is beneficial but is not necessarily the only or primary benefit. More important is the fact that while designing compression, MPEG has given a structure to the information. In MPEG-G the structure is provided by Part 1 (File and transport) and by Part 2 (Compression).

In MPEG-G most information relevant to applications is immediately accessible, locally and, more importantly, also from remote without the need to download the entire file to be able to access the information of interest. Part 3 (Application Programming Interfaces) makes this fast access even more convenient because it facilitates the work of developers of genomics applications who may not have in-depth information of the – certainly complex – MPEG-G standard.


In the best MPEG tradition, MPEG-G is a generic standard, i.e. a standard that can be employed in a wide variety of applications that require small footprint of and fast access to genomic information.

A certainly incomplete list includes: Assistance to medical doctors’ decisions; Lifetime Genetic Testing; Personal DNA mapping on demand; Personal design of pharmaceuticals; Analysis of immune repertoire; Characterisation of micro-organisms living in the human host; Mapping of micro-organisms in the environment (e.g. biodiversity).

Standards are living beings, but MPEG standards have a DNA that allows them to grow and evolve to cope with the manifold needs of its ever-growing number of users.

I look forward to welcoming new communities in the big family of MPEG users.

Posts in this thread (in bold this post)