Moving intelligence around


Artificial intelligence has reached the attention of mass media and technologies supporting it – Neural Networks (NN) – are being deployed in several contexts affecting end users, e.g. in their smart phones.

If a NN is used locally, it is possible to use existing digital representation of NNs (e.g., NNEF, ONNX). However, these format miss vital features for distributing intelligence, such as compression, scalability and incremental updates.

To appreciate the need for compression let’s consider the case of adjusting the automatic mode of a camera based on recognition of scene/object obtained by using a properly trained NN. As this area is intensely investigated, very soon there will be a new better trained version of the NN or a new NN with additional features. However, as the process to create the necessary “intelligence” usually takes time and labor (skilled and unskilled), in most cases the new created intelligence must be moved from the center to where the user handset is. With today’s NNs reaching a size of several hundred Mbytes and growing, a scenario where millions of users clog the network because they are all downloading the latest NN with great new features looks likely.

This article describes some elements of the MPEG work plan to develop one or more standards that enable compression of neural networks. Those wishing to know more please read Use cases and Requirements, and Call for Proposals.

About Neural Networks

A Neural Network is a system composed of connected nodes each of which can

  1. Receive input signals from other nodes,
  2. Process them and
  3. Transmit an output signal to other nodes.

Nodes are typically aggregated into layers, each performing different functions. Typically the “first layers” are rather specific of the signals (audio, video, various forms of text information etc.). Nodes can send signals to subsequent layers but, depending on the type of network, also to the preceding layers.

Training is the process of “teaching” a network to do a particular job, e.g. recognising a particular object or a particular word. This is done by presenting to the NN data from which it can “learn”. Inference is the process of presenting to a trained network new data to get a response about what the new data is.

When is NN compression useful?

Compression is useful whenever there is a need to distribute NNs to remotely located devices. Depending on the specific use case, compression should be accompanied by other features. In the following two major use cases will be analysed.

Public surveillance

In 2009 MPEG developed the Surveillance Application Format. This is a standard that specifies the package (file format) containing audio, video and metadata to be transmitted to a surveillance center. Today, however, it is possible to introduce to ask the surveillance network to do more more intelligent things by distributing intelligence even down to the level of visual and audio sensors.

For this more advanced scenarios MPEG is developing a suite of specifications under the title of Internet of Media Things (IoMT) where Media Things (MThing) are the media “versions” of IoT’s Things. The IoMT standard (ISO/IEC 23093) will reach FDIS level in March 2019.

The IoMT reference model is represented in the figure

IoMT standardises the following interfaces:

1: User commands (setup info.) between a system manager and an MThing

1’: User commands forwarded by an MThing to another MThing, possibly in a modified form (e.g., subset of 1)

2: Sensed data (Raw or processed data in the form of just compressed data or resulting from a semantic extraction) and actuation information

2’: Wrapped interface 2 (e.g. for transmission)

3: MThing characteristics, discovery

IoMT is neutral as to the type of semantic extraction or, more generally, to nature of intelligence actually present in the cameras. However, as NNs networks are demonstrating better and better results for visual pattern recognition, such as object detection, object tracking and action recognition, cameras can be equipped with NNs capable to process the information captured to achieve a level of understanding and transmit that understanding through interface 2.

Therefore, one can imagine that re-trained or brand new NNs can be regularly uploaded to a server that distributes NNs to surveillance cameras. Distribution need not be uniform since different neural networks may be needed at different areas, depending on the tasks that need to be specifically carried out at given areas.

NN compression is a vitally important technology to make the described scenarios real because automatic surveillance system may use many cameras (e.g. thousands and even million units) and because, as the technology to create NNs matures, the time between NN updates will progressively become shorter.

Distribution of NN-based apps to devices

There are many cases where compression is useful to efficiently distribute heavy NN-based apps to a large number of devices, in particular mobile. Here 3 case are considered.

  1. Visual apps. Updating a NN-based camera app in one’s mobile handset will soon become common place. Ditto for the many conceivable application where the smart phone understand some of the objects in the world around. Both will happen at an accelerated frequency.
  2. Machine translation (speech-to-text, translation, text-to-speech). NN-based translation apps already exist and their number, efficiency, and language support can only increase.
  3. Adaptive streaming. As AI-based methods can improve the QoE, the coded representation of NNs can initially be made available to clients prior to streaming while updates can be made during streaming to enable better adaptation decisions, i.e. better QoE.


The MPEG Call for Proposals identifies a number of requirements that a compressed neural network should satisfy. Even though not all applications need the support of all requirements, the NN comnpression algorithm must eventually be able to support all the identified requirements.

  1. Compression shall have a lossless mode, i.e. the performance of the compressed NN is exactly the same as the uncompressed NN
  2. Compression shall have a lossy mode, i.e. the performance of the decompressed NN can be different than the performance of the uncompressed NN of course in exchange for more compression
  3. Compression shall be scalable, i.e. even if only a subset of the compressed NN is used, there is still a level of performance
  4. Compression shall support incremental updates, i.e. as more data are received the performance of NN improves
  5. Decompression shall be possible with limited resources, i.e. with limited processing performance and memory
  6. Compression shall be error resilient, i.e. if an error occurs during transmission, the file is not lost
  7. Compression shall be robust to interference, i.e. it is possible to detect that the compressed NN has been tampered with
  8. Compression shall be possible even if there is no access to the original training data
  9. Inference shall be possible using compressed NN
  10. Intelligence from multiple providers shall be possible to extend performance of a NN


The currently published Call for Proposals is not requesting technologies for all requirements listed above (which are themselves a subset of all identified requirements). It is expected, however, that the responses to the CfP will provide enough technology to produce a base layer standard that will help the industry move its first steps in this exciting field that will shape the way intelligence is added to things near to all of us.

Posts in this thread

More standards – more successes – more failures


I have seen people ask the question: MPEG makes many very successful standard but many are not widely used. Why do you make so many standards?

I know they ask this question because they dare not ask this other question “Why don’t you make just the good standards?”. They do not do it because they know that the easy answer would be the famous phrase attributed to John Wanamaker: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half”.

In this article I do not want to brush off a serious question with an aphorism. When MPEG decides on developing a standard and when a company decides on developing a product face similar problems.

Therefore, I will first compare the processes to develop a company product and an MPEG standard, highlighting the similarities and the differences. Then I analyse some successes and failures of MPEG standards. I will also explain how MPEG can turn standards that should apparently be doomed to failure to an unexpected success.

Those looking for the perfect recipe that will lead to only successful standards should look at Mr. Wanamaker’s epigones, in the hope they have found an answer to his question.

Standards as products

A standard is the product that MPEG delivers to its customers. I would first like to show that the process used by a company to develop a product is somehow aligned to the MPEG process of standard development – with some remarkable differences.

Let’s see how a company could decide to make a new product:

  1. A new product idea is proposed
  2. Product idea is supported by market studies
  3. Technology is available/accessible to make the product
  4. Design resources are available
  5. “Product board” approves the project
  6. Design is developed.

Let us see the corresponding work flow of an MPEG standard (look at How does MPEG actually work to have more details about the process):

  1. An idea is proposed/discussed at a meeting
  2. Idea is clarified in context and objectives
  3. Use cases of the idea are developed
  4. Requirements are derived from use cases
  5. A Call for Evidence (CfE) is issued to check that technologies meeting the requirements exist
  6. A Call for Proposals (CfP) is issued to make the necessary technologies available to the committee
  7. National Bodies (NB) approve the project
  8. The standard is developed.

Let us compare and align the two processes because there are significant differences next to similarities:

# Company product steps MPEG standard steps
1 A new product idea is proposed Idea is aired/proposed at a meeting
2 Market studies support product Context & objectives of idea drafted

Use cases developed

3 Product requirements are developed Requirements derived from use cases
4 Technology is available/accessible Call for Evidence is issued

Call for Proposals is issued

5 Design resources are available MPEG looks for those interested
6 “Product board” approves the product NBs approve the project
7 Design is developed Test Model developed

Core Experiments carried out

Working Drafts produced

Standard improved in NB balloting

8 The design is approved NBs approve the standard

Comparing products and standards

With reference to the table above the following comparison can be made

  1. Product proposal: process is hard to compare. Any company has its own processs. In MPEG, proposals can come to the fore spontaneously from any member.
  2. Proposal justification: process is hard to compare. Any company has its own specific means to assess the viability of a proposed new product. In MPEG, when enough support exists, first the context in which the idea would be applied and for what purposes is documented. Then MPEG develops use cases to prove that a standard implementing the idea would support the use cases better than it is possible today or make possible use cases that today are not. As an entity, MPEG does not make “market studies” (because it does not have the means). It relies instead on members bringing relevant information into the committee when “context and objectives” and “use cases” are developed.
  3. Requirements definition: happens under different names and processes in companies and in MPEG.
  4. Technology availability is quite different. A company often owns a technology as a result of some R&D effort. If it does not have a technology for a product, it either develops it or acquires it. MPEG, too, does “own” a body of technologies, but typically a new proposal requires new technology. While MPEG members may know that a technology is actually available, they may not be allowed to talk about it. Therefore, MPEG needs in general two steps: 1) to become aware of technology (via CfE) and 2) to have the technology available (via CfP). In some cases, like in Systems standards, MPEG members may develop the technology collaboratively from a clean sheet of paper.
  5. Design resource availability is very different in the two environments. If a company sees a product opportunity, it has the means to deploy the appropriate resources (well, that also depends on internal product advocates’ influence). If MPEG sees a standard opportunity, it has no means to “command” members to do something because members report to their companies, not to MPEG. It would be great if some MPEG members who insist on MPEG pursuing certain opportunities without offering resources to achieve them understood this.
  6. Product approval: is very different in the two environments. Companies have their own internal processes to approve products. In MPEG the project for a new standard is approved by the shareholders, i.e. by the NBs, the simple majority of which must approve the project and a minimum of five NBs must commit resources to execute it.
  7. Design development: is very different in the two environments. Companies have their own internal processes to design a new product. In MPEG work obviously stops at the design phase but it entails the following steps: 1) Test Model creation, 2) Core Experiments execution, 3) Working Drafts development and 4) Standard improvement though NB balloting.
  8. Design approval: is very different in the two environments. Companies have their own internal processes to approve the design of a new product. In MPEG, again, the shareholders, i.e. the NBs, approve the standard with a qualified majority.

What is certainly common in the two processes is that the market response to the company product or to the MPEG standard is anybody’s guess. Some products/standards are widely successful, some fare so and so, and some are simply rejected by the market. Companies have resources that allow them to put in place other strategies to reduce the number of failures, but it is a reality that even companies darling of the market stumble from time to time.

MPEG is no exception.

How to cope with uncertainty

MPEG, being an organisation whose basis of operation is consensus, has advantages and disadvantages compared to a company. Let us see now how MPEG has managed the uncertainty surrounding its standards.

An interesting case is MPEG-1. The project was driven by the idea of video interactivity on CD and digital audio broadcasting. MPEG-1 did not have commercial success for both targets. However, Video CD, not even in the radar when MPEG-1 was started, used MPEG-1 and sold 1 billion units (and tens of billion CDs). MP3, too, was also not in the radar when MPEG-1 was approved and some members even argumented against the inclusion of such a “complex” technology into the standard. I doubt there is anybody now regretting the decision to make MP3 part of the MPEG-1 standard. If there is, it is for completely different reasons. The reason why the standard was eventually successful is that MPEG-1 was designed as a system (VCD is exactly that), but its parts were designed to be usable as stand-alone components (as in MP3).

The second case is MPEG-2. The project was driven by the idea of making television digital. When the first 3 MPEG-2 parts (Systems-Video-Audio) were consolidated, the possibility to use MPEG-2 for interactive video services on the telecom and cable networks became real. MPEG-2 Audio did not fare well in broadcasting (the demand for multichannel was also not there), but it did fare well in other domains. In any case many thought that MPEG-1 Audio delivered just enough. MPEG-2 AAC did fare well in broadcasting and laid the ground for the 20-year long MPEG-4 Audio ride. MPEG started the Digital Storage Media Command and Control (DSM-CC) standard (part 6 of MPEG-2). The DSM-CC carousel is used in broadcasting because it provides the means for a set top box to access various types of information that a broadcaster sends/updates at regular intervals.

MPEG-4 is rich in relevant examples. The MPEG-4 model was a 3D scene populated by “objects” that could be 1) static or dynamic, 2) natural or synthetic, 3) audio or visual in any combination. BIFS (the MPEG name for the 3D scene technology, an extension of VRML) did not fly (but VRML did not fly either). However, 10 years later the Korea-originated Digital Multimedia Broadcasting technology, which used BIFS scaled down to 2D, had a significant success in radio broadcasting.

Much of the MPEG-4 video work was driven by the idea of video “objects” which, along with BIFS, did not fly (the standard specified video objects but did not say how to make them, because that was an encoder issue). For a few years, MPEG-4 video was used in various environments. Unfortunately the main use – video streaming – was stopped by the “content fees” clause of the licensing terms. Part 10 of MPEG-4 Advanced Video Coding (AVC) was very successful, especially because patent holders did not repeat some of the mistakes they had made for MPEG-4 Visual. None of the 3 “royalty free” (Option 1 in ISO language) MPEG-4 video coding standards did fly, showing that in ISO today it is not practically possible to make a media-related standard that does not require onerous licensing of thirty party technology.

The MPEG-4 Parametric coding for high-quality audio did not fly, but a particular tool in it – Parametric Stereo (PS) – could very efficiently encode stereo music as a mono signal plus a small amount of side-information. MPEG combined the PS tool with HE-AAC and produced HE-AAC v2, an audio decoder that is on board of billions of mobile handsets today as it enables transmission of a stereo signal at 32 kb/s with very good audio quality.

For most MPEG standards, the reference model is the figure below

Different groups with different competences develop the different parts of a standard. Some parts are designed to work together with others in systems identified in the Context-Objectives-Use cases phase. However, the parts are not tightly bound because in general it is possible to use them separately.

The MPEG-7 project was driven by the idea of a world rich of audio-video-multimedia descriptors that would allow users to navigate the large amount of media content expected at that time and that we have today. Content descriptors were expressed in verbose XML, a tool at odds with the MPEG bit-thrifty approach. So MPEG developed the first standard for XML compression, a technology adopted in many fields.

Of MPEG-A is remarkable the Common Media Application Format (CMAF) standard. Several technologies drawn from different MPEG standards are integrated to efficiently deliver large scale, possibly protected, video applications, e.g. streaming of televised events. CMAF Segments can be delivered once to edge servers in content delivery networks, then accessed from cache by thousands of streaming video players without additional network backbone traffic or transmission delay.

MPEG-V – Media context and control is another typical example. The work was initiated in the wake of the success of Second Life, a service that looked like it could take over the world. The purpose of part 4 of MPEG-V Virtual world object characteristics was not to standardise a Second Life like service but the interfaces that would allow a user to move assets from one virtual space to another virtual space. The number of Second Life users dived and part 4 never took off. Other parts of MPEG-V concern formats and interfaces to enrich the the audio-visual user experience with, say, a breeze when there is a little wind in the movie, a smell when you are in a field of violets etc. So far, this apparently interesting extension of the user experience did not fly, but MPEG-V provides a very solid communication framework for sensors and actuator that finds use in other standards.

The MPEG-H MPEG Media Transport (MMT) project showed how it is possible to innovate without destabilising existing markets. MPEG-2 Transport Stream (TS) has been in use for 25 years (and MPEG has received an Emmy for that) and will continue to be used for the foreseable future. But MPEG-2 TS shows the signs of time because it has been designed for a one-way channel – an obvious choice 25 years ago – while so much video distribution today happens on two-way channels. MMT uses IP transport instead of MPEG-2 TS transport and achieves content delivery unification in both one-way and two-way distribution channels.

Is MPEG in the research business?

The simple and flat answer is NO. However, MPEG CfPs are great promoters of corporate research because they push companies to improve their technologies to enable them to make successful proposals in response to CfPs.

One of the reasons of MPEG success, but also of the difficulties highlighted in this article, is that, in the MPEG domain, standardisation is a process closer to research than to product design.

Roughly speaking, in the MPEG standardisation process, research happens in two phases: in the companies, in preparation for CfEs or CfPs (MPEG calls this competitive phase) and in what MPEG calls collaborative phase, i.e. during the development of Core Experiments (of course this research phase is still done by the companies, but in the coordinating framework of an MPEG standard).

The power of the MPEG competitive phase lies in the fact that MPEG receives many submissions from respondents to a CfP and pools together the components technologies. Therefore, the MPEG “product” has a much better performance than any autarchic product developed by an independent company because it uses many good technologies from many more companies than a single company could do.

Actually, improvement is even greater and the MPEG collaborative phase offers another opportunity to do more research. This has a much more limited scope because it is in the context of optimising a subset of the entire scope of the standard, but the sum of many small optimisations can provide big gains in performance. The shortcoming of this process is the possible introduction of a large number of IP items for a gain that some may may well consider not to justify the added IP onus and complexity.

With its MPEG-5 project MPEG is trying to see if a suitably placed lower limit to performance improvements can help solve the problems identified in the HEVC standard.


MPEG has a large number of successful standards. For many of them the unit of measure is billion of units, be they hardware, software and firmware.

MPEG has also had failures. The reasons for these can be manifold. One that is often quoted by outsiders is “the standard was too technology driven”. Of course technology plays an important part in MPEG standards. But then, what should MPEG do? Stop standards that are too technology driven? And how much is too much?

Excluding technology would be a mistake as I will show in two examples. If MPEG had done that, we would not have MP3. In 1992 layer 3 was a costly appendix to Layer 1 and 2 that just did a good job. If we had done that, we would not have Point Cloud Compression (now at CD level), a standard that industry dies for today. Sure, MPEG should establish firmer contacts with market needs, but the necessary expertise can only be provided by companies sending experts to MPEG.

MPEG needs more market, but do not expect that more market will necessarily have miraculous effects. The basic logic that has guided MPEG when making a decision on a standard has been “if there is a legitimate request to have a standard (within the constraints of 50% + 1 approval and 5 countries willing to provide experts to do the work), we do it”. More market information can certainly be useful to articulate a complete proposal and add more evidence at the time shareholders (NBs) vote.

MPEG’s value is in its capability to produce standards that anticipate the needs of the market in a process that may take years from the time an idea is launched to the time the standard is produced. People in our age are volatile and so are the markets. In comparison technology is stable.

Posts in this thread

Thirty years of audio coding and counting


Obviously, the electrical representation of sound information happened before the electrical representation of visual information and so did the services that used that representation to distribute sound information. The digital representation of audio, too, happened at different times than video’s. In the early 1980s the Compact Disc (CD) allowed record companies to distribute digital audio for the consumer market, while the D1 digital tape, available in the late 1980’s, was for the exclusive use of professional applications such as in the studio. Compression technologies reversed the order: compressed digital video happened before compressed digital audio by some 10 years. Therefore, unlike the title of the article Forty years of video coding and counting, the title of this post is Thirty years of audio coding and counting.

This statement can become a source of dispute, if a proper definition of Audio is not adopted. In this article by Audio we mean sound in the human audible range not generated by a human phonatory system or for any other sound source for which a sound production model is not available or not used. Indeed digital speech happened in professional applications (trunk network) some 20 years before the CD. ITU-T G.721 “32 kbit/s adaptive differential pulse code modulation (ADPCM)” dates back to 1984, the same year H.120 was approved as a recommendation.

Therefore the title of this article could very well have been Forty years of audio coding and counting. This would have come at the cost of a large number of speech compression standards and this article would have been overwhelmed by them. Therefore this article will only deal with audio compression standards where audio does not include speech. With one exception that will be mentioned later, I mean.

Unlike video compression where ITU-T is the non-MPEG body that develops video coding standards, in audio compression MPEG dominance is total. Indeed ITU-R, who does need audio compression for its digital audio broadcasting standards, prefers to rely on external sources, including MPEG.

MPEG-1 Audio

Those interested in knowing why and how a group – MPEG – working in video compression ended up also working on audio compression (and a few more other things) can look here. The kick off of the MPEG Audio group took place on 1-2 December 1988, when, in line with a tradition that at that time had not been fully established yet, a most diverse group of audio coding experts met in Hannover and kick-started the work that eventually gave rise to the MPEG-1 Audio standard released by MPEG in November 1992.

The Audio group in MPEG is very often the forerunner of things to come. In this instance the first is that while the broadcasting world shunned the low resolution MPEG-1 Video compression standard, it very much valued the MPEG-1 Audio compression standard. The second is that, unlike video, which relied on essentially the same coding architecture, the Audio Call for Proposals had yielded two classes of algorithms, one that was a well established, easy to implement but less performing and the other that was more recent, harder to implement (at that time) but more performing. The work to merge the two technologies was painstaking but eventually the standard included 3 layers (a notion later called profiles) where both technologies were used.

Layer 1 was used in Digital Compact Cassette (DCC), a product discontinued a few years later, Layer 2 was used in audio broadcasting and as the audio component of Video CD (VCD). Layer 3 (MP3) does not need a particular introduction 😉. As revised in the subsequent MPEG-2 effort, MP3 provided a user experience with no perceivable difference as compared to the original CD signal for most content at 128 kbit/s from a CD source of 1.44 Mbit/s, i.e with a compression of 11:1.

MPEG-2 Audio

The main goal of this standard, approved in 1994, was multi-channel audio with the key requirement that an MPEG-1 Audio decoder should be able to decode a stereo component of an MPEG-2 Audio bitstream. Backward compatibility is particularly useful in the broadcasting world because an operator can upgrade to a multi-channel services without losing the customers who only have an MPEG-1 Audio decoder.


Work on MPEG-2 Advanced Video Coding (AAC) was motivated by the request of those who wished to provide the best possible audio quality without backward compatibility constraints. This meant that layer 2 must decode both layer 1 and 2, and layer 3 must decode all layers. MPEG-2 AAC, released in April 1997, is built upon the MP3 technology and can provide perceptually transparent audio quality at 128 kbit/s for a stereo signal, and 320 kbit/s for a 5.1 channel signal (i.e. as in digital television).


In 1998 MPEG-4 Audio was released with the other 2 MPEG-4 components – Systems and Visual. Again MPEG-4 AAC is built on MPEG-2 AAC. The dominating role of MP3 in music distribution was shaken in 2003 when Apple announced that its iTunes and iPod products would use MPEG-4 AAC as primary audio compression algorithm. Most PCs, smart phones and later tablets could play AAC songs. Far from using AAC as a pure player technology, Apple started the iTunes service that provides songs in AAC format packaged in the MPEG-4 File Format, with filename extension “.m4a”.


In 1999 MPEG released MPEG-4 amendment 1 with a low delay version of AAC, called Low Delay AAC (AAC-LD). While a typical AAC encoder/decoder has a one-way latency of ~55 ms (transform delay plus look-ahead processing), AAC-LD achieves a one-way latency of only 21 ms by simplifying and replacing some AAC tools (new transform with lower latency and removal of look-ahead processing). AAC-LD can be used as a conversational codec, with a signal bandwidth and perceived quality of a music coder with excellent audio quality at 64 kb/s for a mono signal.


In 2003 MPEG released the MPEG-4 High Efficiency Advanced Audio Coding (HE-AAC), as amendment 1 to MPEG-4.  HE-AAC helped to consolidate the role of the mobile handset as the tool of choice to access very good audio quality stereo music at 48 kbit/s, more than a factor of 2.5 better than AAC, for a compression ratio of almost 30:1 relative to the CD signal.

HE-AAC adds the spectral bandwidth replication (SBR) tool to the core AAC compression engine. Since AAC was already widely deployed, this permitted extending this base to HE-AAC by only adding the SBR tool to existing AAC implementations.


In the same 2003, 9 months later, MPEG released the MPEG HE-AAC v2 profile. This originated from a tools contained in amendment 2 to MPEG-4 (Parametric coding for high-quality audio).  While the core parametric coder did not enjoy wide adoption, the Parametric Stereo (PS) tool in the amendment could very efficiently encode stereo music as a mono signal plus a small amount of side-information.  HE-AAC v2, the combination of PS tool with HE-AAC, enabled transmission of a stereo signal at 32 kb/s with very good audio quality.

This profile was also adopted by 3GPP under the name Enhanced aacPlus. Adoption by 3GPP paved the way for HE-AAC v2 technology to be incorporated into mobile phones.  Today, more than 10 billion mobile devices support streaming and playout of HE-AAC v2 format songs. Since HE-AAC is built on AAC, these phone also support streaming and playout of AAC format songs.


In 2005 MPEG released two algorithms for lossless compression of audio, MPEG Audio LosslesS coding (ALS) and Scalable to LosslesS coding (SLS). Both provide perfect (i.e. lossless) reconstruction of a standard Compact Disk audio signal with a compression ratio approximately 2:1. An important feature of SLS is that it has a variable compression ratio: it can compress a stereo signal to 128 kb/s (11:1 compression ratio) with excellent quality as an AAC codec but it can achieve lossless reconstruction with a compression ratio of 2:1 by increasing the coded bitrate (i.e. by decreasing the compression ratio) in a continuous fashion.

MPEG Surround

ALS/SLS were the last significant standards in MPEG-4 Audio, which is MPEG’s most long-lived audio standard. First issued in 1999, 20 years later (in 2019) MPEG is issuing its Fifth Edition.

After closing the “MPEG-4 era,” MPEG created the MPEG-D suite of audio compression standards. The first of these was MPEG Surround, issued in 2007. This technology is a generalised PS of HE-AAC v2 tool in the sense that, MPEG Surround can operate as a 5-to-2 channel compression tool or as an M-to-N channel compression tool. This “generalised PS” tool is followed by a HE-AAC codec. Therefore MPEG Surround builds on HE-AAC as much as HE-AAC builds on AAC. MPEG Surround provides very good compression while maintaining very good audio quality and also low computational complexity. While HE-AAC can transmit stereo at 48 kbit/s, MPEG Surround can transmit 5.1 channel audio within the same 48 kbit/s transmission budget. The complexity is no greater than stereo HE-AAC’s. Hence MPEG Surround is a “drop-in” replacement for stereo services to extend them to 5.1 channel audio!


In 2007 MPEG released Enhanced Low Delay AAC (AAC-ELD) technology. This combines tools from other profiles: SBR and PS from HE-AAC v2 profile and AAC-LD. The new codec provides even greater signal compression with only a modest increase in latency: AAC-ELD provides excellent audio quality at 48 kb/s for a mono signal with a one-way latency of only 32 ms.


In 2010 MPEG released MPEG-D Spatial Audio Object Coding (SAOC) which allows very efficient coding of a multi-channel signal that is a mix of objects (e.g. individual musical instruments). SAOC down-mixes the multi-channel signal, e.g. stereo to mono, codes and transmits the mono signal along with some side-information, and then up-mixes the received and decoded mono signal back to a stereo signal such that user perceives the instruments to be placed at the correct positions and the resulting stereo signal to be the same as the original. This is done by exploiting the fact that at any instant in time and any frequency region one of the instruments will tend to dominate the others so that in this time/frequency region the other signals will be perceived with much less acuity, if at all. SAOC analyses the input signal, divides each channel into time and frequency “tiles” and then decides to what extent each tile dominates. This is coded as side information.

An example SAOC application is teleconferencing, in which a multi-location conference call can be mixed at the conference bridge down to a single channel and transmitted to each conference participant, along with the SAOC side information. At the user’s terminal, the mono channel is up-mixed to stereo (or 3 channels – Left-Center-Right) and presented such that each remote conference participant is at a distinct location in the front sound stage.


Unified Speech and Audio Coding (USAC), released in 2011, combines the tools for speech coding and audio coding into one algorithm. USAC combines the tools from MPEG AAC (exploiting the means of human perception of audio) with the tools from a state-of-the-art speech coder (exploit the means of human production of speech). Therefore, the encoder has both a perceptual model and a speech excitation/vocal tract model and dynamically selects the music/speech coding tools every 20 ms. In this way USAC achieves a high level of performance for any input signal, be is music, speech or a mix of speech and music.

In the tradition of MPEG standards, USAC extends the range of “good” performance down to as low as 16 kb/s for a stereo signal and provides higher quality as the bitrate is increased. The quality at 128 kbit/s for a stereo signal is slightly better that MPEG-4 AAC so USAC can replace AAC, because its performance is equal or better than AAC at all bit rates and can similarly code multichannel audio signals, and can also optimally encode speech content.


MPEG-D Dynamic Range Control (DRC) is a technology that gives the listener the ability to control the audio level. It can be a post-processor for every MPEG audio coding technology and modifies the dynamic range of the decoded signal as it is being played.  It can be used to reduce the loudest part of a movie so as not to disturb your neighbours, to make the quiet portions of the audio louder in hostile audio environments (car, bus, room with many people), to match the dynamics of the audio to that of a smart phone speaker output, which typically has very limited dynamic range. The DRC standard also plays the very important function of normalizing the loudness of the audio output signal, which may be mandated in some regulatory environments.  DRC was released in 2015 and extended in 2017 as Amendment 1 Parametric DRC, gain mapping and equalization tools.

3D Audio

MPEG-H 3D Audio, released in 2015, is part of the typical suite of MPEG tools: Systems, Video and Audio. It provides very efficient coding of immersive audio content, typically from 11 to 22 channels of content. The 3D Audio algorithms can actually process any mix of channels, objects and Higher Order Ambisonics (HOA) content, where objects are single-channel audio whose position can be dynamic in time and HOA can encode an entire sound scene as a multi-channel “HOA coefficient” signal.

Since 3D Audio content is immersive, it is conceived as being consumed as a 360-degree “movie” (i.e. video plus audio). The user sits at the center of a sphere (“sweet spot”) and the audio is decoded and presented so that the user perceives it to be coming from somewhere on the surrounding sphere. MPEG-H 3D audio also can be presented via headphones because not every consumer has an 11 or 22 channel listening space. Moreover MPEG-H 3D Audio supports use of a default or personalised Head Related Transfer Function (HRTF) to allow the listener to perceive the audio content as if it is from sources all around the listener, just as it would be when using loudspeakers. An added feature of 3D Audio playout to headphones, is that the audio heard by the listener can remain at the “correct” position when the user turns his or her head. In other words, a sound that is “straight ahead” when the user is looking straight ahead is perceived as coming from the left if the user turns to look right. Hence, MPEG-H 3D Audio is already a nearly complete solution for Video 360 applications.

Immersive Audio

This activity (to be released as a standard sometime in 2021) is part of the emerging MPEG-I Immersive Audio standard. MPEG is still defining the requirements and functionality of this standard, which will support audio in Virtual and Augmented Reality applications. It will be based on MPEG-H 3D Audio, which already supports a 360 degree view of a virtual world from one listener position (“3 degrees of freedom” or 3DoF) that the listener can move his or her head left, right, up, down or tilted left or right (so-called “yaw, pitch roll”). The Immersive Audio standard will add three additional degrees of freedom, i.e., permit the user to get up and walk around in the Virtual World. This additional movement is designated “x, y, z,” so that MPEG-I Immersive Audio supports 6 degrees of freedom (6 DoF) which are “yaw, pitch roll and x, y, z.” It is envisioned that MPEG-I Immersive Audio will use MPEG-H 3D Audio to compress the audio signals, and will specify additional metadata and technology so that the audio signals can be rendered in a fully flexible 6 DoF way.


MPEG is proud of the work done by the Audio group. For 30 years the group has injected generations of audio coding standards into the market. In the best MPEG tradition, the standards are generic in the sense that can be used in audio-only or audio+video applications and often scalable, with a new generation of audio coding standards building on previous ones.

This long ride is represented in the figure that ventures into the next step of the ride.

Today MPEG Audio already provides a realistic 3DoF experience in combination with MPEG Video standards. More will be needed to provide a complete and rewarding 6DoF experience, but MPEG’s ability to draw the necessary multi-domain expertise from its membership promises that the goal will be successfully achieved.


This article would not have been possible without the competent assistance – and memory – of Schuyler Quackenbush, the MPEG Audio Chair.

Posts in this thread (in bold this post)