Matching technology supply with demand

Introduction

There have always been people in need of technology and, most of the time, people ready to provide something in response to the demand. In book XVIII of Iliad, Thetis, Achilles’s mother, asks Hephestus, the god of, blacksmiths, craftsmen, artisans, sculptors and more, to provide a new armour to her son who had lost it to Hector. Hephestus duly complied. Still in the fictional domain, but in more recent years, Agent 007 visits Q Branch to get the latest gadgets for his next spy mission, which are inevitably put to good use in the mission.

Wars have always been times when the need for technologies stretches the ability to supply them. In our, supposedly peaceful, age, there are lots of technology around, but it is often difficult for companies needing a particular technology to find the solution matching their needs and budget.

Supply and demand in standardisation

Standardisation is an interesting case of an entity,  typically non-commercial and non-governmental, needing technologies to make a standard. Often standards organisations, too, need technologies to accomplish their mission. How can they access the needed technologies?

A few decades back, if industry needed a standard for, say, a video cassette recorder, the process was definitely supply-driven: a company who had developed a successful product (call it Sony or JVC) submitted a proposal (call it Betamax or VHS) to a standards committee (call it IEC SC 60B). Too bad if the process produced two standards.

In the second half of the 1980’s, ITU SG XV (ITU numbering of that time) started developing the H.261 recommendation. Experts developed the standard piece by piece in the committee (so-called Okubo group) by acquiring the necessary technologies in a process where the roles of demand and supply were rather blurred. Only participants were entitled to provide their technologies to fulfill the needs of the standard.

A handful of years later, MPEG further innovated technology procurement in a standardisation environment. To get the technologies needed to make a standard, it used a demand-driven tool – MPEG’s Call for Proposals (CfP). Since then, technologies provided by respondents and assessed for relevance by the group are used to 1) create the initial reference model (RM0) and 2) initiate a first round of Core Experiments (CE). CEs result from the agreement among participating experts that there is room for improving the performance of the standard under development by opening a particular area to optimisation. CEs are continued until available room for optimisation is exhausted. While anybody is entitled to respond to a CfP and contribute technology to RM0, only experts participating in the standardisation project can provide technology for CEs. This, however, is not really a limitation because the process is open to anybody wishing to join a recognised standards organisation who is a member of ISO.

The MPEG process of standards development has allowed the industry to maintain a sustained development and expansion for many years. In fairness, this is not entirely MPEG’s merit. Patent pools have played a synergistic role to MPEG’s by providing (industry) users with the means to practice MPEG standards and IP holders the means to be remunerated for the use of their IP.

The situation today

The HEVC case has shown that the cooperation of different parties to achieve the common goal of enabling the use of a standard is not discounted (see, e.g., A crisis, the causes and a solution). There are several reasons for this: the increasing number of individual technologies needed to make a high-performance MPEG standard, the increasing number of IP holders, the increasing number of Non-Performing Entities (NPE) as providers of technology and the increasing number of patent pools who stand as independent licence providers of a portion of a patent.

I have already made several proposals with the intention of helping MPEG from this stalemate (see, e.g., Business model based ISO/IEC standards). Here I would like to present an additional idea that extends the MPEG process of standards development (see How does MPEG actually work?).

A new process proposal

A possible implementation of the proposal applying to an entity (a company, an industry forum or a standards organisation) wishing to develop a specification or a standard, could run like this (some details could be fine tuned or changed on a case-by-case basis):

  1. An entity (a company, an industry forum or a standards organisation) wishes to develop a specification or a standard
  2. The entity issues a Call for Proposals (CfP) including requirements requesting proponents to to accept the process (as defined in this numbered list) and to commit to RAND licensing of their technologies for the specification or standard
  3. The entity assesses the proposals received
  4. The entity sets aside a certain amount of tokens for the entire standard (e.g. 1,000)
  5. The entity builds a “minimal” Reference Model zero (RM0) using technologies contained in the proposals in a conservative way so as to create ample space for a healthy Core Experiment (CE) process
  6. The entity
    1. Assigns a percentage of tokens to RM0
    2. Establishes the amount of tokens that will be given to the proponent who achieves the highest performance in a CE (e.g. 1 token for each 0.1% improvements)
  7. The entity identifies and publishes CEs at the pace required by the specification or standard
  8. For each CE the entity makes a call containing
    1. A description of the CE
    2. A minimum performance target (say, at least 1% improvement)
    3. A deadline for submission of 1) results and 2) code that proves that the target has been achieved
    4. The maximum level of associated complexity
  9. CE proponents should do due diligence and
    1. Make proposals that contain only their own technologies, or
    2. Ask any third party to join in the response (ad accept the conditions of the CfP)
  10. If the tokens are all used and there is still room for optimisation, new tokens are created and token holders have their tokens scaled down so that the total number of tokens is still 1,000
  11. If room for RM optimisation is exhausted but there are still tokens unassigned, token holders have their tokens scaled up so that the total number of tokens is 1,000.

Depending on the nature of the entity (company-industry forum-standards organisation) another entity, which can be the same entity who has managed the process or a patent pool

  1. Identifies who are IP holders in RM0 and CEs
  2. Removes technology in case the status of IP has not been clarified
  3. After completing steps 1 and 2, pays royalties to IP holders based on the number of tokens they have acquired in the process (i.e. RM0 and CEs)

Merits and limits of the proposal

The proposal achieves the goal to

  1. Associate patent holders to RM0 and CE areas as opposed to associate patent holders just to the standard
  2. Enablthe turning off of technologies of a CE area if this has unclear IP status and turning them on again if the status is clarified

In case the entity developing the specification is a standards organisation, more than one patent pool can develop a licence using the results of the process.

Conclusions

This idea was developed in collaboration with Malvika Rao, the founder of Incentives Research and holder of a PhD from Harvard University, and Don Marti, an open source expert and an advisor at Incentives Research.

Converting the basic concept described above into a workable market design requires further work. There may be opportunities to game the system, and the design must consider issues such as how to attract and retain participation. In addition the design must be tested (e.g., via simulation or usability study) to understand its performance.

Please send comments to Leonardo.

Posts in this thread

 

 

 

What would MPEG be without Systems?

The most visited articles on this blog Forty years of video coding and counting and Thirty years of audio coding and counting prove that MPEG is known for its audio and video coding standards. But I will not tire of saying that MPEG would not it be what it has become if the Systems aspects had not been part of most of its standards. This is what I intend to talk about in this article.

It is hard to acknowledge, but MPEG was not the first to deal with the problem of putting together digital audio and video for delivery purposes. In the second half of the 1990’s ITU-T dealt with the problem of handling audio-visual services using the basic ISDN access at 2B+D (2×64 kbit/s) or the primary ISDN access, the first digital streams made possible by ITU-T Recommendations.

Figure 1 depicts the solution specified in ITU Recommendation H.221. Let’s assume that we have 2 B channels at 64 kbit/s (Basic Access ISDN). H.221 creates on each B channel a Frame Structure of 80 bytes, i.e. 640 bits repeating itself 100 times per second. Each bit position in an octet can be considered as an 8 kbit/s sub-channel. The 8th bit in each octet represents the 8th sub-channel, called the Service Channel.

Within the Service Channel bits 1-8 are used by the Frame Alignment Signal (FAS) and bits 9-16 are used by the Bit Alignment Signal (BAS). Audio is always carried by the first B channel, e.g. by the first 2 subchannels, and Video and Data by the other subchannels (less the bitrate allocated to FAS and BAS).

Figure 1 – ITU Recommendation H.221

MPEG-1 Systems

The solution depicted in Figure 1 bears the mark of the transmission part of the telecom industry that had never been much friendly to packet communication. That is why MPEG in the late 1990’s had an opportunity to bring some fresh air in this space. Starting from a blank sheet of paper (at that time MPEG still used paper 😊) MPEG designed a flexible packet-based multiplexer to convey in a single stream compressed audio and video, and clock information in such a way as to enable audio‑video synchronisation (Figure 2).

Figure 2 – MPEG-1 Systems

The MPEG Systems revolution took time to take effect. Indeed the European EU 95 project used MPEG-1 Audio layer 2, but designed a frame-based multiplexer for the Digital Audio Broadcasting service.

MPEG-2 Systems

In the early 1990’s MPEG started working on another blank sheet of paper. MPEG had the experience of MPEG-1 Systems design but the requirements were significantly different. In MPEG-1, audio and video (possibly many of them in the same stream) had a common time base, but the main users of MPEG-2 wanted a system that could deliver a plurality of TV programs, possibly coming from different sources (i.e. with different time bases) and with possibly a lot of metadata related to the programs, not to mention some key business enabler like conditional access information. Moreover, unlike MPEG-1 where it was safe to assume that the bits issuing from a Compact Disc would travel without errors to a demultiplexer, in MPEG-2 it was mandatory to assume that the transmission channel was anything but error-free.

MPEG-2 Transport Stream (TS) provides efficient mechanisms to multiplex multiple audio-visual data streams into one delivery stream. Audio-visual data streams are packetised into small fixed-size packets and interleaved to form a single stream. Information about the multiplexing structure is interleaved with the data packets so that the receiving entity can efficiently identify a specific stream. Sequence numbers help identify missing packets at the receiving end, and timing information is assigned after multiplexing with the assumption that the multiplexed stream will be delivered and played in sequential order.

MPEG-2 Systems is actually two specifications in one (Figure 3). The Transport Stream (TS) is a fixed-length packet-based transmission system designed to work for digital television distribution on error-prone physical channels, while the Program Stream (PS) is a packet-based multiplexer with many points in common with MPEG-1 Systems. `While TS and PS share significant information, moving from one to the other may not be immediate.

Figure 3 – MPEG-2 Systems

MPEG-4 Systems

MPEG-4 gave MPEG the opportunity to experience an epochal transition in data delivery. When MPEG-2 Systems was designed Asynchronous Transfer Mode (ATM) was high on the agenda of the telecom industry and was considered as the vehicle to transport MPEG-2 TS streams on telecommunication networks. Indeed, the Digital Audio-Visual Council (DAVIC) designed its specifications on that assumption. At that time, however, IP was still unknown to the telecom (at least to the transmission part, broadcast and consumer electronics worlds.

The MPEG-4 Systems work was a completely different story than MPEG-2 Systems. An MPEG4 Mux (M4Mux) was developed along the lines of MPEG-1 and MPEG-2 Systems, but MPEG had to face an unknown world where many transports were surging as possible candidates. MPEG was obviously unable to make choices (today, 25 years later, the choice is clear) and developed the notion of Delivery Multimedia Integration Framework (DMIF), where all communications and data transfers between the data source and the terminal were abstracted through a logical API called the DAI (DMIF Application Interface), independent of the transport type (broadcast, network, storage).

MPEG-4 Systems, however, was about more than interfacing with transport and multiplexing. The MPEG-4 model was a 3D space populated with dynamic audio, video and 3D Graphics objects. Binary Format for Scenes (BIFS) was the technology designed to provide the needed functionality.

Figure 4 shows the 4 MPEG-4 layers: Transport, Synchonisation, Compression and Composition.

Figure 4 – MPEG-4 Systems

MPEG-4 File Format

For almost 10 years – until 1997 – MPEG was a group who made intense use of IT tools (in the form of computer programs that simulated encoding and decoding operation of the standards it was developing) but was not an “IT group”. The proof? Until that time it had not developed a single file format. Today MPEG can claim to have another such attribute (IT group) along with the many others it has.

In those years MP3 files were already being created and exchanged by the millions, but the files did not provide any structure. The MP4 File Format, officially called ISO Base Media File Format (ISO BMFF), filled that gap as it can be used for editing, HTTP streaming and broadcasting.

Let’s have a high level look to understand the sea that separates MP3 files from the MP4 FF. MP4 FF contains tracks for each media type (audio, video etc.), with additional information: a four-character the media type ‘name’ with all parameters needed by the media type decoder. “Track selection data” helps a decoder identify what aspect of a track can be used and to determine which alternatives are available.

Data are stored in a basic structure called box with attributes of length, type (4 printable characters), possibly version and flags. No data can be found outside of a box. Figure 5 shows a possible organisation of an MP4 file

 

Figure 5 – Boxes in an MP4 File

MP4 FF can store:

  1. Structural and media data information for timed presentations of media data (e.g. audio, video, subtitles);
  2. Un-timed data (e.g. meta-data);
  3. Elementary stream encryption and encryption parameter (CENC);
  4. Media for adaptive streaming (e.g. DASH);
  5. High Efficiency Image Format (HEIF);
  6. Omnidirectional Media Format (OMAF);
  7. Files partially received over lossy links for further processing such as playback or repair (Partial File Format);
  8. Web resources (e.g. HTML, JavaScript, CSS, …).

Save for the first two features, all others were added in the years following 2001 when MP4 FF was approved. The last two are still under development.

MPEG-7 Systems

With MPEG-7, MPEG made the first big departure from media compression and turned its attention to media description including ways to compress that information. In addition to descriptors for visual audio and multimedia information, MPEG-7 includes a Systems layer used by an application, say, navigation of a multimedia information repository, to access coded information coming from a delivery layer in the form of coded descriptors (in XML or in BiM, MPEG’s XML compression technology). The figure illustrates the operation of MPEG-7 Systems decoder.

Figure 6 – MPEG-7 Systems

An MPEG-7 Systems decoder operates in two phases

  1. Initialisation when DecoderInit initialises the decoder by conveying description format information (textual or binary), a list of URIs that identifies schemas, parameters to configure the Fragment Update decoder, and an initial description. The list of URIs is passed to a schema resolver that associates the URIs with schemas to be passed to Fragment Update Decoder.
  2. Main operation, when the Description Stream (composed of Access Units containing fragment updates) is fed to the decoder which processes
    1. Fragment Update Command specifying the update type (i.e., add, replace or delete content or a node, or reset the current description tree);
    2. Fragment Update Context that identifies the data type in a given schema document, and points to the location in the current description tree where the fragment update command applies; and
    3. Fragment Update Payload conveying the coded description fragment to be added toor replaced in the description.

MPEG-E

MPEG Multimedia Middleware (M3W), also called MPEG-E, is an 8-part standard defining the protocol stack of consumer-oriented multimedia devices, as depicted in Figure 7.

Figure 7 – MPEG Multimedia Middleware (M3W)

The M3W model includes 3 layers:

  1. Applications non part of the specifications but enabled by the M3W Middleware API;
  2. Middleware consisting of
    1. M3W middleware exposing the M3W Middleware API;
    2. Multimedia platform supporting the M3W Middleware by exposing the M3W Multimedia API;
    3. Support platform providing the means to manage the lifetime of, and interaction with, realisation entities by exposing the M3W Support API (it also enables management of support properties, e.g. resource management, fault management and integrity management);
  3. Computing platform: whose API are outside of M3W scope.

MPEG-M

Multimedia service platform technologies (MPEG-M) specifies two main components of a multimedia device, called peer in MPEG-M.

As shown in Figure 8, the first component is API: High-Level for applications and Low Level for network, energy and security.

Figure 8 – High Level and Low Level API

The second components is a middleware called MXM that relies on its multimedia technologies

Figure 9 – The MXM architecture

The Middleware is composed of two types of engine. Technology Engines are used to call functionalities defined by MPEG standards such as creating or interpreting a licence attached to a content item. Protocol Engines are used to communicate with other peer, e.g. in case a peer does not have a particular Technology Engine that another peer has. For instance, a peer can use a Protocol Engine to call a licence server to get a licence to attach to a multimedia content item. The MPEG-M middleware has the ability to create chains of Technology Engines (Orchestration) or Protocol Engines (Aggregation).

MMT

MPEG Media Transport (MMT) is part 1 of High efficiency coding and media delivery in heterogeneous environments (MPEG-H). It is the solution for the new world of broadcasting where delivery of content can take place over different channels each with different characteristics, e.g. one-way (traditional broadcasting) and two-way (the ever more pervasive broadband network). MMT assumes that the Internet Protocol is common to all channels.

Figure 10 depicts the MMT protocol stack

Figure 10 – The MMT protocol stack

Figure 11 focuses on the MMT Payload, i.e. on the content structure.

Figure 11 – Structure of MMT Payload

The MMT Payload has an onion-like structure:

  1. . Media Fragment Unit (MFU), the atomic unit which can be independently decoded;
  2. Media Processing Unit (MPU), the atomic unit for storage and consumption of MMT content (structured according to ISO BMFF), containing one or more MFUs;
  3. MMT Asset, the logical unit for elementary streams of multimedia component, e.g. audio, video and data, containing one or more MPU files;
  4. MMT Package, a logical unit of multimedia content such as a broadcasting program, containing one or more MMT Assets, also containing
    1. Composition Information (CI), describing the spatio-temporal relationships among MMT Assets
    2. Delivery Information, describing the network characteristics.

MPEG-DASH

Dynamic adaptive streaming over HTTP (DASH) is another MPEG Systems standard that was motivated by the popularity of HTTP streaming and the existence of different protocols used in different streaming platforms, e.g. different manifest and segment formats. By developing the DASH standard for HTTP streaming of multimedia content, MPEG has enabled a standard-based client to stream content from any standard-based server, thereby enabling interoperability between servers and clients of different make.

Figure 12 – DASH model

As depicted in Figure 12, the multimedia content is stored on an HTTP server in two components: 1) Media Presentation Description (MPD) which describes a manifest of the available content, its various alternatives, their URL addresses and other characteristics, and 2) Segments which contain the actual multimedia bitstreams in form of chunks, in single or multiple files.

A typical operation of the system would follow the steps

  1. DASH client obtains the MPD;
  2. Parses the MPD;
  3. Gets information on several parameters, e.g. program timing, media content availability, media types, resolutions, min/max bandwidths, existence of alternatives of multimedia components, accessibility features and the required protection mechanism, the location of each media component on the network and other content characteristics;
  4. Selects the appropriate encoded alternative and starts streaming the content by fetching the segments using HTTP GET requests;
  5. Fetches the subsequent segments after appropriate buffering to allow for network throughput variations
  6. Monitors the network bandwidth fluctuations;
  7. Decides how to adapt to the available bandwidth depending on its measurements by fetching segments of different alternatives (with lower or higher bitrate) to maintain an adequate buffer.

DASH only defines the MPD and the segment formats. MPD delivery and media encoding formats containing the segments as well as client behavior for fetching, adaptation heuristics and content playing are outside of MPEG-DASH’s scope.

Conclusions

The reader should not think that this is an exhaustive presentation of MPEG’s Systems work. I hope the description will reveal the amount of work that MPEG has invested in Systems aspects, sometimes per se, and sometimes to provide adequate support to users of its media coding standards. This article also describes some of the most successful MPEG standards. At the top certainly towers MPEG-2 Systems of which 9 editions have been produced to keep up with continuous user demands for new functionalities.

Without mentioning the fact that MPEG-2 Systems has received an Emmy Award 😉.

Posts in this thread

 

 

MPEG: what it did, is doing, will do

Introduction

If I exchange words with taxi drivers in a city somewhere in the world, one of the questions I am usually asked is: “where are you from?”. As I do not like straight answers, I usually ask back “where do you think I am from?” It usually takes time before the driver gets the information he asked for. Then the next question is: “what is your job?”. Again, instead of giving a straight answer, I ask the question: “do you know MPEG?” Well, believe it or not, 9 out of 10 times the answer is “yes”, often supplemented by an explanation decently connected with what MPEG is.

Wow! Do we need a more convincing proof that MPEG has conquered the minds of the people of the world?

The interesting side of the story, though, is that, even if the name MPEG is known by billions of people, it is not a trademark. Officially, the word MPEG does not even exist. When talking to ISO you should say “ISO/IEC JTC 1/SC 29/WG 11” (next time, ask your taxi driver if they know this letter soup). The last insult is that the mpeg.org domain is owned by somebody who just keeps it without using it.

Should all this be of concern? Maybe for some, but not for me. What I have just talked about is just one aspect of what MPEG has always been. Do you think that MPEG was the result of high-level committees made of luminaries advising governments to take action on the future of media? You are going to be disappointed. MPEG was born haphazardly (read here, if you want to know how). Its strength is that it has been driven by the idea that the epochal transition from analogue to digital should not become another PAL-SECAM-NTSC or VHS-Betamax trap.

In 30 years MPEG has grown 20-fold, changed the way companies do business with media, made music liquid, multiplied the size of TV screens, brought media where there were stamp-size displays, made internet the primary delivery for media, created new experiences, shown that its technologies can successfully be applied beyond media…

There is no sign that its original driving force is abating, unless… Read until the end if you want to know more.

What did MPEG do?

MPEG-1 & MPEG-2

MPEG was the first standards group that brought digital media to the masses. In the 2nd half of the 1990’s the MPEG-1 and MPEG-2 standards were converted to products and services as the list below will show (not that the use of MPEG-1 and MPEG-2 is confined to the 1990’s).

  • Digital Audio Broadcasting: in 1995, just 3 years after MPEG-1 was approved, DAB services began to appear in Europe with DAB receivers becoming available some time later.
  • Portable music: in 1997, 5 years after MPEG-1 was approved, Saehan Information Systems launched MPMan, probably the first portable digital audio player for the mass market that used MP3. This was followed by a long list of competing players until the mobile handset largely took over that function.
  • Video CD: in the second half of the 1990’s VCD spread especially in South East Asia until the MPEG-2 based DVD, with its superior quality, slowly replaced it. It uses all 3 parts of MPEG-1 (layer 2 for audio).
  • Digital Satellite broadcasting: in June 1994 DirecTV launched its satellite TV broadcasting service for the US market, even before MPEG released the MPEG-2 standard in November of that year! It used MPEG-2 and its lead was followed by many other regions who gradually converted their analogue broadcast services to digital.
  • Digital Cable distribution: in 1992 John Malone launched the “500-channel” vision for future cable services and MPEG gave the cable industry the means to make that vision real.
  • Digital Terrestrial broadcasting:
    • In 1996 the USA Federal Communications Commission adopted the ATSC A/53 standard. It took some time, however, before wide coverage of the country, and of other countries following the ATSC standards, was achieved.
    • In 1998 the UK introduced Digital Terrestrial Television (DTT).
    • In 2003 Japan started DTT services using MPEG-2 AAC for audio in addition to MPEG-2 Video and TS.
    • DTT is not deployed in all countries yet, and there are regularly news of a country switching to digital, the MPEG way of course.
  • Digital Versatile Disc (DVD): toward the end of the 1990’s the first DVD players were put to market. They used MPEG-2 Program Stream (part 1 of MPEG-2) and MPEG-2 Video, and a host of audio formats, some from MPEG.

MPEG-4

In the 1990s the Consumer Electronics industry provided devices to the broadcasting and telecom industries. and devices for package media. The shift to digital services called for the IT industry to join as providers of big servers for broadcasting and interactive services (even though in the 1990’s the latter did not take off). The separate case of portable audio players provided by startups did not fit the established categories.

MPEG-4 played the fundamental role of bringing the IT industry under the folds of MPEG as a primary player in the media space.

  • Internet-based audio services: The great original insight of Steve Jobs and other industry leaders transformed Advanced Audio Coding (AAC) from a promising technology to a standard that dominates mobile devices and internet services
  • Internet video: MPEG-4 Visual, with the MP4 nickname, did not repeat the success of MP3 for video. Still it was the first example of digital media on the internet as DivX (a company name). Its hopes to become the streaming video format for the internet were dashed by the licensing terms of MPEG-4 Visual, the first example of ill-influence of technology rights on an MPEG standard
  • Video for all: MPEG-4 Advanced Video Coding (AVC) became a truly universal standard adopted in all areas and countries. Broadcasting, internet distribution, package media (Blu-ray) and more.
  • Media files: the MP4 File Format is the general structure for time-based media files, that has become another ubiquitous standard at the basis of modern digital media.
  • Advanced text and graphics: the Open Font Format (OFF), based on the OpenType specification, revised and extended by MPEG, is universally used.

MPEG-7

MPEG-A

  • Format for encrypted, adaptable multimedia presentation: is provided by the Common Media Application Format (CMAF), a format optimised for large scale delivery of protected media with a variety of adaptive streaming, broadcast, download, and storage delivery methods including DASH and MMT.
  • Interoperable image format: the Multi-Image Application Format (MIAF) enables precise interoperability points for creating, reading, parsing, and decoding images embedded in HEIF.

MPEG-B

  • Generic binary format for XML: is provided by Binary format for XML (BiM), a standard used by products and services designed to work according to ARIB and DVB specifications.
  • Common encryption for files and streams: is provided by Common Encryption (CENC) defined in two MPEG-B standards – Part 7 for MP4 Files and Parts 9 for MPEG-2 Transport Stream. CENC is widely used for the delivery of video to billions of devices capable to access internet-delivered stored files, MPEG-2 Transport Syteam and live adaptive streaming.

MPEG-H

  • IP-based television: MPEG Media Transport (MMT) is the “transport layer” of IP-based television. MMT assumes that delivery is achieved by an IP network with in-network intelligent caches close to the receiving entities. Caches adaptively packetise and push the content to receiving entities. MMT has been adopted by the ATSC 3.0 standard and is currently being deployed in countries adopting ATSC standards and also used in low-delay streaming applications.
  • More video compression, siempre!: has been provided by High Efficiency Video Coding (HEVC), the AVC successor yielding an improved compression up to 60% compared to AVC. Natively, HEVC supports High Dynamic Range (HDR) and Wider Colour Gamut (WCG). However, its use is plagued by a confused licensing landscape as described, e.g. in A crisis, the causes and a solution
  • Not the ultimate audio experience, but close: MPEG-H 3D Audio is a comprehensive audio compression standard capable of providing very satisfactory immersive audio experiences in broadcast and interactive applications, It is part of the ATSC 3.0 standard.
  • Comprehensive image file format: High Efficiency Image File Format (HEIF) is a file format for individual HEVC-encoded images and sequences of images. It is a container capable of storing HEVC intra-images and constrained HEVC inter-images, together with other data such as audio in a way that is compatible with the MP4 File Format. HEIF is widely used and supported by major OSs and image editing software.

MPEG-DASH

Streaming on the unreliable internet: Dynamic Adapting Streaming on HTTP (DASH) is the widely used standard that enables a media client connected to a media server via the internet to obtain instant-by-instant the version, among those available on the server, that best suites the momentary network conditions.

What is MPEG doing now?

In the preceding chapter I singled out only MPEG standards that have been (and often still continue to be) extremely successful.

I am  unable to single out those that will be successful in the future 😊, so the reasonable thing to do is to show the entire MPEG work plan

At the risk of making the wrong bet 😊. let me introduce some of the most high profile standards under development,  subdivided in the three categories Media Coding, Systems and Tools, and Beyond Media. But you have better become acquainted with all ongoing activities. In MPEG sometimes the last become the first.

Media Coding

  • Versatile Video Coding (VVC): is the flagship video compression activity that will deliver another round of improved video compression. It is expected to be the platform on which MPEG will build new technologies for immersive visual experiences (see below).
  • Enhanced Video Coding (EVC): is the shorter term project with less ambitious goals. EVC is designed to satisfy urgent needs from those who need a standard with a less complex IP landscape
  • Immersive visual technologies: investigations on technologies applicable to visual information captured by different camera arrangements are under way, as described in The MPEG drive to immersive visual experiences.
  • Point Cloud Compression (PCC): refers to two standards capable of compressing 3D point clouds captured with multiple cameras and depth sensors. The algorithms in both standards are lossy, scalable, progressive and support random access to point cloud subsets. See The MPEG drive to immersive visual experiences for more details.
  • Immersive audio: MPEG-H 3D Audio supports a 3 Degrees of Freedom or 3DoF (yaw, pitch, roll) experience at the movie “sweet spot”. More complete user experiences, however, are needed, i.e. 6 DoF (adding x, y, z). These can be achieved with additional metadata and rendering technology.

Systems and Tools

  • Omnidirectional media format: Omnidirectional Media Application Format (OMAF) v1 is a format supporting the interoperable exchange of omnidirectional (VR 360) content for a user who can only Yaw, Pitch and Roll their head. OMAF v2 will support some head translation movements. See The MPEG drive to immersive visual experiences for more details.
  • Storage of PCC data in MP4 FF: MPEG is developing systems support to enable storage and transport of compressed point clouds with DASH, MMT etc.
  • Scene Description Interface: MPEG is investigating the interface to the scene description (not the technology) to enable rich immersive experiences.
  • Service interface for immersive media: Network-based Media Processing will enable a user to obtain potentially very sophisticated processing functionality from a network service via standard API.
  • IoT when Things are Media Things: Internet of Media Things (IoMT) will enable the creation of networks of intelligent Media Things (i.e. sensors and actuators)

Beyond Media

  • Standards for biotechnology applications: MPEG is finalising all 5 parts of the MPEG-G standard and establishing new liaisons to investigate new opportunities.
  • Coping with neural networks everywhere: shortly (25 March 2019) MPEG will receive responses to its Call for Proposals for Neural Network Compression as described in Moving intelligence around.

What will MPEG do in the future?

At the risk of being considered boastful, I would think that MPEG should have deserved attention from some of the business schools that study socio-economic phenomena. Why? Because many have talked about media convergence, but they have forgotten that MPEG, with its standards, has actually triggered that convergence. MPEG people know the ecosystem at work in MPEG and I for one see how it is unique.

This has not happened. Let’s say that it is better to be neglected than to receive unwanted attention.

I would also think that a body that started from a Subcommittee on character sets and has become the reference standards group for the media industry, i.e. devices, content, services and applications, worth hundreds of billion USD with potent influences on a nearby industry such as telecommunication, should have suggested standards organisations to study the work method and possibly apply it to other domains.

This has not happened. Let’s say, again, that its is better to be neglected than to receive unwanted attention.

So can we expect MPEG to continue its mission, and apply its technologies and know how to continue delivering compression standards for immersive experiences and new compression standards for other domains?

Maybe this time MPEG will attract attention. So, don’t count on it.

Posts in this thread

 

 

The MPEG drive to immersive visual experiences

Introduction

In How does MPEG actually work? I described the MPEG process: once an idea is launched, context and objectives of the idea are identified; use cases submitted and analysed; requirements derived from use cases; and technologies proposed, validated for their effectiveness for eventual incorporation into the standard.

Some people complain that MPEG standards contain too many technologies supporting “non-mainstream” use cases. Such complaints are understandable but misplaced. MPEG standards are designed to satisfy the needs of different industries and what is a must for some, may well not be needed by others.

To avoid burdening a significant group of users of the standard with technologies considered irrelevant, from the very beginning MPEG adopted the “profile approach”. This allows to retain a technology for those who need it without encumbering those who do not.

It is true that there are a few examples where some technologies in an otherwise successful standard get unused. Was adding such technologies a mistake? In hindsight yes, but at the time a standard is developed the future is anybody’s guess and MPEG does not want find out later that one of its standards misses a functionality that was deemed to be necessary in some use cases and that technology could support at the time the standard was developed.

For sure there is a cost in adding the technology to the standard – and this is borne by the companies proposing the technology – but there is no burden to those who do not need it because they can use another profile.

Examples of such “non-mainstream” technologies are provided by those supporting stereo vision. Since as early as MPEG-2 Video, multiview and/or 3D profile(s) have been present in most MPEG video coding standards. Therefore, this article will review the attempts made by MPEG at developing new and better technologies to support what are called today immersive experiences.

The early days

MPEG-1 did not have big ambitions (but the outcome was not modest at all ;-). MPEG-2 was ambitious because it included scalability – a technology that reached maturity only some 10 years later – and multiview. As depicted in Figure 1, multiview was possible because if you have two close cameras pointing to the same scene, you can exploit intraframe, interframe and interview redundancy.

Figure 1 – Redundancy in multiview

Both MPEG-2 scalability and multiview saw little take up.

Both MPEG-4 Visual and AVC had multiview profiles. AVC had 3D profiles next to multiview profiles. Multiview Video Coding (MVC) of AVC was adopted by the Blu-ray Disc Association, but the rest of the industry took another turn as depicted in Figure 2.

Figure 2 – Frame packing in AVC and HEVC

If the left and right frames of two video streams are packed in one frame, regular AVC compression can be applied to the packed frame. At the decoder, the frames are de-packed after decompression and the two video streams are obtained.

This is a practical but less that optimal solution. Unless the frame size of the codec is not doubled, you either compromise the horizontal or the vertical resolution depending on the frame-packing method used. Because of this a host of other more sophisticates, but eventually non successful, frame packing methods have been introduced into the AVC and HEVC standards. The relevant information is carried by Supplemental Enhancement Information (SEI) messages, because the specific frame packing method used is not normative.

The HEVC standard, too, supports 3D vision with tools that efficiently compress depth maps, and exploit the redundancy between video pictures and associated depth maps. Unfortunately use of HEVC for 3D video has also been limited.

MPEG-I

The MPEG-I project – ISO/IEC 23090 Coded representation of immersive media – was launched at a time when the word “immersive” was prominent in many news headings. Figure 3 gives three examples of immersivity where technology challenges increase moving from left to right.


Figure 3 – 3DoF (left), 3DoF+ (centre) and 6DoF (left)

In 3 Degrees of Freedom (3DoF) the user is static but the head that can Yaw, Pitch and Roll. In 3DoF+ the user has the added capability of some head movements in the three directions. In 6 Degrees of Freedom the user can freely walk in a 3D space.

Currently there are several activities in MPEG that aim at developing standards that support some form of immersivity. While they had different starting points, they are likely to converge to one or, at least, a cluster of points (hopefully not to a cloud😊).

OMAF

Omnidirectional Media Application Format (OMAF) is not a way to compress immersive video but a storage and delivery format. Its main features are:

  1. Support of several projection formats in addition to the equi-rectangular one
  2. Signalling of metadata for rendering of 360ᵒ monoscopic and stereoscopic audio-visual data
  3. Use of MPEG-H video (HEVC) and audio (3D Audio)
  4. Several ways to arrange video pixels to improve compression efficiency
  5. Use of the MP4 File Format to store data
  6. Delivery of OMAF content with MPEG-DASH and MMT.

MPEG has released OMAF in 2018 that is now published as an ISO standard (ISO/IEC 23090-2).

3DoF+

If the current version of OMAF is applied to a 3DoF+ scenario, the user may feel parallax errors that are more annoying the larger the movement of the head.

To address this problem, at the January 2019 meeting MPEG has issued a call for proposals requesting appropriate metadata (see the red blocks in Figure 4) to help the Post-processor to present the best image based on the viewer’s position if available, or to synthesise a missing one, if not available.

Figure 4 – 3DoF+ use scenario

The 3DoF+ standard will be added to OMAF which will be published as 2nd edition. Both standards are planned to be completed in October 2020.

VVC

Versatile Video Coding (VVC) is the latest in the line of MPEG video compression standards supporting 3D vision. Currently VVC does not specifically include full-immersion technologies, as it only supports omnidirectional video as in HEVC. However, VVC could not only replace HEVC in the Figure 4, but also be the target of other immersive technologies as will be explained later.

Point Cloud Compression

3D point clouds can be captured with multiple cameras and depth sensors with points that can number a few thousands up to a few billions, and with attributes such as colour, material properties etc. MPEG is developing two different standards whose choice depends on whether the points are dense (Video-based PCC) or less so (Graphic-based PCC). The algorithms in both standards are lossy, scalable, progressive and support random access to subsets of the point cloud. See here for an example of a Point Cloud test sequence being used by MPEG for developing the V-PCC standard.

MPEG plans to release Video-based Point Cloud Compression as FDIS in October 2019 and Graphic-based PCC Point Cloud Compression as FDIS in April 2020.

Next to PCC compression MPEG is working on Carriage of Point Cloud Data with the goal to specify how PCC data can be stored in ISOBMFF and transported with DASH, MMT etc.

Other immersive technologies

6DoF

MPEG is carrying out explorations on technologies that enable 6 degrees of freedom (6DoF). The reference diagram for that work is what looks like a minor extension of the 3DoF+ reference model (see Figure 5), but may have huge technology implications.

Figure 5 – 6DoF use scenario

To enable a viewer to freely move in a space and enjoy a 3D virtual experience that matches the one in the real world, we still need some metadata as in 3DoF+ but also additional video compression technologies that could be plugged into the VVC standard.

Light field

The MPEG Video activity is all about standardising efficient technologies that compress digital representations of sampled electromagnetic fields in the visible range captured by digital cameras. Roughly speaking we have 4 types of camera:

  1. Conventional cameras with a 2D array of sensors receiving the projection of a 3D scene
  2. An array of cameras, possibly supplemented by depth maps
  3. Point clouds cameras
  4. Plenoptic cameras whose sensors capture the intensity of light from a number of directions that the light rays travel to reach the sensor.

Technologically speaking, #4 is an area that has not been shy in promises and is delivering on some of them. However, economic sustainability for companies engaged in developing products for the entertainment market has been a challenge.

MPEG is currently engaged in Exploration Experiments (EE) to check

  1. The coding performance of Multiview Video Data (#2) for 3DoF+ and 6DoF, and Lenslet Video Data (#4) for Light Field
  2. The relative coding performance of Multiview coding and Lenslet coding, both for Lenslet Video Data (#4).

However, MPEG is not engaged in checking the relative coding performance of #2 data and #4 data because there are no #2 and #4 test data for the same scene.

Conclusion

In good(?) old times MPEG could develop video coding standards – from MPEG-1 to VVC – by relying on established input video formats. This somehow continues to be true for Point Clouds as well. On the other hand, Light Field is a different matter because the capture technologies are still evolving and the actual format in which the data are provided has an impact on the actual processing that MPEG applies to reduce the bitrate.

MPEG has bravely picked up the gauntlet and its machine is grinding data to provide answers that will eventually lead to one or more visual compression standards to enable rewarding immersive user experiences.

MPEG is planning a “Workshop on standard coding technologies for immersive visual experiences” in Gothenburg (Sweden) on 10 July 2019. The workshop, open to the industry, will be an opportunity for MPEG to meet its client industries, report on its results and discuss industries’ needs for immersive visual experiences standards.

Posts in this thread

There is more to say about MPEG standards

Introduction

In Is there a logic in MPEG standards? I described the first steps in MPEG life that look so “easy” now: MPEG-1 (1988) for interactive video and digital audio broadcasting; MPEG-2 (1991) for digital television; MPEG-4 (1993) for digital audio and video on fixed and mobile internet; MPEG-7 (1997) for audio-video-multimedia metadata; MPEG-21 (2000) for trading of digital content. Just these 5 standards, whose starting dates cover 12 years i.e. 40% of MPEG’s life time, include 86 specifications, i.e. 43% of the entire production of MPEG standards.

MPEG-21 was the first to depart from the one and trine nature of MPEG standards: Systems, Video and Audio, and that departure has continued until MPEG-H. Being one and trine is a good qualification for success, but MPEG standards do not have to be one and trine to be successful, as this paper will show.

A bird’s eye view of MPEG standards

The figure below presents a view of all MPEG standards, completed, under development or planned. Yellow indicates that the standard has been dormient for quite some time, light brown indicates that the standard is still active, and white indicates standards that will be illustrated in the future

MPEG-A

The official title of MPEG-A is Multimedia Application Formats. The idea behind MPEG-A is kind of obvious: we have standards for media elements (audio, video 3D graphics, metadata etc.), but what should one do to be interoperable when combining different media elements? Therefore MPEG-A is a suite of specifications that define application formats integrating existing MPEG technologies to provide interoperability for specific applications. Unlike the preceding standards that provided generic technologies for specific contexts, the link that unites MPEG-A specifications is the task of combing MPEG and, when necessary, other technologies for specific needs.

An overview of the MPEG-A standard is available here. Some of the 20 MPEG-A specifications are briefly described below:

  1. Part 2 – MPEG music player application format specifies an “extended MP3 format” to enable augmented sound experiences (link)
  2. Part 3 – MPEG photo player application format specifies additional information to a JPEG file to enable augmented photo experiences (link)
  3. Part 4 – Musical slide show application format is a superset of the Music and Photo Player Application Formats enabling slide shows accompanied by music
  4. Part 6 – Professional archival application format specifies a format for carriage of content, metadata and logical structure of stored content and related data protection, integrity, governance, and compression tools (link)
  5. Part 10 – Surveillance application format specifies a format for storage and exchange of surveillance data that include compression video and audio, file format and metadata (link)
  6. Part 13 – Augmented reality application format specifies a format to enable consumption of 2D/3D multimedia content including both stored and real time, and both natural and synthetic content (link)
  7. Part 15 – Multimedia Preservation Application Format specifies the Multimedia Preservation Description Information (MPDI) that enables a user to discover, access and deliver multimedia resources (link)
  8. Part 18 – Media Linking Application Format specifies a data format called “bridget”, a link from a media item to another media item that includes source, destination, metadata etc. (link)
  9. Part 19 – Common Media Application Format combines and restricts different technologies to deliver and combine CMAF Media Objects in a flexible way to form multimedia presentations adapted to specific users, devices, and networks (link)
  10. Part 22 – Multi-Image Application Format enables precise interoperability points for creating, reading, parsing, and decoding images embedded in a High Efficiency Image File (HEIF).

MPEG-B

The official title of MPEG-B is MPEG systems technologies. After developing MPEG-1, -2, -4 and -7, MPEG realised that there were specific systems technologies that did not fit naturally into any part 1 of those standards. Thus, after using the letter A in MPEG-A, MPEG decided to use the letter B for this new family of specifications.

MPEG-B is composed of 13 parts, some of which are

  1. Part 1 – Binary MPEG format for XML, also called Binary MPEG format for XML or BiM, specifies a set of generic technologies for encoding XML documents adding to and integrating the specifications developed in MPEG-7 Part 1 and MPEG-21 part 16 (link)
  2. Part 4 – Codec configuration representation specifies a framework that enables a terminal to build a new video or 3D Graphics decoder by assembling standardised tools expressed in the RVC CAL language (link)
  3. Part 5 – Bitstream Syntax Description Language (BSDL) specifies a language to describe the syntax of a bistream
  4. Part 7 – Common encryption format for ISO base media file format files specifies elementary stream encryption and encryption parameter storage to enable a single MP4 file to be used on different devices supporting different content protection systems (link)
  5. Part 9 – Common Encryption for MPEG-2 Transport Streams is a similar specification as Part 7 for MPEG-2 Transport Stream
  6. Part 11 – Green metadata specifies metadata to enable a decoder to consume less energy while still providing a good quality video (link)
  7. Part 12 – Sample Variants defines a Sample Variant framework to identify the content protection system used in the client (link)
  8. Part 13 – Media Orchestration contains tools for orchestrating in time (synchronisation) and space the automated combination of multiple media sources (i.e. cameras, microphones) into a coherent multimedia experience rendered on multiple devices simultaneously
  9. Part 14 – Partial File Format enables the description of an MP4 file partially received over lossy communication channels by providing tools to describe reception data, the received data and document transmission information
  10. Part 15 – Carriage of Web Resource in MP4 FF specifies how to use MP4 File Format tools to enrich audio/video content, as well as audio-only content, with synchronised, animated, interactive web data, including overlays.

 MPEG-C

The official title of MPEG-C is MPEG video technologies. As for systems, MPEG realised that there were specific video technologies supplemental to video compression that did not fit naturally into any part 2 of the MPEG-1, -2, -4 and -7 standards.

MPEG-B is composed of 6 parts, two of which are

  1. Part 1 – Accuracy requirements for implementation of integer-output 8×8 inverse discrete cosine transform, was created after IEEE had discontinued a similar standard which is at the basis of important video coding standards
  2. Part 4 – Media tool library contains modules called Functional Units expressed in the RVC-CAL language. These can be used to assemble some of the main MPEG Video coding standards, including HEVC and 3D Graphics compression standards, including 3DMC.

MPEG-D

The official title of MPEG-C is MPEG audio technologies. Unlike MPEG-C, MPEG-D parts 1, 2 and 3 actually specify audio codecs that are not generic, as MPEG-1, MPEG-2 and MPEG-4 but intended to address specific application targets.

MPEG-D is composed of 5 parts, the first 4 of which are

  1. Part 1 – MPEG Surround specifies an extremely efficient method for coding of multi-channel sound via the transmission of a 1) compressed stereo or monoaudio program and 2) a low-rate side-information channel with the advantage of retaining backward compatibility to now ubiquitous stereo playback systems while giving the possibility to next-generation players to present a high-quality multi-channel surround experience
  2. Part 2 – Spatial Audio Object Coding (SAOC) specifies an audio coding algorithm capable to efficiently handle individual audio objects (e.g. voices, instruments, ambience, ..) in an audio mix and to allow the listener to adjust the mix based on their personal taste, e.g. by changing the rendering configuration of the audio scene from stereo over surround to possibly binaural reproduction
  3. Part 3 – Unified Speech and Audio Coding (USAC) specifies an audio coding algorithm capable to provide consistent quality for mixed speech and music content with a quality that is better than codecs that are optimized for either speech content or music content
  4. Part 4 – Dynamic Range Control (DRC) specifies a unified and flexible format supporting comprehensive dynamic range and loudness control, addressing a wide range of use cases including media streaming and broadcast applications. The DRC metadata attached to the audio content can be applied during playback to enhance the user experience in scenarios such as ‘in a crowded room’ or ‘late at night’

MPEG-E

This standard is the result of an entirely new direction of MPEG standardisation. Starting from the need to define API that applications can call to access key MPEG technologies MPEG developed a Call for Proposal to which several responses were received  MPEG reviewed the responses and developed the ISO/IEC standard called Multimedia Middleware.

MPEG-E is composed of 8 parts

  1. Part 1 – Architecture specifies the MPEG Multimedia Middleware (M3W) architecture that allows applications to execute multimedia functions without requiring detailed knowledge of the middleware and to update, upgrade and extend the M3W
  2. Part 2 – Multimedia application programming interface (API) specifies the M3W API that provide media functions suitable for products with different capabilities for use in multiple domains
  3. Part 3 – Component model specifies the M3W component model and the support API for instantiating and interacting with components and services
  4. Part 4 – Resource and quality management, Part 5 – Component download, Part 6 – Fault management and Part 7 – System integrity management specify the support API and the technology used for M3W Component Download Fault Management Integrity Management and Resource Management, respectively

MPEG-V

The development of the MPEG-V standard Media context and control started in 2006 from the consideration that MPEG media – audio, video, 3D graphics etc. – offer virtual experiences that may be a digital replica of a real world, a digital instance of a virtual world or a combination of natural and virtual worlds. At that time, however, MPEG could not offer users any means to interact with those worlds.

MPEG undertook the task to provide standard interactivity technologies that allow a user to

  1. Map their real-world sensor and actuator context to a virtual-world sensor and actuator context, and vice-versa and
  2. Achieve communication between virtual worlds.

This is depicted in the figure

All data streams indicated are specified in one or more of the 7 MPEG-V parts

  1. Part 1 – Architecture expands of the figure above
  2. Part 2 – Control information specifies control devices interoperability (actuators and sensors) in real and virtual worlds
  3. Part 3 – Sensory information specifies the XML Schema-based Sensory Effect Description Language to describe actuator commands such as light, wind, fog, vibration, etc. that trigger human senses
  4. Part 4 – Virtual world object characteristics defines a base type of attributes and characteristics of the virtual world objects shared by avatars and generic virtual objects
  5. Part 5 – Data formats for interaction devices specifies syntax and semantics of data formats for interaction devices – Actuator Commands and Sensed Information – required to achieve interoperability in controlling interaction devices (actuators) and in sensing information from interaction devices (sensors) in real and virtual worlds
  6. Part 6 – Common types and tools specifies syntax and semantics of data types and tools used across MPEG-V parts.

Conclusion

The standards from MPEG-A to MPEG-V include 59 specifications that extend over the entire 30 years of MPEG activity. These standards account for 29% of the entire production of MPEG standards.

In this period of time MPEG standards have addessed more of the same technologies – systems (MPEG-B), video (MPEG-C) and audio (MPEG-D) – and have covered other features beyond those initially addressed: application formats (MPEG-A), media application life cycle (MPEG-E), and interaction of the real world with virtual worlds, and between virtual world (MPEG-V).

Media technologies evolve and so do their applications. Sometimes applications succeed and sometimes fail. So do MPEG standards.

Posts in this thread

Moving intelligence around

Introduction

Artificial intelligence has reached the attention of mass media and technologies supporting it – Neural Networks (NN) – are being deployed in several contexts affecting end users, e.g. in their smart phones.

If a NN is used locally, it is possible to use existing digital representation of NNs (e.g., NNEF, ONNX). However, these format miss vital features for distributing intelligence, such as compression, scalability and incremental updates.

To appreciate the need for compression let’s consider the case of adjusting the automatic mode of a camera based on recognition of scene/object obtained by using a properly trained NN. As this area is intensely investigated, very soon there will be a new better trained version of the NN or a new NN with additional features. However, as the process to create the necessary “intelligence” usually takes time and labor (skilled and unskilled), in most cases the new created intelligence must be moved from the center to where the user handset is. With today’s NNs reaching a size of several hundred Mbytes and growing, a scenario where millions of users clog the network because they are all downloading the latest NN with great new features looks likely.

This article describes some elements of the MPEG work plan to develop one or more standards that enable compression of neural networks. Those wishing to know more please read Use cases and Requirements, and Call for Proposals.

About Neural Networks

A Neural Network is a system composed of connected nodes each of which can

  1. Receive input signals from other nodes,
  2. Process them and
  3. Transmit an output signal to other nodes.

Nodes are typically aggregated into layers, each performing different functions. Typically the “first layers” are rather specific of the signals (audio, video, various forms of text information etc.). Nodes can send signals to subsequent layers but, depending on the type of network, also to the preceding layers.

Training is the process of “teaching” a network to do a particular job, e.g. recognising a particular object or a particular word. This is done by presenting to the NN data from which it can “learn”. Inference is the process of presenting to a trained network new data to get a response about what the new data is.

When is NN compression useful?

Compression is useful whenever there is a need to distribute NNs to remotely located devices. Depending on the specific use case, compression should be accompanied by other features. In the following two major use cases will be analysed.

Public surveillance

In 2009 MPEG developed the Surveillance Application Format. This is a standard that specifies the package (file format) containing audio, video and metadata to be transmitted to a surveillance center. Today, however, it is possible to introduce to ask the surveillance network to do more more intelligent things by distributing intelligence even down to the level of visual and audio sensors.

For this more advanced scenarios MPEG is developing a suite of specifications under the title of Internet of Media Things (IoMT) where Media Things (MThing) are the media “versions” of IoT’s Things. The IoMT standard (ISO/IEC 23093) will reach FDIS level in March 2019.

The IoMT reference model is represented in the figure

IoMT standardises the following interfaces:

1: User commands (setup info.) between a system manager and an MThing

1’: User commands forwarded by an MThing to another MThing, possibly in a modified form (e.g., subset of 1)

2: Sensed data (Raw or processed data in the form of just compressed data or resulting from a semantic extraction) and actuation information

2’: Wrapped interface 2 (e.g. for transmission)

3: MThing characteristics, discovery

IoMT is neutral as to the type of semantic extraction or, more generally, to nature of intelligence actually present in the cameras. However, as NNs networks are demonstrating better and better results for visual pattern recognition, such as object detection, object tracking and action recognition, cameras can be equipped with NNs capable to process the information captured to achieve a level of understanding and transmit that understanding through interface 2.

Therefore, one can imagine that re-trained or brand new NNs can be regularly uploaded to a server that distributes NNs to surveillance cameras. Distribution need not be uniform since different neural networks may be needed at different areas, depending on the tasks that need to be specifically carried out at given areas.

NN compression is a vitally important technology to make the described scenarios real because automatic surveillance system may use many cameras (e.g. thousands and even million units) and because, as the technology to create NNs matures, the time between NN updates will progressively become shorter.

Distribution of NN-based apps to devices

There are many cases where compression is useful to efficiently distribute heavy NN-based apps to a large number of devices, in particular mobile. Here 3 case are considered.

  1. Visual apps. Updating a NN-based camera app in one’s mobile handset will soon become common place. Ditto for the many conceivable application where the smart phone understand some of the objects in the world around. Both will happen at an accelerated frequency.
  2. Machine translation (speech-to-text, translation, text-to-speech). NN-based translation apps already exist and their number, efficiency, and language support can only increase.
  3. Adaptive streaming. As AI-based methods can improve the QoE, the coded representation of NNs can initially be made available to clients prior to streaming while updates can be made during streaming to enable better adaptation decisions, i.e. better QoE.

Requirements

The MPEG Call for Proposals identifies a number of requirements that a compressed neural network should satisfy. Even though not all applications need the support of all requirements, the NN comnpression algorithm must eventually be able to support all the identified requirements.

  1. Compression shall have a lossless mode, i.e. the performance of the compressed NN is exactly the same as the uncompressed NN
  2. Compression shall have a lossy mode, i.e. the performance of the decompressed NN can be different than the performance of the uncompressed NN of course in exchange for more compression
  3. Compression shall be scalable, i.e. even if only a subset of the compressed NN is used, there is still a level of performance
  4. Compression shall support incremental updates, i.e. as more data are received the performance of NN improves
  5. Decompression shall be possible with limited resources, i.e. with limited processing performance and memory
  6. Compression shall be error resilient, i.e. if an error occurs during transmission, the file is not lost
  7. Compression shall be robust to interference, i.e. it is possible to detect that the compressed NN has been tampered with
  8. Compression shall be possible even if there is no access to the original training data
  9. Inference shall be possible using compressed NN
  10. Compression shall supportincremental updates from multiple providers to improve performance of a NN

Conclusions

The currently published Call for Proposals is not requesting technologies for all requirements listed above (which are themselves a subset of all identified requirements). It is expected, however, that the responses to the CfP will provide enough technology to produce a base layer standard that will help the industry move its first steps in this exciting field that will shape the way intelligence is added to things near to all of us.

Posts in this thread

More standards – more successes – more failures

Introduction

I have seen people ask the question: MPEG makes many very successful standard but many are not widely used. Why do you make so many standards?

I know they ask this question because they dare not ask this other question “Why don’t you make just the good standards?”. They do not do it because they know that the easy answer would be the famous phrase attributed to John Wanamaker: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half”.

In this article I do not want to brush off a serious question with an aphorism. When MPEG decides on developing a standard and when a company decides on developing a product face similar problems.

Therefore, I will first compare the processes to develop a company product and an MPEG standard, highlighting the similarities and the differences. Then I analyse some successes and failures of MPEG standards. I will also explain how MPEG can turn standards that should apparently be doomed to failure to an unexpected success.

Those looking for the perfect recipe that will lead to only successful standards should look at Mr. Wanamaker’s epigones, in the hope they have found an answer to his question.

Standards as products

A standard is the product that MPEG delivers to its customers. I would first like to show that the process used by a company to develop a product is somehow aligned to the MPEG process of standard development – with some remarkable differences.

Let’s see how a company could decide to make a new product:

  1. A new product idea is proposed
  2. Product idea is supported by market studies
  3. Technology is available/accessible to make the product
  4. Design resources are available
  5. “Product board” approves the project
  6. Design is developed.

Let us see the corresponding work flow of an MPEG standard (look at How does MPEG actually work to have more details about the process):

  1. An idea is proposed/discussed at a meeting
  2. Idea is clarified in context and objectives
  3. Use cases of the idea are developed
  4. Requirements are derived from use cases
  5. A Call for Evidence (CfE) is issued to check that technologies meeting the requirements exist
  6. A Call for Proposals (CfP) is issued to make the necessary technologies available to the committee
  7. National Bodies (NB) approve the project
  8. The standard is developed.

Let us compare and align the two processes because there are significant differences next to similarities:

# Company product steps MPEG standard steps
1 A new product idea is proposed Idea is aired/proposed at a meeting
2 Market studies support product Context & objectives of idea drafted

Use cases developed

3 Product requirements are developed Requirements derived from use cases
4 Technology is available/accessible Call for Evidence is issued

Call for Proposals is issued

5 Design resources are available MPEG looks for those interested
6 “Product board” approves the product NBs approve the project
7 Design is developed Test Model developed

Core Experiments carried out

Working Drafts produced

Standard improved in NB balloting

8 The design is approved NBs approve the standard

Comparing products and standards

With reference to the table above the following comparison can be made

  1. Product proposal: process is hard to compare. Any company has its own processs. In MPEG, proposals can come to the fore spontaneously from any member.
  2. Proposal justification: process is hard to compare. Any company has its own specific means to assess the viability of a proposed new product. In MPEG, when enough support exists, first the context in which the idea would be applied and for what purposes is documented. Then MPEG develops use cases to prove that a standard implementing the idea would support the use cases better than it is possible today or make possible use cases that today are not. As an entity, MPEG does not make “market studies” (because it does not have the means). It relies instead on members bringing relevant information into the committee when “context and objectives” and “use cases” are developed.
  3. Requirements definition: happens under different names and processes in companies and in MPEG.
  4. Technology availability is quite different. A company often owns a technology as a result of some R&D effort. If it does not have a technology for a product, it either develops it or acquires it. MPEG, too, does “own” a body of technologies, but typically a new proposal requires new technology. While MPEG members may know that a technology is actually available, they may not be allowed to talk about it. Therefore, MPEG needs in general two steps: 1) to become aware of technology (via CfE) and 2) to have the technology available (via CfP). In some cases, like in Systems standards, MPEG members may develop the technology collaboratively from a clean sheet of paper.
  5. Design resource availability is very different in the two environments. If a company sees a product opportunity, it has the means to deploy the appropriate resources (well, that also depends on internal product advocates’ influence). If MPEG sees a standard opportunity, it has no means to “command” members to do something because members report to their companies, not to MPEG. It would be great if some MPEG members who insist on MPEG pursuing certain opportunities without offering resources to achieve them understood this.
  6. Product approval: is very different in the two environments. Companies have their own internal processes to approve products. In MPEG the project for a new standard is approved by the shareholders, i.e. by the NBs, the simple majority of which must approve the project and a minimum of five NBs must commit resources to execute it.
  7. Design development: is very different in the two environments. Companies have their own internal processes to design a new product. In MPEG work obviously stops at the design phase but it entails the following steps: 1) Test Model creation, 2) Core Experiments execution, 3) Working Drafts development and 4) Standard improvement though NB balloting.
  8. Design approval: is very different in the two environments. Companies have their own internal processes to approve the design of a new product. In MPEG, again, the shareholders, i.e. the NBs, approve the standard with a qualified majority.

What is certainly common in the two processes is that the market response to the company product or to the MPEG standard is anybody’s guess. Some products/standards are widely successful, some fare so and so, and some are simply rejected by the market. Companies have resources that allow them to put in place other strategies to reduce the number of failures, but it is a reality that even companies darling of the market stumble from time to time.

MPEG is no exception.

How to cope with uncertainty

MPEG, being an organisation whose basis of operation is consensus, has advantages and disadvantages compared to a company. Let us see now how MPEG has managed the uncertainty surrounding its standards.

An interesting case is MPEG-1. The project was driven by the idea of video interactivity on CD and digital audio broadcasting. MPEG-1 did not have commercial success for both targets. However, Video CD, not even in the radar when MPEG-1 was started, used MPEG-1 and sold 1 billion units (and tens of billion CDs). MP3, too, was also not in the radar when MPEG-1 was approved and some members even argumented against the inclusion of such a “complex” technology into the standard. I doubt there is anybody now regretting the decision to make MP3 part of the MPEG-1 standard. If there is, it is for completely different reasons. The reason why the standard was eventually successful is that MPEG-1 was designed as a system (VCD is exactly that), but its parts were designed to be usable as stand-alone components (as in MP3).

The second case is MPEG-2. The project was driven by the idea of making television digital. When the first 3 MPEG-2 parts (Systems-Video-Audio) were consolidated, the possibility to use MPEG-2 for interactive video services on the telecom and cable networks became real. MPEG-2 Audio did not fare well in broadcasting (the demand for multichannel was also not there), but it did fare well in other domains. In any case many thought that MPEG-1 Audio delivered just enough. MPEG-2 AAC did fare well in broadcasting and laid the ground for the 20-year long MPEG-4 Audio ride. MPEG started the Digital Storage Media Command and Control (DSM-CC) standard (part 6 of MPEG-2). The DSM-CC carousel is used in broadcasting because it provides the means for a set top box to access various types of information that a broadcaster sends/updates at regular intervals.

MPEG-4 is rich in relevant examples. The MPEG-4 model was a 3D scene populated by “objects” that could be 1) static or dynamic, 2) natural or synthetic, 3) audio or visual in any combination. BIFS (the MPEG name for the 3D scene technology, an extension of VRML) did not fly (but VRML did not fly either). However, 10 years later the Korea-originated Digital Multimedia Broadcasting technology, which used BIFS scaled down to 2D, had a significant success in radio broadcasting.

Much of the MPEG-4 video work was driven by the idea of video “objects” which, along with BIFS, did not fly (the standard specified video objects but did not say how to make them, because that was an encoder issue). For a few years, MPEG-4 video was used in various environments. Unfortunately the main use – video streaming – was stopped by the “content fees” clause of the licensing terms. Part 10 of MPEG-4 Advanced Video Coding (AVC) was very successful, especially because patent holders did not repeat some of the mistakes they had made for MPEG-4 Visual. None of the 3 “royalty free” (Option 1 in ISO language) MPEG-4 video coding standards did fly, showing that in ISO today it is not practically possible to make a media-related standard that does not require onerous licensing of thirty party technology.

The MPEG-4 Parametric coding for high-quality audio did not fly, but a particular tool in it – Parametric Stereo (PS) – could very efficiently encode stereo music as a mono signal plus a small amount of side-information. MPEG combined the PS tool with HE-AAC and produced HE-AAC v2, an audio decoder that is on board of billions of mobile handsets today as it enables transmission of a stereo signal at 32 kb/s with very good audio quality.

For most MPEG standards, the reference model is the figure below

Different groups with different competences develop the different parts of a standard. Some parts are designed to work together with others in systems identified in the Context-Objectives-Use cases phase. However, the parts are not tightly bound because in general it is possible to use them separately.

The MPEG-7 project was driven by the idea of a world rich of audio-video-multimedia descriptors that would allow users to navigate the large amount of media content expected at that time and that we have today. Content descriptors were expressed in verbose XML, a tool at odds with the MPEG bit-thrifty approach. So MPEG developed the first standard for XML compression, a technology adopted in many fields.

Of MPEG-A is remarkable the Common Media Application Format (CMAF) standard. Several technologies drawn from different MPEG standards are integrated to efficiently deliver large scale, possibly protected, video applications, e.g. streaming of televised events. CMAF Segments can be delivered once to edge servers in content delivery networks, then accessed from cache by thousands of streaming video players without additional network backbone traffic or transmission delay.

MPEG-V – Media context and control is another typical example. The work was initiated in the wake of the success of Second Life, a service that looked like it could take over the world. The purpose of part 4 of MPEG-V Virtual world object characteristics was not to standardise a Second Life like service but the interfaces that would allow a user to move assets from one virtual space to another virtual space. The number of Second Life users dived and part 4 never took off. Other parts of MPEG-V concern formats and interfaces to enrich the the audio-visual user experience with, say, a breeze when there is a little wind in the movie, a smell when you are in a field of violets etc. So far, this apparently interesting extension of the user experience did not fly, but MPEG-V provides a very solid communication framework for sensors and actuator that finds use in other standards.

The MPEG-H MPEG Media Transport (MMT) project showed how it is possible to innovate without destabilising existing markets. MPEG-2 Transport Stream (TS) has been in use for 25 years (and MPEG has received an Emmy for that) and will continue to be used for the foreseable future. But MPEG-2 TS shows the signs of time because it has been designed for a one-way channel – an obvious choice 25 years ago – while so much video distribution today happens on two-way channels. MMT uses IP transport instead of MPEG-2 TS transport and achieves content delivery unification in both one-way and two-way distribution channels.

Is MPEG in the research business?

The simple and flat answer is NO. However, MPEG CfPs are great promoters of corporate research because they push companies to improve their technologies to enable them to make successful proposals in response to CfPs.

One of the reasons of MPEG success, but also of the difficulties highlighted in this article, is that, in the MPEG domain, standardisation is a process closer to research than to product design.

Roughly speaking, in the MPEG standardisation process, research happens in two phases: in the companies, in preparation for CfEs or CfPs (MPEG calls this competitive phase) and in what MPEG calls collaborative phase, i.e. during the development of Core Experiments (of course this research phase is still done by the companies, but in the coordinating framework of an MPEG standard).

The power of the MPEG competitive phase lies in the fact that MPEG receives many submissions from respondents to a CfP and pools together the components technologies. Therefore, the MPEG “product” has a much better performance than any autarchic product developed by an independent company because it uses many good technologies from many more companies than a single company could do.

Actually, improvement is even greater and the MPEG collaborative phase offers another opportunity to do more research. This has a much more limited scope because it is in the context of optimising a subset of the entire scope of the standard, but the sum of many small optimisations can provide big gains in performance. The shortcoming of this process is the possible introduction of a large number of IP items for a gain that some may may well consider not to justify the added IP onus and complexity.

With its MPEG-5 project MPEG is trying to see if a suitably placed lower limit to performance improvements can help solve the problems identified in the HEVC standard.

Conclusions

MPEG has a large number of successful standards. For many of them the unit of measure is billion of units, be they hardware, software and firmware.

MPEG has also had failures. The reasons for these can be manifold. One that is often quoted by outsiders is “the standard was too technology driven”. Of course technology plays an important part in MPEG standards. But then, what should MPEG do? Stop standards that are too technology driven? And how much is too much?

Excluding technology would be a mistake as I will show in two examples. If MPEG had done that, we would not have MP3. In 1992 layer 3 was a costly appendix to Layer 1 and 2 that just did a good job. If we had done that, we would not have Point Cloud Compression (now at CD level), a standard that industry dies for today. Sure, MPEG should establish firmer contacts with market needs, but the necessary expertise can only be provided by companies sending experts to MPEG.

MPEG needs more market, but do not expect that more market will necessarily have miraculous effects. The basic logic that has guided MPEG when making a decision on a standard has been “if there is a legitimate request to have a standard (within the constraints of 50% + 1 approval and 5 countries willing to provide experts to do the work), we do it”. More market information can certainly be useful to articulate a complete proposal and add more evidence at the time shareholders (NBs) vote.

MPEG’s value is in its capability to produce standards that anticipate the needs of the market in a process that may take years from the time an idea is launched to the time the standard is produced. People in our age are volatile and so are the markets. In comparison technology is stable.

Posts in this thread

Thirty years of audio coding and counting

Introduction

Obviously, the electrical representation of sound information happened before the electrical representation of visual information and so did the services that used that representation to distribute sound information. The digital representation of audio, too, happened at different times than video’s. In the early 1980s the Compact Disc (CD) allowed record companies to distribute digital audio for the consumer market, while the D1 digital tape, available in the late 1980’s, was for the exclusive use of professional applications such as in the studio. Compression technologies reversed the order: compressed digital video happened before compressed digital audio by some 10 years. Therefore, unlike the title of the article Forty years of video coding and counting, the title of this post is Thirty years of audio coding and counting.

This statement can become a source of dispute, if a proper definition of Audio is not adopted. In this article by Audio we mean sound in the human audible range not generated by a human phonatory system or for any other sound source for which a sound production model is not available or not used. Indeed digital speech happened in professional applications (trunk network) some 20 years before the CD. ITU-T G.721 “32 kbit/s adaptive differential pulse code modulation (ADPCM)” dates back to 1984, the same year H.120 was approved as a recommendation.

Therefore the title of this article could very well have been Forty years of audio coding and counting. This would have come at the cost of a large number of speech compression standards and this article would have been overwhelmed by them. Therefore this article will only deal with audio compression standards where audio does not include speech. With one exception that will be mentioned later, I mean.

Unlike video compression where ITU-T is the non-MPEG body that develops video coding standards, in audio compression MPEG dominance is total. Indeed ITU-R, who does need audio compression for its digital audio broadcasting standards, prefers to rely on external sources, including MPEG.

MPEG-1 Audio

Those interested in knowing why and how a group – MPEG – working in video compression ended up also working on audio compression (and a few more other things) can look here. The kick off of the MPEG Audio group took place on 1-2 December 1988, when, in line with a tradition that at that time had not been fully established yet, a most diverse group of audio coding experts met in Hannover and kick-started the work that eventually gave rise to the MPEG-1 Audio standard released by MPEG in November 1992.

The Audio group in MPEG is very often the forerunner of things to come. In this instance the first is that while the broadcasting world shunned the low resolution MPEG-1 Video compression standard, it very much valued the MPEG-1 Audio compression standard. The second is that, unlike video, which relied on essentially the same coding architecture, the Audio Call for Proposals had yielded two classes of algorithms, one that was a well established, easy to implement but less performing and the other that was more recent, harder to implement (at that time) but more performing. The work to merge the two technologies was painstaking but eventually the standard included 3 layers (a notion later called profiles) where both technologies were used.

Layer 1 was used in Digital Compact Cassette (DCC), a product discontinued a few years later, Layer 2 was used in audio broadcasting and as the audio component of Video CD (VCD). Layer 3 (MP3) does not need a particular introduction 😉. As revised in the subsequent MPEG-2 effort, MP3 provided a user experience with no perceivable difference as compared to the original CD signal for most content at 128 kbit/s from a CD source of 1.44 Mbit/s, i.e with a compression of 11:1.

MPEG-2 Audio

The main goal of this standard, approved in 1994, was multi-channel audio with the key requirement that an MPEG-1 Audio decoder should be able to decode a stereo component of an MPEG-2 Audio bitstream. Backward compatibility is particularly useful in the broadcasting world because an operator can upgrade to a multi-channel services without losing the customers who only have an MPEG-1 Audio decoder.

MPEG-2 AAC

Work on MPEG-2 Advanced Video Coding (AAC) was motivated by the request of those who wished to provide the best possible audio quality without backward compatibility constraints. This meant that layer 2 must decode both layer 1 and 2, and layer 3 must decode all layers. MPEG-2 AAC, released in April 1997, is built upon the MP3 technology and can provide perceptually transparent audio quality at 128 kbit/s for a stereo signal, and 320 kbit/s for a 5.1 channel signal (i.e. as in digital television).

MPEG-4 AAC

In 1998 MPEG-4 Audio was released with the other 2 MPEG-4 components – Systems and Visual. Again MPEG-4 AAC is built on MPEG-2 AAC. The dominating role of MP3 in music distribution was shaken in 2003 when Apple announced that its iTunes and iPod products would use MPEG-4 AAC as primary audio compression algorithm. Most PCs, smart phones and later tablets could play AAC songs. Far from using AAC as a pure player technology, Apple started the iTunes service that provides songs in AAC format packaged in the MPEG-4 File Format, with filename extension “.m4a”.

AAC-LD

In 1999 MPEG released MPEG-4 amendment 1 with a low delay version of AAC, called Low Delay AAC (AAC-LD). While a typical AAC encoder/decoder has a one-way latency of ~55 ms (transform delay plus look-ahead processing), AAC-LD achieves a one-way latency of only 21 ms by simplifying and replacing some AAC tools (new transform with lower latency and removal of look-ahead processing). AAC-LD can be used as a conversational codec, with a signal bandwidth and perceived quality of a music coder with excellent audio quality at 64 kb/s for a mono signal.

MPEG-4 HE-AAC

In 2003 MPEG released the MPEG-4 High Efficiency Advanced Audio Coding (HE-AAC), as amendment 1 to MPEG-4.  HE-AAC helped to consolidate the role of the mobile handset as the tool of choice to access very good audio quality stereo music at 48 kbit/s, more than a factor of 2.5 better than AAC, for a compression ratio of almost 30:1 relative to the CD signal.

HE-AAC adds the spectral bandwidth replication (SBR) tool to the core AAC compression engine. Since AAC was already widely deployed, this permitted extending this base to HE-AAC by only adding the SBR tool to existing AAC implementations.

MPEG HE-AAC v2

In the same 2003, 9 months later, MPEG released the MPEG HE-AAC v2 profile. This originated from a tools contained in amendment 2 to MPEG-4 (Parametric coding for high-quality audio).  While the core parametric coder did not enjoy wide adoption, the Parametric Stereo (PS) tool in the amendment could very efficiently encode stereo music as a mono signal plus a small amount of side-information.  HE-AAC v2, the combination of PS tool with HE-AAC, enabled transmission of a stereo signal at 32 kb/s with very good audio quality.

This profile was also adopted by 3GPP under the name Enhanced aacPlus. Adoption by 3GPP paved the way for HE-AAC v2 technology to be incorporated into mobile phones.  Today, more than 10 billion mobile devices support streaming and playout of HE-AAC v2 format songs. Since HE-AAC is built on AAC, these phone also support streaming and playout of AAC format songs.

ALS and SLS

In 2005 MPEG released two algorithms for lossless compression of audio, MPEG Audio LosslesS coding (ALS) and Scalable to LosslesS coding (SLS). Both provide perfect (i.e. lossless) reconstruction of a standard Compact Disk audio signal with a compression ratio approximately 2:1. An important feature of SLS is that it has a variable compression ratio: it can compress a stereo signal to 128 kb/s (11:1 compression ratio) with excellent quality as an AAC codec but it can achieve lossless reconstruction with a compression ratio of 2:1 by increasing the coded bitrate (i.e. by decreasing the compression ratio) in a continuous fashion.

MPEG Surround

ALS/SLS were the last significant standards in MPEG-4 Audio, which is MPEG’s most long-lived audio standard. First issued in 1999, 20 years later (in 2019) MPEG is issuing its Fifth Edition.

After closing the “MPEG-4 era,” MPEG created the MPEG-D suite of audio compression standards. The first of these was MPEG Surround, issued in 2007. This technology is a generalised PS of HE-AAC v2 tool in the sense that, MPEG Surround can operate as a 5-to-2 channel compression tool or as an M-to-N channel compression tool. This “generalised PS” tool is followed by a HE-AAC codec. Therefore MPEG Surround builds on HE-AAC as much as HE-AAC builds on AAC. MPEG Surround provides very good compression while maintaining very good audio quality and also low computational complexity. While HE-AAC can transmit stereo at 48 kbit/s, MPEG Surround can transmit 5.1 channel audio within the same 48 kbit/s transmission budget. The complexity is no greater than stereo HE-AAC’s. Hence MPEG Surround is a “drop-in” replacement for stereo services to extend them to 5.1 channel audio!

AAC-ELD

In 2007 MPEG released Enhanced Low Delay AAC (AAC-ELD) technology. This combines tools from other profiles: SBR and PS from HE-AAC v2 profile and AAC-LD. The new codec provides even greater signal compression with only a modest increase in latency: AAC-ELD provides excellent audio quality at 48 kb/s for a mono signal with a one-way latency of only 32 ms.

SAOC

In 2010 MPEG released MPEG-D Spatial Audio Object Coding (SAOC) which allows very efficient coding of a multi-channel signal that is a mix of objects (e.g. individual musical instruments). SAOC down-mixes the multi-channel signal, e.g. stereo to mono, codes and transmits the mono signal along with some side-information, and then up-mixes the received and decoded mono signal back to a stereo signal such that user perceives the instruments to be placed at the correct positions and the resulting stereo signal to be the same as the original. This is done by exploiting the fact that at any instant in time and any frequency region one of the instruments will tend to dominate the others so that in this time/frequency region the other signals will be perceived with much less acuity, if at all. SAOC analyses the input signal, divides each channel into time and frequency “tiles” and then decides to what extent each tile dominates. This is coded as side information.

An example SAOC application is teleconferencing, in which a multi-location conference call can be mixed at the conference bridge down to a single channel and transmitted to each conference participant, along with the SAOC side information. At the user’s terminal, the mono channel is up-mixed to stereo (or 3 channels – Left-Center-Right) and presented such that each remote conference participant is at a distinct location in the front sound stage.

USAC

Unified Speech and Audio Coding (USAC), released in 2011, combines the tools for speech coding and audio coding into one algorithm. USAC combines the tools from MPEG AAC (exploiting the means of human perception of audio) with the tools from a state-of-the-art speech coder (exploit the means of human production of speech). Therefore, the encoder has both a perceptual model and a speech excitation/vocal tract model and dynamically selects the music/speech coding tools every 20 ms. In this way USAC achieves a high level of performance for any input signal, be is music, speech or a mix of speech and music.

In the tradition of MPEG standards, USAC extends the range of “good” performance down to as low as 16 kb/s for a stereo signal and provides higher quality as the bitrate is increased. The quality at 128 kbit/s for a stereo signal is slightly better that MPEG-4 AAC so USAC can replace AAC, because its performance is equal or better than AAC at all bit rates and can similarly code multichannel audio signals, and can also optimally encode speech content.

DRC

MPEG-D Dynamic Range Control (DRC) is a technology that gives the listener the ability to control the audio level. It can be a post-processor for every MPEG audio coding technology and modifies the dynamic range of the decoded signal as it is being played.  It can be used to reduce the loudest part of a movie so as not to disturb your neighbours, to make the quiet portions of the audio louder in hostile audio environments (car, bus, room with many people), to match the dynamics of the audio to that of a smart phone speaker output, which typically has very limited dynamic range. The DRC standard also plays the very important function of normalizing the loudness of the audio output signal, which may be mandated in some regulatory environments.  DRC was released in 2015 and extended in 2017 as Amendment 1 Parametric DRC, gain mapping and equalization tools.

3D Audio

MPEG-H 3D Audio, released in 2015, is part of the typical suite of MPEG tools: Systems, Video and Audio. It provides very efficient coding of immersive audio content, typically from 11 to 22 channels of content. The 3D Audio algorithms can actually process any mix of channels, objects and Higher Order Ambisonics (HOA) content, where objects are single-channel audio whose position can be dynamic in time and HOA can encode an entire sound scene as a multi-channel “HOA coefficient” signal.

Since 3D Audio content is immersive, it is conceived as being consumed as a 360-degree “movie” (i.e. video plus audio). The user sits at the center of a sphere (“sweet spot”) and the audio is decoded and presented so that the user perceives it to be coming from somewhere on the surrounding sphere. MPEG-H 3D audio also can be presented via headphones because not every consumer has an 11 or 22 channel listening space. Moreover MPEG-H 3D Audio supports use of a default or personalised Head Related Transfer Function (HRTF) to allow the listener to perceive the audio content as if it is from sources all around the listener, just as it would be when using loudspeakers. An added feature of 3D Audio playout to headphones, is that the audio heard by the listener can remain at the “correct” position when the user turns his or her head. In other words, a sound that is “straight ahead” when the user is looking straight ahead is perceived as coming from the left if the user turns to look right. Hence, MPEG-H 3D Audio is already a nearly complete solution for Video 360 applications.

Immersive Audio

This activity (to be released as a standard sometime in 2021) is part of the emerging MPEG-I Immersive Audio standard. MPEG is still defining the requirements and functionality of this standard, which will support audio in Virtual and Augmented Reality applications. It will be based on MPEG-H 3D Audio, which already supports a 360 degree view of a virtual world from one listener position (“3 degrees of freedom” or 3DoF) that the listener can move his or her head left, right, up, down or tilted left or right (so-called “yaw, pitch roll”). The Immersive Audio standard will add three additional degrees of freedom, i.e., permit the user to get up and walk around in the Virtual World. This additional movement is designated “x, y, z,” so that MPEG-I Immersive Audio supports 6 degrees of freedom (6 DoF) which are “yaw, pitch roll and x, y, z.” It is envisioned that MPEG-I Immersive Audio will use MPEG-H 3D Audio to compress the audio signals, and will specify additional metadata and technology so that the audio signals can be rendered in a fully flexible 6 DoF way.

Conclusions

MPEG is proud of the work done by the Audio group. For 30 years the group has injected generations of audio coding standards into the market. In the best MPEG tradition, the standards are generic in the sense that can be used in audio-only or audio+video applications and often scalable, with a new generation of audio coding standards building on previous ones.

This long ride is represented in the figure that ventures into the next step of the ride.

Today MPEG Audio already provides a realistic 3DoF experience in combination with MPEG Video standards. More will be needed to provide a complete and rewarding 6DoF experience, but MPEG’s ability to draw the necessary multi-domain expertise from its membership promises that the goal will be successfully achieved.

Acknowledgements

This article would not have been possible without the competent assistance – and memory – of Schuyler Quackenbush, the MPEG Audio Chair.

Posts in this thread (in bold this post)

Is there a logic in MPEG standards?

So far MPEG has developed, is completing or is planning to develop 22 standards for a total of 201 specifications. For those not in MPEG, and even for some active in MPEG, there is natural question: what is the purpose of all these standards? Assuming that the answer to this question is given, a second one pops up: is there a logic in all these MPEG standards?

Depending on the amount of understanding of the MPEG phenomenon, you can receive different answers ranging from

“There is no logic. MPEG started its first standard with a vision of giving the telco and CE industries a single format. Later it exploited the opportunities that that its growing expertise allowed.”

to

“There is a logic. The driver of MPEG work was to extend its vision to more industries leveraging its assets while covering more functionalities.”

I will leave it to the reader to decide where to place their decision on this continuum of possibilities after reading this article that will only deal with the first 5 standards.

MPEG-1

The goal of MPEG-1 was to leverage the manufacturing power of the Consumer Electronics (CE) industry to develop the basic audio and video compression technology for an application that was considered particularly attractive when MPEG was established (1988), namely interactive audio and video on CD-ROM. This was the logic of the telco industry who thought that their future would be “real time audio-visual communication” but did not have a friendly industry to ask to develop the terminal equipment.

The bitrate of 1.5 Mbit/s mentioned in the official title of MPEG-1 Coding of moving pictures and associated audio at up to about 1,5 Mbit/s was an excellent common point for the telecom industry with their ADSL technology whose first generation targeted that bitrate and for the CE industry whose Compact Disc had a throughput of 1.44 Mbit/s (1.2 for the CD-ROM). With that bitrate, compression technology of the late 1980’s could only deal with a rather low, but still acceptable resolution (1/2 the horizontal and 1/2 the vertical resolution obtained by subsampling every other field, so that the input video is progressive), Considering that audio had to be impeccable (that is what humans want), at least 200 kbit/s had to be assigned to audio.

The figure below depicts the model of an MPEG-1 decoder

 

Figure 1 – Model of the MPEG-1 standard

The structure adopted for MPEG-1 set the pattern for most MPEG standards:

  1. Part 1 – Systems specifies how to combine one or more audio and video data streams with timing information to form a single stream (link)
  2. Part 2 – Video specifies the video coding algorithm applied to so-called SIF video of ¼ the standard definition TV (link)
  3. Part 3 – Audio specifies the audio compression. Audio is stereo and can be compressed with 3 different perfomance “layers”: layer 1 is for an entry level digital audio, layer 2 for digital broadcasting and layer 3, aka MP3, for digital music. The MPEG-1 Audio layers were the predecessors of MPEG-2 profiles (and of most subsequent MPEG standards) (link)
  4. Part 4 – Compliance testing (link)
  5. Part 5 – Software simulation (link).

 MPEG-2

MPEG-2 was a more complex beast to deal with. A digitised TV channel can yield 20-24 Mbit/s, depending on the delivery system (terrestrial/satellite broadcasting or cable TV). Digital stereo audio can take 0.2 Mbit/s and standard resolution 4 Mbit/s (say a little less with more compression). Audio could be multichannel (say, 5.1) and hopefully consume less bitrate for a total bitrate of a TV program of 4 Mbit/s. Hence the bandwidth taken by an analogue TV program can be used for 5-6 digital TV programs.

The fact that digital TV programs part of a multiplex may come from independent sources and that digital channels in the real world are subject to errors force the design of an entirely different Systems layer for MPEG-2. The fact that users need to access other data sent in a carousel, that in an interactive scenario (with a return channel) there is a need for session management and that a user may interact with a server forced MPEG to add a new stream for user-to-network and user-to-user protocols.

In conclusion the MPEG-2 model is a natural extension of the MPEG-1 model (superficially, the DSM-CC line, but the impact is more pervasive).

Figure 2 – Model of the MPEG-2 standard

The official title of MPEG-2 is Generic coding of moving pictures and associated audio information. It was originally intended for coding of standard definition television (MPEG-3 was expected to deal with coding of High Definition Television). As the work progressed, however, it became clear that a single format for both standard and high definition was not only desirable but possible. Therefore the MPEG-3 project never took off.

The standard is not specific of a video resolution (this was already the case for MPEG-1 Video) but rationalises the notion of profiles, i.e. assemblies of coding tools and levels a notion that applies to, say, resolution, bitrate etc. Profiles and levels have subsequently adopted in most MPEG standardisation areas.

The standard is composed of 10 parts, some of which are

  1. Part 1 – Systems specifies the Systems layer to enable the transport of a multichannel digital TV stream on a variety of delivery media (link)
  2. Part 2 – Video specifies the video coding algorithm. Video is interlaced and may have a wide range of resolutions with support to scalability and multiview in appropriate profiles (link)
  3. Part 3 – Audio specifies a MPEG-1 Audio backward-compatible multichannel audio coding algorithm. This means that an MPEG-1 Audio decoder is capable of extracting and decoding an MPEG-1 Audio bitstream (link)
  4. Part 6 – Extensions for DSM-CC specifies User-to-User and User-to-Network protocols for both broadcasting and interactive applications. For instance DSM-CC can be used to enable such functionalities as carousel or session set up (link)
  5. Part 7 – Advanced Audio Coding (AAC) specifies a non backward compatible multichannel audio coding algorithm. This was done because backward compatibility imposes too big a penalty for some applications, e.g. those that do not need backward compatibility (link), the first time MPEG was forced to develop two standards for apparently the same applications.

MPEG-4

MPEG-4 had the ambition of bringing interactive 3D spaces to every home. Media objects such as audio, video, 2D graphics were an enticing notion in the mid-1990’s. The WWW had shown that it was possible to implement interactivity inexpensively and the extension to media interactivity looked like it would be the next step. Hence the official title of MPEG-4 Coding of audio-visual objects.

This vision did not become true and one could say that even today it is not entirely clear what is interactivity and what is the interactive media experience a user is seeking, assuming that just one exists.

Is this a signal that MPEG-4 was a failure?

  • Yes, it was a failure, and so what? MPEG operates like a company. Its “audio-visual objects” product looked like a great idea, but the market thought differently.
  • No, it was a success, because 6 years after MPEG-2, MPEG-4 Visual yielded some 30% improvement in terms of compression.
  • Yes, it was a failure because a patent pool dealt a fatal blow with their “content fee” (i.e. “you pay royalties by the amount of time you stream”).
  • No it was a success because MPEG-4 has 34 parts, the largest number ever achieved by MPEG in a standard, that include some of the most foundational and successful standards such as the AAC audio coding format, the MP4 File Format, the Open Font Format and, of course the still ubiquitous Advanced Video Coding AVC video coding format whose success was not dictated so much by the 20% more compression that it delivers compared to MPEG-4 Visual (always nice to have), but to the industry-friendly licence released by a patent pool. Most important, the development of most MPEG standards is driven by a vision. Therefore, users have available a packaged solution, but they can also take the pieces that they need.

Figure 3 – Model of the MPEG-4 standard

An overview of the entire MPEG-4 standard is available here. The standard is composed of 34 parts, some of which are

  1. Part 1 – Systems specifies the means to interactively and synchronously represent and deliver audio-visual content composed of various objects (link)
  2. Part 2 – Visual specifies the coded representation of visual information in the form of natural objects (video sequences of rectangular or arbitrarily shaped pictures) and synthetic visual objects (moving 2D meshes, animated 3D face and body models, and texture) (link).
  3. Part 3 – Audio specifies a multi-channel perceptual audio coder with transparent quality compression of Compact Disc music coded at 128 kb/s that made it the standard of choice for many streaming and downloading applications (link)
  4. Part 6 – Delivery Multimedia Integration Framework (DMIF) specifies interfaces to virtualise the network
  5. Part 9 – Reference hardware description specifies the VHDL representation of MPEG-4 Visual (link)
  6. Part 10 – Advanced Video Coding adds another 20% of performance to part 2 (link)
  7. Part 11 – Scene description and application engine provides a time dependent interactive 3D environment building on VRML (link)
  8. Part 12 – ISO base media file format specifies a file format that has been enriched with many functionalities over the years to satisfy the needs of the multiple MPEG client industries (link)
  9. Part 16 – Animation Framework eXtension (AFX) specifies a range of 3D Graphics technologies, including 3D mesh compression (link)
  10. Part 22 – Open Font Format (OFF) is the result of the MPEG effort that took over an industry initiative (OpenType font format specification), brought it under the folds of international standardisation and expanded/maintained it in response to evolving industry needs (link)
  11. Part 29 – Web video coding (WebVC) specifies the Constrained Baseline Profile of AVC in a separate document
  12. Part 30 – Timed text and other visual overlays in ISO base media file format supports applications that need to overlay other media to video (link)
  13. Part 31 – Video coding for browsers (VCB) specifies a video compression format (unpublished)
  14. Part 33 – Internet Video Coding (IVC) specifies a video compression format (link).

Parts 29, 31 and 33 are the results of 3 attempts made by MPEG to develop Option 1 Video Coding standards with a good performance. All did not reach the goal because ISO rules allow a company to make a patent declaration without specifying which is the patented technology that the declaring company alleges to be affected by a standard. The patented technologies could not be removed because MPEG did not have a clue about which were the allegedly infringing technologies.

MPEG-7

In the late 1990’s the industry had been captured by the vision of “500 hundred channels” and telcos thought they could offer interactive media services. With the then being deployed MPEG-1 and MPEG-2, and with MPEG-4 under development,  MPEG expected that users would have zillions of media items.

MPEG-7 started with the idea of providing a standard that would enable users to find the media content of their interest in a sea of media content. Definitely MPEG-7 deviates from the logic of the previous two standards and the technologies used reflect that because it provides formats for data (called metadata) extracted from multimedia content to facilitate searching in multimedia items. As shown in the figure, metadata can be classified as Descriptions (metadata extracted from the media items, especially audio and video) and Description Schemes (compositions of descriptions). The figure also shows two additional key MPEG-7 technologies. The first is the Description Definition Language (DDL) used to define new Descriptors and the second id XML Compression. With Descriptions and Description Schemes represented in verbose XML, it is clear that MPEG needed a technology to effectively compress XML.

 

Figure 4 –Components of the MPEG-7 standard

 An overview of the entire MPEG-7 standard is available here. The official title of MPEG-7 is Multimedia content description interface and the standard is composed of 16 parts, some of which are:

  1. Part 1 – Systems has similar functions as the parts 1 of previous standards. In addition, it specifies a compression method for XML schemas used to represent MPEG-7 Descriptions and Description Schemes.
  2. Part 2 – Description definition language breaks the Systems-Video-Audio traditional sequences of previous standards to provide a language to describe descriptions (link)
  3. Part 3 – Visual specifies a wide variety of visual descriptors such as colour, texture, shape, motion etc. (link)
  4. Part 4 – Audio specifies a wide variety of audio descriptors such as signature, instrument timber, melody description, spoken content description etc. (link)
  5. Part 5 – Multimedia description schemes specifies description tools that are not visual and audio ones, i.e., generic and multimedia description tools such as description of the content structural aspects (link)
  6. Part 8 – Extraction and use of MPEG-7 descriptions explains how MPEG-7 descriptions can be practically extracted and used
  7. Part 12 – Query format defines format to query multimedia repositories (link)
  8. Part 13 – Compact descriptors for visual search specifies a format that can be used to search images (link)
  9. Part 15 – Compact descriptors for video analysis specifies a format that can be used to analyse video clips (link).

 MPEG-21

In the year 1999 MPEG understood that its technologies were having a disruptive impact on the media business. MPEG thought that the industry should not fend of a new threat with old repressive tools. The industry should convert the threat into an opportunity, but there were no standard tools to do that.

MPEG-21 is the standard resulting from the effort by MPEG to create a framework that would facilitate electronic commerce of digital media. It is a suite of specifications for end-to-end multimedia creation, delivery and consumption that can be used to enable open media markets.

This is represented in the figure below. The basic MPEG-21 element is the Digital Item, a structured digital object with a standard representation, identification and metadata, around which a number of specifications were developed. MPEG-21 also includes specifications of Rights and Contracts and basic technologies such as the file format.

Figure 5 –Components of the MPEG-21 standard

An overview of the entire MPEG-21 standard, whose official title of MPEG-21 is Multimedia Framework, is available here. Some of the 21 MPEG-21 parts are briefly described below:

  1. Part 2 – Digital Item Declaration specifies Digital Item (link)
  2. Part 3 – Digital Item Identification specifies identification methods for Digital Items and their components (link)
  3. Part 4 – Intellectual Property Management and Protection (IPMP) Components specifies how to include management and protection information and protected parts in a Digital Item (link)
  4. Part 5 – Rights Expression Language specifies a language to express rights (link)
  5. Part 6 – Rights Data Dictionary specifies a dictionary of rights-related data (link)
  6. Part 7 – Digital Item Adaptation specifies description tools to enable optimised adaptation of multimedia content (link)
  7. Part 15 – Event Reporting specifies a format to report events (links)
  8. Part 17 – Fragment Identification of MPEG Resources specifies a syntax for URI Fragment Identifiers (link)
  9. Part 19 – Media Value Chain Ontology specifies an ontology for Media Value Chains (link)
  10. Part 20 – Contract Expression Language specifies a language to express digital contracts (link)
  11. Part 21 – Media Contract Ontology specifies an ontology for media-related digital contracts (link).

 Conclusions

The standards from MPEG-1 to MPEG-21 contain 86 specifications covering the entire 30 years of MPEG activity. They should give a rough idea of how MPEG started from the vision of single standards for all industries belonging to what we can call today the “media industry” and has kept on adapting – without disowning – its vision. The original vision has been a seed that has grown – and continues to grow – into a tree. MPEG keeps track of the evolution of technologies to provide more efficient standards and to the needs of the industry with refurbished old and brand new standards.

Posts in this thread (in bold this post)

Forty years of video coding and counting

Introduction

For about 150 years, the telephone service has provided a socially important communication means to billions of people. For at least a century the telecom industry wanted to offer a more complete user experience (as we would call it today) by adding the visual to the speech component.

Probably the first large scale attempt at offering such an audio-visual service was AT&T’s PicturePhone in the mid 1960’s. The service was eventually discontinued but the idea of expanding the telephone service with a video service caught the attention of telephone companies. Many expected that digital video-phone or video-conference services on the emerging digital networks would guarantee the success that the PicturePhone service did not have and research in video coding was funded in many research labs of the telephone companies.

This article will tell the story of how this original investment, seconded by other industries, gave rise to the ever improving digital video experience that our generation is experiencing in ever greater number.

First Video Coding Standard

The first international standard that used video coding techniques – ITU-T Recommendation H.120 – originated from the European research project called COST 211. H.120 was intended for video-conference services, especially on satellite channels, was approved in 1984 and implemented in a limited number of specimens.

Second Video Coding Standard

The second international standard that used video coding techniques – ITU-T Recommendation H.161 – was intended for audio-visual services and was approved in 1990. This signaled the maturity of video coding standardisation that left the old and inefficient algorithms to enter the DCT/motion compensation age.

For several reasons H.261 was implemented by a limited number of manufacturing companies for a limited number of customers.

Third Video Coding Standard

Television broadcasting has always been – and, with challenges, continues to be also today – a socially important communication tool. Unlike audio-visual services that were mostly a strategic target on the part of the telecom industry, television broadcasting in the 1980’s was a thriving industry served by the Consumer Electronic (CE) industry providing devices to hundreds of millions of consumers.

The idea the originated ISO MPEG-1, the third international standard that used video coding techniques and intended for interactive video applications on CD-ROM, was approved by MPEG in November 1992. Besides the declared goal, the intention was to popularise video coding technologies by relying on the manufacturing prowess of the CE industry. MPEG-1 was the first example of a video coding standard developed by two industries that had had until that time very little in common: telecom and CE (terminals for the telecom market were developed by a special industry with little contact with the CE industry).

Fourth Video Coding Standard

Even though in the late 1990’s MPEG-1 Video eventually reached the 1 billion units sold with the nickname “Video CD”, especially in the Far East, the big game started with the fourth international standard that used video coding techniques – ISO MPEG-2 – whose original target was “digital television”. The number of industries interested in it made MPEG crowded: telecom had always sought to have a role in television, CE was obviously interested in having existing analogue TV sets replaced by shining digital TV sets or at least supplemented by a set top box, satellite broadcasters and cable were very keen on the idea of hundreds of TV programs in their bouquets, terrestrial broadcasters had different strategies in different regions but eventually joined, as well as the package media sector of the CE industry, with their tight contacts with the movie industry. This explains why the official title of MPEG-2 is “Generic coding of moving pictures and associated audio information” to signal the fact that MPEG-2 could be used by all the industries that, at that time, had an interest in digital video, a unique feat in the industry.

Fifth and Sixth Video Coding Standards

Remarkably, MPEG-2 Video (and Systems) was a standard jointly developed by MPEG and ITU-T. The world, however, follows the dictum of the Romance of Three Kingdoms (三國演義): 話說天下大勢.分久必合,合久必分. Adapted to the context this can be translated as in the world things divided for a long time shall unite, things united for a long time shall divide. So, the MPEG and ITU paths divided in the following phase. ITU-T developed its own H.263 Recommendation “Video coding for low bit rate communication” and MPEG developed its own MPEG-4 Visual standard, part 2 “Coding of audio-visual objects”. The conjunction of the two standards is a very tiny code that simply tells the decoder that a bitstream is H.263 or MPEG-4 Visual. A lot of coding tool commonality exists, but not at the bitstream level.

H.263 focused on low bitrate video communication, while MPEG-4 Visual kept on making real the vision of extending video coding to more industries: this time Information Technology and Mobile. MPEG-4 Visual was released in 2 versions in 1999 and 2000, while H.263 went through a series of updates documented in a series of Annexes to the H.263 Recommendation. H.263 enjoyed some success thanks to the common belief that it was “royalty free”, while MPEG-4 Visual suffered a devastating blow by a patent pool that decided to impose “content fees” on their licensing term.

Seventh Video Coding Standard

The year 2001 marked the return to the second half of Romance of Three Kingdoms’ dictum: 分久必合 (things separated for a long time shall divide), even though it was not too 久 (long time) since they had divided, certainly not on the scale intended by the Romance of Three Kingdoms. MPEG and ITU-T (through its Video Coding Experts Group – VCEG) joined forces again in 2001 and produced the seventh international standard in 2003. The standard is called Advanced Video Coding by both MPEG and ITU, but is labelled as AVC by MPEG and as H.264 by ITU-T. Reasonable licensing terms (of course always considered unreasonable by licensees) ensured AVC’s long-lasting success in the market place that continues to this day (for another 4 years and 3 months, I mean).

Eighth Video Coding Standard

The eight international video coding standard that used video coding techniques stands by itself because it is not a standard with “new” video coding technologies, but a standard that enables a video decoder to build a decoder matching the bitstream using standardised tools represented in a standard form available at the decoder. The technique, called Reconfigurable Video Coding (RVC) or, more generally, Reconfigurable Media Coding (RMC), because MPEG has applied the same technology to 3D Graphics Coding, is enabled by two standards: ISO/IEC 23002-4 Codec configuration representation and ISO/IEC 23003-4 Video tool library (VTL). The former defines the methods and general principles to describe codec configurations. The latter describes the MPEG VTL and specifies the Functional Units that are required to build a complete decoder for the following standards: MPEG-4 Simple Profile, AVC Constrained Baseline Profile and Progressive High Profile, MPEG-4 SC3DMC, and HEVC Main Profile.

Ninth Video Coding Standard

In 2010 MPEG and VCEG extended their collaboration to a new project: High Efficiency Video Coding (HEVC). A few months after the HEVC FDIS had been released, the HEVC Verification Tests showed that the standard had achieved 60% improvement on AVC, 10% more than originally planned. After that HEVC has been enriched with a number of features presently not supported by previous standards such as High Dynamic Range (HDR) and Wide Colour Gamut (WCG), and support to Screen Content and omnidirectional video (video 360). Unfortunately, technical success did not translate into market success because adoption of HEVC is still hampered – 6 years after its approval by MPEG – by an unclear licensing situation. In IP counting or revenue counting?; Business model based ISO/IEC standards, Can MPEG overcome its Video “crisis”? and A crisis, the causes and a solution an analysis is made of the reasons of the currently stalled situation and possible remedies are proposed.

Tenth Video Coding Standard

ISO, IEC and ITU share a common policy vis-à-vis patents in their standards. Using few imprecise but clear words (where a patent attorney would use many precise but unclear words), the policy is: it is good if a standard has no patents or if the patent holders are allowing use of their patents for free (Optioon 1); it is tolerable if a standard has patents but the patents holders allow use of their patent on fair and reasonable terms and non discriminatory conditions (Option 2); it is not permitted to have a standard with patents whose holders do not allow use of their patents (Option 3).

The target of MPEG standards until AVC had always been “best performance no matter what is the IPR involved” (of course if the IPR holders allow), but as the use of AVC extended to many domains, it was becoming clear that there was so much “old” IP (i.e. more than 20 years) that it was technically possible to make a standard whose IP components were Option 1.

In 2013 MPEG released the FDIS of WebVC, strictly speaking not a new standard because MPEG had simply extracted what was the Constrained Baseline Profile of AVC and made it a separate standard with the intention of making it Option 1. The attempt failed because some companies confirmed their Option 2 patent declarations already made against the AVC standard.

Eleventh Video Coding Standard

WebVC has not been the only effort made by MPEG to develop an Option 1 video coding standard (i.e. a standard for which only Option patent declarations have been made). A second effort, called Internet Video Coding (IVC), was concluded in 2017 with the release of the IVC FDIS. Verification Tests performed showed that the performance of IVC exceeded that of the best profile of AVC, by then a 14 years old standard. Three companies made Option 2 patent declarations that did not contain any detail so that MPEG could not remove the technologies in IVC that the companies claimed infringed their patents.

Twelfth Video Coding Standard

MPEG achieved a different result with its third attempt at developing an Option 1 video coding standard. The proposal made by a company in response to an MPEG Call for Proposals was reviewed by MPEG and achieved FDIS with the name of Video Coding for Browsers (VCB). However, a company made an Option 3 patent declaration that, like those made against IVC, did not contain any detail that would enable MPEG to remove the allegedly infringing technologies. Eventually ISO did not publish VCB.

Today ISO and IEC have disabled the possibility for companies to make Option 3 patent declarations without details (a policy that ITU had not allowed). As the VCB approval process has been completed, it is not possible to resume the study of VCB if MPEG does not restart the process. Therefore, VCB is likely to remain unpublished and therefore not an ISO standard.

Thirteenth Video Coding Standard

For the third time MPEG and ITU are collaborating in the development of a new video coding standard with the target of a 50% reduction of bitrate compared to HEVC. The development of Versatile Video Coding (VVC), as the new standard is called, is still under way and involves close to 300 experts attending VVC sessions. MPEG expects to reach the FDIS of Versatile Video Coding (VVC) in October 2020.

Fourteenth Video Coding Standard

Thirteen is a large number for video coding standards but this number should be measured against the number of years covered – close to 40. In this long period of time we have gone from 3 initial standards that were mostly application/industry-specific (H.120, MPEG-1 and H.261) to a series of generic (i.e. industry-neutral) standards (MPEG-2, MPEG-4 Visual, MPEG-4 AVC and HEVC) and then to a group of standards that sought to achieve Option 1 status (WebVC, IVC and VCB). Other proprietary video coding formats that have found significant use in the market point to the fact that MPEG cannot stay forever in its ivory tower of “best video coding standards no matter what”. MPEG has to face the reality of a market that becomes more and more diversified and where – unlike the golden age of a single coding standard – there is no longer one size that fits all.

At its 125th meeting MPEG has reviewed the responses to its Call for Proposals on a new video coding standard that sought proposals with a simplified coding structure and an accelerated development time of 12 months from working draft to FDIS. The new standard will be called MPEG-5 Essential Video Coding (EVC) and is expected to reach FDIS in January 2020.

The new video coding project will have a base layer/profile which is expected to be Option 1 and a second layer/profile that has alraedy a performance ~25% better than HEVC. Licensing terms are expected to be published by patent holders within 2 years.

VCEG has decided not to work with MPEG on this coding standard. Are we back to the 合久必分 (things combined for a long time must split) situation? This is half true because the MPEG-VCEG collaboration in VVC is continuing. In any case VVC will provide 50% more than the HEVC compression performance.

Fifteenth Video Coding Standard

If there was a need to prove that there is no longer “one size fits all” in video coding, just look at the Call for Proposals for a “Low Complexity Video Coding Enhancements” standard issued by MPEG. This Call is not for a “new video codec”, but a technology capable to extend the capabilities of an existing video codec. A typical usage scenario is the addition of, say, the high definition capability to a set top boxes (typically deployed by the millions) that cannot be recalled. Proposals are due at the March 2019 meeting and FDIS is expected in April 2020.

Sixteenth Video Coding Standard

Point Clouds are not really the traditional “video” content as we know it, namely sequences of “frames” at a frequency sufficiently high frequency to fool the eye into believing that the motion is natural. In point clouds motion is given by dynamic point clouds that represent the surface of objects moving in the scene. For the eye, however, the end-result is the same: moving pictures displayed on a 2D surface, whose objects can be manipulated by the viewer (this, however, requires a system layer that MPEG is already developing).

MPEG is working on two different technologies: the first one uses HEVC to compress projections of portions of a point cloud (and is therefore well-suited for entertainment applications because it can rely on an existing HEVC decoder) and the second one uses computer graphics technologies (and is currently more suited to automotive applications). The former will achieve FDIS in January 2020 and the latter in April 2020.

Seventeenth and Eighteenth Video Coding Standards

Unfortunately, the crystal ball gets blurred as we move into the future. Therefore MPEG is investigating several technologies capable to providesolutions for alternative immersive experiences. After providing HEVC and OMAF for 3DoF experiences (where the user can only have roll, pitch, and yaw movement of the head), MPEG is working on OMAF v2 for 3DoF+ experiences (where the user can have a limited translation of the head). A Call for Proposal has been issued and responses are due in March 2019 and the FDIS is expected in July 2020. Investigations are being carried out on 6DoF (where the user can have full translation of the head) and on light field.

Conclusions

The last 40 years have seen digital video converted from a dream into a reality that involves billions of users every day. This long ride is represented in the figure that ventures into the next steps of the ride.

MPEG keeps working to make sure that manufacturers and content/services providers have access to more and better standard visual technologies for an increasingly diversified market of increasingly demanding users.

Posts in this thread (in bold this post)