Quality, more quality and more more quality

Quality measurement is an essential ingredient of the MPEG business model that targets the development of the best performing standards that satisfy given requirements.

MPEG was not certainly the first to discover the importance of media quality assessment. Decades ago, when still called Comité Consultatif International des Radiocommunications (CCIR), ITU-R developed Recommendation 500  – “Methodologies for the subjective assessment of the quality of television images”. This recommendation guided the work of television labs for decades. It was not possible, however, to satisfy all MPEG needs with BT.500, the modern name of CCIR Recommendation 500, for three main reasons: MPEG needed methods to assess the impact of coding on video quality, MPEG dealt with a much wider range of moving pictures than television and MPEG ended up dealing with more than just 2D rectangular moving pictures.

Video quality assessment in MPEG began in November 1989 at the research laboratories of JVC in Kuriyama when all aspects of the responses to the MPEG-1 Call for Proposals (CfP), including quality, were considered. Two years later MPEG met again in Kurihama to consider the responses to the MPEG-2 CfP. At that time the assessment of video quality was done using the so-called Double-stimulus impairment scale (DSIS) using a 5-grade impairent scale. In both tests massive use of digital D1 tapes was made to deliver undistorted digital video to the test facility. The Test subgroup led by the chair Tsuneyoshi Hidaka managed all the logistics of D1 tapes coming from the 4 corners of the worls.

The MPEG Test chair could convince the JVC management to offer free use of the testing facilities for MPEG-1. However, he could not achieve the same for MPEG-2. Therefore MPEG-2 respondents were asked to pay for the tests. Since then participation in most if not all subjective tests campaigns has been subject to the payment of a fee to cover the use of facilities and/or the human subjects who were requested to view the video sequences under test. The MPEG-1 and MPEG-2 tests were carried out in the wake of Recommendation BT.500.

The MPEG-4 tests, carried out in 1995, fundamentally changed the scope because the CfP addressed Multimedia contents, i.e.  progressively scanned moving images typically at lower resolution than TV which was supposed to be transmitted over noisy channels (videophone over fixed subscriber line or the nascent mobile networks). The statistical processing of subjective data applied to the MPEG-4 CfP was innovated by the use of ANOVA (analysis of variance), because until then tests only used simple mean value and Grand Mean, i.e. the mean value computed considering the scores assigned to several video sequences.

The use of Statistically Significant Difference (SSD) allowed a precise ranking of the technologies under test. Traditional test methods (DSIS and SS) were used together with the new Single Stimulus Continuous Quality Evaluation (SSCQE) test method to evaluate “long” video sequences of 3 minutes measure how well a video compression technology could recover from transmission errors. The tests were carried out using the D1 digital professional video recorder and Professional Studio Quality “grade 1” CRT displays.

The Digital Cinema test, carried out in 2001 at the Entertainment Technology Centre (ETC) of the University of Southern California, was designed to evaluate cinematic content in a real theatrical environment, i.e. on a 20 m base perforated screen, projected by a cinema projector fed with digital content. The subjective evaluations were done with three new test methods: The Expert Viewing Test (EVT), a two steps procedure, where the results of a DSIS test were refined by means of careful observation by a selected number of “golden eye” observations, the Double Stimulus Perceived Difference Scale (DSPDS), a double stimulus impairment detection test method using a 5 grades impairment scale and the Double Stimulus Split-Screen Perceived Difference Scale (S3PDS), a test method based on a split screen approach where both halves of the screen were observed in sequence.

The test for the Call for New Tools to Further Improve Coding Efficiency were done using traditional test methods and the same methodology and devices of the MPEG 4 Call for Proposal. The test demonstrated the existence of a new technology in video compression and allowed the collaboration between ISO and ITU-T in the area of digital video coding to resume. This was the first test to use the 11-grade impairment scale, that became a reference for DSIS and the SS test experiments, and provided a major improvement in result accuracy.

A new test method – the VSMV-M Procedure – was designed in 2004 to assess the submission received for the Core Experiment for the Scalable Video Coding. The Procedure was made of two phases: a “controlled assessment” phase and a “deep analysis” phase. The first phase was made according to the DSIS and SS test methods and a second phase, designed by MPEG, where a panel of experts confirmed the ranking obtained running the evaluation done with formal subjective assessment. These test were the first to be entirely based on digital video servers and DLP projector. Therefore, 15 years after they were first used in the MPEG-1 tests, D1 tapes were finally put to rest.

The SVC Verification Tests carried out in 2007, represented another important step in the evolution of the MPEG testing methodology. Two new test methods were designed: the Single Stimulus Multi-Media (SSMM) and the Double Stimulus Unknown Reference (DSUR). The SSMM method minimised the contextual effect typical of the Single Stimulus (SS) and the DSUR was derived from the Double Stimulus Impairment Scale (DSIS) Variant II introduced some of the advantages of the Double Stimulus Continuous Quality Scale (DSCQS) method in the DSIS method avoiding the tricky and difficult data processing of DSCQS.

The Joint Call for Proposals on Video Compression Technology (HECV) covered 5 different classes of content, with resolutions ranging from WQVGA (416×240) to 2560×1600, in two configurations (low delay and random access) for different classes of target applications. It was a very large test effort because it was done on a total of of 29 submissions that lasted 4 months and involved 3 laboratories which assessed more than 5000 video files and hired more than 2000 non-expert viewers. The ranking of submissions was done considering the Mean Opinion Square (MOS) and Confidence Interval (CI) values. A procedure was introduced to check that the results provided by different test laboratories were consistent. The results of the three laboratories included a common test set that allowed to measure the impact of a laboratory on the results of a test experiment.

A total of 24 complete submissions were received in response to the Joint Call for Proposal on 3D Video Coding (stereo and auto-stereo) issued in 2012. For each test case each submission produced 24 files representing the different viewing angle. Two sets of two and three viewing angles were blindly selected to synthesise the stereo and auto-stereo test files. The test was done on standard 3D displays (with glasses) and auto stereoscopic displays. A total of 13 test laboratories took part in the test running a total of 224 test sessions, hiring around 5000 non expert viewers. The test applied a full redundancy scheme, where each test case was run by two laboratories to increase the reliability and the accuracy of the results. The ranking of the submissions was done considering the MOS and CI values. This test represented a further improvement in the control of performances of each test laboratory. The test could ensure full result recovery in the case of failure of up to 6 out of 13 testing laboratories.

The Joint CfP for Coding of Screen Content was issued to extend the HEVC standard in order to improve the coding performance of typical computer screen content. Whent it became clear that the set of test conditions defined in the CfP was not suitable to obtain valuable results, the test method was modified from the original “side by side” scheme, to a sequential presentation scheme. The complexity of the test material led to the design of an extremely accurate and long training of the non-expert viewers. Four laboratories participated in the formal subjective assessment test, assessing and ranking the seven responses to the CfP. More than 30 test sessions were run (including the “dry-run” phase) hiring around 250 non-expert viewers.

The CfP on Point Cloud Coding was issued to assess coding technologies for 3D point coulds. MPEG had no experience (but actually no one had) in assessing the visual quality of point clouds. MPEG projected the 3D point clouds to 2D spaces and evaluated the resulting 2D video according to formal subjective assessment protocols. The video clips were produced using a rendering tool that generated two different video clips for each of the received submissions, under the same creation conditions. Both were rotating views of 1) a fixed synthesised image and 2) a moving synthesised video clips. The rotations were blindly selected.

The CfP for Video Compression with Capability beyond HEVC included three test categories, for which different test methods had to be designed. The Standard Dynamic Range category was a  compression efficiency evaluation process where the classic DSIS test method was applied with good results. The High Dynamic Range category required two separate sessions, according to the peak luminance of the video content taken into account, i.e. below (or equal to) 1K nits and above 1K nits (namely 4K nits); in both cases DSIS test method was used. The quality of the 360° category was assessed in a “viewport” extracted from the whole 360° screen with an HD resolution.

When the test was completed, the design of the 36 “SDR”, 14 “HDR” and 8 “360°” test sessions was verified. For each test session the distribution of the raw quality scores assigned during each session was analysed to verify that the level of visual quality across the many test sessions was equally distributed.

This was a long but still incomplete review of 30 years of subjective visual quality in MPEG. This ride across 3 decades should demonstrate that MPEG draws from established knowledge to create new methods that are functional to obtain the resulst MPEG is seeking. It should also show the level of effort invovled in actually assigning task, coordinate the work and produce integrated results that provide the responses. Most important is the level of human participation involved: 2000 people (non experts) for the HEVC tests!


Many thanks to the MPEG Test chair Vittorio Baroncini for providing the initial text of this article. Many parts of the activities described here were conducted by him as Test chair.

Posts in this thread