Genome is digital, and can be compressed

Introduction

The well-known double helix carries the DNA of living beings. The human DNA contains about 3.2 billion nucleotide base pairs represented by the quaternary symbols (A, G, C, T). With high-speed sequencing machines today it is possible to “read” the DNA. The resulting file contains millions of “reads”, short segments of symbols, typically all of the same length, and weighs an unwieldy few Terabytes.

The upcoming MPEG-G standards, developed jointly by MPEG and ISO TC 276 Biotechnology, will reduce the size of the file, without loss of information, by exploiting the inherent redundancy of the reads and make at the same time the information in the file more easily accessible.

This article provides some context, and explains the basic ideas of the standard and the benefits it can yield to those who need to access genomic information.

Reading the DNA

There are two main obstacles preventing a direct use of files from sequencing machines: the position of a read on the DNA sample is unknown and the value of each symbol of the read is not entirely reliable.

The picture below represents a 17 reads with a read length of 15 nucleotides. These have been aligned to a reference genome (first line). Reads with a higher number start further down in the reference genome.

Reading column-wise, we see that in most cases the values have exactly the value of the reference genome. A single difference (represented by isolated red symbols) may be caused by read errors while a quasi completely different column (most symbols in red) may be caused by the fact that 1) a given DNA is unlikely to be exactly equal to a reference genome or 2) the person with this particular DNA may have health problems.

Use of genomics today

Genomics is already used in the clinical practice. An example of genomic workflow is depicted in the figure below which could very well represent a blood test workflow if “DNA” were replaced by “blood”. Patients go to a hospital where a sample of their DNA is taken and read by a sequencing machine. The files are analysed by experts who produce reports which are read and analysed by doctors who decide actions.

Use of genomics tomorrow

Today genomic workflows take time – even months – and are costly – thousands of USD per DNA sample. While there is not much room to cut the time it takes to obtain a DNA sample, sequencing cost has been decreasing and are expected to continue doing so.

Big savings could be achieved by acting on data transport and processing. If the size of a 3 Terabytes file is reduced by, say, a factor of 100, the transport of the resulting 30 Gigabytes would be compatible with today’s internet access speeds of 1 Gbit/s (~4 min). Faster data access, a by-product of compression, would allow doctors to get the information they are searching, locally or from remote, in a fraction of a second.

The new possible scenario is depicted in the figure below.

MPEG makes genome compression real

Not much had been done to make the scenario above real (zip is the oft-used compression technology today) until the time (April 2013) MPEG received a proposal to develop a standard to losslessly compress files from DNA sequencing machines.

The MPEG-G standard – titled Genomic Information Representation – has 5-parts: Parts 1 and 2 are expected to be approved at MPEG 125 (January 2018) and the other parts are expected to follow suit shortly after.

MPEG-G is an excellent example of how MPEG could apply its expertise to a different field than media. Part 1, an adaptation of the MP4 File Format present in all smartphones/tablets/PCs, specifies how to make and transport compressed files. Part 2 specifies how to compress reads and Part 3 how to invoke the APIs to access specific compressed portions of a file. Part 4 and 5 are Conformance and Reference Software, respectively.

The figure below depicts the very sophisticated operation specified in Part 2 in a simplified way.

An MPEG-G file can be created with the following sequence of operations:

  1. Put the reads in the input file (aligned or unaligned) in bins corresponding to segments of the reference genome
  2. Classify the reads in each bin in 6 classes: P (perfect match with the reference genome), M (reads with variants), etc.
  3. Convert the reads of each bin to a subset of 18 descriptors specific of the class: e.g., a class P descriptor is the start position of the read etc.
  4. Put the descriptors in the columns of a matrix
  5. Compress each descriptor column (MPEG-G uses the very efficient CABAC compressor already present in several video coding standards)
  6. Put compressed descriptors of a class of a bin in an Access Unit (AU) for a maximum of 6 AUs per bin

Therefore MPEG-G file contains all AUs of all bins corresponding to all segments of the reference genome. A file may contain the compressed reads of more than one DNA sample.

The benefits of MPEG-G

Compression is beneficial but is not necessarily the only or primary benefit. More important is the fact that while designing compression, MPEG has given a structure to the information. In MPEG-G the structure is provided by Part 1 (File and transport) and by Part 2 (Compression).

In MPEG-G most information relevant to applications is immediately accessible, locally and, more importantly, also from remote without the need to download the entire file to be able to access the information of interest. Part 3 (Application Programming Interfaces) makes this fast access even more convenient because it facilitates the work of developers of genomics applications who may not have in-depth information of the – certainly complex – MPEG-G standard.

Conclusions

In the best MPEG tradition, MPEG-G is a generic standard, i.e. a standard that can be employed in a wide variety of applications that require small footprint of and fast access to genomic information.

A certainly incomplete list includes: Assistance to medical doctors’ decisions; Lifetime Genetic Testing; Personal DNA mapping on demand; Personal design of pharmaceuticals; Analysis of immune repertoire; Characterisation of micro-organisms living in the human host; Mapping of micro-organisms in the environment (e.g. biodiversity).

Standards are living beings, but MPEG standards have a DNA that allows them to grow and evolve to cope with the manifold needs of its ever-growing number of users.

I look forward to welcoming new communities in the big family of MPEG users.

Posts in this thread (in bold this post)

 

301 thoughts on “Genome is digital, and can be compressed”

  1. Pingback: Viagra 5mg prix
  2. Pingback: levitra 20mg
  3. Pingback: online viagra
  4. Pingback: viagra 100mg
  5. Pingback: viagra
  6. Pingback: pharmacy online
  7. Pingback: tadalafil
  8. Pingback: cialis prices
  9. Pingback: tadalafil 20mg
  10. Pingback: cialis generic
  11. Pingback: buy cialis online
  12. Pingback: pharmacy
  13. Pingback: we-b-tv.com
  14. Pingback: hs;br
  15. Pingback: tureckie_serialy
  16. Pingback: serialy
  17. Pingback: 00-tv.com
  18. Pingback: +1+
  19. Pingback: watch
  20. Pingback: ++++++
  21. Pingback: HD-720
  22. Pingback: 2020
  23. Pingback: Video
  24. Pingback: wwin-tv.com
  25. Pingback: amoxicillin
  26. Pingback: movies
  27. Pingback: movies online
  28. Pingback: karan johar
  29. Pingback: Top Movies
  30. Pingback: Movies1
  31. Pingback: 11 10 2019
  32. Pingback: Serial smotret
  33. Pingback: kinokrad
  34. Pingback: kinokrad 2020
  35. Pingback: Watch TV Shows
  36. Pingback: casino
  37. Pingback: filmy-kinokrad
  38. Pingback: kinokrad-2019
  39. Pingback: serial
  40. Pingback: cerialest.ru
  41. Pingback: youtube2019.ru
  42. Pingback: dorama hdrezka
  43. Pingback: movies hdrezka
  44. Pingback: HDrezka
  45. Pingback: kinosmotretonline
  46. Pingback: LostFilm HD 720
  47. Pingback: bofilm
  48. Pingback: 1 seriya
  49. Pingback: topedstoreusa.com
  50. Pingback: hqcialismht.com
  51. Pingback: lindamedic.com
  52. Pingback: myonlinebuy.us
  53. Pingback: genericvgrmax.com
  54. Pingback: гдз 10
  55. Pingback: canpharmb3.com
  56. Pingback: 3 гдз
  57. Pingback: гдз 5
  58. Pingback: 4serial.com
  59. Pingback: See-Season-1
  60. Pingback: Evil-Season-1
  61. Pingback: Evil-Season-2
  62. Pingback: Evil-Season-3
  63. Pingback: Evil-Season-4
  64. Pingback: Dollface-Season-1
  65. Pingback: Google
  66. Pingback: strap-on dildo
  67. Pingback: rabbit vibe
  68. Pingback: best cbd capsules
  69. Pingback: ehlers danlos
  70. Pingback: best cbd gummies
  71. Pingback: Real money
  72. Pingback: Pressure Washing
  73. Pingback: male stroker
  74. Pingback: best wand vibrator
  75. Pingback: finger massager
  76. Pingback: serial 2020
  77. Pingback: Dailymotion
  78. Pingback: Watch+movies+2020
  79. Pingback: butt plug
  80. Pingback: 4chan porn
  81. Pingback: genfio.com
  82. Pingback: tvrv.ru
  83. Pingback: 1plus1serial.site
  84. Pingback: #1plus1
  85. Pingback: 1plus1
  86. Pingback: Get it now
  87. Pingback: customer support
  88. Pingback: satta king 986
  89. Pingback: ipl live streaming
  90. Pingback: butt plugs
  91. Pingback: increase DR fast
  92. Pingback: viagra
  93. Pingback: viagra
  94. Pingback: Judi Poker Terbaik
  95. Pingback: Almeta Hoefler
  96. Pingback: Film
  97. Pingback: Film 2020
  98. Pingback: Film 2021
  99. Pingback: viagra
  100. Pingback: cannabidiol
  101. Pingback: SexGalaxy
  102. Pingback: Weed Wax
  103. Pingback: Research Chemicals
  104. Pingback: THC Oil
  105. Pingback: top market
  106. Pingback: Cherry Pie Strain
  107. Pingback: Joel Osteen
  108. Pingback: Joel Osteen Books
  109. Pingback: نقل عفش
  110. Pingback: pure cbd oil
  111. Pingback: cbd oil
  112. Pingback: buy hacklink
  113. Pingback: Dresses
  114. Pingback: Notarias en cancun
  115. Pingback: Ontario Theatre
  116. Pingback: Grand Daddy Purple
  117. Pingback: welding apron
  118. Pingback: hack robux
  119. Pingback: INDIAN VISA ONLINE
  120. Pingback: past life analysis
  121. Pingback: good lawyer
  122. Pingback: human design
  123. Pingback: dizajn cheloveka
  124. Pingback: human-design-space
  125. Pingback: indian visa online

Comments are closed.