A Two Bit Decoder

John Jacobsen
Saturday, July 6, 2013

Home All Posts

Earlier post Exploratory Genomics with Clojure code clojure

Later post Validating the Genome Decoder code clojure

The entire human genome is available as a single .2bit file here (click on “Full Data Set”, then download hg19.2bit). Unlike the stellar signal in His Master’s Voice, the 2bit format is reasonably clearly documented.

We want to write Clojure code to:

  1. Provide base pairs in symbolic (rather than raw binary) form as lazy sequences – i.e., sequences which need not all fit in memory at once, but can be consumed and processed as needed;
  2. Provide “random access” to this data selectively, e.g. by chromosome, rather than always reading through the entire file;
  3. Provide access to metadata encoded in the file.

The functionality to do this is posted in the jenome project on GitHub. In this post, we’ll explore this code a little; in following posts, we’ll do some investigating of the actual genome per se.

Capability (2) is provided, in part, by implementing random-access reads from file fname of len bytes starting at offset:

(ns jenome.rafile
  (:import (java.io RandomAccessFile)))

(defn read-with-offset [fname offset len]
  (let [raf (RandomAccessFile. fname "r")
        bb (byte-array len)]
    (.seek raf offset)
    (.readFully raf bb)
    (.close raf)
    bb))

Armed with this, we can get the .2bit file header:

(defn file-header [fname]
  (let [[sig ver seqcnt resvd] (->> (read-with-offset fname 0 16)
                                    (partition 4)
                                    (map bytes-to-number))]
    (assert (= sig 0x1A412743))
    (assert (= ver 0))
    (assert (= resvd 0))
    seqcnt))

file-header basically just gives us the number of sequences (usually, chromosomes) in the file, doing some sanity checks along the way. bytes-to-number converts an arbitrary sequence of bytes to the appropriate unsigned integer. (For brevity’s sake, I won’t show every utility function in this blog post; the source code on GitHub is reasonably short.)

2bitfig1.png

Figure 1: Header and File Index

The next part of the file, as shown in Figure 1, is called the “file index,” and contains a list of sequences contained the rest of the file. It can be read as follows:

(defn file-index [fname seqcnt]
  (loop [i 0
         ofs 16
         ret []]
    (if (< i seqcnt)
      (let [[nlen] (read-with-offset fname ofs 1)
            name (apply str (map char (read-with-offset fname 
                                                        (+ ofs 1) nlen)))
            seq-offset (get32 fname (+ ofs 1 nlen))]
        (recur (inc i) (+ ofs nlen 5) (conj ret [nlen name seq-offset])))
      ret)))

This somewhat imperative code walks through the seqcnt sequence portions of the index, pulling out sequence names and lengths as we go.

It’s here that we introduce a new friend, the yeast Saccharomyces cerevisiae (SacCer3), used since antiquity for making bread and fermented beverages. Relatively small in comparison with the human genome, SacCer3 will be our “unit test” organism. Available here and checked into resources, the file can be accessed as

(def yeast
   (as-file (resource "sacCer3.2bit")))

(I have imported resource and as-file from clojure.java.io; again, see the source code.)

Our index-reading code yields:

  (let [seqcnt (file-header yeast)]
    (file-index yeast seqcnt))
  ;=>
[[4 "chrI" 191]
 [5 "chrII" 57762]
 [6 "chrIII" 261074]
 [5 "chrIV" 340245]
 [5 "chrIX" 723245]
 [4 "chrV" 833233]
 [5 "chrVI" 977468]
 [6 "chrVII" 1045025]
 [7 "chrVIII" 1317776]
 [4 "chrX" 1458453]
 [5 "chrXI" 1644907]
 [6 "chrXII" 1811627]
 [7 "chrXIII" 2081188]
 [6 "chrXIV" 2312312]
 [5 "chrXV" 2508412]
 [6 "chrXVI" 2781251]
 [4 "chrM" 3018284]]

The apparent consistency of these values give us some initial confidence that we are reading the index correctly. Note, however, the curious fact that chrIX appears between IV and V.

With this encouraging start, we can now attack the sequences proper. These are laid out as shown in Figure 2, with block metadata prepended to the actual DNA sequences:

2bitfig2.png

Figure 2: Sequence Record Layout

The “N blocks” are blocks of unknown sequences with specified offsets and lengths. Masked blocks are blocks which are known repetitions (indicated as lower case “a”, “g”, “c” and “t” in the text-based ‘FASTA’ file format). We are obviously most interested in dnaSize, the number of base pairs in the sequence, and the actual sequence values themselves.

Unpacking the above data format (except the base pairs per se) makes heavy use of get32, which just returns the unsigned 32-bit integer at the specified file location. This code doesn’t need to be super efficient, since the block headers themselves are quite small. The metadata for the entire file is returned as a sequence of maps.

(defn getblk [fname offset n]
  (let [ret (map #(get32 fname (+ offset (* 4 %))) (range n))
        offset (skip offset n)]
    [ret offset]))


(defn sequence-headers
  "
  Get sequence headers from .2bit file, as documented in
  http://genome.ucsc.edu/FAQ/FAQformat#format7. Returns a list of maps
  with details for each sequence.
  "
  [fname]
  (let [seqcnt (file-header fname)]
    (for [[nlen name ofs] (file-index fname seqcnt)]
      (let [[[dna-size]         ofs] (getblk fname ofs 1)
            [[n-block-count]    ofs] (getblk fname ofs 1)
            [n-block-starts     ofs] (getblk fname ofs n-block-count)
            [n-block-sizes      ofs] (getblk fname ofs n-block-count)
            [[mask-block-count] ofs] (getblk fname ofs 1)
            [mask-block-starts  ofs] (getblk fname ofs mask-block-count)
            [mask-block-sizes   ofs] (getblk fname ofs mask-block-count)
            [[reserved]         ofs] (getblk fname ofs 1)]
        (assert (zero? reserved))
        {:name name
         :nlen nlen
         :dna-size dna-size
         :n-block-count n-block-count
         :n-block-starts n-block-starts
         :n-block-sizes n-block-sizes
         :mask-block-starts mask-block-starts
         :mask-block-sizes mask-block-sizes
         :dna-offset ofs}))))

A few sanity checks are included in test_core.clj to make sure we’re decoding the metadata correctly. Requirement (3) is done!

The final step (Requirement (1)), is to actually get our base pairs (BPs). Since we have to assume the data set is very large (as is the case with the human genome), we cannot read the entire DNA sequence at once. The first part is to break the dna-size base pairs (remember we have 2 bits per BP, or 4 BP/byte), starting at dna-offset. First we obtain the “coordinates” of the sequences we want to read:

(defn get-buffer-starts-and-lengths 
  "
  Return buffer offsets (starting at ofs) required to cleanly read a
  total of m bytes no more than n at a time
  "
  [ofs n m]
  (loop [a ofs
         len n
         ret []]
    (if (>= a (+ m ofs))
      ret
      (recur (+ a n)
             n
             (conj ret [a (min len (- (+ m ofs) a))])))))

;; Example:
(get-buffer-starts-and-lengths 100 200 512)
;=> [[100 200] [300 200] [500 112]]

At long last, having obtained the locations and lengths we want to read from, we can get our sequences out:

(defn genome-sequence
  "
  Read a specific sequence, or all sequences in a file concatenated
  together; return it as a lazy seq.
  "
  ([fname]
     (let [sh (sequence-headers fname)]
       (mapcat #(genome-sequence fname %1 %2)
               (map :dna-offset sh)
               (map :dna-size sh))))
  ([fname ofs dna-len]
     (take dna-len
           (apply concat
                  (let [byte-len (rounding-up-divide dna-len 4)
                        starts-and-lengths (get-buffer-starts-and-lengths
                                              ofs 10000 byte-len)]
                    (for [[offset length] starts-and-lengths
                          :let [buf (read-with-offset fname offset length)]
                          b buf]
                      (byte-to-base-pairs b)))))))

;; Example:
 (->> yeast
      genome-sequence
      (take 30))
;=> (:C :C :A :C :A :C :C :A :C :A :C :C :C :A :C
;    :A :C :A :C :C :C :A :C :A :C :A :C :C :A :C)

This function allows us to choose the entire genome, or that for a given offset and number of base pairs (whether from the metadata for an entire chromosome, or some smaller region, thus satisfying (2)).

The next post will focus on verification of correctness of this code; subsequent posts will begin to explore the characteristics of this data for various genomes, human or otherwise.

Next: Validating the Genome Decoder.

Earlier post Exploratory Genomics with Clojure code clojure

Later post Validating the Genome Decoder code clojure

Blog Posts (140)


Home

Reflections on a Year of Daily Memory Drawings art

Repainting art

Daily Memory Drawings art

Questions to Ask art

Macro-writing Macros code clojure

Lazy Physics code clojure

Fun with Instaparse code clojure

Nucleotide Repetition Lengths code clojure

Updating the Genome Decoder code clojure

Getting Our Hands Dirty (with the Human Genome) code clojure

Validating the Genome Decoder code clojure

Exploratory Genomics with Clojure code clojure

Rosalind Problems in Clojure code clojure

Introduction to Context Managers in Python code python

Processes vs. Threads for Integration Testing code python

Resources for Learning Clojure code clojure

Continuous Testing in Python, Clojure and Blub code clojure python

Programming Languages code

Milvans and Container Malls southpole

Oxygen southpole

Ghost southpole

Turkey, Stuffing, Eclipse southpole

Wind Storm and Moon Dust southpole

Shower Instructions southpole

Fresh Air and Bananas southpole

Traveller and the Human Chain southpole

Reveille southpole

Drifts southpole

Bon Voyage southpole

A Nicer Guy? southpole

The Quiet Earth southpole

Ten southpole

ISO50 southpole art

In Defense of Hobbies misc code art

Closure southpole

Takeoff southpole

Mummification southpole

Eleventh Hour southpole

Diamond southpole

Baby, It’s Cold Outside southpole

Fruition southpole

Settling In southpole

Revolution Number Nine southpole

Red Eye to McMurdo southpole

Faults in Ice and Rock southpole

Bardo southpole

Chasing the Sun southpole

Downhole southpole

Coming Out of Hibernation southpole

Managing the Most Remote Data Center in the World code southpole

Photoshop on a Dime art

Man on Wire misc

Posing Rigs art

Metric art

Cuba southpole

Wickets southpole

Safe southpole

Broken Glasses southpole

End of the Second Act southpole

Pigs and Fish southpole

Last Arrivals southpole

Lily White southpole

In a Dry and Waterless Place southpole

Immortality southpole

Routine southpole

Tourists southpole

Passing Notes southpole

Translation southpole

The Usual Delays southpole

RNZAF southpole

CHC southpole

Wyeth on Another Planet art

Detox southpole

Packing southpole

Nails southpole

Gearing Up southpole

Gouache, and a new system for conquering the world art

Fall 2008 HPAC Studies art

YABP (Yet Another Blog Platform) southpole

A Bath southpole

Green Marathon southpole

Sprung southpole

Outta Here southpole

Lame Duck DAQer southpole

Eclipse Town southpole

One More Week southpole

IceCube Laboratory Video Tour; Midrats Finale southpole

SPIFF, Party, Shift Change southpole

Good things come in threes, or 18s southpole

Sleepless in the Station southpole

Post Deploy southpole

Midrats southpole

IceCube and The Beatles southpole

Video: Flight to South Pole southpole

The Pure Land southpole

Almost There southpole

There are no mice in the Hotel California Bunkroom southpole

Short Timer southpole

Sleepy in MacTown southpole

Superposition of Luggage States southpole

Sir Ed southpole

Shortcut to Toast southpole

Pynchon, Redux southpole

Flights: Round 1 southpole

Packing for the Pole southpole

Goals for Trip southpole

Balaklavas southpole

Tree and Man (Test Post) southpole

Schedule southpole

How to mail stuff to John at the South Pole southpole

Summer and Winter southpole

Homeward Bound southpole

Redeployment southpole

Short-timer southpole

The Cleanest Air in the World southpole

One more day (?) southpole

One more week (?) southpole

Closing Softly southpole

More Photos southpole

Super Bowl Wednesday southpole

Night Owls southpole

First Week southpole

More Ice Pix southpole

Settling In southpole

NPX southpole

Pole Bound southpole

Bad Dirt southpole

The Last Laugh southpole

Nope southpole

First Delay southpole

Batteries and Sheep southpole

All for McNaught southpole

The Big (Really really big…) Picture southpole

t=0 southpole

Giacometti southpole

Descent southpole

Video Tour southpole

How to subscribe to blog updates southpole

What The Blog is For southpole

Home