Updating the Genome Decoder

John Jacobsen
Saturday, July 13, 2013

Home All Posts

Earlier post Getting Our Hands Dirty (with the Human Genome) code clojure

Later post Nucleotide Repetition Lengths code clojure

In our last post we saw that so-called “N-blocks” (regions of the genome for which sequences are not available) were not getting decoded correctly by our decoder.

The solution is to look back into the so-called “file index” which specifies where the N-blocks are, and how long each block is. The modified decoder looks like this:

(defn genome-sequence
  "
  Read a specific sequence, or all sequences in a file concatenated
  together; return it as a lazy seq.
  "
  ([fname]
     (let [sh (sequence-headers fname)]
       (lazy-mapcat (partial genome-sequence fname) sh)))
  ([fname hdr]
     (let [ofs (:dna-offset hdr)
           dna-len (:dna-size hdr)
           byte-len (rounding-up-divide dna-len 4)
           starts-and-lengths (get-buffer-starts-and-lengths ofs 10000 byte-len)]
       (->> (for [[offset length] starts-and-lengths
                  b (read-with-offset fname offset length)]
              (byte-to-base-pairs b))
            (apply concat)
            (take dna-len)
            (map-indexed (fn [i x] (if (is-in-an-n-block i hdr)
                                    :N
                                    x)))))))

The primary modification is the map-indexed bit at the end, which looks up the position of the base pair in the index using the is-in-an-n-block function:

(defn is-in-an-n-block
  ([x hdr]
     (let [{:keys [n-block-starts n-block-sizes]} hdr]
       (is-in-an-n-block x n-block-starts n-block-sizes)))
  ([x n-block-starts n-block-lengths]
     (let [pairs (map (fn [a b] [a (+ a b)])
                      n-block-starts n-block-lengths)]
       (some (fn [[a b]] (< (dec a) x b)) pairs))))

I tested the new decoder against FASTA files of the first two chromosomes of the human genome using the same procedure we used for yeast in my validation post The N-blocks (and all other sequences) were decoded correctly.

The astute reader will note the use of lazy-mapcat instead of =mapcat:

(defn lazy-mapcat
  "
  Fully lazy version of mapcat.  See:
  http://clojurian.blogspot.com/2012/11/beware-of-mapcat.html
  "
  [f coll]
  (lazy-seq
   (if (not-empty coll)
     (concat
      (f (first coll))
      (lazy-mapcat f (rest coll))))))

A version of genome-sequence which used mapcat instead of lazy-mapcat worked fine for individual chromosomes (file sections), but consistently ran out of memory when processing whole genome files. It took a bit of research and hair-pulling to figure out that there is a rough edge with mapcat operating on large lazy sequences.

Lazy sequences are a strength of Clojure in general – they allow one to process large, or even infinite, sequences of data without running out of memory, by consuming and emitting values only as needed, rather than all at once. Many of the core functions and macros in Clojure operate on sequences lazily (producing lazy sequences as output), including map, for, concat, and so on. mapcat is among these; however, mapcat is apparently not “maximally lazy” when concatenating other lazy seqs, causing excessive memory consumption as explained in this blog post. Using that post’s fully lazy (if slightly slower) version of mapcat fixed my memory leak as well. Though lazy seqs are awesome in many ways, one does have to be careful of gotchas such as this one.

So, in the process of handling this relatively large dataset, we have discovered two rough edges of Clojure, namely with mapcat and count. We shall see if other surprises await us. Meanwhile, the latest code is up on GitHub.

Back to Frequencies

Now that we are armed with a decoder which handles N-blocks correctly, let us return to our original problem: nucleotide frequencies. For yeast, we again have:

jenome.core> (->> yeast genome-sequence frequencies)
{:C 2320576, :A 3766349, :T 3753080, :G 2317100}

Same as last time. For humans,

jenome.core> (->> human genome-sequence frequencies)
{:N 239850802, :T 856055361, :A 854963149, :C 592966724, :G 593325228}

As expected, the GAC numbers are the same as what we had previously; only T and N have changed. The distributions look like so:

hg-yeast-frequencies-2.png

Figure 1: Updated nucleotide frequencies

We can see that Chargaff’s Rule obtains now: A/T ratios are equal, as are G/C ratios. (BTW, this rule is intuitively obvious if you consider that in the double-helix structure, As are paired with Ts and Gs are paired with Cs.) Interestingly, the relative abundance of GC pairs in the human genome is higher than it is in yeast.

Timing and Parallelizing

Many of these “queries” on our data take a long time to run (particularly for the human genome). So as not to tie up my REPL and preventing me from doing other (presumably shorter) experiments, I find it helpful to run such tasks inside a little macro (similar to Clojure’s time) which performs the computation in a thread and, when the task is finished, prints the time elapsed, the code that was run, and the result:

(defmacro tib
  "
  tib: Time in the Background
  Run body in background, printing body and showing result when it's done.
  "
  [expr]
  `(future (let [code# '~expr
                 start# (. System (nanoTime))]
             (println "Starting" code#)
             (let [result# ~expr
                   end# (. System (nanoTime))
                   dursec# (/ (double (- end# start#)) (* 1000 1000 1000.0))]
               (println (format "Code: %s\nTime: %.6f seconds\nResult: %s"
                                code#
                                dursec#
                                result#))
               result#))))

For example, our overflowing count runs as follows:

(tib (count (range (* 1000 1000 1000 3))))
Starting (count (range (* 1000 1000 1000 3)))
;; ...
Code: (count (range (* 1000 1000 1000 3)))
Time: 399.204845 seconds
Result: -1294967296

Another tool which has proved useful is:

(defmacro pseq
  "
  Apply threading of funcs in parallel through all sequences specified
  in the index of fname.
  "
  [fname & funcs]
  `(pmap #(->> (genome-sequence ~fname %)
               ~@funcs)
         (sequence-headers ~fname)))

This threads one or more funcs through all the sequences in the file, as mapped out by the file headers. The magic here is pmap, which is map parallelized onto separate threads across all available cores. pseq maxes out the CPU on my quad-core Macbook Pro and gives results significantly faster:

Code: (->> yeast genome-sequence frequencies)
Time: 33.090855 seconds
Result: {:C 2320576, :A 3766349, :T 3753080, :G 2317100}

Code: (apply (partial merge-with +) (pseq yeast frequencies))
Time: 16.056695 seconds
Result: {:C 2320576, :A 3766349, :T 3753080, :G 2317100}

With our updated decoder and these new tools, we can continue our poking and prodding of the genome in subsequent posts.

Earlier post Getting Our Hands Dirty (with the Human Genome) code clojure

Later post Nucleotide Repetition Lengths code clojure

Blog Posts (140)


Home

Reflections on a Year of Daily Memory Drawings art

Repainting art

Daily Memory Drawings art

Questions to Ask art

Macro-writing Macros code clojure

Lazy Physics code clojure

Fun with Instaparse code clojure

Nucleotide Repetition Lengths code clojure

Getting Our Hands Dirty (with the Human Genome) code clojure

Validating the Genome Decoder code clojure

A Two Bit Decoder code clojure

Exploratory Genomics with Clojure code clojure

Rosalind Problems in Clojure code clojure

Introduction to Context Managers in Python code python

Processes vs. Threads for Integration Testing code python

Resources for Learning Clojure code clojure

Continuous Testing in Python, Clojure and Blub code clojure python

Programming Languages code

Milvans and Container Malls southpole

Oxygen southpole

Ghost southpole

Turkey, Stuffing, Eclipse southpole

Wind Storm and Moon Dust southpole

Shower Instructions southpole

Fresh Air and Bananas southpole

Traveller and the Human Chain southpole

Reveille southpole

Drifts southpole

Bon Voyage southpole

A Nicer Guy? southpole

The Quiet Earth southpole

Ten southpole

ISO50 southpole art

In Defense of Hobbies misc code art

Closure southpole

Takeoff southpole

Mummification southpole

Eleventh Hour southpole

Diamond southpole

Baby, It’s Cold Outside southpole

Fruition southpole

Settling In southpole

Revolution Number Nine southpole

Red Eye to McMurdo southpole

Faults in Ice and Rock southpole

Bardo southpole

Chasing the Sun southpole

Downhole southpole

Coming Out of Hibernation southpole

Managing the Most Remote Data Center in the World code southpole

Photoshop on a Dime art

Man on Wire misc

Posing Rigs art

Metric art

Cuba southpole

Wickets southpole

Safe southpole

Broken Glasses southpole

End of the Second Act southpole

Pigs and Fish southpole

Last Arrivals southpole

Lily White southpole

In a Dry and Waterless Place southpole

Immortality southpole

Routine southpole

Tourists southpole

Passing Notes southpole

Translation southpole

The Usual Delays southpole

RNZAF southpole

CHC southpole

Wyeth on Another Planet art

Detox southpole

Packing southpole

Nails southpole

Gearing Up southpole

Gouache, and a new system for conquering the world art

Fall 2008 HPAC Studies art

YABP (Yet Another Blog Platform) southpole

A Bath southpole

Green Marathon southpole

Sprung southpole

Outta Here southpole

Lame Duck DAQer southpole

Eclipse Town southpole

One More Week southpole

IceCube Laboratory Video Tour; Midrats Finale southpole

SPIFF, Party, Shift Change southpole

Good things come in threes, or 18s southpole

Sleepless in the Station southpole

Post Deploy southpole

Midrats southpole

IceCube and The Beatles southpole

Video: Flight to South Pole southpole

The Pure Land southpole

Almost There southpole

There are no mice in the Hotel California Bunkroom southpole

Short Timer southpole

Sleepy in MacTown southpole

Superposition of Luggage States southpole

Sir Ed southpole

Shortcut to Toast southpole

Pynchon, Redux southpole

Flights: Round 1 southpole

Packing for the Pole southpole

Goals for Trip southpole

Balaklavas southpole

Tree and Man (Test Post) southpole

Schedule southpole

How to mail stuff to John at the South Pole southpole

Summer and Winter southpole

Homeward Bound southpole

Redeployment southpole

Short-timer southpole

The Cleanest Air in the World southpole

One more day (?) southpole

One more week (?) southpole

Closing Softly southpole

More Photos southpole

Super Bowl Wednesday southpole

Night Owls southpole

First Week southpole

More Ice Pix southpole

Settling In southpole

NPX southpole

Pole Bound southpole

Bad Dirt southpole

The Last Laugh southpole

Nope southpole

First Delay southpole

Batteries and Sheep southpole

All for McNaught southpole

The Big (Really really big…) Picture southpole

t=0 southpole

Giacometti southpole

Descent southpole

Video Tour southpole

How to subscribe to blog updates southpole

What The Blog is For southpole

Home