Tuesday, June 10, 2008

And now for something completely different

One of the interesting things about working at a start-up company (indeed, one of the things that got me excited about moving to this new job) is that there is a lot of work to be done and people need to be able to step up and take on a variety of projects which require different skill sets. Now that may work on the customer-focused software has wound down I'm picking up a new project working on some internal workflow software for manging the way we create optical maps and do identification of those maps.

This is a completely different part of the system than what I've been working on, involves a completely different set of code and, in fact, involves a completely different programming language. After languishing and collecting dust for over 5 years I'm brushing off my Perl programming knowledge and starting to put it to use. So far I've been working on understanding the code that is already there (thankfully the man who wrote it was very conscientious about clean, understandable code) and getting an idea of how things are laid out. That being said, it's still Perl! For those that don't have programming experience with Perl, it's often said that it's a "write once, read never" language ... meaning that you just write something to get a job done and then you never go back and look at the code again. It's not the prettiest programming language. Here's a short example:


for (my $i=0; $i<=$#vals; $i+=2)
{
my ($c, $flag) = ( $vals[$i], $vals[$i+1] );
$flag = ($flag eq "n") ? 0 : 1;
push @rle, $flag for ( 1 .. $c);
}


Yeah... isn't that nice and easy on the eyes?

Yesterday I had my first meeting with some of the scientists in the company who will be using the software that I'll be creating and got some very good ideas about things that it should do and how it should look and behave. It was very nice to be talking to users (even if they're my coworkers) and seeing how the software I'm going to write is going to make a huge difference in the way they do their work. I'm excited to have something new and interesting to work on and I'm glad that it's going to be something useful.

Wednesday, June 4, 2008

Ship It!

The last month at work has been a non-stop push of implementing features, testing, and fixing bugs. We spent hours and hours testing the software that I've been working on and I think we really polished it up nicely and have something that's going to make customers happy. We've actually had a couple of customers using a beta release of my application and so far the feedback has been quite positive, which makes me happy.

Today the last official document was signed off and the software I've been working on for the last 8 months has now gone gold for the first release to customers! I'm very excited, relieved, and apprehensive all at the same time. It's exciting to know that people are going to use the software that I created and put all of my work into. I banged my head against the desk trying to solve some very hairy and insidious problems, spent hours discussing how features should work and how to make the program as useful as possible, and delved deep into performance profiling and benchmarking to try and eek out as much speed improvement as possible. It was a lot of work and I'm thrilled that we came this far and that things are looking this good.

But all of that is followed on with a niggling doubts and worries that I've missed something that's going to be a big problem once people start actually using the software. One thing is true about writing software: you can plan all you want for how you think customers are going to use your software, but you can never be certain that you've got it just right. I worry that there are hidden bugs that we didn't find that are going to crop up and cause big problems or make us look bad. I'm concerned that the technology choices I made might not be the right ones .. maybe we didn't fully understand how people are going to use the application and it just won't scale well.

But after I calm down for a little bit I remind myself that I'm surrounded by a group of people that are working hard to make sure that we don't send out something that's going to be a flop. After all the hours of testing and of talking to customers and other users I think that it's going to be a success.

But I will still worry a little bit.

Monday, April 28, 2008

Help Wanted

Things are moving along very quickly here at work and we've hit our first big milestone. It kind of feels like we're just over the top of the first big climb on a roller coaster... things are about to start getting wild! There is a lot of software work to be done and it looks like we have a number of openings in software development that we're hoping to fill soon.

In general, we're looking for bright, self-motivated, and effective people that can come on board and happily pick up some new projects and jump right into making some software. In particular we're looking for Java and C++ programmers for a variety of projects including production database management, image processing, and algorithm development to name a few.

The job postings are not yet up on the website, but they should be posted some time this week. If you know a person interested in a new opportunity with a cool and fun company please direct them either to myself or to the hiring manager Erik.

Wednesday, April 16, 2008

Here comes the science, Part 3

Before I went on hiatus it was just about time to talk about the software that I've been working on and how it pertains to the process of working with optical genome maps and actually doing something interesting with them.

Building a database
As I mentioned in previous posts the most interesting thing you can do with an optical map is to compare it to other maps of similar genomes for the purposes of looking at similarities and differences. But in order to do that you need a repository of maps and way of categorizing and searching that repository to find what you're looking for. We didn't have anything like that at the time I started so it was the first thing I worked on and we now have a nicely categorized, searchable database of over 40,000 genome maps.

Making maps in software
Creating optical maps is currently a time-consuming process and we'd need a lot of people to make 40,000 maps by hand. The vast majority of those maps that we have in the database are what are called "in-silico" maps, which is a cutesy way of saying that they were made in software. When you think about what mapping is, you're taking little bits and pieces of DNA and cutting it up with an enzyme and then measuring the fragments that get created. We don't necessarily know what the actual DNA sequence of that genome is and it's actually irrelevant for the purposes of creating optical maps (which can be very helpful, which I'll describe later). However there are plenty of people out there who are working hard at sequencing the genomes of all sorts of organisms. We can take those sequences (the literal nucleotide sequence, e.g. ATCGGACT) and simulate the process of applying a restriction enzyme to cut that sequence into fragments to create in-silico maps. Luckily someone out there already wrote libraries for doing these sorts of things so it was pretty easy to use that code to populate our database.

Comparing maps
Really the critical functionality of the software I'm working on is the ability to compare maps to each other. In a nutshell we compare maps by looking at the series of fragments in each map and use some complicated math that I'll likely never understand to figure out whether or not they're "close enough" to each other to confidently say that they probably represent the same underlying DNA structure. While it's very likely that the actual DNA sequences are different in some respects, those differences are small enough that they don't show up at the map level. And we can reasonably assume that these are regions of similarity between the genomes. Using maps these similarities and differences are very easy to visualize.. here's an example of a couple of similar strains of P.aeruginosa:


The purple parts are regions of similarity while the white parts represent regions that do not appear to be similar at all. It's immediately obvious where these particular strains differ and where they appear to have common structure.

Extracting meaning
All of these leads up to my final point which is how you can use this software for comparing maps to extract meaning. As we know, the DNA structure of organisms dictates what they look like and what they are capable of in the physical world. In our particular realm we're mainly looking at bacteria.. specifically bacteria that make people sick. There are a lot of species of bacteria that make people sick and, within those species, there are several sub-species or strains that act differently. Some are particularly nasty, some are immune to certain antibiotic medications, and some are just run-of-the-mill . Since these strains are all of the same species they (frequently, but not always) end up sharing a lot of similar DNA. So by comparing maps of the different strains you can fairly easily see places where the DNA structure differs and that can really help you isolate the region of the genome that is cause a particular strain to be especially nasty.

I guess that's about it for now .. this is getting pretty long. I may revisit this topic a little later as more code gets written and I start in on more new things. Right now I'm kind of in the middle of a round of bug-fixing and polish and that' s just not that interesting!! Bye for now.

Brushing the dust off

It's embarrassing to say it, but it's been almost 3 months since my last blog post. I've had a few people comment on this and my wife actually asked if she should remove my blog from her RSS aggregator... ouch!! Well my only comment is that I took a while off when our daughter, Sylvia, was born in early February and I got out of the habit. But I'm back in the saddle now and looking at getting back to moving forward.

Thursday, January 24, 2008

Here comes the science, Part 2

This time around I'm going to dive down into a little more detail about what actually goes on in the process of making this optical maps of DNA samples. At a theoretical level, the process is fairly straight forward and sounds pretty basic. However, as I'm learning while working here, real life does not think very highly of our nice, simple, straight-forward theories. So the process has to be very robust, especially on the software side, which makes me very glad that I work with a lot of really smart people.

Making DNA lay down straight
The whole linchpin of this process is being able to measure the length of the strands of DNA (more accurately, the lengths of DNA fragments but we'll get to that shortly). In order for length to have any meaning, we need to have the subjects we're measuring be as close to a straight line as possible. In order to do that we use a glass surface (you remember those microscope slides you used in high school) and a cover slip that has microscopically small channels carved into it. The DNA is placed, in solution, onto this surface and, using a magical process I know nothing about, the DNA is stretched out along those channels which serve as guides for straightening out the molecules.

Cutting it up into fragments
In order to create meaningful maps that can be used to identify and compare organisms we need to break up the DNA molecules into fragments. It's these fragments that we measure to create that nice-looking barcode map. In order to create those maps you use what is called a "restriction enzyme", which are enzymes that actually cut DNA. A particular restriction enzyme always cuts in the same place, at a particular occurrence of base pairs in the DNA strand. For example, the enzyme BamHI cuts at restriction sites of GGATCC. As an aside, most restriction enzymes cut at sites that are palindromic.. there's no obvious reason that I know of why that is, but it sure is a neat coincidence. Due to the genetic makeup of different organisms, different enzymes will cut different numbers of fragments for each organism. Part of our process is picking an enzyme that will cut the "right number" of fragments, that is, enough to make a meaningful map. Too few fragments or too many fragments often make the maps indistinguishable from each other.

Measuring those fragments
As part of the preparation of the DNA, a stain is applied that will cause the fragments to light up or "fluoresce" when exposed to a laser. The glass slip containing the DNA solution is placed on a fluorescent microscope that has a camera attached to it and an automated software system moves the camera up and down the length of those channels and takes pictures through the microscope. Here's an example of what it looks like. Remember this is one tiny fraction of a single image from the microscope. You can see several broken strands of DNA. The colored one is one that has been picked out by the software as clean enough to be measured and recorded.


Hundreds of such images are acquired and finally fed through some image processing software that finds the nicest looking DNA molecules and it finds the fragments and measures their length in some unit of measure that is smaller than anything I can image. Finally those fragment lengths are recorded in order and they can be visually represented by that barcode-like display I showed last time. Here's an artists interpretation of how that looks:


Assembling the pieces
Here's where the real world comes in and whacks you in the head. DNA will almost never stay fully in tact throughout the process I just described. And even if it did, there are usually a lot of molecules and they tend to overlap each other or they don't straighten out exactly right (or sometimes at all). So what you end up with is lots and lots of small maps that represent just a chunk of the entire strand of DNA. And here is where some intense computing power is brought to bear on the problem as we take all of those smaller maps and try and determine how and where the overlap each other. It's kind of analogous to trying to put a piece of paper back together after it has been through a shredder. When this process is finally done (and everything worked out okay) you end up with a "consensus map", which is the amalgamation of all of those smaller map chunks.


Once you've got that consensus map, then the doors open to a wide range of things you can do with it and that's where the software that I'm working on comes into play. But I'll save that stuff for the third and final installment of this series. Thanks for reading!

Wednesday, January 16, 2008

Here comes the science, Part 1

I've been wanting for some time to make a post or two to try and explain the basics of the science that is the core of our company. The techniques described here are in the public domain so there's nothing secret here. I've been working here long enough that I now feel pretty confident that I understand the process at a fairly high level.. just the right amount to be able to describe it to someone else that might find it interesting. For part 1 here, I will just describe the basic premise and will cover more actual detail in later posts. I've really enjoyed learning this stuff and I hope that other people find it interesting too.

The "Op" stands for Optical
The company's name, OpGen, is based on the fact that the core scientific process behind the business is called "optical mapping" which is, in short, a technique for taking physical samples of DNA and creating a visual representation of it such that unique organisms can be easily differentiated from each other and similarities between other organisms can be easily spotted. The whole concept is that you can break up DNA into many fragments and then put those fragments together in a line and you get what we call a "map". What's useful about this is that similar organisms will consistently and repeatedly break up in the same way such that their maps are very similar. As you'll see later, these maps almost look like barcodes and you can actually think of them as such, or as a "fingerprint" which uniquely identifies an organism. Here's an example of what one might look like:


What's it for?
The most interesting applications for optical mapping that I am aware of are in the area of what we call "comparative genomics" (other people might call it something else). Basically, it's the practice of looking at a number of similar or related organisms and analyzing what's different about them. For instance, say you have maps of two isolates of the same species of bacteria that cause infections in humans. Furthermore, say that one of those isolates is known to be extremely nasty and hard to treat, while the other is easily killed off with a round of antibiotics. By comparing the maps of these two bugs, you can actually see where the two are genetically different. Those parts that are different most likely indicate where the nastiness of the bacteria is regulated and can point the way for researchers to know where to look when trying to figure out how to combat that strain.

Coming soon...
In future posts I'll tell you more detail about how we actually create those maps and talk about the software I'm working on and how it pertains to these maps.