We need to re-think the Ph.D...

A few months ago, I wrote a post about the differences between academic and industry interviews (including some advice for the latter). Since then, I've been involved in more interviews and I've come to the conclusion that the field of academia from whence I came is largely failing Ph.D. students.

One of my frustrations with academic research was a lack of appreciation for details. The 'point' of everything seems to be the results themselves - which generate publications - rather than the process of getting to those results. Papers in genomics often illustrate this: many publications choose arbitrary thresholds for analysis without explanation, or don't bother explaining why they chose this statistical test and/or normalization method over others, as just a few examples. I suppose that you can argue that it doesn't really matter: you're probably going to accept a false-discovery rate of 10%+ anyways, and ultimately, nothing's going to happen if you have a bunch of false-positives/negatives in your data.

Things are quite different on the industry side. The results that you generate are directly responsible for your livelihood as well as that of your co-workers. The point isn't to meet some minimum publishable standard of analysis, but rather to convince yourself and others that your results are legitimate. Consequently, it's no surprise then that good companies want you to prove that you are an expert in the methods required to generate the results that you've touted on your resume.

Which brings me to the title of this post: shockingly few Ph.D.s actually understand the details of the methods underlying their work. They can probably cite every paper published about their chosen discipline, but when pressed they'll admit that they analyzed their data like this because they were told to do so by a postdoc, or that they performed such and such normalization/analysis/test because it's what another paper did.

I completely understand - I spent 6 years in grad school and another 5-and-a-half as a postdoc. I've actually seen PIs tell students to skip the details of their analyses during lab meetings because they're not interested; they only want to see results. Furthermore, I've never seen a lab where members are actively encouraged to spend time at work improving their skills rather than performing experiments and analyzing data as quickly as possible [1].

As we all know, 85% of Ph.D.s will not be going into academia, and I expect that this percentage will only grow as time goes on and academic jobs continue to become less and less attractive. So regardless of the underlying factors (publish-or-perish, etc.) by focusing on results rather than skills, academia is leaving most of their trainees ill-prepared for the job market in which they will find themselves [2].

If you think that I'm blowing things out of proportion, then consider the following observations: most industry job postings require candidates to have 2-5 years of postdoctoral and/or industry experience above the Ph.D. in order to apply for a scientist position (rather than a technician). Also, my own employer interviews many, many candidates in order to fill each position, and a very common complaint is that candidates fail to show that they understand the methodology upon which their work is based.

The saddest aspect of all of this is that I've been hearing versions of these complaints since I was an undergrad: most university grads aren't going on to become professors, so why are we training all of them as if they were? My fear is that we're just going to accumulate more and more unemployed Ph.D.s until the system breaks under their weight.     

[1] 1) I assume that PIs generally believe that you'll learn by doing, but there's a surprising amount of stuff you can accomplish by jury-rigging random stuff off of the internet while learning very little of substance. 2) Labs that encourage such 'personal development' must exist, but have any biologists ever seen anyone give a weekly update at lab meeting about how they made their code more elegant, or efficient, or that they generalized it and shared it on the lab server? This should be part of the culture. 

[2] There's a stronger case to be made here: I honestly think that academic labs are under-performing because their members aren't learning the most efficient ways to accomplish their objectives. There's a total lack of knowledge-sharing among members of many labs and a lot of reinventing the wheel ad nauseum

Book Club: The Patient Will See You Now...

Eric Topol is a cardiologist known for his advocacy for technology-based disruption of the healthcare industry. I heard him make some provocative statements about creative destruction in medicine on the Econtalk podcast, and since I'm now now working in the broader healthcare industry, I decided to read his book, The Patient Will See you Now (2015; Basic Books).

I'm not sure for whom this book is intended: it covers a lot of ground, and many of the technical concepts that it discusses are far more controversial than presented. Topol tries to tackle a multitude of weighty subjects in a single book, and I'm sure that the general exuberance for all things 'omic' and 'big data' are going to ruffle a few feathers. In the interest of my time and yours, I'll comment on three major themes.

Paternalism in medicine

The first section of the TPWSYN(?) criticizes the problem of pervasive paternalism in medical practice. While all professions are expected to be self-promoting (and self-serving), medicine is somewhat unique in its degree of self-congratulation and self-importance. In particular, Topol is critical of the field's lack of interest in 'democratizing' the healthcare process: essentially, there's a lot of information available for patients to make informed decisions about their care, but they rarely have access to their own medical data [1].

I understand why physicians could be weary of too much patient 'involvement': doctors already complain about patients citing 'Dr. Oz' when questioning diagnoses and prescriptions, and it's easy for desperate patients to fall for misinformed woo that they read online. But ultimately I agree with Topol: patients are already organizing support groups and sharing information online and MDs can either be there to shape the process, or allow it to happen without their involvement (the lack of practitioner involvement goes a long way to explaining why so-called Electronic Medical Records, or EMRs, are so physician-unfriendly [2])

Omics will revolutionize everything

Anyone who's followed the literature on things like genome-wide association studies (GWAS) [3] knows that, for many complex diseases, they've been quite controversial and/or disappointing with little of the phenotypic variance explained by genomic factors (see Visscher et al. 2012, for example). They're also very expensive. Regardless, Topol presents them without any controversy as if they're going to explain the root causes of everything - he's firmly on the side of 'we just need more data'. 

But that's the rub: the root cause of every disease isn't purely genomic. Rather, disease phenotypes result from the interaction between genes and the environment. Furthermore, no law says that these interactions need be 'additive', so saying that this disease is 40% genetic and 60% environmental doesn't make sense. More data may be good from an academic perspective - but much more work needs to be done to translate this into clinically actionable findings - there's a big difference between the statistical significance of an effect and its magnitude [4] .

It's also worth pointing out that a major challenge in applying the results of GWAS in a clinical setting is that in addition to the results being sample and size-specific, they are also often very population-specific. This means separate studies are required to identify risk loci associated with cancer in caucasians (the most well-studied group), versus africans, or asians, or latinos, etc. So unless the diagnostic value of these studies increases dramatically, it may be difficult to justify the costs.

The smartphone as the all-in-one medical diagnosis device

TPWSYN spends a lot of time discussing how technological advancement is shrinking the cost and size footprint of complex medical devices. In particular, there are apparently several excellent proof-of-principle technologies that can attach to your smartphone and collect information on things like blood-pressure, temperature, or the visual status of your inner ear, nose, or throat, among others. Via software and/or telemedecine, there's a possibility that such devices could allow routine diagnoses of minor conditions without the need for expensive, time-consuming hospital visits.

Clearly, there's a lot of exciting potential in such devices: as an example, diabetics have been able to monitor their own blood-sugar levels for years now. However, I think that this type of technology brings up one of the major caveats of the entire book: 'More data' is only useful if clinicians know what to do with it. Consider the following: maternity wards have largely adopted continuous fetal monitors that affix to the mother's belly, over the traditional 'checking in' every so often. This has coincided with a large spike in the number of unplanned, emergency C-sections. However, there  has been no corresponding drop in rates of infant mortality. Most likely, continuous monitors exposed a large number of 'normal' fluctuations in prenatal heart-rates and contractions, which spooked unfamiliar medical staff into performing unnecessary operations [5].  

In the fullness of time, we'll likely figure out how to perform analytics on 'big data' in order to produce meaningful effects on individual patient outcomes (not simply 'statistically significant', but actually noticeable in magnitude at the individual level). A lot of this is going to come from combining 'omics' and monitoring with work unraveling the underlying mechanisms of disease. But the results that one obtains from data are only as good as the data themselves, as well as the hypotheses under which they are interpreted. I'm not sure whether the best place upon which to focus the bulk of our efforts is in collecting ever more data of untested quality. 

Ultimately, much of what Topol discusses in his book will likely come to pass - at least in implementation if not in actual value to patients. But without serious discussion of the subtleties of the underlying science, it amounts to much more hype than information.

[1] Topol also criticizes medical associations for levying non-evidence-based criticisms against things like allowing registered nurses to handle diagnosis and prescription in 'routine' practice.  

[2] See The Digital Doctor, by Robert Wachter.

[3] e.g. The entire journal called Nature Genetics.

[4] Consider the types of results that you (used to) get from 23&me: If you have a variant that increases your risk of disease X by 2%, are you going to change anything about the way you live your life? Would it even help at an individual level or are you only going to see an effect in aggregate?

[5] See Expecting Better, by Emily Oster.

Rant: putting a little effort into public speaking...

I'm continuously baffled by how little effort scientists put into public presentations. It's easy to downplay the importance of talks when there are so many other constraints on our time, but, we need to take into consideration the sheer amount of collective time that bad talks are wasting.

It's odd that there seem to be no incentives to improve the quality, or most importantly, the timing of talks. For instance, I can't count the number of talks that I've attended where the speaker's gone way over time [1]. Conversely, I can count the number of times that I've seen someone ask the speaker to stop on one or two fingers. Scientists have never struck me as sheepish about offending colleagues' feelings when it comes to reviewing papers or criticizing work during lab meetings. And yet there seems to be some kind of universal ban on offending folks, or even providing constructive criticism regarding presentations.  

I wish that it would become culturally accepted that unnecessarily long or uninformative presentations waste the precious time of every single attendee. It should be acceptable and expected that a moderator first give a speaker a signal that their time is coming up (a five-minute warning, for example), and then politely cut them off when that time arrives. I have a feeling that people would feel embarrassed to be cut off, which would provide at least some incentive to do a better job putting together their talks [2].

Here's a general observation: despite over a decade in the 'biz', I have never heard a seminar attendee describe a presentation as 'too short', while the converse is as regular as clockwork. This is probably an excellent indication upon which side to err when prepping a presentation. 

Finally, I'd be remiss not to bring up two personal pet-peeves about seminars:

1) I've noticed a trend towards a particular presentation style that I call the 'look how much work I've done!!!'-talk. This is where the speaker focuses on telling you about the effort they've put into something, usually by presenting a lot of slides without going into detail about any of them. In my experience, this is always a bad idea. It's much better to focus on one aspect of a project in sufficient detail to convey why it's important, and why people should care - both of which are rarely as self-evident as people would like to think.

2) The purpose of overview slides are to help the audience put the various parts of a talk into context. However, I notice that most people use them as a long-winded abstract. On top of taking up valuable time, I don't find it helpful to receive a barrage of concepts all at once, before they're properly explained. Again, I'd focus more on why it's important, and why people should care at this point. Also, I don't think that any talk shorter than 30 mins needs a minute-long overview slide [3]. 

P.S. I think that these concepts should apply to all talks - not just big, public seminars. No need to waste time polishing lab meeting presentations, but it's no less important to be considerate of your audience's time.

[1] I've actually been to a conference where our entire session had to miss dinner because we we're so ridiculously behind schedule.

[2] While practicing a talk before the official delivery is ideal, I don't think that this is required. With a bit of experience, you can develop pretty reliable rules-of-thumb about how long you should spend on a each background or data-heavy slide and so on.

[3] I know that a lot of people disagree with me on overview slides, but I've seen so many talks begin with a 'First I'm going to give you an introduction to X. Then I'll talk about some of the results I've obtained, before discussing their implications. Finally, I'll end with some conclusions'-slide. I don't think that we need to be reminded of how a talk works. If it's not helpful, it's unnecessary. 

Wasting my time...

One of the most irksome aspects of working in computational biology is how frustrating it can be to analyze other people's data (OPD) [1]. By OPD, I don't mean quickie files generated for personal use; rather, I'm talking about datasets ostensibly provided so that other folks can build upon, or at the very least, replicate published work. I'm talking about anything from supplementary material included with papers and/or software, to big taxpayer funded public databases.

Here's a typical scenario: I need to combine two or more pieces of data, such as a list of human disease associated variants identified in a study with some database of previously published variant associations. Conveniently, both datasets use the same format for identifying variants, which means that this should boil down to finding the union between a particular column in each of the tables. This shouldn't take more than five minutes, right?

Unfortunately, I quickly notice that some proportion of variants aren't being found in the database, even though the referenced origin of said variants are in there. 15 minutes of searching reveals that many of these are just typos, the others I'll have to check in more detail. I decide that I'd better write a script that cross-references the references [2] against the variants to catch any further mistakes, but this ends up spitting out a lot of garbage. Some time later, I realize that one of the tables doesn't stick to a consistent referencing style [3], so I can either go through the column and fix the entries manually, or try to write a script that handles all possibilities. A few hours later, I've finally got the association working, minus a dozen or so oddball cases that I'll have to go through one-by-one, only to find out that much of the numeric data I wanted to extract in the first place is coded as 'free text'. Now I'll need to write more code to extract the values I want. However, it's now 7 pm, and this will have to wait until tomorrow.

I've encountered this sort of problem many, many times when working with scientific data. Why are we so tolerant of poorly formated, error-riden, undocumented datasets? Or, perhaps a more appropriate question is why don't scientists have more respect for each other's time? Is it more reasonable for the dataset generator to spend a little bit of time checking the reliability of their data programmatically or for each person who downloads the data to waste hours (or days) working through typos and errors?

I get it: after spending months writing and rewriting a manuscript, rarely do you feel like spending a lot of time polishing off the supplementary materials. Mistakes happen simply because you're in a rush to get a draft out the door. On the more cynical side, I have also been told that spending time making it easier for people to use my data isn't worth my time. Neither of these considerations explains errors found in public databases, however.

I don't have a solution to the problem, but I'm pretty sure that the root cause is one of incentives: that is to say, there are few professional incentives for making it easier for your colleagues (competitors) to replicate and/or build upon your work. Perhaps we need a culture shift towards teaching better values to students or, more realistically, we need journals to actually required that data follow minimal standards, perhaps including requiring that mistakes in supplementary tables be fixed when pointed out by downstream users. 

[1] Who's down with OPD? Very few folks, I'm afraid.

[2] Cross-referencing has always struck me as the lamest, overused, 'nerd word' on TV. I cross-reference all the time, but I think this is the first time I've actually referred to it as such.

[3] e.g., [First author's first name] YYYY. I wish I was making this up.