Some hospitals are better than others. Some physicians are better than others. That’s obvious [1].

But let’s pretend it’s not. How do we prove that one hospital or one physician is better than others? That is more difficult than it may initially appear. This is true for several reasons:

  1. All patients are different.
  2. “Better” is a vague word with multiple definitions.
  3. Even if a definition of “better” is agreed upon, it may not be measurable.
  4. Even if “better” is defined and measured, it may not apply to me.

To start our series that dissects hospital and physician ranking systems, we’ll start by considering the first of these: that patients are different.

People Are Not All the Same: Apples and Oranges

If people were all the same, and had all the same medical problems, then finding the best physician would be easy – just compare the death and disability rates for the patients in each doctor’s practice.

But the world is not set up this way. An orthopedic surgeon who only treats tennis elbow in athletes should have a near-zero death rate among his or her patients, while an equally competent orthopedic surgeon who treats broken hips in osteoporotic elderly persons is likely to have quite a noticeable death rate. Although both physicians are board-certified in the same specialty (orthopedic surgery), their patients are too different for simple statistical comparisons. It’s apples and oranges.

This same principle applies to hospitals. Because the Mayo Clinic is in rural Minnesota, many of its patients get there only if they can afford an airplane flight, and most of those patients have good health insurance as well. At the spectrum’s opposite end, the University of Maryland Hospital is located in an economically challenged section of Baltimore. Here you have two hospitals with patient populations as different as that of doctors treating tennis elbow and broken hips. Again, we end up comparing apples and oranges.

How do ranking systems get around this fundamental limitation? There are two main approaches.

Approach #1: Get Apples-to-Apples Comparisons by Making Patients Homogeneous

We have already seen that, even within a single specialty like orthopedic surgery, the patients are so heterogeneous that it is not possible to get apples-to-apples comparisons. But what if we sliced each specialty into dozens of extremely narrow sub-sub-sub-specialties so that the patients in each slice were quite alike for every physician?

This could work in theory, but there are two practical difficulties. First, it is not clear how thin the slicing needs to be. Consider coronary artery bypass surgery. Some cardiovascular surgeons specialize in “re-do” bypasses that are always more difficult and riskier than “first-time” operations. So even slicing down to a single type of operative procedure is problematic.

The second difficulty with this approach is that it requires very detailed data to support it. Unfortunately, detailed data are both costly and rare. In comparing tennis elbow treatment results, we would probably want to distinguish results in professional athletes, weekend athletes, and non-athletes, but these patient characteristics are not recorded in any routine data collection tools. Manually extracting the data from notes in the medical record would be expensive and ultimately impractical.

Approach #2: Adjust for Patient Differences by Inventing New Statistics That Can Be Compared

The second approach accepts the inevitability of apples-to-oranges comparisons but tries to transform both the apples and the oranges into bananas, which can then be meaningfully compared.

For example, one influential study [2] compared the survival of patients in intensive care units (ICUs) in 13 different hospitals. One might expect the thin-slicing of Approach #1 to work, given that ICU patients are similar from one hospital to another because they are all very sick, but in this study it did not work. Death rates ranged from 9% to 38% and, perhaps surprising to some, the world-famous Johns Hopkins Hospital was the hospital with the highest mortality.

For Approach #2 the researchers needed a new statistic to compare hospitals, so they turned to the then-new APACHE II algorithm that predicts each patient’s probability of death based only on physiological characteristics such as blood pressure, blood oxygen level, kidney function, and so on. Using APACHE II, for example, the researchers could see that, among the 13 hospitals, Hopkins patients were, by far, the sickest – they had the highest APACHE II scores. APACHE II predicted that 43% of Hopkins ICU patients would die.

But, as noted, “only” 38% of Hopkins ICU patients died, giving Hopkins an observed-to-predicted death rate of 38/43, or about 90%. In other words, about 10% more patients lived than expected [3] – far from a shameful performance. By contrast, another hospital with a predicted ICU mortality of 16.7% according to APACHE II actually had a mortality of 26.4% – so 58% more patients died than expected.

In summary, for each hospital the researchers used APACHE II to convert the hospital’s raw mortality rate into a ratio of observed-to-predicted death rates. Had they compared the raw mortality rates between hospitals, it would have been apples-to-oranges, but comparing the ratios was banana-to-banana and, therefore, was informative.

Lessons Learned From the ICU Study

After discarding thin-slicing as impractical, we showed how a risk-adjustment tool like APACHE II enables informed ranking of hospitals by ICU death rate.

If you are now thinking “This is all fine, but how do we know the APACHE II predictions were accurate?”, then you are absolutely asking the right question. Any comparison that incorporates risk adjustment will critically depend on the quality of the adjustment model.

APACHE II was built from a data set of 5000 patients, so its predictions are generally accepted as valid [4]. But constructing such models is difficult, and they are data-hungry, resulting in a paucity of widely used models. The USNews ranking system for instance, uses the “Elixhauser model” of risk adjustment, perhaps a subject for a future post.

The Universal Lesson-Learned

A more general lesson worth noting is the substantial complexity needed to make fair comparisons between hospitals (same for physicians). In the ICU study this included not only the study of 5000 patients from which APACHE II was built, but also all the effort in the 13 hospitals to systematically collect the two dozen data values on each of their ICU patients. Absent an order from the government or an accreditation governing body, no hospital is going to compile nice spreadsheets of patient data just to make life easy for ranking organizations like USNews or Healthgrades.

This dependence on rich underlying data is the Achilles heel of ranking systems.

Or not. As we’ll see in later posts, the difficulties in building ranking systems are so numerous that the animal mascot for ranking systems should be the centipede – the only creature with enough feet to have the required number of Achilles heels.

Our next installment looks at the Achilles heel that comes into play as soon as a ranking system uses the word “best.”

REFERENCES

[1] Although obvious, for decades the American Medical Association’s implied position was that one physician is just as good as another. While it makes perfect sense for a physician advocacy organization to hold this position, it defies statistical reality.

[2] Knaus1986 and colleagues were attempting to learn whether the organization of ICU services has an impact on mortality. Their conclusions in that realm have been difficult to replicate, but the underlying data on the variability of adjusted ICU mortality rates remains valid. // Knaus WA, Draper EA, Wagner DP, Zimmerman JE. An evaluation of outcome from intensive care in major medical centers. Ann Intern Med. 1986;104(3):410-418. // PubMed 3946981 // DOI 10.7326/0003-4819-104-3-410

[3] Technically, the result for Hopkins was not quite statistically significant. But there were statistical outliers among the 13: one hospital with a mortality 59% of expected (good) and the one with a mortality 158% of expected (bad).

[4] APACHE II is still used today, but hospitals are still learning how to use it correctly. // Polderman KH, Girbes AR, Thijs LG, Strack van Schijndel RJ. Accuracy and reliability of APACHE II scoring in two intensive care units Problems and pitfalls in the use of APACHE II and suggestions for improvement. Anaesthesia. 2001;56(1):47-50. // PubMed 11167435 // DOI 10.1046/j.1365-2044.2001.01763.x // One question beyond the scope of this blog: with the enormous changes in ICU medicine since APACHE II’s birth in 1982, are the mortality predictions from that era still accurate?