Wednesday, May 27, 2020

Change to my Tweets about coronavirus data

It's frustrating to do this, but it has to be done. I'm going to change the thing I've been reporting. And (possibly worse), I'm going to tell you why. (Basically, I'm fixing something I should've been doing all along.)

  I've been trying to report a measurement to try to communicate how fast the rate of infection was, how the rate was different in different places, and how it was changing (thankfully, slowing down) over time. This has always involved some inference: the new cases each day aren't the same as the total new infections, just the ones that were being diagnosed; similarly, the total cases aren't the same as the total infections in the community, just the ones that were being treated. The assumption is that, roughly speaking, the new cases was some (relatively constant) percentage of the new infections, and the total cases was some (relatively constant) percentage of the total infections. 

 Early on, "total cases" seemed as good a measure as any to substitute for the total number of people infected. But for a while now, there's been both good and bad news. People don't stay infected, as "active cases". Many get better. Sadly, some die. All of them, hopefully, once they become "active cases", are isolated and are much less likely to infect someone else. The "total cases", which I've been using, counts all the people in the state/region/etc. who have been diagnosed with the virus. It includes the recovered and the dead, as well as the cases still being treated. 

 Early on, there wasn't much difference between active cases and total cases. Now, in New York, there are over 279,000 active cases, which is a LOT. There are over 29,000 deaths, which is TOO MANY. And there are over 64,000 people who have recovered, which is a GREAT START. And I'm trying to figure out which number is the best representative of the number of infected people in the population. 

It doesn't have to be closest to that number, it needs to correlate with it: if there's twice as many infected people out there right now, you'd expect twice as many... 
active cases. 

 So I'm going to start reporting areas based on the threshold of new cases over the last 7 days being higher or lower than 10% (or 5%) of the _active cases_, not _total cases_. This will momentarily mean more areas will be above each threshold. But I think it's more accurate, and over time, the trend will tell us more about how fast (or slow) the disease is spreading. 

 As was always the case, this can do weird things for areas where the numbers are smaller generally, especially if the reporting of active cases is peculiar (Vermont, for example, lists 967 total cases, but only 65 active cases; I don't know if they've had 850 or so recoveries, or if those people have been moved to other states, or what. I have no reason to question these figures, but it could be Vermont's really good recovery rate is temporarily making them look like they have a faster spread rate).

 Another thing that should eventually put a hitch in my numbers is testing. As testing gets more and more widespread, the number of new cases should go up, as more infected people are found and treated. But that's a good thing! So if a state jumps back into the >10% category when they're testing more broadly, that isn't a reason for alarm - it's the longer term trend after the jump that really matters. 

 Okay, without further ado...as of today (I'm sorry, I'm going to have to report the new numbers going forward, and haven't recalculated the data for the past), here’s the parts of the US that had a daily increase in COVID cases >10% of their current active cases: ID,MN,Veteran Affairs,WV. >5%: AL,AK,AR,KY,MS,NV,SC,SD,TX,US Military,VT (the rest were less).

Click here to see the rest of this post...