One way to mine data is to analyse past values to see if there are trends or cycles in the data or whether there are temporal correlations with other variables. I've made a start by looking at 12 years, 1999 to 2010, of monthly US mortality data.
The data comes from the multiple causes of deaths, Wonder database at the CDC.
The graph below shows the raw PD data (blue, lefthand scale) and raw all cause data (red righthand scale). The most noticeable features are:
- the variability from month to month;
- the values go through an annual cycle: high in winter, low in summer;
- the PD and all cause numbers follow similar patterns.
In the results shown below, the following terms are used:
PD: number of deaths with Parkinson's Disease as one of the multiple causes.
ALL: number of deaths from any cause.
DELTA: PD - ALL
RAW: unprocessed data from the query shown in the Data section.
1. Basic monthly statistics.
RAW PD: mean=2822, min=2156, max=3673
RAW ALL: mean=202611, min=180153, max=247828
2. Seasonality, distribution of deaths by month.
INDEX: processed data taking into account the number of days in each month, shown as an index centred on 100. For instance, if you had just January and February data from two years, the raw data could be Jan=310, Feb=280, whereas the index for both is 100.
SEQ: 12 values, one for each month, in order from January to December.
INDEX PD: mean = SEQ(115.6, 112.9, 108.9, 99.1, 93.1, 88.6, 88.2, 88.0, 92.6, 99.0, 103.4, 111.2)
INDEX ALL: mean= SEQ(110.2, 109.3, 106.3,100.4, 96.4, 94.2, 93.2, 92.5, 93.9, 97.4, 99.9, 106.6)
INDEX DELTA: mean = SEQ(5.4, 3.6, 2.6, -1.4, -3.3, -5.6, -5.0, -4.5, -1.2, 1.6, 3.5, 4.6)
Both the all cause and the PD mortality rates are higher in the winter than in the summer. (It would be interesting to know whether this is true in the southern states.) PD shows more seasonality than the all cause data.
RAW PD: trend = +0.74%/year
RAW ALL: trend = +0.12%/year
Needs analysis, but I think this is most likely due to the ageing population, linked to the increasing prevalence with age of PD.
The calculations were done using a programming language called r (a free version of the long standing statistical programming language s). Its advantage is that it has an enormous library of statistical functions.
The data comes from the Wonder database at the CDC. The database queries are shown in the Appendix.
Wonder does not require any programming skill to use. It is an excellent system, which I recommend to anyone with an interest in epidemiology. Although on this occasion I used a different approach, you can export the results in a file which can be imported into Excel.
"Dataset: Multiple Cause of Death, 1999-2010"
"Hispanic Origin: All"
"MCD - ICD-10 Codes: G20 (Parkinson's disease)"
"Place of Death: All"
"Ten-Year Age Groups: All"
"UCD - ICD-10 Codes: All"
"Group By: Month"
"Show Totals: True"
"Show Zero Values: False"
"Show Suppressed: False"
"Calculate Rates Per: 100,000"
"Rate Options: Default intercensal populations for years 2001-2009 (except Infant Age Groups)"
"Help: See http://wonder.cdc.gov/wonder/help/mcd.html
for more information."
"Query Date: Mar 25, 2013 11:45:47 PM"
"Suggested Citation: Centers for Disease Control and Prevention , National Center for Health Statistics. Multiple Cause of Death"
"1999-2010 on CDC WONDER Online Database, released 2012. Data are from the Multiple Cause of Death Files, 1999-2010, as compiled"
"from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at"
"http://wonder.cdc.gov/mcd-icd10.html on Mar 25, 2013 11:45:47 PM"