International Classification of Diseases (ICD).
Getting editions Nine (ICD9) & Ten (ICD10) into a relational database can be tricky, even though these ontologies are available free online from numerous US government websites. Finding them in data science-friendly formats can be time consuming and frustrating. So in this post I’ll outline how and where to obtain them, cleanse them and get them into a simple database for use elsewhere. Hopefully saving you hours of unnecessary searching or using data sets of uncertain provenance.
- ICD9 (1MB) : This zip bundle contains convenient Excel .xlsw files. One containing the Diagnoses (Dx) and the other containing Surgical Procedures.
- ICD10 (15MB) : This bundle only offers fixed width text files. At the time of writing I couldn’t locate a delimited version anywhere.
Both files make copious use of commas (",") in the Diagnosis and Procedure description fields. I’m going to remove these to avoid having to use escape codes in any downstream code I write.
I’m choosing to do this in Excel rather than a Shell script because I plan to do this only once, rather than regularly and automatically. Also, the file row count is well within the capabilities of even old versions of Excel.
Step 1 : Pre-processing the ICD9 data file in Excel.
CMS31_DESC_LONG_SHORT_DX.xls and open it in Excel. Then press
<CTRL + H> to open the Find and Replace dialog box.
Enter a comma (",") in the
Find what field and a bank space in the
Replace with field. Then select
Save As a CSV file. Not a Windows or Mac CSV, just a plain old CSV.
Repeat this step if you want the Procedure Codes file
Step 2 : Importing ICD data into a SQLite database.
Extract it and put the executable (sqlite3.exe) in the same folder you saved the
CMS31_DESC_LONG_SHORT_DX.csv you created in Step 1. Double clicking sqlite3.exe will open the command line tool.
Create a database in which to import your csv data
Instruct SQLite to expect a csv file
IMPORT the csv data file into a table
.import filename.csv tablename
Let it import then verify with a query
SELECT * FROM icdTable LIMIT 5;
Finally, quit the SQLite command line tool
This is your ICD9 data Extracted, Transformed and Loaded.
Step 3 : Pre-processing the ICD10 data file in Excel
As this data is “fixed width” the procedure to import it is slightly different though still done in Excel for the same reasons as above.
Launch Excel and open the file
icd10cm_order_2014.txt. Excel should try and launch the
Text Import Wizard.
If it doesn’t, you can select the
Text to Columns Tool which can be found in the
Then follow the wizard to place your column separators.
If you want to be really precise, refer to the PDF document that came in the zip file you downloaded. This will give you the precise character positions where columns start and end.
You should have a nice clean worksheet that you can Save As a CSV file.
Finally, repeat Step 2 to get the file
icd10cm_order_2014.csv into a SQLite database table.
The same procedure will work for MS SQL Server if you use the Data Import Tool.