<h2>International Classification of Diseases (ICD).</h2>

International Classification of Diseases (ICD).

Getting editions Nine (ICD9) & Ten (ICD10) into a relational database can be tricky, even though these ontologies are available free online from numerous US government websites. Finding them in data science-friendly formats can be time consuming and frustrating. So in this post I’ll outline how and where to obtain them, cleanse them and get them into a simple database for use elsewhere. Hopefully saving you hours of unnecessary searching or using data sets of uncertain provenance.

  • ICD9 (1MB) : This zip bundle contains convenient Excel .xlsw files. One containing the Diagnoses (Dx) and the other containing Surgical Procedures.
  • ICD10 (15MB) : This bundle only offers fixed width text files. At the time of writing I couldn’t locate a delimited version anywhere.

Both files make copious use of commas (",") in the Diagnosis and Procedure description fields. I’m going to remove these to avoid having to use escape codes in any downstream code I write.

I’m choosing to do this in Excel rather than a Shell script because I plan to do this only once, rather than regularly and automatically. Also, the file row count is well within the capabilities of even old versions of Excel.

Step 1 : Pre-processing the ICD9 data file in Excel.

Double click CMS31_DESC_LONG_SHORT_DX.xls and open it in Excel. Then press <CTRL + H> to open the Find and Replace dialog box.

Excel Find Replace

Enter a comma (",") in the Find what field and a bank space in the Replace with field. Then select Replace All.

Finally, Save As a CSV file. Not a Windows or Mac CSV, just a plain old CSV.

Excel Save As CSV

Repeat this step if you want the Procedure Codes file CMS31_DESC_LONG_SHORT_SG.xlsx too.

Step 2 : Importing ICD data into a SQLite database.

In the interests of simplicity I’m going to use a SQLite database on a Windows OS. So you’ll want to download the SQLite Command Line Tool.

Extract it and put the executable (sqlite3.exe) in the same folder you saved the CMS31_DESC_LONG_SHORT_DX.csv you created in Step 1. Double clicking sqlite3.exe will open the command line tool.

Create a database in which to import your csv data

  • .open icdDatabase

Instruct SQLite to expect a csv file

  • .mode csv

Bulk IMPORT the csv data file into a table

  • .import filename.csv tablename

Let it import then verify with a query

  • SELECT * FROM icdTable LIMIT 5;

Finally, quit the SQLite command line tool

  • .q

SQLite Console

This is your ICD9 data Extracted, Transformed and Loaded.

Step 3 : Pre-processing the ICD10 data file in Excel

As this data is “fixed width” the procedure to import it is slightly different though still done in Excel for the same reasons as above.

Launch Excel and open the file icd10cm_order_2014.txt. Excel should try and launch the Text Import Wizard.

Excel Screen Shot 1

If it doesn’t, you can select the Text to Columns Tool which can be found in the Data tab. Excel Screen Shot 1a

Then follow the wizard to place your column separators.

Excel Screen Shot 2

If you want to be really precise, refer to the PDF document that came in the zip file you downloaded. This will give you the precise character positions where columns start and end.

Excel Screen Shot 3

You should have a nice clean worksheet that you can Save As a CSV file.

Excel Screen Shot 4

Finally, repeat Step 2 to get the file icd10cm_order_2014.csv into a SQLite database table.

The same procedure will work for MS SQL Server if you use the Data Import Tool.

Adventures in Data...
Hey There!
What is This?