Covered Entity

A covered entity (CE) under HIPAA is a health care provider (e.g. doctors, dentists, pharmacies, etc), a health plan (e.g. private insurance, government programs like Medicare, etc), or a health care clearinghouse (i.e. entities that process and transmit healthcare information).

De-identified Health Data

De-identified health data is data that has had personally identifiable information (PII) removed. Per the HIPAA Privacy Rule, healthcare data not in use for clinical support must have all information that can identify a patient removed before use. This rule offers two paths to compliantly remove this information: the Safe Harbor method and the Statistical method. When these identifying elements have been removed, the resulting de-identified health data set can be used without restriction or disclosure.

Deterministic Matching

Deterministic matching is when fields in two data sets are matched using a unique value. In practice, this value can be a social security number, Medicare Beneficiary ID, or any other value that is known to only correspond to a single entity. Deterministic matching has higher accuracy rates than probabilistic matching, but is not perfect due to data entry errors (mis-typing a social security number such that matching on that field actually matches two different individuals).

Encrypted Patient Token

Encrypted patient tokens are non-reversible 44 character strings created from a patient’s PHI, allowing a patient’s records to be matched across different de-identified health data sets without exposure of the original PHI.

Expert Determination Method of De-Identification (HIPAA)

The HIPAA Privacy Rule allows two methods for de-identifying healthcare data – the Safe Harbor method, and the Expert Determination method. Unlike Safe Harbor, Expert Determination allows the retention in the data set of quasi-identifiers (commonly, geography and dates of service) that are critical for more in-depth analyses of healthcare treatment. Direct identifiers like names, addresses, etc. must still be removed in the Expert Determination method, and users of this method often have an independent certifier (the “Expert”) review their de-identification rules and the resulting de-identified data set for compliance with the HIPAA Privacy Rule.

False positive

A false positive is a result that incorrectly states that a test condition is positive. In the case of matching patient records between data sets, a false positive is the condition where a “match” of two records does not actually represent records for the same patient. False positives are more common in probabilistic matching than in deterministic matching.

Fuzzy matching

Fuzzy matching is the process of finding values that match approximately rather than exactly. In the case of matching PHI, fuzzy matching can include matching on different variants of a name (Jamie, Jim, and Jimmy all being allowed as a match for “James”). To facilitate fuzzy matching, algorithms like Soundex can allow for differently spelled character strings to generate the same output value.

Health Information Technology for Economic and Clinical Health (HITECH) Act

The HITECH Act was passed as part of the as part of the American Recovery and Reinvestment Act of 2009 (ARRA) economic stimulus bill. HITECH was designed to accelerate the adoption of electronic medical records (EMR) through the use of financial incentives for “meaningful use” of EMRs until 2015, and financial penalties for failure to do so thereafter. HITECH added important security regulations and data breach liability rules that built on the rules laid out in HIPAA.

Health Insurance Portability and Accountability Act of 1996 (HIPAA)

HIPAA is a U.S. law requiring the U.S. Department of Health and Human Services (HHS) to develop security and privacy regulations for protected health information. Prior to HIPAA, no such standards existed in the industry. HHS created the HIPAA Privacy Rule and HIPAA Security Rule to fulfill their obligation, and the Office for Civil Rights (OCR) within HHS has the responsibility of enforcing these rules.

Personally-identifiable information (PII)

Personally-identifiable information (PII) is a general term in information and security laws describing any information that allows an individual to be identified either directly or indirectly. PII is a U.S.-centric abbreviation, but is generally equivalent to “personal information” and similar terms outside the United States. PII can consist as informational elements like name, address, social security number or other identifying number or code, telephone number, email address, etc., but can include non-specific data elements such as gender, race, birth date, geographic indicator, etc. that together can still allow indirect identification of an individual.

Probabilistic matching

Probabilistic matching is when fields in two data sets are matched using values that are known not to be unique, but the combination of values gives a high probability that the correct entity is matched. In practice, names, birth dates, and other identifying but non-unique values can be used (often in combination) to facilitate probabilistic matching.

Protected health information (PHI)

Protected health information (PHI) refers to information that includes health status, health care (physician visits, prescriptions, procedures, etc.), or payment for that care and can be linked to an individual. Under U.S. law, PHI is information that is specifically created or collected by a covered entity.

Social Security Death Master File

The U.S. Social Security Administration maintains a file of over 86 million records of deaths collected from social security payments, but it is not a complete compilation of deaths in the United States. In recent years, multiple states have opted out of contributing their information to the Death Master File and its level of completeness has declined substantially. This Death Master File has limited access, and users must be certified to receive it. This file contains PHI elements like social security numbers, names, and dates of birth – therefore, bringing the raw data into a healthcare data environment could risk a HIPAA violation.

Safe Harbor Method of De-Identification (HIPAA)

The Safe Harbor method of anonymization and de-identification under the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule eliminates 18 patient identifiers in healthcare data.These identifiers are also known as protected health information (PHI). The Safe Harbor rule is defined in 45 CFR 164.514b(2) by the US Department of Health and Human Services. It is the hope that by manipulating or eliminating PHI in compliance to the Safe Harbor rule that the patient’s identity cannot be traced back to an original data set. These 18 identifiers include:

  1. Names
  2. All geographic subdivisions smaller than a state usually except for the initial three digits of the ZIP code
  3. All elements of dates except years
  4. Telephone numbers
  5. Fax numbers
  6. Email addresses
  7. Social security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers including license plates
  13. Device identifiers and serial numbers
  14. Web URLs
  15. Internet protocol addresses
  16. Biometric identifies (i.e. retinal scans, fingerprints)
  17. Photos
  18. Any unique identifying number, characteristic or code (What does this mean?)


Soundex is a phonetic algorithm that codes similarly sounding names (in English) as a consistent value. Soundex is commonly used when matching surnames across data sets as variations in spelling are common in data entry. Each Soundex code generated from an input text string has 4 characters – the first letter of the name, and then 3 digits generated from the remaining characters, with similar-sounding phonetic elements coded the same (e.g. D and T are both coded as a 3, M and N are both coded as a 5).

Statistical Method of De-Identification (HIPAA)

Alternate name for the Expert Determination method (see details above) allowed under the HIPAA Privacy Rule. Expert Determination is often called the Statistical Method because certifiers use statistics to determine what level of quasi-identifiers (e.g. ages, geographies, date granularity) can be retained in a data set while still satisfying the HIPAA Privacy Rule’s standards for very low risk of re-identification.