MediLang Glossary

This glossary provides explanations of healthcare and bioinformatics-related terms that developers might encounter while working with MediLang. Each section includes relevant resources, citations, and practical examples for better understanding.

Healthcare Data Standards

FHIR (Fast Healthcare Interoperability Resources)

A modern standard for exchanging healthcare information electronically. FHIR combines the best features of HL7's v2, v3, and CDA while leveraging current web technologies.

Key Features: - RESTful API architecture - JSON, XML, and RDF formats - Modular components called "Resources" - Human-readable base schemas - Strong focus on implementability

Example FHIR Resource (Patient in JSON):

{
  "resourceType": "Patient",
  "id": "example",
  "active": true,
  "name": [{
    "use": "official",
    "family": "Smith",
    "given": ["John", "Edward"]
  }],
  "gender": "male",
  "birthDate": "1974-12-25",
  "address": [{
    "use": "home",
    "line": ["123 Healthcare Street"],
    "city": "Ann Arbor",
    "state": "MI",
    "postalCode": "48105"
  }]
}

Example FHIR API Endpoints:

GET  /Patient/[id]           # Get specific patient
POST /Patient               # Create new patient
GET  /Patient?name=Smith    # Search patients by name
GET  /Observation?patient=[id] # Get patient's observations

Resources: - Official FHIR Documentation - FHIR for Developers - SMART on FHIR

HL7 (Health Level Seven)

The global authority on standards for interoperability of health technology, producing healthcare data exchange and information modeling standards.

Versions: - HL7 v2.x: Legacy message-based standard using pipe-delimited syntax

Example HL7 v2 Message (ADT-A01 Patient Admission):

MSH|^~\&|SENDING_APP|SENDING_FAC|RECEIVING_APP|RECEIVING_FAC|20230401123045||ADT^A01|MSG00001|P|2.3
EVN|A01|20230401123045
PID|1||12345^^^MRN||SMITH^JOHN^E||19741225|M|||123 HEALTHCARE ST^^ANN ARBOR^MI^48105
NK1|1|SMITH^JANE^|SPOUSE|734-555-0123
PV1|1|I|2000^2012^01||||0123^WATSON^ROBERT|||||||||V|||||||||||||||||||||||||20230401123045

Example HL7 v3 Message (Same admission in XML):

<ADT_A01 xmlns="urn:hl7-org:v3">
  <id root="2.16.840.1.113883.19.3.2409" extension="MSG00001"/>
  <creationTime value="20230401123045"/>
  <patient>
    <name>
      <given>John</given>
      <family>Smith</family>
    </name>
    <administrativeGenderCode code="M"/>
    <birthTime value="19741225"/>
  </patient>
</ADT_A01>

- HL7 v3: XML-based standard with Reference Information Model (RIM) - CDA: Clinical Document Architecture for clinical documents

Resources: - HL7 Standards - HL7 v2 to FHIR

Genomics File Formats

FASTQ

A text format that stores both biological sequences and their quality scores. Essential in modern high-throughput sequencing workflows.

Format Structure:

@SEQ_ID                 # Sequence identifier
GATTTGGGGTTCAAAGCAG    # Raw sequence
+                      # Separator line
!''*((((***+))%%%++   # Quality scores (ASCII-encoded)

Example FASTQ Entry (Illumina Format):

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GATTACGAATGCTAGGTCGGATCTGAGGCAATGTG
+
!''*(((((***+))%%%++)(%%%%).1***-+*''

Quality Score Interpretation:

# Phred+33 encoding (Illumina 1.8+)
def qual_to_prob(ascii_char):
    """Convert ASCII quality char to error probability"""
    phred = ord(ascii_char) - 33
    error_prob = 10 ** (-phred/10)
    return error_prob

# Example
print(qual_to_prob('!'))  # Lowest quality (Q=0)
print(qual_to_prob('I'))  # High quality (Q=40)

Resources: - FASTQ Format Specification - Quality Score Encoding

VCF (Variant Call Format)

The standard format for storing DNA sequence variations. Critical for genomic medicine and population genetics studies.

Example VCF File:

##fileformat=VCFv4.3
##reference=GRCh38
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2
20      14370   rs6054257    G       A       29      PASS    AF=0.5  GT      0/1     1/1
20      17330   .           T       A       3       q10     AF=0.017 GT      0/0     0/1

Common Operations with VCF:

# Filter variants by quality using bcftools
bcftools filter -i 'QUAL>20' input.vcf > high_quality.vcf

# Extract SNPs only
bcftools view -v snps input.vcf > snps_only.vcf

# Calculate allele frequencies
bcftools +fill-tags input.vcf -- -t AF

Resources: - VCF 4.3 Specification - VCF Guide by GATK

BAM/SAM

Formats for storing sequence alignments against reference sequences.

Example SAM File:

@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:248956422
@RG     ID:S1   SM:sample1    LB:lib1     PL:ILLUMINA
read1   99      chr1    45646  60      76M     =       45867   221     AGCTGCAGTCAGTTCTGTACACC    BBBBBBBBBBBBBBBBBBBBBBB    NM:i:0  AS:i:76
read2   147     chr1    45867  60      76M     =       45646   -221    TGCACCTGTACAGAACTGACTGCA    BBBBBBBBBBBBBBBBBBBBBBB    NM:i:0  AS:i:76

Common SAM/BAM Operations:

# Convert SAM to BAM
samtools view -b input.sam > output.bam

# Sort BAM file
samtools sort input.bam -o sorted.bam

# Index BAM file
samtools index sorted.bam

# View alignment statistics
samtools flagstat sorted.bam

Example Python Code (using pysam):

import pysam

# Open BAM file
bam = pysam.AlignmentFile("sorted.bam", "rb")

# Count reads in a region
region_reads = bam.count("chr1", 100000, 200000)

# Get read names in region
for read in bam.fetch("chr1", 100000, 200000):
    print(read.query_name, read.reference_start)

Resources: - SAM/BAM Specifications - Samtools Documentation - Pysam Documentation

Computational Biology Terms

NGS (Next Generation Sequencing)

Modern high-throughput DNA sequencing technologies that revolutionized genomics research.

Key Technologies: - Illumina (Short-read sequencing) - PacBio (Long-read sequencing) - Oxford Nanopore (Long-read, real-time sequencing) - 10x Genomics (Linked-read sequencing)

Example Workflow:

graph TD
    A[Sample Collection] --> B[DNA Extraction]
    B --> C[Library Preparation]
    C --> D[Sequencing]
    D --> E[Base Calling]
    E --> F[Quality Control]
    F --> G[Data Analysis]

Example Quality Metrics:

# Common NGS QC metrics
class NGSMetrics:
    def __init__(self, total_reads, mapped_reads, q30_bases):
        self.total_reads = total_reads
        self.mapped_reads = mapped_reads
        self.q30_bases = q30_bases

    @property
    def mapping_rate(self):
        return self.mapped_reads / self.total_reads * 100

    @property
    def q30_rate(self):
        return self.q30_bases / (self.total_reads * 150) * 100  # assuming 150bp reads

Resources: - Nature Review: NGS Technologies - Illumina Sequencing Methods

Bioinformatics Pipeline

A series of computational steps for processing biological data, typically implemented as workflows.

Example Nextflow Pipeline:

// Basic RNA-seq pipeline
nextflow.enable.dsl=2

process FASTQC {
    input:
        tuple val(sample_id), path(reads)
    output:
        path "fastqc_${sample_id}_logs"

    script:
    """
    fastqc -o fastqc_${sample_id}_logs ${reads}
    """
}

process ALIGN {
    input:
        tuple val(sample_id), path(reads)
        path index
    output:
        tuple val(sample_id), path("${sample_id}.bam")

    script:
    """
    STAR --genomeDir $index \
         --readFilesIn ${reads} \
         --outFileNamePrefix ${sample_id}. \
         --runThreadN ${task.cpus}
    """
}

Example Snakemake Rule:

# Rule for variant calling
rule call_variants:
    input:
        bam="mapped/{sample}.bam",
        ref="reference/genome.fa"
    output:
        vcf="variants/{sample}.vcf"
    conda:
        "envs/gatk.yaml"
    shell:
        """gatk HaplotypeCaller \
           -R {input.ref} \
           -I {input.bam} \
           -O {output.vcf}"""

Resources: - Nextflow Documentation - nf-core: Curated Pipelines - Snakemake Documentation

Clinical Terms

EHR/EMR Systems

Digital systems for managing patient health information.

Example EHR Data Structure:

href="#__codelineno-16-1">{ "patient": { "demographics": { "id": "12345", "name": { "first": "John", "last": "Smith" }, "dob": "1974-12-25", "gender": "M" }, "encounters": [{ "date": "2023-03-15", "type": "office_visit", "provider": "Dr. Jane Wilson", "diagnosis": [{ "code": "E11.9", "system": "ICD-10", "description": "Type 2 diabetes without complications" }], "vitals": { "blood_pressure": "120/80", "temperature": "98.6", "pulse": 72 } }], "medications": [{ "name": "Metformin", "dose": "500mg", "frequency": "BID", "rxnorm": "860975" }] } }

Example HL7 FHIR Query (Python):

from fhirclient import client
import fhirclient.models.patient as p

# Connect to FHIR server
settings = {
    'app_id': 'my_app',
    'api_base': 'https://hapi.fhir.org/baseR4'
}
server = client.FHIRClient(settings=settings)

# Search for patients with diabetes
search = p.Patient.where(struct={'condition': 'diabetes'})
patients = search.perform_resources(server.server)

# Process results
for patient in patients:
    print(f"Found patient {patient.name[0].given} {patient.name[0].family}")

Resources: - HealthIT.gov EHR Basics - ONC Health IT Certification - SMART on FHIR Apps

ICD (International Classification of Diseases)

WHO's foundation for health statistics and outcomes.

Example ICD-10 Codes:

E11.9    Type 2 diabetes mellitus without complications
I10      Essential (primary) hypertension
J45.909  Unspecified asthma, uncomplicated
F32.1    Major depressive disorder, single episode, moderate

Example Python Code for ICD Processing:

from typing import Dict, List

class ICDCode:
    def __init__(self, code: str, description: str):
        self.code = code
        self.description = description
        self.category = self._get_category()

    def _get_category(self) -> str:
        categories = {
            'A': 'Infectious diseases',
            'C': 'Neoplasms',
            'E': 'Endocrine disorders',
            'F': 'Mental disorders',
            'I': 'Circulatory system',
            'J': 'Respiratory system'
        }
        return categories.get(self.code[0], 'Other')

# Example usage
diagnoses = [
    ICDCode('E11.9', 'Type 2 diabetes'),
    ICDCode('I10', 'Hypertension'),
    ICDCode('F32.1', 'Depression')
]

# Group by category
by_category = {}
for dx in diagnoses:
    if dx.category not in by_category:
        by_category[dx.category] = []
    by_category[dx.category].append(dx)

Resources: - WHO ICD-11 - CDC ICD Guidelines - ICD-10 Data Files

SNOMED CT

The most comprehensive clinical healthcare terminology.

Example SNOMED CT Concepts:

{
  "conceptId": "73211009",
  "fsn": "Diabetes mellitus (disorder)",
  "preferredTerm": "Diabetes mellitus",
  "relationships": [
    {
      "type": "116680003",  // Is a
      "target": "6475002"    // Endocrine disorder
    },
    {
      "type": "363698007", // Finding site
      "target": "113331007" // Endocrine system
    }
  ],
  "mappings": {
    "ICD10": "E11.9",
    "ICD11": "5A11"  
  }
}

Example SNOMED CT Query (using snowstorm):

import requests

def search_snomed(term: str) -> List[Dict]:
    """Search SNOMED CT concepts"""
    url = "https://snowstorm.ihtsdotools.org/snowstorm/snomed-ct/MAIN/concepts"
    params = {
        "term": term,
        "activeFilter": True,
        "offset": 0,
        "limit": 50
    }
    response = requests.get(url, params=params)
    return response.json().get('items', [])

# Example usage
diabetes_concepts = search_snomed("diabetes mellitus")
for concept in diabetes_concepts:
    print(f"{concept['conceptId']}: {concept['fsn']}")

Resources: - SNOMED International - SNOMED CT Browser - SNOMED CT Implementation Guide

Data Analysis in Healthcare

Machine Learning Applications

Common applications of ML in healthcare:

Example: Disease Prediction Model

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load and prepare data
def prepare_diabetes_data(df: pd.DataFrame) -> tuple:
    """Prepare diabetes prediction dataset"""
    features = ['age', 'bmi', 'blood_pressure', 'glucose']
    X = df[features]
    y = df['diabetes']
    return train_test_split(X, y, test_size=0.2)

# Train model
def train_diabetes_model(X_train, y_train) -> RandomForestClassifier:
    """Train diabetes prediction model"""
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    return model

# Evaluate model
def evaluate_model(model, X_test, y_test) -> dict:
    """Evaluate model performance"""
    y_pred = model.predict(X_test)
    return classification_report(y_test, y_pred, output_dict=True)

# Example usage
df = pd.read_csv('diabetes_data.csv')
X_train, X_test, y_train, y_test = prepare_diabetes_data(df)
model = train_diabetes_model(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)

Example: Medical Image Analysis

import tensorflow as tf
from tensorflow.keras import layers, Model

def create_cnn_model(input_shape=(224, 224, 3)):
    """Create CNN for medical image classification"""
    inputs = layers.Input(shape=input_shape)
    x = layers.Conv2D(32, 3, activation='relu')(inputs)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(64, 3, activation='relu')(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(128, activation='relu')(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)

    return Model(inputs, outputs)

# Example training loop
def train_model(model, train_ds, val_ds, epochs=10):
    """Train medical image classification model"""
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC()]
    )

    return model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)]
    )

Resources: - Nature Medicine ML Review - Google Health AI - Fast.ai Medical Imaging

Statistical Concepts in Medical Research

Example: Statistical Analysis in Python

from scipy import stats
import numpy as np

class ClinicalTrial:
    def __init__(self, treatment_group, control_group):
        self.treatment = treatment_group
        self.control = control_group

    def calculate_statistics(self):
        """Calculate key statistical measures"""
        # T-test for difference between groups
        t_stat, p_value = stats.ttest_ind(self.treatment, self.control)

        # Effect size (Cohen's d)
        effect_size = (np.mean(self.treatment) - np.mean(self.control)) / \
                     np.sqrt((np.var(self.treatment) + np.var(self.control)) / 2)

        # Confidence intervals
        t_interval = stats.t.interval(
            alpha=0.95,
            df=len(self.treatment) + len(self.control) - 2,
            loc=np.mean(self.treatment) - np.mean(self.control),
            scale=stats.sem(np.concatenate([self.treatment, self.control]))
        )

        return {
            'p_value': p_value,
            'effect_size': effect_size,
            'confidence_interval': t_interval
        }

# Example usage
trial = ClinicalTrial(
    treatment_group=np.random.normal(loc=10, scale=2, size=100),
    control_group=np.random.normal(loc=8, scale=2, size=100)
)

results = trial.calculate_statistics()
print(f"P-value: {results['p_value']:.4f}")
print(f"Effect size: {results['effect_size']:.2f}")
print(f"95% CI: [{results['confidence_interval'][0]:.2f}, {results['confidence_interval'][1]:.2f}]")

Resources: - BMJ Statistics Notes - Nature Methods Statistics Guide - statsmodels Documentation

MediLang-Specific Features

Runtime Features

MEDI_BACKTRACE: Environment variable enabling detailed call stack traces for runtime errors
Similar to Rust's RUST_BACKTRACE
Helps debug execution flow
Shows function call hierarchy
Includes source locations

Development Tools

Task management system
Code analysis tools
Testing frameworks
Documentation generators

Resources: - MediLang Documentation - GitHub Repository

Note: This glossary is actively maintained. For corrections or additions, please submit a pull request or issue on GitHub.