Getting Started with tcrdistR • tcrdistR

Introduction

tcrdistR computes pairwise distances between T-cell receptors (TCRs) using the TCRdist metric. Each TCR is represented by its V-gene and CDR3 amino acid sequence for both alpha and beta chains. The distance incorporates V-region similarity (via BLOSUM62-derived substitution matrices) and CDR3 sequence alignment with variable gap positioning matching tcrdist3.

All distance computations are implemented in C++ for high performance.

Installation

tcrdistR uses C++ code (via Rcpp) for fast distance computation, so you need a C++ compiler:

macOS: Install Xcode Command Line Tools by running xcode-select --install in Terminal.
Windows: Install Rtools (match the version to your R version).
Linux: Install r-base-dev (Debian/Ubuntu) or R-devel (Fedora/RHEL).

Then install from GitHub:

# install.packages("devtools")
devtools::install_github("shihanli92/tcrdistR")

Optional dependencies for visualization and UMAP (install any you need):

install.packages(c("ggplot2", "ggseqlogo", "patchwork", "uwot", "igraph"))

The DASH Dataset

tcrdistR ships with the DASH dataset (Dash et al., 2017) — 1924 paired alpha-beta mouse TCRs responding to 7 viral epitopes, collected from 78 subjects. We use it throughout these vignettes.

library(tcrdistR)
data(dash)

dim(dash)
#> [1] 1924   12
colnames(dash)
#>  [1] "subject"      "epitope"      "count"        "va"           "ja"          
#>  [6] "cdr3a"        "cdr3a_nucseq" "vb"           "jb"           "cdr3b"       
#> [11] "cdr3b_nucseq" "clone_id"
table(dash$epitope)
#> 
#>   F2 m139  M38  M45   NP   PA  PB1 
#>  117   87  158  291  305  324  642

The required columns for distance computation are va, cdr3a, vb, and cdr3b. Additional columns like epitope and subject are carried along for downstream analysis.

Loading Your Own Data

If you have a CSV or TSV file from a sequencing core, you can load it directly. tcrdistR needs four columns for paired-chain analysis: va (V-alpha gene), cdr3a (CDR3-alpha amino acid sequence), vb (V-beta gene), and cdr3b (CDR3-beta amino acid sequence). If you have only beta-chain data, you need just vb and cdr3b.

# Step 1: Read your file
my_data <- read.csv("my_tcr_data.csv")

# Step 2: Check your columns
colnames(my_data)
# e.g., [1] "V_alpha" "CDR3_alpha" "V_beta" "CDR3_beta" "patient"

# Step 3: Map your column names to tcrdistR names
tcrs <- as_tcr_df(my_data, col_map = c(
  va   = "V_alpha",
  cdr3a = "CDR3_alpha",
  vb   = "V_beta",
  cdr3b = "CDR3_beta"
))

# Step 4: Check the result
head(tcrs[, c("va", "cdr3a", "vb", "cdr3b")])

V-gene names should look like TRAV1-1*01 or TRBV5-1*01. If your genes lack allele suffixes (e.g., TRAV1-1 instead of TRAV1-1*01), set normalize_genes = TRUE in as_tcr_df() to append *01 automatically.

Beta-only data (common from bulk sequencing): just provide vb and cdr3b columns. All distance functions automatically detect single-chain input:

beta_tcrs <- as_tcr_df(my_data, col_map = c(
  vb   = "V_beta",
  cdr3b = "CDR3_beta"
))
d <- tcrdist_matrix(beta_tcrs, "human")  # works with beta-only data

Reading from Common Formats

tcrdistR also has dedicated readers for common pipeline outputs. All produce a data.frame with standardized column names.

# Auto-detect format from column naming conventions
tcrs <- read_tcr_table("my_data.tsv")

# 10X Genomics (produces paired alpha-beta data; cells with only one chain
# are dropped with a message showing how many were excluded)
tcrs <- read_10x("filtered_contig_annotations.csv")

# AIRR Community Standard format (paired if both chains present)
tcrs <- read_airr("airr_rearrangements.tsv")

# Adaptive Biotechnologies (beta-chain only)
tcrs <- read_adaptive("immunoseq_export.tsv")

Computing the Distance Matrix

What is TCRdist? TCRdist quantifies how similar two TCR sequences are. It aligns the CDR3 regions of both TCRs, scores each amino acid match/mismatch using a biochemical similarity matrix (BSD4, derived from BLOSUM62), adds a penalty for insertions/deletions (gaps), then adds the distance between the germline-encoded CDR1, CDR2, and CDR2.5 loops (looked up from the V-gene). Lower distance means more similar TCRs. TCRs with small distances (roughly < 50) often recognize the same antigen.

The core function tcrdist_matrix() computes a dense N x N pairwise distance matrix. We use a subset of 50 PA-specific TCRs for speed:

pa <- dash[dash$epitope == "PA", ]
pa_sub <- pa[1:50, ]

dist_mat <- tcrdist_matrix(pa_sub, organism = "mouse")
dim(dist_mat)
#> [1] 50 50
dist_mat[1:5, 1:5]
#>     1   2   3   4   5
#> 1   0 317 341 326 281
#> 2 317   0 177 243 204
#> 3 341 177   0 288 216
#> 4 326 243 288   0 138
#> 5 281 204 216 138   0

Each entry is an integer TCRdist value. Identical TCRs have distance 0; typical distances range from 0 to ~400. As a rough guide:

0–25: Very similar; likely recognize the same epitope with similar binding mode.
25–75: Moderately similar; may share specificity.
75–200: Distantly related; different CDR3 or V-gene usage.
> 200: Unrelated.

Sparse Distances

For large datasets, the full N x N matrix becomes impractical — a 10,000 TCR dataset needs ~800 MB, and 50,000 TCRs would need ~20 GB. Use tcrdist_sparse() to only store distances below a threshold:

sparse_mat <- tcrdist_sparse(pa_sub, organism = "mouse", threshold = 100)
class(sparse_mat)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

The result is a Matrix::dgCMatrix (a compressed sparse column matrix). Non-stored entries represent distances above the threshold. Since most TCR pairs are unrelated, the sparse matrix is dramatically smaller than the dense version.

Rectangular Distances

Compute distances between two different sets of TCRs (query vs reference):

query <- pa_sub[1:5, ]
reference <- pa_sub[6:50, ]
rect_mat <- tcrdist_rect(query, reference, organism = "mouse")
dim(rect_mat)
#> [1]  5 45

K-Nearest Neighbors

Find the K closest TCRs for each input. This computes distances and extracts neighbors in a single pass without materializing the full N x N matrix:

knn <- tcrdist_knn(pa_sub, organism = "mouse", K = 5L)
dim(knn$knn_indices)    # N x K
#> [1] 50  5
dim(knn$knn_distances)  # N x K
#> [1] 50  5

# Nearest neighbor of TCR 1
knn$knn_indices[1, ]
#> [1] 49 43  9 14 15
knn$knn_distances[1, ]
#> [1] 211 227 235 245 248

Radius-Based Neighbors

Find all neighbors within a distance threshold:

neighbors <- tcrdist_radius_neighbors(pa_sub, organism = "mouse", radius = 50)
length(neighbors)  # one entry per TCR
#> [1] 50

# First TCR's neighbors
neighbors[[1]]$indices
#> integer(0)
neighbors[[1]]$distances
#> numeric(0)

Per-Component Distances

All distance functions accept a components parameter to compute distances from specific TCR regions. This lets you ask focused biological questions:

CDR3 only ("cdr3"): captures the unique rearrangement-derived binding interface — the most variable part of the TCR.
V-region only ("v_region"): captures germline-encoded contacts (CDR1, CDR2, CDR2.5) that interact with the MHC molecule.
Single chain ("alpha" or "beta"): useful when you suspect one chain dominates antigen recognition.

# CDR3 only (both chains, no V-region)
d_cdr3 <- tcrdist_matrix(pa_sub, "mouse", components = "cdr3")

# V-region only (CDR1 + CDR2 + CDR2.5)
d_v <- tcrdist_matrix(pa_sub, "mouse", components = "v_region")

# Components are additive: v_region + cdr3 = all
d_all <- tcrdist_matrix(pa_sub, "mouse")
stopifnot(all(abs((d_v + d_cdr3) - d_all) < 1e-10))

# Single chain
d_alpha <- tcrdist_matrix(pa_sub, "mouse", components = "alpha")
d_beta  <- tcrdist_matrix(pa_sub, "mouse", components = "beta")
stopifnot(all(abs((d_alpha + d_beta) - d_all) < 1e-10))

# Custom combination
d_custom <- tcrdist_matrix(pa_sub, "mouse", components = c("cdr3a", "vb"))

Available presets: "all" (default), "cdr3", "v_region", "alpha", "beta". Individual terms: "va", "cdr3a", "vb", "cdr3b".

Single-Chain Mode

Many datasets contain only beta-chain data (e.g., from bulk sequencing). All distance functions automatically detect single-chain input and compute distances for the available chain only:

# Beta-only data (no alpha columns)
beta_only <- pa_sub[, c("vb", "cdr3b")]
d_beta_only <- tcrdist_matrix(beta_only, "mouse")
dim(d_beta_only)
#> [1] 50 50

# Result matches components="beta" on paired data
d_beta_paired <- tcrdist_matrix(pa_sub, "mouse", components = "beta")
stopifnot(all(abs(d_beta_only - d_beta_paired) < 1e-10))

# Works with all wrappers: sparse, rect, knn, radius_neighbors
knn_beta <- tcrdist_knn(beta_only, "mouse", K = 3L)

Standardizing TCR Data

Use as_tcr_df() to convert data.frames from common tools (scRepertoire, scirpy, dandelion, tcrdist3) to tcrdistR’s canonical column names:

# Auto-detect format
tcrs <- as_tcr_df(my_data)

# Explicit format
tcrs <- as_tcr_df(my_data, format = "screpertoire")

# Custom column mapping
tcrs <- as_tcr_df(my_data, col_map = c(va = "V_alpha", cdr3a = "CDR3_alpha",
                                        vb = "V_beta",  cdr3b = "CDR3_beta"))

Single-Pair Distances

For comparing individual CDR3 sequences:

# Weighted CDR3 distance (uses BLOSUM62-derived BSD4 matrix)
weighted_cdr3_distance("CAVSLDSNYQLIW", "CALGDRATGGNNKLTF")
#> [1] 102

# Simple Hamming distance (counts mismatches at each position)
hamming_distance("CASSI", "CASSK")
#> [1] 1

Diversity Metrics

Diversity metrics summarize how “spread out” a repertoire is. A repertoire dominated by one or two clonotypes (high clonality) suggests a strong antigen-driven expansion, while a highly diverse repertoire suggests a broad, unstimulated sample.

# Clone counts for PA-specific TCRs
pa_counts <- pa$count

# Generalized diversity (order 2 = inverse Simpson)
# entropy: 0 = dominated by one clone, 1 = perfectly even
div <- tcr_diversity(pa_counts, order = 2)
div$entropy
#> [1] 0.9958049
div$effective_number  # "how many equally-abundant clones would give this diversity"
#> [1] 238.3727

# Species richness: simply the number of distinct clonotypes
tcr_richness(pa_counts)
#> [1] 324

# Clonality: 0 = maximally diverse, 1 = single dominant clone
tcr_clonality(pa_counts)
#> [1] 0.05175504

The effective_number is particularly intuitive: if your repertoire has an effective number of 50, it behaves as if it had 50 equally abundant clonotypes, even though the actual distribution may be uneven.

The TCRrep Object

For more structured workflows, wrap your data in a TCRrep S4 object. By default, TCRrep() deduplicates identical clones (matching tcrdist3 behavior), grouping by chain columns plus subject if present, and summing clone counts:

rep <- TCRrep(pa_sub, organism = "mouse")
#> deduplicate: 50 -> 49 clones
nrow(pa_sub)        # before dedup
#> [1] 50
nrow(rep@clone_df)  # after dedup (within-subject duplicates merged)
#> [1] 49
rep
#> TCRrep object: 49 clonotypes
#>   organism: mouse
#>   chains: AB
#>   metric: tcrdist
#>   distances: not computed

You can control deduplication with the deduplicate argument:

# Chain columns only (collapse across subjects)
rep_strict <- TCRrep(pa_sub, organism = "mouse", deduplicate = character(0))
#> deduplicate: 50 -> 47 clones
nrow(rep_strict@clone_df)
#> [1] 47

# No deduplication
rep_raw <- TCRrep(pa_sub, organism = "mouse", deduplicate = FALSE)
nrow(rep_raw@clone_df)
#> [1] 50

Choosing Parameters

Several functions take a radius or threshold parameter. Here are practical starting points:

Parameter	Recommended range	Notes
`tcrdist_sparse(threshold=)`	50–200	Use 50 for tight clusters, 200 for broad search
`tcrdist_radius_neighbors(radius=)`	24–96	48 is a common default for paired chains
`neighborhood_test(radius=)`	48–96	Smaller radius = more specific neighborhoods
`find_meta_clonotypes(radius=)`	36–72	Balance sensitivity vs. specificity
`find_clumping(radii=)`	c(24, 48, 72)	Test multiple radii
`cluster_tcrs(k=)`	5–20	Start with expected number of groups

These values assume paired alpha-beta TCRs with default weights. For single-chain analysis, distances are roughly halved, so use proportionally smaller thresholds.

Performance and Scalability

All distance computations are implemented in C++ for performance. Approximate guidance for dataset sizing:

N (TCRs)	`tcrdist_matrix`	`tcrdist_sparse`	`tcrdist_knn`	Memory (dense)
1,000	< 1 sec	< 1 sec	< 1 sec	~8 MB
5,000	~5 sec	~5 sec	~3 sec	~200 MB
10,000	~20 sec	~15 sec	~10 sec	~800 MB
50,000	minutes	~minutes	~1 min	~20 GB

When to use each function:

< 5,000 TCRs: tcrdist_matrix() is fast and convenient.
5,000–20,000 TCRs: tcrdist_sparse() saves memory; use for downstream analyses that only need nearby neighbors.
> 20,000 TCRs: Use tcrdist_knn() which avoids materializing the full matrix. For dimensionality reduction, use the UMAP KNN path rather than kernel PCA (which needs the full matrix).

tcrdist3 Feature Map

For users migrating from Python tcrdist3:

tcrdist3 concept	tcrdistR equivalent	Notes
`TCRrep`	`TCRrep()` S4 class	Similar dedup behavior
`compute_distances()`	`tcrdist_matrix()`	Per-component via `components=`
`compute_sparse_rect_distances()`	`tcrdist_sparse()`, `tcrdist_rect()`
`nn_approach()`	`tcrdist_knn()`
`TCRpublic` / `find_metaclonotypes()`	`find_meta_clonotypes()`
`TCRsampler`-based clumping	`find_clumping()`	Chain-resampling background
Kernel PCA	`compute_tcrdist_kernel_pca()`	Linear + Gaussian kernels
Diversity metrics	`tcr_diversity()`, `tcr_clonality()`, etc.
CDR3 motif logos	`plot_cdr3_logo()`, `plot_tcr_logo_panel()`	ggseqlogo-based
DB search	`match_tcrs_to_db()`	Human only
CD8 scoring	`make_cd8_score_table_column()`	Human only

Next Steps

vignette("tcrdistR-tcrrep-workflow") — End-to-end analysis with the TCRrep S4 object
vignette("tcrdistR-advanced") — Clustering, neighborhood tests, meta-clonotypes, database matching, kernel PCA, fuzzy diversity
vignette("tcrdistR-visualization") — Heatmaps, dendrograms, CDR3 logos, scatter plots
vignette("tcrdistR-comparing-repertoires") — Comparing repertoires between conditions
vignette("tcrdistR-glossary") — Glossary of technical terms