Getting Started with tcrdistR
Source:vignettes/tcrdistR-getting-started.Rmd
tcrdistR-getting-started.RmdIntroduction
tcrdistR computes pairwise distances between T-cell receptors (TCRs) using the TCRdist metric. Each TCR is represented by its V-gene and CDR3 amino acid sequence for both alpha and beta chains. The distance incorporates V-region similarity (via BLOSUM62-derived substitution matrices) and CDR3 sequence alignment with variable gap positioning matching tcrdist3.
All distance computations are implemented in C++ for high performance.
Installation
tcrdistR uses C++ code (via Rcpp) for fast distance computation, so you need a C++ compiler:
-
macOS: Install Xcode Command Line Tools by running
xcode-select --installin Terminal. - Windows: Install Rtools (match the version to your R version).
-
Linux: Install
r-base-dev(Debian/Ubuntu) orR-devel(Fedora/RHEL).
Then install from GitHub:
# install.packages("devtools")
devtools::install_github("shihanli92/tcrdistR")Optional dependencies for visualization and UMAP (install any you need):
install.packages(c("ggplot2", "ggseqlogo", "patchwork", "uwot", "igraph"))The DASH Dataset
tcrdistR ships with the DASH dataset (Dash et al., 2017) — 1924 paired alpha-beta mouse TCRs responding to 7 viral epitopes, collected from 78 subjects. We use it throughout these vignettes.
library(tcrdistR)
data(dash)
dim(dash)
#> [1] 1924 12
colnames(dash)
#> [1] "subject" "epitope" "count" "va" "ja"
#> [6] "cdr3a" "cdr3a_nucseq" "vb" "jb" "cdr3b"
#> [11] "cdr3b_nucseq" "clone_id"
table(dash$epitope)
#>
#> F2 m139 M38 M45 NP PA PB1
#> 117 87 158 291 305 324 642The required columns for distance computation are va,
cdr3a, vb, and cdr3b. Additional
columns like epitope and subject are carried
along for downstream analysis.
Loading Your Own Data
If you have a CSV or TSV file from a sequencing core, you can load it
directly. tcrdistR needs four columns for paired-chain analysis:
va (V-alpha gene), cdr3a (CDR3-alpha amino
acid sequence), vb (V-beta gene), and cdr3b
(CDR3-beta amino acid sequence). If you have only beta-chain data, you
need just vb and cdr3b.
# Step 1: Read your file
my_data <- read.csv("my_tcr_data.csv")
# Step 2: Check your columns
colnames(my_data)
# e.g., [1] "V_alpha" "CDR3_alpha" "V_beta" "CDR3_beta" "patient"
# Step 3: Map your column names to tcrdistR names
tcrs <- as_tcr_df(my_data, col_map = c(
va = "V_alpha",
cdr3a = "CDR3_alpha",
vb = "V_beta",
cdr3b = "CDR3_beta"
))
# Step 4: Check the result
head(tcrs[, c("va", "cdr3a", "vb", "cdr3b")])V-gene names should look like TRAV1-1*01 or
TRBV5-1*01. If your genes lack allele suffixes (e.g.,
TRAV1-1 instead of TRAV1-1*01), set
normalize_genes = TRUE in as_tcr_df() to
append *01 automatically.
Beta-only data (common from bulk sequencing): just
provide vb and cdr3b columns. All distance
functions automatically detect single-chain input:
beta_tcrs <- as_tcr_df(my_data, col_map = c(
vb = "V_beta",
cdr3b = "CDR3_beta"
))
d <- tcrdist_matrix(beta_tcrs, "human") # works with beta-only dataReading from Common Formats
tcrdistR also has dedicated readers for common pipeline outputs. All produce a data.frame with standardized column names.
# Auto-detect format from column naming conventions
tcrs <- read_tcr_table("my_data.tsv")
# 10X Genomics (produces paired alpha-beta data; cells with only one chain
# are dropped with a message showing how many were excluded)
tcrs <- read_10x("filtered_contig_annotations.csv")
# AIRR Community Standard format (paired if both chains present)
tcrs <- read_airr("airr_rearrangements.tsv")
# Adaptive Biotechnologies (beta-chain only)
tcrs <- read_adaptive("immunoseq_export.tsv")Computing the Distance Matrix
What is TCRdist? TCRdist quantifies how similar two TCR sequences are. It aligns the CDR3 regions of both TCRs, scores each amino acid match/mismatch using a biochemical similarity matrix (BSD4, derived from BLOSUM62), adds a penalty for insertions/deletions (gaps), then adds the distance between the germline-encoded CDR1, CDR2, and CDR2.5 loops (looked up from the V-gene). Lower distance means more similar TCRs. TCRs with small distances (roughly < 50) often recognize the same antigen.
The core function tcrdist_matrix() computes a dense N x
N pairwise distance matrix. We use a subset of 50 PA-specific TCRs for
speed:
pa <- dash[dash$epitope == "PA", ]
pa_sub <- pa[1:50, ]
dist_mat <- tcrdist_matrix(pa_sub, organism = "mouse")
dim(dist_mat)
#> [1] 50 50
dist_mat[1:5, 1:5]
#> 1 2 3 4 5
#> 1 0 317 341 326 281
#> 2 317 0 177 243 204
#> 3 341 177 0 288 216
#> 4 326 243 288 0 138
#> 5 281 204 216 138 0Each entry is an integer TCRdist value. Identical TCRs have distance 0; typical distances range from 0 to ~400. As a rough guide:
- 0–25: Very similar; likely recognize the same epitope with similar binding mode.
- 25–75: Moderately similar; may share specificity.
- 75–200: Distantly related; different CDR3 or V-gene usage.
- > 200: Unrelated.
Sparse Distances
For large datasets, the full N x N matrix becomes impractical — a
10,000 TCR dataset needs ~800 MB, and 50,000 TCRs would need ~20 GB. Use
tcrdist_sparse() to only store distances below a
threshold:
sparse_mat <- tcrdist_sparse(pa_sub, organism = "mouse", threshold = 100)
class(sparse_mat)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"The result is a Matrix::dgCMatrix (a compressed sparse
column matrix). Non-stored entries represent distances above the
threshold. Since most TCR pairs are unrelated, the sparse matrix is
dramatically smaller than the dense version.
Rectangular Distances
Compute distances between two different sets of TCRs (query vs reference):
query <- pa_sub[1:5, ]
reference <- pa_sub[6:50, ]
rect_mat <- tcrdist_rect(query, reference, organism = "mouse")
dim(rect_mat)
#> [1] 5 45K-Nearest Neighbors
Find the K closest TCRs for each input. This computes distances and extracts neighbors in a single pass without materializing the full N x N matrix:
knn <- tcrdist_knn(pa_sub, organism = "mouse", K = 5L)
dim(knn$knn_indices) # N x K
#> [1] 50 5
dim(knn$knn_distances) # N x K
#> [1] 50 5
# Nearest neighbor of TCR 1
knn$knn_indices[1, ]
#> [1] 49 43 9 14 15
knn$knn_distances[1, ]
#> [1] 211 227 235 245 248Radius-Based Neighbors
Find all neighbors within a distance threshold:
neighbors <- tcrdist_radius_neighbors(pa_sub, organism = "mouse", radius = 50)
length(neighbors) # one entry per TCR
#> [1] 50
# First TCR's neighbors
neighbors[[1]]$indices
#> integer(0)
neighbors[[1]]$distances
#> numeric(0)Per-Component Distances
All distance functions accept a components parameter to
compute distances from specific TCR regions. This lets you ask focused
biological questions:
-
CDR3 only (
"cdr3"): captures the unique rearrangement-derived binding interface — the most variable part of the TCR. -
V-region only (
"v_region"): captures germline-encoded contacts (CDR1, CDR2, CDR2.5) that interact with the MHC molecule. -
Single chain (
"alpha"or"beta"): useful when you suspect one chain dominates antigen recognition.
# CDR3 only (both chains, no V-region)
d_cdr3 <- tcrdist_matrix(pa_sub, "mouse", components = "cdr3")
# V-region only (CDR1 + CDR2 + CDR2.5)
d_v <- tcrdist_matrix(pa_sub, "mouse", components = "v_region")
# Components are additive: v_region + cdr3 = all
d_all <- tcrdist_matrix(pa_sub, "mouse")
stopifnot(all(abs((d_v + d_cdr3) - d_all) < 1e-10))
# Single chain
d_alpha <- tcrdist_matrix(pa_sub, "mouse", components = "alpha")
d_beta <- tcrdist_matrix(pa_sub, "mouse", components = "beta")
stopifnot(all(abs((d_alpha + d_beta) - d_all) < 1e-10))
# Custom combination
d_custom <- tcrdist_matrix(pa_sub, "mouse", components = c("cdr3a", "vb"))Available presets: "all" (default), "cdr3",
"v_region", "alpha", "beta".
Individual terms: "va", "cdr3a",
"vb", "cdr3b".
Single-Chain Mode
Many datasets contain only beta-chain data (e.g., from bulk sequencing). All distance functions automatically detect single-chain input and compute distances for the available chain only:
# Beta-only data (no alpha columns)
beta_only <- pa_sub[, c("vb", "cdr3b")]
d_beta_only <- tcrdist_matrix(beta_only, "mouse")
dim(d_beta_only)
#> [1] 50 50
# Result matches components="beta" on paired data
d_beta_paired <- tcrdist_matrix(pa_sub, "mouse", components = "beta")
stopifnot(all(abs(d_beta_only - d_beta_paired) < 1e-10))
# Works with all wrappers: sparse, rect, knn, radius_neighbors
knn_beta <- tcrdist_knn(beta_only, "mouse", K = 3L)Standardizing TCR Data
Use as_tcr_df() to convert data.frames from common tools
(scRepertoire, scirpy, dandelion, tcrdist3) to tcrdistR’s canonical
column names:
Single-Pair Distances
For comparing individual CDR3 sequences:
# Weighted CDR3 distance (uses BLOSUM62-derived BSD4 matrix)
weighted_cdr3_distance("CAVSLDSNYQLIW", "CALGDRATGGNNKLTF")
#> [1] 102
# Simple Hamming distance (counts mismatches at each position)
hamming_distance("CASSI", "CASSK")
#> [1] 1Diversity Metrics
Diversity metrics summarize how “spread out” a repertoire is. A repertoire dominated by one or two clonotypes (high clonality) suggests a strong antigen-driven expansion, while a highly diverse repertoire suggests a broad, unstimulated sample.
# Clone counts for PA-specific TCRs
pa_counts <- pa$count
# Generalized diversity (order 2 = inverse Simpson)
# entropy: 0 = dominated by one clone, 1 = perfectly even
div <- tcr_diversity(pa_counts, order = 2)
div$entropy
#> [1] 0.9958049
div$effective_number # "how many equally-abundant clones would give this diversity"
#> [1] 238.3727
# Species richness: simply the number of distinct clonotypes
tcr_richness(pa_counts)
#> [1] 324
# Clonality: 0 = maximally diverse, 1 = single dominant clone
tcr_clonality(pa_counts)
#> [1] 0.05175504The effective_number is particularly intuitive: if your
repertoire has an effective number of 50, it behaves as if it
had 50 equally abundant clonotypes, even though the actual distribution
may be uneven.
The TCRrep Object
For more structured workflows, wrap your data in a
TCRrep S4 object. By default, TCRrep()
deduplicates identical clones (matching tcrdist3 behavior), grouping by
chain columns plus subject if present, and summing clone
counts:
rep <- TCRrep(pa_sub, organism = "mouse")
#> deduplicate: 50 -> 49 clones
nrow(pa_sub) # before dedup
#> [1] 50
nrow(rep@clone_df) # after dedup (within-subject duplicates merged)
#> [1] 49
rep
#> TCRrep object: 49 clonotypes
#> organism: mouse
#> chains: AB
#> metric: tcrdist
#> distances: not computedYou can control deduplication with the deduplicate
argument:
# Chain columns only (collapse across subjects)
rep_strict <- TCRrep(pa_sub, organism = "mouse", deduplicate = character(0))
#> deduplicate: 50 -> 47 clones
nrow(rep_strict@clone_df)
#> [1] 47
# No deduplication
rep_raw <- TCRrep(pa_sub, organism = "mouse", deduplicate = FALSE)
nrow(rep_raw@clone_df)
#> [1] 50Choosing Parameters
Several functions take a radius or threshold parameter. Here are practical starting points:
| Parameter | Recommended range | Notes |
|---|---|---|
tcrdist_sparse(threshold=) |
50–200 | Use 50 for tight clusters, 200 for broad search |
tcrdist_radius_neighbors(radius=) |
24–96 | 48 is a common default for paired chains |
neighborhood_test(radius=) |
48–96 | Smaller radius = more specific neighborhoods |
find_meta_clonotypes(radius=) |
36–72 | Balance sensitivity vs. specificity |
find_clumping(radii=) |
c(24, 48, 72) | Test multiple radii |
cluster_tcrs(k=) |
5–20 | Start with expected number of groups |
These values assume paired alpha-beta TCRs with default weights. For single-chain analysis, distances are roughly halved, so use proportionally smaller thresholds.
Performance and Scalability
All distance computations are implemented in C++ for performance. Approximate guidance for dataset sizing:
| N (TCRs) | tcrdist_matrix |
tcrdist_sparse |
tcrdist_knn |
Memory (dense) |
|---|---|---|---|---|
| 1,000 | < 1 sec | < 1 sec | < 1 sec | ~8 MB |
| 5,000 | ~5 sec | ~5 sec | ~3 sec | ~200 MB |
| 10,000 | ~20 sec | ~15 sec | ~10 sec | ~800 MB |
| 50,000 | minutes | ~minutes | ~1 min | ~20 GB |
When to use each function:
-
< 5,000 TCRs:
tcrdist_matrix()is fast and convenient. -
5,000–20,000 TCRs:
tcrdist_sparse()saves memory; use for downstream analyses that only need nearby neighbors. -
> 20,000 TCRs: Use
tcrdist_knn()which avoids materializing the full matrix. For dimensionality reduction, use the UMAP KNN path rather than kernel PCA (which needs the full matrix).
tcrdist3 Feature Map
For users migrating from Python tcrdist3:
| tcrdist3 concept | tcrdistR equivalent | Notes |
|---|---|---|
TCRrep |
TCRrep() S4 class |
Similar dedup behavior |
compute_distances() |
tcrdist_matrix() |
Per-component via components=
|
compute_sparse_rect_distances() |
tcrdist_sparse(), tcrdist_rect()
|
|
nn_approach() |
tcrdist_knn() |
|
TCRpublic / find_metaclonotypes()
|
find_meta_clonotypes() |
|
TCRsampler-based clumping |
find_clumping() |
Chain-resampling background |
| Kernel PCA | compute_tcrdist_kernel_pca() |
Linear + Gaussian kernels |
| Diversity metrics |
tcr_diversity(), tcr_clonality(),
etc. |
|
| CDR3 motif logos |
plot_cdr3_logo(),
plot_tcr_logo_panel()
|
ggseqlogo-based |
| DB search | match_tcrs_to_db() |
Human only |
| CD8 scoring | make_cd8_score_table_column() |
Human only |
Next Steps
-
vignette("tcrdistR-tcrrep-workflow")— End-to-end analysis with the TCRrep S4 object -
vignette("tcrdistR-advanced")— Clustering, neighborhood tests, meta-clonotypes, database matching, kernel PCA, fuzzy diversity -
vignette("tcrdistR-visualization")— Heatmaps, dendrograms, CDR3 logos, scatter plots -
vignette("tcrdistR-comparing-repertoires")— Comparing repertoires between conditions -
vignette("tcrdistR-glossary")— Glossary of technical terms