Contextual Analysis of TF Occupancy (CATO) scores

CATO scores are important for predicting which genetic variants affect transcription factor (TF) binding and DNA accessibility.

CATO scores are standardized regression scores from logistic regression models that use the following variables to predict the probability of a variant affecting the binding of a TF:

The cell type-specific activity spectrum
The position of the SNP relative to the TF motif
The score of the match to the TF motif
Read depth
Number of heterozygous samples
TF occupancy measured by DNase I footprinting
Phylogenetic conservation

"This approach resulted in a simple scoring scheme, termed contextual analysis of TF occupancy (CATO), that provides a recalibrated probability of affecting the binding of any TF, as well as a quantitatively ranked list of TF families whose binding might be altered" (information here and below from Matt Maurano et al. Nat. Genetics 2015).

CATO model: significant ∼ log(Read depth) + Num. hets.^2 + MCV^2 + CpG Island + 3 ′ UTR + coding + intron + intergenic + Dist. to TSS^2 + DHS strength^2 + Width of DHS + #nearby binding sites^2 + PhastCons + Footprint presence + Footprint occupancy + log(score)^2 + logodds difference + x2 + ... + xn

For additional information, see CATO explainer from paper in: http://www.mauranolab.org/CATO/