Virtual ChIP-seq Predicting transcription factor binding by learning from the transcriptome

Karimzadeh M. and Hoffman MM.. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. Genome Biology 23, 126 (2022). doi: https://doi.org/10.1186/s13059-022-02690-2. (BibTeX)

Virtual ChIP-seq predicts transcription factor binding in any cell type from RNA-seq and ATAC-seq (or DNase-seq).

The Virtual ChIP-seq track hub contains genome-wide predictions for binding of 36 TFs in 33 different human tissues:

Predicting transcription factor binding

Virtual ChIP-seq uses multi-layer perceptron to predict binding of individual transcription factors (TFs). Virtual ChIP-seq uses data on chromatin accessibility, genomic conservation, and binding characteristics of TFs from previous experiments in other cell types. It also learns from the asso- ciation of gene expression and TF binding at different genomic regions. By incorporating existing ChIP-seq data, there is no longer a need to represent TF sequence preferences in form of position weight matrices. For a new cell type with data on chromatin accessibility and gene expression, Virtual ChIP-seq predicts indirect TF binding, as well as binding of TFs without known sequence preference. MLP

Accuracy of predictions

To build a generalizable classifier that performs well on new cell types with only transcriptome and chromatin accessibility data, we train the multi-layer perceptron on training cell types (A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and T47D) We assess the performance of the model in validation cell types (K562, PANC-1, IMR-90, MCF-7, H1-hESC, and liver). We use the posterior probability cutoff which maximizes Matthews correlation coefficient (MCC) in H1-hESC for each TF. If we don't have ChIP-seq data of the TF in H1-hESC, we use the mode of the optimal cutoffs in other TFs (0.4). Below, we report median ∓ standard deviation of performance among validation cell types. Column N corresponds to number of validation cell types for each TF.

Datasets and software on Zenodo

F1 Accuracy MCC auROC auPR N
ATF2 0.270∓0.002 0.990∓0.001 0.314∓0.008 0.917∓0.026 0.443∓0.022 1
BHLHE40 0.334∓0.021 0.997∓0.000 0.356∓0.010 0.974∓0.002 0.382∓0.010 1
CEBPB 0.510∓0.091 0.992∓0.002 0.515∓0.072 0.964∓0.017 0.534∓0.073 3
CHD2 0.270∓0.051 0.996∓0.000 0.332∓0.040 0.950∓0.012 0.386∓0.046 1
CREB1 0.362∓0.131 0.997∓0.002 0.371∓0.121 0.868∓0.135 0.335∓0.174 2
CTCF 0.667∓0.143 0.995∓0.004 0.686∓0.107 0.988∓0.055 0.849∓0.121 4
E2F1 0.256∓0.097 0.998∓0.002 0.314∓0.078 0.978∓0.019 0.291∓0.105 2
ELF1 0.431∓0.047 0.997∓0.001 0.456∓0.038 0.949∓0.042 0.493∓0.066 2
ELK1 0.430∓0.069 1.000∓0.000 0.465∓0.054 0.991∓0.009 0.420∓0.054 2
ESR1 0.270∓0.024 0.988∓0.003 0.380∓0.018 0.846∓0.012 0.476∓0.010 1
FOS 0.333∓0.027 0.997∓0.001 0.393∓0.020 0.861∓0.004 0.394∓0.008 1
FOSL1 0.319∓0.006 0.994∓0.001 0.316∓0.006 0.929∓0.006 0.272∓0.012 1
FOXA1 0.407∓0.045 0.994∓0.005 0.444∓0.061 0.961∓0.022 0.467∓0.131 2
GABPA 0.298∓0.049 0.994∓0.002 0.393∓0.036 0.986∓0.012 0.496∓0.036 3
GTF2F1 0.235∓0.120 0.996∓0.001 0.312∓0.070 0.985∓0.015 0.191∓0.081 2
HCFC1 0.459∓0.021 0.999∓0.000 0.487∓0.024 0.990∓0.005 0.515∓0.044 2
HDAC2 0.303∓0.033 0.986∓0.005 0.370∓0.018 0.948∓0.051 0.281∓0.040 2
HSF1 0.350∓0.149 1.000∓0.000 0.378∓0.145 0.999∓0.012 0.309∓0.240 1
JUN 0.218∓0.127 0.998∓0.001 0.311∓0.153 0.983∓0.009 0.456∓0.257 2
JUND 0.363∓0.080 0.994∓0.002 0.399∓0.053 0.971∓0.020 0.370∓0.078 3
MAFK 0.354∓0.041 0.997∓0.001 0.423∓0.028 0.989∓0.005 0.513∓0.103 3
MAX 0.400∓0.045 0.996∓0.002 0.444∓0.059 0.961∓0.012 0.491∓0.111 3
MAZ 0.370∓0.025 0.997∓0.001 0.422∓0.019 0.987∓0.005 0.493∓0.070 2
MXI1 0.394∓0.018 0.999∓0.000 0.402∓0.017 0.993∓0.004 0.381∓0.025 1
NRF1 0.668∓0.051 1.000∓0.000 0.680∓0.046 0.996∓0.018 0.725∓0.062 2
RAD21 0.593∓0.062 0.996∓0.002 0.626∓0.056 0.983∓0.033 0.740∓0.095 3
REST 0.482∓0.120 0.999∓0.001 0.493∓0.091 0.985∓0.008 0.567∓0.095 3
SIN3A 0.389∓0.048 0.998∓0.002 0.394∓0.029 0.966∓0.004 0.411∓0.037 3
SMC3 0.733∓0.016 0.999∓0.000 0.734∓0.016 0.998∓0.001 0.792∓0.018 1
SRF 0.353∓0.060 0.998∓0.001 0.364∓0.070 0.982∓0.008 0.365∓0.115 2
TAF1 0.378∓0.073 0.999∓0.001 0.437∓0.097 0.987∓0.009 0.490∓0.168 3
TEAD4 0.344∓0.061 0.990∓0.002 0.385∓0.020 0.967∓0.023 0.343∓0.019 2
TP53 0.275∓0.103 1.000∓0.000 0.382∓0.086 1.000∓0.008 0.660∓0.222 1
USF1 0.353∓0.047 0.993∓0.001 0.382∓0.040 0.891∓0.012 0.372∓0.046 1
USF2 0.410∓0.040 0.999∓0.000 0.427∓0.028 0.982∓0.007 0.437∓0.032 1
YY1 0.397∓0.049 0.996∓0.001 0.408∓0.058 0.945∓0.043 0.417∓0.104 2

Virtual ChIP-seq accepts chromatin accessibility data in narrowPeak format and RNA-seq data in format of a matrix where rows are human gene symbols and columns are cell types (Minimum of 1 column with your cell of interest). The RNA-seq measure must be normalized to length and library (accepts RPKM, FPKM, TPM, but not raw read counts). It takes an average of 6 CPU hours (depending on TF) and a minimum RAM of 8GB to generate the input tables for your TF of interest. Applying the trained model takes less than 20 minutes for most TFs and datasets.

Track hub, file access, and software

UCSC Genome Browser

View the Virtual ChIP-seq track hub in the UCSC genome browser.

There are 36 supertracks corresponding to each transcription factor. Each supertrack contains a bigBed9 track for Cistrome and ENCODE ChIP-seq data, and one bigwig file for prediction of binding of the TF in each of the Roadmap consortium datasets.

Using the track hub

There are 36 supertracks corresponding to each transcription factor. Each supertrack contains to bigBed9 files, one showing genomic bins with TF binding in Cistrome DB datasets, and one showing Virtual ChIP-seq predictions in the Roadmap consortium datasets.

View the Virtual ChIP-seq track hub in UCSC genome browser.

List of Roadmap consortium tissue types with Virtual ChIP-seq predictions

Tissue Day ENCODE accession
adrenal gland 108day ENCFF551HRI
B cell 37year ENCFF444ZRC
CD14-positive monocyte 37year ENCFF007TSW
CD4-positive helperTcell 21year ENCFF276EBZ
CD8-positive-alpha-beta T cell 21year ENCFF614QQR
fibroblast of skin of abdomen 97day ENCFF696SPY
forelimb muscle 108day ENCFF060JZA
heart 120day ENCFF203FLV
hindlimb muscle 120day ENCFF856UQI
kidney 108day ENCFF577ZMC
large intestine 120day ENCFF250JHL
left kidney 96day ENCFF456NFP
left lung 108day ENCFF610OWH
left renal cortex interstitium 120day ENCFF602DIZ
left renal pelvis 120day ENCFF714RWU
muscle of arm 127day ENCFF517DTZ
muscle of back 127day ENCFF066LTB
muscle of leg 127day ENCFF207RZS
muscle of trunk 120day ENCFF979SJD
ovary NA ENCFF916EFR
renal cortex interstitium 120day ENCFF330NPA
renal pelvis 105day ENCFF155DZV
right lung 105day ENCFF828HED
right renal cortex interstitium 120day ENCFF198WIN
right renal pelvis 120day ENCFF832UZR
skin fibroblast 97day ENCFF969YOA
small intestine 108day ENCFF227RVA
spinal cord 113day ENCFF412SKC
spleen 112day ENCFF180AEX
stomach 127day ENCFF803IZB
T-cell 37year ENCFF410MHQ
testis NA ENCFF518XTM
thymus 127day ENCFF178AYH

Software and documentation

Read the documentation for Virtual ChIP-seq software, which begins with a quick start.

Support

Please ask questions about Virtual ChIP-seq on our mailing list. If you want to report a bug or request a feature, use Virtual ChIP-seq issue tracker. We are interested in all comments on the package, and the ease of use of installation and documentation.

Source code

Credits

Virtual ChIP-seq is developed by Mehran Karimzadeh during his PhD at Michael Hoffman Lab.