Karimzadeh M. and Hoffman MM.. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. Genome Biology 23, 126 (2022). doi: https://doi.org/10.1186/s13059-022-02690-2. (BibTeX)
Virtual ChIP-seq predicts transcription factor binding in any cell type from RNA-seq and ATAC-seq (or DNase-seq).
The Virtual ChIP-seq track hub contains genome-wide predictions for binding of 36 TFs in 33 different human tissues:
Virtual ChIP-seq uses multi-layer perceptron to predict binding of individual transcription factors (TFs). Virtual ChIP-seq uses data on chromatin accessibility, genomic conservation, and binding characteristics of TFs from previous experiments in other cell types. It also learns from the asso- ciation of gene expression and TF binding at different genomic regions. By incorporating existing ChIP-seq data, there is no longer a need to represent TF sequence preferences in form of position weight matrices. For a new cell type with data on chromatin accessibility and gene expression, Virtual ChIP-seq predicts indirect TF binding, as well as binding of TFs without known sequence preference.
To build a generalizable classifier that performs well on new cell types with only transcriptome and chromatin accessibility data, we train the multi-layer perceptron on training cell types (A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and T47D) We assess the performance of the model in validation cell types (K562, PANC-1, IMR-90, MCF-7, H1-hESC, and liver). We use the posterior probability cutoff which maximizes Matthews correlation coefficient (MCC) in H1-hESC for each TF. If we don't have ChIP-seq data of the TF in H1-hESC, we use the mode of the optimal cutoffs in other TFs (0.4). Below, we report median ∓ standard deviation of performance among validation cell types. Column N corresponds to number of validation cell types for each TF.
F1 | Accuracy | MCC | auROC | auPR | N | |
---|---|---|---|---|---|---|
ATF2 | 0.270∓0.002 | 0.990∓0.001 | 0.314∓0.008 | 0.917∓0.026 | 0.443∓0.022 | 1 |
BHLHE40 | 0.334∓0.021 | 0.997∓0.000 | 0.356∓0.010 | 0.974∓0.002 | 0.382∓0.010 | 1 |
CEBPB | 0.510∓0.091 | 0.992∓0.002 | 0.515∓0.072 | 0.964∓0.017 | 0.534∓0.073 | 3 |
CHD2 | 0.270∓0.051 | 0.996∓0.000 | 0.332∓0.040 | 0.950∓0.012 | 0.386∓0.046 | 1 |
CREB1 | 0.362∓0.131 | 0.997∓0.002 | 0.371∓0.121 | 0.868∓0.135 | 0.335∓0.174 | 2 |
CTCF | 0.667∓0.143 | 0.995∓0.004 | 0.686∓0.107 | 0.988∓0.055 | 0.849∓0.121 | 4 |
E2F1 | 0.256∓0.097 | 0.998∓0.002 | 0.314∓0.078 | 0.978∓0.019 | 0.291∓0.105 | 2 |
ELF1 | 0.431∓0.047 | 0.997∓0.001 | 0.456∓0.038 | 0.949∓0.042 | 0.493∓0.066 | 2 |
ELK1 | 0.430∓0.069 | 1.000∓0.000 | 0.465∓0.054 | 0.991∓0.009 | 0.420∓0.054 | 2 |
ESR1 | 0.270∓0.024 | 0.988∓0.003 | 0.380∓0.018 | 0.846∓0.012 | 0.476∓0.010 | 1 |
FOS | 0.333∓0.027 | 0.997∓0.001 | 0.393∓0.020 | 0.861∓0.004 | 0.394∓0.008 | 1 |
FOSL1 | 0.319∓0.006 | 0.994∓0.001 | 0.316∓0.006 | 0.929∓0.006 | 0.272∓0.012 | 1 |
FOXA1 | 0.407∓0.045 | 0.994∓0.005 | 0.444∓0.061 | 0.961∓0.022 | 0.467∓0.131 | 2 |
GABPA | 0.298∓0.049 | 0.994∓0.002 | 0.393∓0.036 | 0.986∓0.012 | 0.496∓0.036 | 3 |
GTF2F1 | 0.235∓0.120 | 0.996∓0.001 | 0.312∓0.070 | 0.985∓0.015 | 0.191∓0.081 | 2 |
HCFC1 | 0.459∓0.021 | 0.999∓0.000 | 0.487∓0.024 | 0.990∓0.005 | 0.515∓0.044 | 2 |
HDAC2 | 0.303∓0.033 | 0.986∓0.005 | 0.370∓0.018 | 0.948∓0.051 | 0.281∓0.040 | 2 |
HSF1 | 0.350∓0.149 | 1.000∓0.000 | 0.378∓0.145 | 0.999∓0.012 | 0.309∓0.240 | 1 |
JUN | 0.218∓0.127 | 0.998∓0.001 | 0.311∓0.153 | 0.983∓0.009 | 0.456∓0.257 | 2 |
JUND | 0.363∓0.080 | 0.994∓0.002 | 0.399∓0.053 | 0.971∓0.020 | 0.370∓0.078 | 3 |
MAFK | 0.354∓0.041 | 0.997∓0.001 | 0.423∓0.028 | 0.989∓0.005 | 0.513∓0.103 | 3 |
MAX | 0.400∓0.045 | 0.996∓0.002 | 0.444∓0.059 | 0.961∓0.012 | 0.491∓0.111 | 3 |
MAZ | 0.370∓0.025 | 0.997∓0.001 | 0.422∓0.019 | 0.987∓0.005 | 0.493∓0.070 | 2 |
MXI1 | 0.394∓0.018 | 0.999∓0.000 | 0.402∓0.017 | 0.993∓0.004 | 0.381∓0.025 | 1 |
NRF1 | 0.668∓0.051 | 1.000∓0.000 | 0.680∓0.046 | 0.996∓0.018 | 0.725∓0.062 | 2 |
RAD21 | 0.593∓0.062 | 0.996∓0.002 | 0.626∓0.056 | 0.983∓0.033 | 0.740∓0.095 | 3 |
REST | 0.482∓0.120 | 0.999∓0.001 | 0.493∓0.091 | 0.985∓0.008 | 0.567∓0.095 | 3 |
SIN3A | 0.389∓0.048 | 0.998∓0.002 | 0.394∓0.029 | 0.966∓0.004 | 0.411∓0.037 | 3 |
SMC3 | 0.733∓0.016 | 0.999∓0.000 | 0.734∓0.016 | 0.998∓0.001 | 0.792∓0.018 | 1 |
SRF | 0.353∓0.060 | 0.998∓0.001 | 0.364∓0.070 | 0.982∓0.008 | 0.365∓0.115 | 2 |
TAF1 | 0.378∓0.073 | 0.999∓0.001 | 0.437∓0.097 | 0.987∓0.009 | 0.490∓0.168 | 3 |
TEAD4 | 0.344∓0.061 | 0.990∓0.002 | 0.385∓0.020 | 0.967∓0.023 | 0.343∓0.019 | 2 |
TP53 | 0.275∓0.103 | 1.000∓0.000 | 0.382∓0.086 | 1.000∓0.008 | 0.660∓0.222 | 1 |
USF1 | 0.353∓0.047 | 0.993∓0.001 | 0.382∓0.040 | 0.891∓0.012 | 0.372∓0.046 | 1 |
USF2 | 0.410∓0.040 | 0.999∓0.000 | 0.427∓0.028 | 0.982∓0.007 | 0.437∓0.032 | 1 |
YY1 | 0.397∓0.049 | 0.996∓0.001 | 0.408∓0.058 | 0.945∓0.043 | 0.417∓0.104 | 2 |
Virtual ChIP-seq accepts chromatin accessibility data in narrowPeak format and RNA-seq data in format of a matrix where rows are human gene symbols and columns are cell types (Minimum of 1 column with your cell of interest). The RNA-seq measure must be normalized to length and library (accepts RPKM, FPKM, TPM, but not raw read counts). It takes an average of 6 CPU hours (depending on TF) and a minimum RAM of 8GB to generate the input tables for your TF of interest. Applying the trained model takes less than 20 minutes for most TFs and datasets.
View the Virtual ChIP-seq track hub in the UCSC genome browser.
There are 36 supertracks corresponding to each transcription factor. Each supertrack contains a bigBed9 track for Cistrome and ENCODE ChIP-seq data, and one bigwig file for prediction of binding of the TF in each of the Roadmap consortium datasets.There are 36 supertracks corresponding to each transcription factor. Each supertrack contains to bigBed9 files, one showing genomic bins with TF binding in Cistrome DB datasets, and one showing Virtual ChIP-seq predictions in the Roadmap consortium datasets.
View the Virtual ChIP-seq track hub in UCSC genome browser.
List of Roadmap consortium tissue types with Virtual ChIP-seq predictions
Tissue | Day | ENCODE accession |
---|---|---|
adrenal gland | 108day | ENCFF551HRI |
B cell | 37year | ENCFF444ZRC |
CD14-positive monocyte | 37year | ENCFF007TSW |
CD4-positive helperTcell | 21year | ENCFF276EBZ |
CD8-positive-alpha-beta T cell | 21year | ENCFF614QQR |
fibroblast of skin of abdomen | 97day | ENCFF696SPY |
forelimb muscle | 108day | ENCFF060JZA |
heart | 120day | ENCFF203FLV |
hindlimb muscle | 120day | ENCFF856UQI |
kidney | 108day | ENCFF577ZMC |
large intestine | 120day | ENCFF250JHL |
left kidney | 96day | ENCFF456NFP |
left lung | 108day | ENCFF610OWH |
left renal cortex interstitium | 120day | ENCFF602DIZ |
left renal pelvis | 120day | ENCFF714RWU |
muscle of arm | 127day | ENCFF517DTZ |
muscle of back | 127day | ENCFF066LTB |
muscle of leg | 127day | ENCFF207RZS |
muscle of trunk | 120day | ENCFF979SJD |
ovary | NA | ENCFF916EFR |
renal cortex interstitium | 120day | ENCFF330NPA |
renal pelvis | 105day | ENCFF155DZV |
right lung | 105day | ENCFF828HED |
right renal cortex interstitium | 120day | ENCFF198WIN |
right renal pelvis | 120day | ENCFF832UZR |
skin fibroblast | 97day | ENCFF969YOA |
small intestine | 108day | ENCFF227RVA |
spinal cord | 113day | ENCFF412SKC |
spleen | 112day | ENCFF180AEX |
stomach | 127day | ENCFF803IZB |
T-cell | 37year | ENCFF410MHQ |
testis | NA | ENCFF518XTM |
thymus | 127day | ENCFF178AYH |
Read the documentation for Virtual ChIP-seq software, which begins with a quick start.
Please ask questions about Virtual ChIP-seq on our mailing list. If you want to report a bug or request a feature, use Virtual ChIP-seq issue tracker. We are interested in all comments on the package, and the ease of use of installation and documentation.
Virtual ChIP-seq is developed by Mehran Karimzadeh during his PhD at Michael Hoffman Lab.