Chapter 3 Download Data

3.1 ChIP-seq

Chen, X., et al. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133(6), 1106-1117.

3.1.1 samFile

cat <<END  > samFile.txt
SRR002004   ES_Nanog_1  ES_Nanog
SRR002005   ES_Nanog_2  ES_Nanog
SRR002011   ES_Nanog_3  ES_Nanog
SRR002010   ES_Nanog_4  ES_Nanog
SRR002009   ES_Nanog_5  ES_Nanog
SRR002008   ES_Nanog_6  ES_Nanog
SRR002007   ES_Nanog_7  ES_Nanog
SRR002006   ES_Nanog_8  ES_Nanog
SRR002012   ES_Oct4_1   ES_Oct4
SRR002013   ES_Oct4_2   ES_Oct4
SRR002014   ES_Oct4_3   ES_Oct4
SRR002015   ES_Oct4_4   ES_Oct4
SRR002025   ES_Sox2_1   ES_Sox2
SRR002024   ES_Sox2_2   ES_Sox2
SRR002023   ES_Sox2_3   ES_Sox2
SRR002026   ES_Sox2_4   ES_Sox2
SRR002021   ES_Smad1_1  ES_Smad1
SRR002020   ES_Smad1_2  ES_Smad1
SRR002022   ES_Smad1_3  ES_Smad1
SRR001991   ES_E2f1_1   ES_E2f1
SRR001990   ES_E2f1_2   ES_E2f1
SRR001989   ES_E2f1_3   ES_E2f1
SRR001988   ES_E2f1_4   ES_E2f1
SRR002034   ES_Tcfcp2I1_1   ES_Tcfcp2I1
SRR002033   ES_Tcfcp2I1_2   ES_Tcfcp2I1
SRR002032   ES_Tcfcp2I1_3   ES_Tcfcp2I1
SRR002031   ES_Tcfcp2I1_4   ES_Tcfcp2I1
SRR001987   ES_CTCF_1   ES_CTCF
SRR001986   ES_CTCF_2   ES_CTCF
SRR001985   ES_CTCF_3   ES_CTCF
SRR002035   ES_Zfx_1    ES_Zfx
SRR002036   ES_Zfx_2    ES_Zfx
SRR002037   ES_Zfx_3    ES_Zfx
SRR002038   ES_Zfx_4    ES_Zfx
SRR002019   ES_STAT3_1  ES_STAT3
SRR002018   ES_STAT3_2  ES_STAT3
SRR002017   ES_STAT3_3  ES_STAT3
SRR002016   ES_STAT3_4  ES_STAT3
SRR002000   ES_Klf4_1   ES_Klf4
SRR002001   ES_Klf4_2   ES_Klf4
SRR002002   ES_Klf4_3   ES_Klf4
SRR002003   ES_Klf4_4   ES_Klf4
SRR001992   ES_Esrrb_1  ES_Esrrb
SRR001993   ES_Esrrb_2  ES_Esrrb
SRR001994   ES_Esrrb_3  ES_Esrrb
SRR001995   ES_Esrrb_4  ES_Esrrb
SRR002039   ES_c-Myc_1  ES_c-Myc
SRR002040   ES_c-Myc_2  ES_c-Myc
SRR002041   ES_c-Myc_3  ES_c-Myc
SRR002042   ES_c-Myc_4  ES_c-Myc
SRR002046   ES_n-Myc_1  ES_n-Myc
SRR002045   ES_n-Myc_2  ES_n-Myc
SRR002044   ES_n-Myc_3  ES_n-Myc
SRR002043   ES_n-Myc_4  ES_n-Myc
SRR001996   ES_GFP_1    ES_GFP
SRR001997   ES_GFP_2    ES_GFP
SRR001998   ES_GFP_3    ES_GFP
SRR001999   ES_GFP_4    ES_GFP
SRR023866   ES_p300_1   ES_p300
SRR023867   ES_p300_2   ES_p300
SRR023868   ES_p300_3   ES_p300
SRR023869   ES_p300_4   ES_p300
SRR002027   ES_Suz12_1  ES_Suz12
SRR002028   ES_Suz12_2  ES_Suz12
SRR002029   ES_Suz12_3  ES_Suz12
SRR002030   ES_Suz12_4  ES_Suz12
END

3.1.2 download

#!/bin/bash
while read srr sampleName experiment
do
srrHead=(${srr:0:6})
ftp=$(echo "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/"$srrHead"/"$srrTail"/"$srr"/"$srr".fastq.gz -O "$sampleName".fastq.gz")
wget $ftp
done < samFile.txt

3.2 uliCUT&RUN

Hainer, S. J., et al. (2019). Profiling of pluripotency factors in single cells and early embryos. Cell, 177(5), 1319.

For CUT&RUN data, I’ll begin with the BED format file, which can be downloaded from the GEO (GEO accession: GSE111121)

In this project we are only interested in samples from blastocysts, and here are 28 samples in total.

Table 3.1: Experimental design in CUT & RUN
cellType Experiment Antibody Rep
blastocysts None NoAb 2
blastocysts None CTCF 2
blastocysts EGFPKD NoAb 4
blastocysts EGFPKD NANOG 4
blastocysts Brg1 NoAb 4
blastocysts Brg1 NANOG 4
blastocysts Nanog NoAb 4
blastocysts Nanog NANOG 4

3.2.1 samFile

Save as samFile.txt.

# create a file to store the acc number and experiment information
cat <<END  > samFile.txt
GSM3022469  blast_NoAb_rep1 NoAb_1  NoAb
GSM3022470  blast_NoAb_rep2 NoAb_2  NoAb
GSM3022471  blast_CTCF_rep1 CTCF_1  CTCF
GSM3022472  blast_CTCF_rep2 CTCF_2  CTCF
GSM3022473  blast_EGFPKD_NoAb_rep1  EGFPKD_NoAb_1   EGFPKD_NoAb
GSM3022474  blast_EGFPKD_NoAb_rep2  EGFPKD_NoAb_2   EGFPKD_NoAb
GSM3022475  blast_EGFPKD_NoAb_rep3  EGFPKD_NoAb_3   EGFPKD_NoAb
GSM3022476  blast_EGFPKD_NoAb_rep4  EGFPKD_NoAb_4   EGFPKD_NoAb
GSM3022477  blast_EGFPKD_Nanog_rep1 EGFPKD_1    EGFPKD
GSM3022478  blast_EGFPKD_Nanog_rep2 EGFPKD_2    EGFPKD
GSM3022479  blast_EGFPKD_Nanog_rep3 EGFPKD_3    EGFPKD
GSM3022480  blast_EGFPKD_Nanog_rep4 EGFPKD_4    EGFPKD
GSM3022481  blast_Brg1KD_NoAb_rep1  Brg1KD_NoAb_1   Brg1KD_NoAb
GSM3022482  blast_Brg1KD_NoAb_rep2  Brg1KD_NoAb_2   Brg1KD_NoAb
GSM3022483  blast_Brg1KD_NoAb_rep3  Brg1KD_NoAb_3   Brg1KD_NoAb
GSM3022484  blast_Brg1KD_NoAb_rep4  Brg1KD_NoAb_4   Brg1KD_NoAb
GSM3022485  blast_Brg1KD_Nanog_rep1 Brg1KD_1    Brg1KD
GSM3022486  blast_Brg1KD_Nanog_rep2 Brg1KD_2    Brg1KD
GSM3022487  blast_Brg1KD_Nanog_rep3 Brg1KD_3    Brg1KD
GSM3022488  blast_Brg1KD_Nanog_rep4 Brg1KD_4    Brg1KD
GSM3022489  blast_NanogKD_NoAb_rep1 NanogKD_NoAb_1  NanogKD_NoAb
GSM3022490  blast_NanogKD_NoAb_rep2 NanogKD_NoAb_2  NanogKD_NoAb
GSM3022491  blast_NanogKD_NoAb_rep3 NanogKD_NoAb_3  NanogKD_NoAb
GSM3022492  blast_NanogKD_NoAb_rep4 NanogKD_NoAb_4  NanogKD_NoAb
GSM3022493  blast_NanogKD_Nanog_rep1    NanogKD_1   NanogKD
GSM3022494  blast_NanogKD_Nanog_rep2    NanogKD_2   NanogKD
GSM3022495  blast_NanogKD_Nanog_rep3    NanogKD_3   NanogKD
GSM3022496  blast_NanogKD_Nanog_rep4    NanogKD_4   NanogKD
END

3.2.2 download

save the below bash script as downfile.sh.

The link looks like below:

https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3022nnn/GSM3022469/suppl/GSM3022469_blast_NoAb_1_1-120.ucsc.bedGraph.gz

https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3022nnn/GSM3022473/suppl/GSM3022473_blast_EGFPKD_NoAb_rep1_1-120.ucsc.bedGraph.gz

#!/bin/bash
while read gsm rep sampleName experiment
do
gsmSub=(${gsm:0:7}nnn)
# repHead=(${rep:0:$((${#rep}-4))})
# repTail=(${rep:$((${#rep}-1)):1})
# ftp=$(echo "https://ftp.ncbi.nlm.nih.gov/geo/samples/"$gsmSub"/"$gsm"/suppl/"$gsm"_"$repHead$repTail"_1-120.ucsc.bedGraph.gz -O "$sampleName".bedGraph.gz")
ftp=$(echo "https://ftp.ncbi.nlm.nih.gov/geo/samples/"$gsmSub"/"$gsm"/suppl/"$gsm"_"$rep"_1-120.ucsc.bedGraph.gz -O "$sampleName".bedGraph.gz")
wget $ftp
done < samFile.txt
# download data
./downfile.sh

# check the number of .gz file (should be 28)
ls *.gz | wc -l