The following table shows some of the common types of data files and includes some information about them:
File Extension | Type of Data | Format | Example(s) | |
---|---|---|---|---|
txt | multi-format | Text | study metadata, tab-delimited data | txt |
fastq | nucleotide | Text | sequencing reads | fastq |
fasta | nucleotide, protein | Text | the human genome | fasta |
sff | nucleotide | Binary | Roche/454 sequencing data | sff |
vcf | multi-format | Text | variation/SNP calls | vcf |
sam | alignment | Text | reads aligned to a reference | sam |
bam | alignment | Binary | reads aligned to a reference | bam |
bed | metadata / feature definitions | Binary | genome coverage | bed |
h5 | binary hierarchical | Binary | PacBio sequencing data | h5 |
pileup | alignment | Text | mpileup, SNP and indel calling | pileup |
In this workshop, there are a few bioinformatics-related data types we will focus on (beyond simple text files - although in principle many of the files are text). First let's consider the definition/documentation for these file types:
Plain-text
Compressed/binary
In addition to understanding how to work with these files, we also need to understand how to verify the integrity of these files. It is not uncommon to download a file, and get error messages, have to restart downloads, move files, etc. In these cases, knowing how to verify that two files are the same (not simply named the same) is very important. To do this we use a process called checksums:
According to wikipedia a checksum is 'a small-size datum from a block of digital data for the purpose of detecting errors which may have been introduced during its transmission or storage.'In other words it is a result we can generate that uniquely corresponds to a file. Any change to that file (adding a space, deleting a character, etc.) will change the file's checksum.
###File integrity - download a file and verify the download
Working with genomics files almost always requires using the shell, so although this isn't the official shell lesson, we're going to use many BASH functions. If you've already covered the shell lesson, consider this a refresher.
First we're going to download a zip file that is saved in a public
Dropbox folder. We're going to download using either wget
or
curl
and which one you need to use depends on your operating system,
as most computers will only have one or the other installed by default.
To see which one you have:
$ which curl
$ which wget
which
is a BASH program that looks through everything you have
installed, and tells you what folder it is installed to. If it can't
find the program you asked for, it returns nothing, i.e. gives you no
results.
On OSX, you'll likely get:
$ which curl
/usr/bin/curl
$ which wget
$
Once you know whether you have curl
or wget
use one of the
following commands to download the zip file:
$ cd
$ wget https://www.dropbox.com/s/d7zitckb5fz8494/GenomicsLesson.zip
or
$ cd
$ curl -LO https://www.dropbox.com/s/d7zitckb5fz8494/GenomicsLesson.zip
Most genomics files that you download will be very large, which can make them more prone to download errors. Flaky internet can make you think that your file is finished downloading, even if it really just stopped partway through. Furthermore, even the files that are filled with familiar characters are nearly impossible to fact check by eye. Imagine trying to make sure your favorite genome had downloaded correctly by manually comparing each base on your computer to the one at NCBI. Luckily theres a better way.
Whenever you download a large or important file, you should check to make sure that it is an exact match to the copy online. The most common way to do this is to run the file through a cryptographic hash function which processes all of the information in the file through a complex algorithm to get a hash value. A hash value looks a bit like an ideal password: a random looking mix of letters and numbers. Because the hash function is a very complex equation, in theory, for any given hash function, every unique input will have a unique hash value. So if you get the same hash value from two different files, those files are identical.
There are many available hash functions, and as computers get more sophisticated, older ones become easier to hack, so there will likely alwasy be new ones. But right now, we're going to use MD5 because it is pre-installed on most computers, and is most likely the one your sequencing facility or collaborator will send you when they give you large files.
There are two common versions of MD5, and we're going to use which
again to see which one you have installed:
$ which md5
$ which md5sum
Then get get the hash value for the zip file you downloaded by running either:
$ md5 GenomicsLesson.zip
or
$ md5sum GenomicsLesson.zip
The file I uploaded gave this answer:
$ md5 GenomicsLesson.zip
MD5 (GenomicsLesson.zip) = e237e30985867e6bea741949e42a0c3b
Practice: Download the best practices file from Data Carpentry: https://raw.githubusercontent.com/ACharbonneau/2016-01-18-MSU/gh-pages/Files/GoodBetterBest.md
###Working with Compressed files
As previously mentioned, genomics data files tend to be large. Since larger files are slower and more costly to move around, you will often encounter files that have been compressed to save time/space/money. The two most commonly encountered types of compressed files are Zip archives (e.g. filename.zip
), Gzip archives (e.g. filename.gz
) and Tarballs (e.g. filename.tar
or filename.tar.gz
).
Once you've convinced yourself that the file you have is the file that you ought to have, the next thing that you'll want to do is unzip it (a.k.a. uncompress or decompress or extract). You can unzip your .zip archive using the unzip
program:
unzip <filename.zip>
If you don't want to extract everything, but rather check the contents, you can view what a zip contains using the -l
flag ('list'):
unzip -l <filename.zip>
When you want to go in the other direction and make your own archive the command is simply zip
. It works like this:
zip <mynewarchive.zip> <myfirstfile.txt> <mysecondfile.sam>
Note that you can also use the -r
flag (recursive) to zip up a folder and all its contents, including subfolders like so:
zip -r <myproject.zip> myproject/
If you have been sent a big bundle of data as a gzip archive, then happily the same procedure applies for viewing and extracting as with zip archives, but with the gunzip
program:
gunzip -l <bundle.gz>
gunzip <bundle.gz>
Things are slightly different (read 'complex') if you encounter a tarball: thisfile.tar
or thatfile.tar.gz
or tacofile.tgz
.
You can view the contents of tarballs using the tar
program:
tar -tf <thisfile.tar>
tar -ztvf <thatfile.tar.gz>
tar -ztvf <tacofile.tgz>
…and extract them like this:
tar -xf <thisfile.tar>
tar -zxvf <thatfile.tar.gz>
tar -zxvf <tacofile.tgz>
Other types of compressed files and archives do exist, but these are the most common.
First, lets see what data files we have available:
ls
ls
stands for list, and if you call it all by itself, it just returns a list of whatever is inside the folder you're currently looking at. You should give you a fairly big list of files in alphabetical order. However, they're hard to understand like this, so lets ask ls
to make the list a little easier to read:
ls -lah
Here we've added modifiers to ls. Computer people usually call these modifiers 'flags' or 'arguments', and here we've added 3 flags:
-l
directs ls
to give us the results in 'long format' so we get more information
-a
tells ls
to show us 'all' of the things in the folder, even if they're usually hidden
-h
makes the output 'human readable', so you see file sizes in kb or gb instead of bytes
You're likely already familiar with FASTA files, as this is the most common way to distribute sequence information. Let's look at one:
head Raphanus.fa
head
is another program, and it shows you just the top few lines of a file. By default, it shows ten, (so five sequences) but we can also change that behavior with flags:
head -4 Raphanus.fa
Now, you should see the first four lines of the Raphanus.fa file.
Exercise >Try looking at EV813540.fa
FASTA files always have at least one comment line, which almost always begins with ">", but can start with ";". A given sequence in the file is allowed to have multiple comment lines, but they usually don't. Extra comment lines for sequences can break some downstream processes.
After the comment line is the sequence. Usually this is all on one line, but you can see that this one is formatted so that each sequence line is only 80 characters wide. This makes it easy to read, but makes it slightly more difficult to search within the file. For searching, its nice to have files where all of each sequence is on a single line. For instance, lets see whether there are any EcoRI sites are in the Raphanus.fa file:
grep "GAATTC" Raphanus.fa
grep is a program that searches for any string, and by default returns the entire line that your string is found in. For a file this big, this isn't very helpful. So lets modify how grep reports it's findings:
grep -B 1 "GAATTC" Raphanus.fa
-B number
grep will return the line with your string plus 'number' lines of 'before context', so here we'll get one previous line…the comment that tells us the sequence name
Now we know which of the sequences have the restriction site we're looking for, but there's so many they've overfilled the screen. So lets redirect the output from the screen into a file:
grep -B 1 "GAATTC" Raphanus.fa > Raphanus_EcoRI.fa
The greater than sign takes everything that happens on this side of it >
and dumps it into the place designated here. So, all of the output from that grep
command above got saved into a new file called Raphanus_EcoRI.fa
Since we didn't specify a place to save it, the new file is just saved in the same folder we're in, and we can see it by using ls
again:
ls -lahr
-r
makes the list print to our screen in reverse chronological order, so the newest files are on the bottom. This makes it easier to find what we're looking for.
grep
, ls
and head
all have lots of useful flags, but we'll only do one more for now:
grep -c "GAATTC" Raphanus.fa
-c
grep 'counted' 88 instances of EcoRI
SSF stands for Standard Flowgram Format and is for 454 data FASTQ is named after FASTA and is the output from most Illumina sequencers FNA for FASTA nucleic acid, and QUAL for quality. If you have Solexa data you might have these These are all files that you might get from your sequencing facility, and they all tell you about the machine, the sequence, and the quality of the sequence.
These are more complex than FASTA files, because they include quality information, but that often makes them more useful.
Most likely, you'll only be using FASTQ, as most people are doing Illumina sequencing right now.
head -4 33_20081121_2_RH2.fastq
FASTQ files have four lines of data per sequence.
Line one should look something like @30LWAAAXX_KD1_4:2:1:1428:1748
or @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
depending on the vintage of your particular sequencer.
In either case, the jumble of letters between the @
and the first :
are a unique instrument name. After that each of the numbers between the colons represents things like run ID, flowcell lane, physical coordinates of the cluster that sequence came from, and whether the sequencing was single or paired end. (See the Wikipedia to decipher your own.)
Line two is the sequencers calls based on light/ph/etc
Line three is often just a '+', but can be followed by some description
Line four are the quality scores for each base call, encoded as ASCII symbols
Probably the most confusing thing about FASTQ are the wide variety of quality scores that different sequencing companies have adopted. If you keep good records, including what type of machine each of your sequences was run on, and at what time, this isn't such a problem. However, if you've just inherited a pile of poorly managed sequences, you'll need your detective hat:
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
0........................26...31.......40
-5....0........9.............................40
0........9.............................40
3.....9.............................40
0.2......................26...31........41
S - Sanger Phred+33, raw reads typically (0, 40)
X - Solexa Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)
with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
(See complete image and much more discussion at Wikipedia)
Quality scores range from -5 to 40, on a somewhat overlapping range of ASCII encoded numbers, such that if your data happened to all have quality scores of 'F' and you didn't know how they were sequenced, it would be impossible to know whether 'F' meant the were all quality 38/40 from Sanger(which is quite good), or quality 6 from Illumina 1.5 (which is quite not good). In practice, you'll almost always have a good range of scores in your data, and by carefully comparing your quality scores to the chart above, you can usually work out where your documentation-challenged colleague got those samples they want you to analyze. (There's also a much easier way, which we'll get to at FastQC)
SAM files are tab-delimited files that describe how reads align to a sequence. They generally start with header lines (which always start with @
) before the actual alignments.
BAM files hold all the same information, but in binary format, which makes them much faster for computers to use, but impossible for us to read. Lets check:
head 12724.bam
head 12724.sam
The .bam
file just looks like nonsense, but the .sam
file looks sort of like we expected, except its all headers. So lets look at more of the SAM file:
head -20 12724.sam
…hmm
head -100 12724.sam
…that's a lot of headers. Rather than try to guess how far the header goes, lets just look at the other end of the file:
tail -20 12724.sam
tail
works just like head
, except it counts up from the end of the file instead of down from the top. So now we can see an example of the alignment part of the file.The alignments all have at least 11 standard columns (although the values might be zero), but can have lots of extra ones as well. These are the 11 required columns:
Col | Field | Type | Brief description |
---|---|---|---|
1 | QNAME | String | Query template NAME |
2 | FLAG | Int | bitwise FLAG |
3 | RNAME | String | Reference sequence NAME |
4 | POS | Int | 1-based leftmost mapping POSition |
5 | MAPQ | Int | MAPping Quality |
6 | CIGAR | String | CIGAR string |
7 | RNEXT | String | Ref. name of the mate/next read |
8 | PNEXT | Int | Position of the mate/next read |
9 | TLEN | Int | observed Template LENgth |
10 | SEQ | String | segment SEQuence |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 |
Because SAM files are tab-delimited, they are easy for both people and computers to read, (just not as quickly as BAM files). For instance, we can use the program cut
to get the flags from a SAM file:
cut -f 2 12724.sam
-f
which 'field' do you want?
That was way too much stuff to look at. So lets make our first script! All we're going to do it take the output from tail
and send it into cut
using a program called 'pipe':
tail -20 12724.sam | cut -f 2
Now we have just the flags from the last 20 lines. Instead lets get the flags from the last 20 lines and their sequences:
tail -20 12724.sam | cut -f 2,10
Exercise 1: Get all of the integer type data from the last 30 lines
Exercise 2: Get the quality scores from the penultimate 10 lines
Variable Call Format contains the genotype information for variable bases in reads mapped to an alignment. Only variable sites are included.
Example:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
File meta-information is included after the ## string
The 'fileformat' field details the VCF format version number.
##fileformat=VCFv4.0
INFO fields
##INFO=<ID=ID,Number=number,Type=type,Description=”description”>
The Number entry is the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1. However, if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. Possible Types are: Integer, Float, Character, String and Flag. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case. The Description value must be surrounded by double-quotes.
Each column in the body of the file should have an info field in the header.
The header line syntax
The header line names the 8 fixed, mandatory columns. These columns are as follows:
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. The header line is tab-delimited.
Fixed fields
There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (”.”). Fixed fields are:
CHROM chromosome: an identifier from the reference genome.
ID semi-colon separated list of unique identifiers.
REF reference base(s): Each base must be one of A,C,G,T,N.
ALT comma separated list of alternate non-reference alleles called on at least one of the samples.
QUAL phred-scaled quality score for the assertion made in ALT.
FILTER filter: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples.
#### README, TXT, MD, HTML, R, Python & limitless others
There are lots of text file types that are not specific to genomics, but that you'll end up using all the time. These can have any format at all, and any file extension, the only stipulation is that they be written in ASCII text. The text your reading right now is a .md (markdown) file rendered as a .html (so you can read it online). All of them can also be opened in a text editor (like sublime). Technically, you can open just about anything in a text editor (like we did with the BAM file), but text format files can be meaningfully opened in a text editor. So if you have a new file, how do you tell what kind it is? The way we've been doing it is by opening each file and looking at them, but that won't always work. Once you have thousands of new files, you don't want to open each one and do an eyeball check. It's much better to let the computer do it for you:
file *
file
is yet another program, and it does exactly what it sounds like, it tells you about files. The * is a 'wildcard', and works just like in playing cards. '*' specifically means 'any character', but there are more specific wildcards too, for instance we could ask for anything with a number in the name:
file *[0-9]*
Or anything that starts with a number:
bash
file [0-9]*
but for now, lets look at all of them:
file *
Now we can see that most of our files are ASCII text, that BAM files are actually gzipped, and lots of others.
An aside: lets look at AmandaSHellHistory.txt
:
AmandaShellHistory.txt: ASCII English text, with CRLF line terminators
This one specifically points out the line terminators, that is, the invisible characters at the end of each line. For reasons not worth discussing, UNIX, Macs and Windows computers all use different invisible characters, and most of the time you can't tell the difference. If you open a file with UNIX line endings on your Windows machine, it will almost always look the way you expect, because most professional software can handle the difference seamlessly. Furthermore, most text files you get from a sequencing facility will have UNIX line terminators. However, some programs can't seamlessly handle unfamiliar line endings, and will display your text as if there aren't any (just one huge, never-ending line), which is a problem. And, if you open and edit text files on your Windows or Mac computer, some programs will save that file with the line endings that your operating system uses instead of UNIX ones, and that can cause all sorts of downstream problems as well. The take home lesson is that if a file looks right to you, but is giving you weird output, check the line endings.
These files will all look the same when opened in Excel or a similar program, however they are very different when viewed from a text editor.
head Ex_combined.csv
head SeqProductionSumm.xls
When given the choice, it's usually best to save spreadsheet type data as a .csv
, it won't keep formatting like colors and bolding, but it can be opened on just about any computer, and in any text editor. They also tend to be much smaller than the same data saved as an Excel file.
We've already seen FASTQC, but a FASTQC 'file' is actually a combination of two files: an .html
and a .zip
file. Generally the two files have the same name, and just different file extensions. The .html
file is for viewing, and has all of the information about what the webpage should look like, but none of the actual data. This file is designed to look at in the GUI, not the command line, and looking at it with head
will just show us html code.
All of the data is stored in the .zip
file.
Exercise Unzip 64_20081121_2_RH2_fastqc.zip using the command line
Inside this file are mostly text files of summary data. Let's look at a couple of potentially useful ones:
unzip 64_20081121_2_RH2_fastqc.zip
cd 1_20081121_2_RH2_fastqc/
ls
less fastqc_data.txt
This file has all of the numeric data that goes into creating the images of the .html
file. These are useful for making your own plots, or if you want to grep
out, say, the quality of base 30 in all of your sequencing runs. Press q
to exit less.
ls Images/
files Images/*
These are all PNG files, which is a type of image file. If we open it with head (or less, or more, or any other text reader), we get nonsense output:
head Images/adapter_content.png
However, if we look at these with the GUI, they correspond to all of the images in the html document. These are useful if you want images for presentations or similar uses, please DON'T take a screen-shot of the .html
file in your browser, these are higher quality, and already there!
As mentioned above, FASTQC files can also be helpful if you don't know your sequence encoding. Just run the files through FASTQ. It checks through the above chart for you and makes it's best guess as to which encoding best fits your output, and it usually guesses pretty well. Lets look at an example. Here, we're going to look at the .html file in a browser, so use your normal file browsers to navigate to the Genomics folder we've been looking at, and click on 33_20081121_2_RH2_fastqc.html and in another window also open 64_20081121_2_RH2_fastqc.html These two files are exactly the same, except that I've (somewhat arbitrarily) changed all of the quality scores in the 64_20081121_2_RH2_fastqc.html to look like they came from a different kind of sequencer. However, you can see that FASTQC wasn't fooled, and shows us exactly the same graphs regardless.
This tutorial could never be a complete list; there are nearly as many file formats are there are bioinformatic packages, as everyone seems to invent their own. Most of the common ones are easily google-able, i.e. "bed file format", and the uncommon ones should at least be described in the documentation for the program that created them. Sequencing facilities are also often great resources for finding more information, UCSC , for instance, has a particularly good section on file formats.
Since most genomics files are just text, it's relatively easy to convert between them, and you could write your own file converter with a complicated enough find/replace command. However, LOTS of people have already written scripts that do this for you, and have made them available online. When you need to convert files, the easiest thing to do is simply Google the conversion, i.e. "sff convert to fastq", this will almost always give you many options to choose from. You can also ask at your sequencing facility, as they likely have already written scripts to convert their output to popular formats. Here's a couple links to groups that have made free conversion tools available for anyone to download: Bioinformatics at COMAV and khmer.
wget
command to download the contents of the ftp site (don't forget to use the '*' wildcard to download all files)$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-27/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/dna/*
You should have downloaded the following files:
CHECKSUMS
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.toplevel.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.toplevel.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.toplevel.fa.gz
README
Use the less
command to examine the README
file - in particular, look at the
Generate a checksum using the sum
command(sum
is used by Ensembl and is an alternative to md5sum
) for the Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz
file and compare with the last few digits of the sum displayed in the CHECKSUMS file.
Preview the first few lines (head
) of the compressed (gzip'd) reference genome using the zcat
command:
$ zcat Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz | head
data
folder in a new ref_genome
folder.$ gzip -d Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz
Tip: create the ref_genome
folder in ~/dc_workshop/data
and use the cp
command to move the data