Extracting interesting information from Fastq files with Bash

Let me start by stating something that everyone who knows me has heard over and over again: I love bash and am always trying to find ways to utilize it. I love how fast it is, I love how you just have to open your terminal and start hacking away. I always use it to extract information rather than go through the arduous (at least for me) task of opening a website or a program or R or python… Yesterday a colleague came to me wanting to know what strategy he had used for his last sequencing run. I knew the information had to be stored somewhere (since this was an Illumina machine, the information was probably stored in their BaseSpace service). However, the colleague was not aware of such things and it would have taken some time to find out who had access to that information. Thus, I just SSH’ed into our server to access the raw fastq files from that run.

For those of you who are not aware of it, fastq is a format for the storage of sequencing data (i.e. the DNA base composition of whatever you had sequenced, be it a whole genome, exome, or a handfull of genes). The file consists of multiple lines of a defined length string of four letters (A,T,C, and G) — plus some N’s which indicate the sequencer was not able to resolve the correct base at this position. Each string is called a read and its length is determined by the user who started the run and depends on the aim of the experiment. In addition, the file consists of such metadata as the type of the sequencer, the flow cell ID, the coordinates of the molecules on the flowcell, and the quality scores of each base in the read. These are distributed over four rows; in other words, each read occupies four rows in the file which encompass all the above. You can read more about the fastq file format in this Wikipedia article.

Unfortunately, finding out the run length is not as straightforward as taking a peek inside the file. For reasons depending on, among other things, the fragment length, the file will include reads of varying lengths. This is what I get on a fastq file when I query for the length of 1% of the reads:

| Length | Number |
| ------ | ------ |
| 50 | 1 |
| 64 | 1 |
| 69 | 1 |
| 97 | 1 |
| 66 | 2 |
| 67 | 2 |
| 68 | 2 |
| 51 | 3 |
| 52 | 3 |
| 70 | 3 |
| 58 | 4 |
| 65 | 4 |
| 98 | 5 |
| 59 | 6 |
| 71 | 6 |
| 54 | 8 |
| 61 | 8 |
| 72 | 8 |
| 60 | 9 |
| 62 | 9 |
| 53 | 10 |
| 56 | 10 |
| 40 | 11 |
| 57 | 11 |
| 73 | 12 |
| 55 | 20 |
| 63 | 38 |
| 99 | 74 |
| 100 | 2,152 |
| 101 | 42,413 |

As you can see, the read lengths vary from as small as 40 to as big as 101 which by the way is the length of the run. Since we don’t need all this information and are just interested in the maximum size of the reads, we can do something like this:

gunzip -c file.fastq.gz | awk 'NR%4==2' | sample -r 1% | awk '{print length($0)}' | stats | grep max

The output will show the following:

max: 101.000000000000

Thus, this is a 101-bp run. Whether it’s a single-end or paired-end run can be gleaned by looking at the files. If only _R1 files exist, it’s single-end but if there are both _R1 and _R2 files, it’s paired-end.

The above relies on the stats package which can be downloaded from this website here. There are other ways of course, but I find this requires the least amount of typing. The command will unzip the file on the fly, take only the rows containing the sequence, randomly sample only 1% of the reads, print out the lengths of the reads, create statistics and only print out the maximum value.

We can also determine the base quality composition of the whole file. The scenario came up that some colleagues wanted to know whether we had any problems with the demultiplexing of our samples on the MiSeq. Now I had previously hacked the MiSeq so that for every run it would generate not only the normal reads but also the indices that were used for multiplexing. I had checked the composition of the file with the following command:

for file in *_L001_I{1,2}_001.fastq.gz; do echo $file; gunzip -c ${file} | awk 'NR%4==2' | sort | uniq -c; done

This showed that the demultiplexing had gone as expected with the greatest number of indices corresponding to what was expected. Because the demultiplexing program allows one mismatch, other sequences that are one base away from the correct index sequence were also included but had far smaller numbers. I had checked the hamming distances between all the indices included in our run and they had a minimum distance of 3 bases, so there was no chance of a mix-up there. But the question arose as to whether the machine had a problem reading the index sequences in the first place. This would be reflected in the Phred-score of the bases and this is included in the fastq file as mentioned previously. We need at least 3 bases to have low quality to cause problems with the demultiplexing so running the following should do the trick:

for file in *_L001_I{1,2}_001.fastq.gz; do echo $file; gunzip -c ${file} | awk 'NR%4==0' | grep -E [\"#$%\&\'\(\)\*\+,\-\.\/0123]{3} | wc -l; done

The grep function looks for 3 bases or more having a Phred-score below 20, which are denoted by the symbols above as well as the numbers 0-3. This is assuming we are working with a Phred-33 scoring system. The above will output the number of indices that have at least 3 bases with a Phred-score below 20. We’ll then need the total number of indices (that should be easy enough) and we can calculate what percentage of the indices is of low quality.

Easily Back Up Folders to a USB Flash Drive on Mac

I have a problem and am ready to admit it: I hate creating regular back ups of my data. I know that it is a crime at this day and age not to have regular back ups (and several copies at that) of your data. I own a MacBook Pro and have set up Time Machine on a 1-TB external HDD, which I use to back up my whole MacBook once a year — namely before I upgrade to the latest MacOS in September. So why don’t I do it more often? Well, for one I use my MacBook for work and my HDD is at home (I can hear the eyes rolling out there: “why don’t you just move your HDD to your office?” I hear you ask. Well, this brings me to the second and third points). Point two: it takes a lot of time to create snapshots of my hard-drive. Actually, the data I really care about are located in one folder that I use for work. Everything else I don’t need to back up regularly. Point three is: I don’t want my back ups to stay in the office since they can be stolen. What I would like is to have a folder mirrored on a server somewhere which can be updated in real-time. There used to be a perfect solution: Bitcasa, but they shut down. Since then, I haven’t been able to find a replacement. So I thought I’d recreate the Bitcasa experience myself on a local server.

Unfortunately, that turned out to be easier said than done. I have a 4-TB local server at home and I installed OneCloud on it. I can easily access the thing while at home, but opening up the ports to the outside world was not possible due to the fact that I do not control the firewall. That is the way things stood until the few days back. That is when we suffered several break-ins at work and computers were stolen. This got me thinking: if my computer had been among those that were stolen, I would have lost invaluable data which I would not have been able to recover. There is nothing more sobering than having your worst-case scenario play out in front of you. I could not put this off any longer and needed a solution.

I can put the solution in two words: RSync and Automator. Automator is a cool tool found on Macs which makes it unbelievable easy to automate complex tasks with just a few clicks of a mouse. RSync is a program that comes preinstalled on many Unix systems. What I will describe monitors your ports for when you plug in your USB flash drive and then checks whether any changes have been made to a certain folder. If this is the case, the changes get synchronized with your USB flash drive.

Open Automator and start a new Folder Action

Under ‘Folder Action receives files and folders added to:’, select ‘Volumes’. This then will monitor the mounting of any external volume, be that an external HDD, a flash drive, or even a network drive. You can drag-and-drop a ‘Display Notification’ action and write whatever you want there (for example: ‘A new volume has been mounted’). Now comes the meat of the program:

Type in the following under ‘Run Shell Script’

ls -R /Volumes/INTENSO/ > a
ls -R /Users/User/Documents/Folder/ > b
DIFF=$(diff a b)
if [ "$DIFF" != "" ] && [ -e /Volumes/INTENSO ];
then
rsync -aE --delete /Users/User/Documents/Folders/ /Volumes/INTENSO/
else
exit
fi

You’ll want to change the ‘INTENSO’ to whatever name your USB flash drive has. I would suggest you change the default name of your drive to something unique so that you don’t get your data transferred to a stranger’s generic drive. What the script does is it checks whether your flash drive is mounted and whether any difference exists between its contents and those of your folder of interest (don’t forget to include the path to that folder instead of ‘/Users/User/Documents/Folder’) and if so, will run RSync to copy those files and folders to your drive. It will also delete the files and folders that were deleted from your source folder. If nothing has been changed between your source folder and your target folder, the script will not run.

You can now add another notification that tells you when the script has finished running and you’re done! Just save this Automator script and the next time you plug in your flash drive, you’ll get an instant back up of your most important data. You’re welcome!

Note: I am posting this after getting positive feedback from colleagues who were looking to do something similar. Unfortunately, as I said earlier, this only works on Macs due to its reliance on Automator. I will try to do something similar for Windows machines. If anyone has something that is easy to implement in Windows, I would be very interested to hear from you.

Hello world!

Welcome to your new Blog! We’re really excited to see what you do with it.

This draft post is here to show you what your posts will look like and to give you a few tips on getting started. Feel free to edit it, delete it or keep it saved as a draft for reference later.

Publishing

If you’re familiar with WordPress, you’ll be right at home. To get started creating your own posts head to your Dashboard and click Add New to bring up the editor. Fill it up with whatever you choose; it could be a recipe, a review of a new product you love, or simply a new idea that needs to be shared with the world. The world is your oyster.

Hit Publish and that’s it – your post will be live and ready for reading.

The new post will be included in the Reader of other members and may also make an appearance on the Community front page, (vivaldi.net).

P.S. Don’t forget to share your new creation far and wide! Tag Vivaldi (on Twitter or Facebook) and we’ll help you spread the word about your new blog.

Customization

There are a number of ways to customize the look of your new Blog. Head to you site’s Admin Dashboard to adjust the theme, site icon, header images, page layouts, custom widgets and much more. Many of these settings can be found in the Appearance menu.

For the more technically savvy out there, you can of course also use custom CSS to make things just right. To add custom CSS, head to Appearance > Customize.

Import

To import content from another blog, select Tools > Import from menu in your dashboard. Right now there are importers for WordPress, Blogger and Tumblr. If you’d like to import content from another service, let us know!

FAQ

What is the Vivaldi Community?

A place for our friends to hang out online. We want to create a place where people can publish, read and discuss ideas with likeminded folks from around the world. We hope you like it.

Do I have to use Vivaldi’s browser to be here?

No. Many Community members use our browser. But many don’t. Everyone is welcome.

What’s included?

Every member gets a free webmail account ([email protected]), access to the Vivaldi Forums and a free Blog with a custom domain (yourblog.vivaldi.net).

What’s the catch?

We have no plans to monetize, share your data or start charging for any of these services. The Community is simply a way for us to give back something to our users. No catch.

Help and Feedback

Help articles for the Community can be found at help.vivaldi.com. If something seems off or you run into a bug, please let us know by using our contact form or leaving a comment in the forum.

Have a read of our Terms of Use and Privacy Policy and let us know if you have any questions.

Enjoy, and welcome!