Bioinformatics Notes for Major League Dummies

Mäneka,Tue Sep 24 2024•notes

I've never had any formal bioinformatics training. Instead I've suffered through hours of trial and error to accomplish what I now realize are simple tasks. This is a growing list of quick notes I've made for myself.

Convert a fastq to fasta file

If you are interested in analyzing DNA sequence data, such as tandem repeats or satellites, the necessary software will likely operate on fasta files. Examples include RepeastMasker and the nucelotide-level functions from EMBOSS. Converting from raw fastq is simple provided you have the necessary tools installed, as our cluster does. There are several options, but I have used seqtk. I can run the following bash script on our cluster:

module load seqtk/1.3
seqtk seq -a path/to/fastq.gz > path/to/output.fasta

For me this took about 30 minutes to run on a single fastq.gz file of ~70 GB.

Don't forget that fasta files do not contain genome location data so you must use samtools or similar to interface with files (likely BAM) that integrate this alignment information.

Slurm

For cluster computing, Slurm is the go-to workload manager. Sometimes it is useful to run things interactively rather than batch mode for debugging.

In the shell, presumably on a login node:

srun -n3 -t06:00:00 --pty bash

This will get you 6 hours with 3 cores with bash shell. Modify as necessary.

Conda

My training is in statistics, so I'm most familiar with R. While there is a wealth of bioinformatics tools on Bioconductor, some familiarity with Python is necessary too. For some reason, I found the concept of Python environments confusing, all the more so with the particularities of cluster computing thrown in.

I think of a conda environment as all the packages you would need to load at the top of your R script or R markdown file. Except that you have neatly put all the packages into one environment, which you can load with one simple call.

Installing RepeatMasker

The documentation for this software leaves something to be desired, for n00bs anyway. Credit to this blog post (opens in a new tab) for a detailed explanation of installation. However, I had an issue on my cluster when building this way. I found it easier, faster, and effective to install via Bioconda, as outlined here (opens in a new tab).

I had already created an environment with

# Get miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Execute installer
bash Miniconda3-latest-Linux-x86_64.sh

/ix/<group>/<user>/custom_miniconda/bin/conda install mamba -n base -c conda-forge
/ix/<group>/<user>/custom_miniconda/bin/mamba install repeatmasker
/ix/<group>/<user>/custom_miniconda/bin/mamba create --name myenvname repeatmasker

Once the environment, myenvname, is set up, activate it:

source /ix/<group>/<user>/custom_miniconda/bin/activate myenvname

Then you can call RepeatMasker (or its help menu as below)

RepeatMasker -h

When installed this way I avoided the RepeatMasker::createLib() error I got with the manual installation.