I received some Illumina data from collaborators without knowing much about how it had been generated.

Inspecting the files I found that the data had already been demultiplexed and stripped of their barcodes. There were also paired reads for each sample. I wasn’t familiar with how to deal with this sort of data, but Robert Edgar has a discussion here, with Example 2 being the appropriate case:¬†

It’s a simple matter to adapt his helpful solution¬†for the multiple file case, but I always find myself googling basic shell scripting so here’s my version.

First we need to get a list of all of the sample names. Assuming that your file names are in the standard form ‘SampleName_L001_R1_001.fastq’, this can be done by the following:

ls *.fastq | awk -F '_L001' '{print $1'} | uniq > sample_names.txt

Then loop through all the samples, doing 1) merging the forward and reverse reads; 2) filtering of the reads; 3) adding the barcode=SampleName annotation; 4) concatenating the reads into a single file.

while read p; do
    echo 'Processing reads for '"$p"''
    usearch61 -fastq_mergepairs ''"$p"'_L001_R1_001.fastq' \
     -reverse ''"$p"'_L001_R2_001.fastq' -fastqout ''"$p"'_merged.fastq'
    usearch61 -fastq_filter ''"$p"'_merged.fastq' \
     -fastaout ''"$p"'_filtered.fa' -fastq_maxee 1.0
    sed '-es/^>\(.*\)/>\1;barcodelabel='"$p"';/' \
    cat ''"$p"'.fa' >> reads.fa
done < sample_names.txt