Running Offline Read Until Scripts

ampbalance.py

Ampbalance is a script which will map entire 2D reads to a reference sequence, identify those likely to give a good 2D read by determining if the template and complement reads overlap with one another appropriately and subsequently copy a defined number of those reads up to a specific depth into a folder for further analysis. This script was originally written to facilitate the sequencing and analysis of Ebola virus and to simplify the transfer of read data prior to base calling where network bandwidth was limited (see Quick, J., Loman, N. J., Duraffour, S., Simpson, J. T. & Severi, E. Real-time, portable genome sequencing for Ebola Surveillance. Nature, doi:10.1038/nature16996 (2016).)

Unlike read until applications this script uses the entire read and so requires both the template and complement models.

The image below illustrates how it works. Note that event data flow along blue lines.

An image illustrating the ampbalance Workflow

Here the script is implemented on reads that have been basecalled to facilitate testing and comparisons.

First navigate to the RUscripts folder in a windows command terminal:

cd \path\to\RUscripts

To print the ampbalance.py help statement at the prompt ($) type:

python ampbalance.py -h

which will output:

usage: ampbalance.py [-h] -fasta FASTA -ids IDS -w WATCHDIR -o TARGETPATH -d
                     DEPTH -procs PROCS [-cautious] [-l LENGTH] -t TEMP_MODEL
                     -c COMP_MODEL [-v] [-ver]

ampbalance: A program designed to balance amplicons from a specific reference
sequence post sequencing on ONT minIONs but prebasecalling. Developed by Matt
Loose @mattloose or matt.loose@nottingham.ac.uk for help! Args that start with
'--' (eg. --reference_fasta_file) can also be set in a config file
(/path/to/your/script/amp.config or ) by using .ini or
.yaml-style syntax (eg. reference_fasta_file=value). If an arg is specified in
more than one place, then command-line values override config file values
which override defaults.

optional arguments:
  -h, --help            show this help message and exit
  -fasta FASTA, --reference_fasta_file FASTA
                        The fasta format file for the reference sequence for
                        your organism.
  -ids IDS, --reference_amplicon_positions IDS
                        A file containing a list of amplicon positions defined
                        for the reference sequence. 1 amplicon per line in the
                        format fasta_sequence_name:start-stop e.g
                        J02459:27-1938
  -w WATCHDIR, --watch-dir WATCHDIR
                        The path to the folder containing the downloads
                        directory with fast5 reads to analyse - e.g.
                        C:\data\minion\downloads (for windows).
  -o TARGETPATH, --output-dir TARGETPATH
                        The path to the destination folder for the
                        preprocessed reads
  -d DEPTH, --depth DEPTH
                        The desired coverage depth for each amplicon. Note
                        this is unlikely to be achieved for each amplicon and
                        should probably be an overestimate of the minimum
                        coverage required.
  -procs PROCS, --proc_num PROCS
                        The number of processors to run this on.
  -cautious, --cautious
                        DTW of long reads on low memory systems can cause
                        unexpected crashes. This option will prevent automatic
                        skipping on any reads over 10,000 events. You can
                        optionally increase this length with the -l parameter.
                        USE WITH CAUTION AS THIS MAY CAUSE A SYSTEM TO CRASH.
  -l LENGTH, --length LENGTH
                        A limit on the length of read that ampbalance will
                        attempt to align using DTW - Long reads can cause
                        problems on low memory systems
  -t TEMP_MODEL, --template_model TEMP_MODEL
                        The appropriate template model file to use
  -c COMP_MODEL, --complement_model COMP_MODEL
                        The appropriate complement model file to use
  -v, --verbose-true    Print detailed messages while processing files.
  -ver, --version       show program's version number and exit

This script is designed to match amplicon sequences of known and approximately uniform length to a reference sequence. The reference sequence should be a single sequence.

We provide an example set of reads using lambda, found in the ampbalancetest folder. A typical usage command would be:

python ampbalance.py -fasta J02459.fasta -ids lambda_amplicons.txt -w ampbalancetest -o ampbalanceoutput -d 5 -procs 4 -t template_r7.3_e6_70bps_6mer_6.model -c complement_r7.3_e6_70bps_6mer_6.model -l 3000

This will output the following:

Reading amplicons
******AMP DICTIONARY*******
<type 'dict'>
{1: 52, 2: 2065, 3: 4070, 4: 6059, 5: 8012, 6: 10008, 7: 12006, 8: 14011, 9: 16076, 10: 18022, 11: 20053}
{'DO': 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 'BF': 0, 'HF': 0, 'NH': 0}
Now we are going to try and open the raw reads and do the same as we have done above...
{'DO': 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 'BF': 0, 'TF': 109, 'HF': 4, 'NH': 0}
{'DO': 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0, 10: 0, 11: 0, 'BF': 0, 'TF': 108, 'HF': 5, 'NH': 0}
.
. (many more lines here)
.
{'DO': 0, 1: 10, 2: 10, 3: 10, 4: 10, 5: 9, 6: 10, 7: 10, 8: 10, 9: 10, 10: 10, 11: 10, 'BF': 0, 'TF': 1, 'HF': 110, 'NH': 0}
{'DO': 0, 1: 10, 2: 10, 3: 10, 4: 10, 5: 10, 6: 10, 7: 10, 8: 10, 9: 10, 10: 10, 11: 10, 'BF': 0, 'TF': 0, 'HF': 110, 'NH': 0}
Amplicon Read Counts
Amplicon Number: 1 Reads: 10
Amplicon Number: 2 Reads: 10
Amplicon Number: 3 Reads: 10
Amplicon Number: 4 Reads: 10
Amplicon Number: 5 Reads: 10
Amplicon Number: 6 Reads: 10
Amplicon Number: 7 Reads: 10
Amplicon Number: 8 Reads: 10
Amplicon Number: 9 Reads: 10
Amplicon Number: 10 Reads: 10
Amplicon Number: 11 Reads: 10
Copying Amplicon Data
Amplicon Number 1
Amplicon Number 2
Amplicon Number 3
Amplicon Number 4
Amplicon Number 5
Amplicon Number 6
Amplicon Number 7
Amplicon Number 8
Amplicon Number 9
Amplicon Number 10
Amplicon Number 11

Key: DO: Template and Complement Strands don't overlap on amplicons as expected. Read Discarded. Unlikely to generate good 2D. 1...11: Amplicon numbers followed by current coverage counts satisfying highest stringency tests. BF: Bad Files - these files contain more events than the -length threshold. They are unlikely to be of interest in an amplicon run and are skipped unless the -cautious flag is set. TF: Total Files - this counts down the number of files left to process. HF: Hairpin Found - this is the number of reads containing a hairpin - these are tested. NH: No Hairpin - these reads are ignored as no hairpin has been found.

This program will output files to the specified output directory. Files which are already basecalled will be written to a subfolder called "Downloads". Raw files will just be written to the specified output directory.

Note that the example read set here has been preselected to contain 10 reads mapping to each amplicon and all the reads are 2D - thus the HF value is 110 and the DO value is 0.

It is also important to note that this script uses the hairpin_found flag in the read files. Since minKNOW version 0.51.1.51 (released February 22nd 2016) the writing of this flag to read files has been problematic. Thus reads generated with this version of minKNOW may appear to have far fewer 2D reads than they really do.

ampliconSPLIT.py

ampliconSPLIT.py simulates read until on either raw or basecalled reads. It ignores the first 50 events of a read and matches the subsequent 250 events against a reference squiggle. Read until only ever processes template data and so only a template model is required.

The ampliconSPLIT workflow is illustrated below. Note that unlike ampbalance, ampliconSPLIT will separate read files into a directory per amplicon.

An image illustrating the ampliconSPLIT Workflow

This script will process reads into subfolders in the targetpath corresponding to each amplicon described in the -ids file - e.g /targetpath/{amplicon_number}. If the read is basecalled, it will be placed in a downloads subfolder - e.g /targetpath/{amplicon_number}/downloads .

First navigate to the RUscripts folder in a windows command terminal:

cd \path\to\RUscripts

To print the ampliconSPLIT.py help statement at the prompt ($) type:

python ampliconSPLIT.py -h

which will output:

usage: ampliconSPLIT.py [-h] -fasta FASTA -ids IDS -w WATCHDIR -o TARGETPATH
                    -d DEPTH -procs PROCS -t TEMP_MODEL [-v] [-ver]

ampliconSPLIT: A program designed to identify and group individual amplicons
from minION reads prior to base calling. The depth setting limits the number
of reads copied to each sub folder. Developed by Matt Loose @mattloose or
matt.loose@nottingham.ac.uk for help! Args that start with '--' (eg.
--reference_fasta_file) can also be set in a config file
(/Users/mattloose/fixes/RUscripts/UPDATES/amp.config or ) by using .ini or
.yaml-style syntax (eg. reference_fasta_file=value). If an arg is specified in
more than one place, then command-line values override config file values
which override defaults.

optional arguments:
-h, --help            show this help message and exit
-fasta FASTA, --reference_fasta_file FASTA
                    The fasta format file for the reference sequence for
                    your organism.
-ids IDS, --reference_amplicon_positions IDS
                    A file containing a list of amplicon positions defined
                    for the reference sequence. 1 amplicon per line in the
                    format fasta_sequence_name:start-stop e.g
                    EM_079517:27-1938
-w WATCHDIR, --watch-dir WATCHDIR
                    The path to the folder containing the downloads
                    directory with fast5 reads to analyse - e.g.
                    C:\data\minion\downloads (for windows).
-o TARGETPATH, --output-dir TARGETPATH
                    The path to the destination folder for the
                    preprocessed reads
-d DEPTH, --depth DEPTH
                    The desired coverage depth for each amplicon. Note
                    this is unlikely to be achieved for each amplicon and
                    should probably be an overestimate of the minimum
                    coverage required.
-procs PROCS, --proc_num PROCS
                    The number of processors to run this on.
-t TEMP_MODEL, --template_model TEMP_MODEL
                    The appropriate template model file to use. This file
                    can be generated uing the getmodels.py script.
-v, --verbose-true    Print detailed messages while processing files.
-ver, --version       show program's version number and exit

An example run command using test data:

python ampliconSPLIT.py -fasta J02459.fasta -ids lambda_amplicons.txt -w RUtestset/ -o RUtestsetOUT -d 10 -procs 8 -t template_r7.3_e6_70bps_6mer_6.model

Which will output:

processing the reference fasta.
ID J02459
length 48502
FORWARD STRAND
REVERSE STRAND
Groking amplicons
******AMP DICTIONARY*******
<type 'dict'>
{1: 52, 2: 2065, 3: 4070, 4: 6059, 5: 8012, 6: 10008, 7: 12006, 8: 14011, 9: 16076, 10: 18022, 11: 20053}
{'DO': 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 'BF': 0, 'HF': 0, 'NH': 0}
We want to build a custom reference that is smaller than the original reference.
First get a list of all the positions we will need to search
Generating a custom fasta
J02459
processing the custom fasta
Attempting to match reads and split into folders based on 250 events, excluding the first 50.
Amplicon Read Counts
    Amplicon Number: 1 Reads: 4
    Amplicon Number: 2 Reads: 5
    Amplicon Number: 3 Reads: 4
    Amplicon Number: 4 Reads: 5
    Amplicon Number: 5 Reads: 6
    Amplicon Number: 6 Reads: 6
    Amplicon Number: 7 Reads: 5
    Amplicon Number: 8 Reads: 5
    Amplicon Number: 9 Reads: 5
    Amplicon Number: 10 Reads: 5
    Amplicon Number: 11 Reads: 5
Copying Amplicon Data
Amplicon Number 1
Amplicon Number 2
Amplicon Number 3
Amplicon Number 4
Amplicon Number 5
Amplicon Number 6
Amplicon Number 7
Amplicon Number 8
Amplicon Number 9
Amplicon Number 10
Amplicon Number 11

The depth parameter (-d) sets the number of reads that will be copied. Setting this to a value greater than the number of reads analysed will sort all reads in the dataset.

test_gReadUntil.py

*** Note this script is located with the ReadUntil folder of this repository

This script runs entirely independently of the read until API and allows for simulation of selective sequencing of a genome. You can provide a pool of reads and the script will use the first 250 events to map a read and copy those mapping within the desired area to a specified folder.

A schematic of the workflow is provided below:

An image illustrating the test_gReadUntil Workflow

First navigate to the ReadUntil folder within the RUscripts folder in a windows command terminal:

cd \path\to\RUscripts\ReadUntil

Help is available by typing:

python test_gReadUntil.py -h

which will output:

***********************************************************************************************
**** This code will open a collection of reads and simulate read until on them. It will    ****
**** copy reads into a secondary folder for subsequent processing by another analysis      ****
**** package.                                                                              ****
***********************************************************************************************
usage: test_gReadUntil.py [-h] -fasta FASTA -targets [TARGETS [TARGETS ...]]
                      -procs PROCS -m TEMP_MODEL [-log LOGFILE] -w
                      WATCHDIR [-length LENGTH] [-o OUTPUT_FOLDER] [-v]
                      [-ver]

real_read_until: A program providing read until with the Oxford Nanopore
minION device. This program will ultimately be driven by minoTour to enable
selective remote sequencing. This program is heavily based on original code
generously provided by Oxford Nanopore Technologies.

optional arguments:
-h, --help            show this help message and exit
-fasta FASTA, --reference_fasta_file FASTA
                    The fasta format file describing the reference
                    sequence for your organism.
-targets [TARGETS [TARGETS ...]]
                    Positional IDs to enrich for in the form seqid:start-
                    stop . Can be space seperated eg: J02459:10000-15000
                    J02459:35000-40000
-procs PROCS, --proc_num PROCS
                    The number of processors to run this on.
-m TEMP_MODEL, --model TEMP_MODEL
                    The appropriate template model file to use
-log LOGFILE, --log-file LOGFILE
                    The name of the log file that data will be written to
                    regarding the decision made by this program to process
                    read until.
-w WATCHDIR, --watch-dir WATCHDIR
                    The path to the folder containing the downloads
                    directory with fast5 reads to analyse - e.g.
                    C:\data\minion\downloads (for windows).
-length LENGTH, --library-length LENGTH
                    Provide the average expected length of your library.
                    This offset will be applied to reads that are likely
                    to extend into your region of interest on either
                    strand.
-o OUTPUT_FOLDER, --output OUTPUT_FOLDER
                    Path to a folder to symbolically place reads
                    representing match and not match.
-v, --verbose-true    Print detailed messages while processing files.
-ver, --version       show program's version number and exit

A typical command line to select reads mapping from 10-15kb in the lambda genome would be:

python test_gReadUntil.py -fasta ../J02459.fasta -targets J02459:10000-15000 -procs 4  -m ../template_r7.3_e6_70bps_6mer_6.model -w ../RUtestset/ -o RUgOUT -length 3000

This would give the following output:

***********************************************************************************************
**** This code will open a collection of reads and simulate read until on them. It will    ****
**** copy reads into a secondary folder for subsequent processing by another analysis      ****
**** package.                                                                              ****
***********************************************************************************************

Checking Files...

All OK.

J02459
processing the reference fasta.
ID J02459
length 48502
FORWARD STRAND
REVERSE STRAND
J02459 R 19692
../RUtestset/llssbzms2p35x_20151004_readuntiludududududu_RU21_lambdaPCR_2922_1_ch10_file49_strand.fast5 No Match
J02459 F 4084
../RUtestset/llssbzms2p35x_20151004_readuntiludududududu_RU21_lambdaPCR_2922_1_ch10_file69_strand.fast5 No Match
.
.
.
J02459 R 11844
../RUtestset/llssbzms2p35x_20151004_readuntiludududududu_RU21_lambdaPCR_2922_1_ch8_file58_strand.fast5 Sequence Found
J02459 R 21791
../RUtestset/llssbzms2p35x_20151004_readuntiludududududu_RU21_lambdaPCR_2922_1_ch8_file11_strand.fast5 No Match
J02459 F 9998
../RUtestset/llssbzms2p35x_20151004_readuntiludududududu_RU21_lambdaPCR_2922_1_ch9_file100_strand.fast5 Sequence Found

Sequence found indicates that the read is derived from the desired region. No Match indicates that the read is from another region. The line "J02459 F 9998" indicates the read was matched to sequence ID J02459 on the forward strand at position 9998.

This script will output reads to the output folder in two separate directories, named "reject" and "sequence". Files which would have been rejected will be placed in subfolders within the "reject" folder and those sequenced within the "sequence" folder.