Sequentially, each sequency is either assinged to an existing cluster or is classified as a new cluster representative if no matching cluster can be found.

runClustering(cdhit_path, sequences, out_dir, identity_cutoff,
  length_cutoff, wordlength, map, write_fastas = FALSE, optional = "")

Arguments

cdhit_path

Path to cd-hit-est executable

sequences

Vector of sequences in FASTA style generated by the sequencesAsFasta

out_dir

Directory to save output files of clustering

identity_cutoff

Sequence identity cutoff used for clustering

length_cutoff

Length difference cutoff

wordlength

CD-Hit word length

map

A data frame with sequences as row names and sequence identifiers in first column. Can be generated by createMap

write_fastas

Boolean that indicates whether a fasta file will be generated for each cluster

optional

Optional execution parameters

Value

A data frame with the columns 'SequenceID' and 'ClusterID' assigning each sequence to a cluster of similar sequences via their identifiers. Additionally, a file CD-HIT.fa, CD-HIT.fa.clstr and a folder Clusters is generated in the given output directory. The CD-HIT.fa file is the FASTA file of all cluster representatives. The CD-HIT.fa.clustr file lists all identified clusters and the assigned sequence identifiers together with the percentage of overlapping sequence with the cluster representative. In the Clusters directory there is a FASTA file for each cluster.