Run CD-HIT clustering

Sequentially, each sequency is either assinged to an existing cluster or is classified as a new cluster representative if no matching cluster can be found.

runClustering(cdhit_path, sequences, out_dir, identity_cutoff,
  length_cutoff, wordlength, map, write_fastas = FALSE, optional = "")

Arguments

cdhit_path	Path to cd-hit-est executable
sequences	Vector of sequences in FASTA style generated by the sequencesAsFasta
out_dir	Directory to save output files of clustering
identity_cutoff	Sequence identity cutoff used for clustering
length_cutoff	Length difference cutoff
wordlength	CD-Hit word length
map	A data frame with sequences as row names and sequence identifiers in first column. Can be generated by createMap
write_fastas	Boolean that indicates whether a fasta file will be generated for each cluster
optional	Optional execution parameters

Value

A data frame with the columns 'SequenceID' and 'ClusterID' assigning each sequence to a cluster of similar sequences via their identifiers. Additionally, a file CD-HIT.fa, CD-HIT.fa.clstr and a folder Clusters is generated in the given output directory. The CD-HIT.fa file is the FASTA file of all cluster representatives. The CD-HIT.fa.clustr file lists all identified clusters and the assigned sequence identifiers together with the percentage of overlapping sequence with the cluster representative. In the Clusters directory there is a FASTA file for each cluster.

Arguments

Value

Contents