Phylogenetic networks represent the evolutionary process of reticulate organisms by the explicit modeling of gene flow. While most existing network methods are not scalable to tackle big data, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants without the heuristic search of network space. Our methodology is available in the Julia package phylo-diamond.jl, and it is at least 10 times faster than the fastest-to-date network methods.
The abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process, as this process cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard. Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of 4-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. Our novel inference methodology is optimization-free as it only requires evaluation of polynomial equations, and as such, it bypasses the traversal of network space yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate the accuracy and speed of our new method on a variety of simulated scenarios as well as in the estimation of a phylogenetic network for the genus Canis. We implement our novel theory on an open-source publicly available Julia package phylo-diamond.jl with broad applicability within the evolutionary biology community.