Hamming distance project
Initial problem
- Bottleneck in computation: parallelized python code to calculate hamming distances from large set of gene expressions
- Pre-processing step that takes many hours to complete even using many cores
What we did
- Replace with a highly optimized, vectorized and parallelized c++ version
- Packaged as a python library hammingdist, to fit into the existing user workflow
- Set up Continuous Integration and Continuous Delivery to provide automated testing and deployment of the code
Result
- Code now runs in minutes on a single core with original dataset
- Code now scales using HPC resources to process hundreds of thousands of gene expressions
- Publications:
- arXiv:2207.03394 [math.AT] MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations
- arXiv:2106.07292 [q-bio.PE] Topological data analysis identifies emerging adaptive mutations in SARS-CoV-2
- Pipeline: CoVtRec Topological Surveillance of Recurrent Mutations in SARS-CoV-2