Deep learning and proteins
A central challenge is to be able to predict functional properties of a protein from its sequence, and thus (i) discover new proteins with specific functionality and (ii) better understand the functional effect of genomic mutations. Experimental breakthroughs in our ability to read and write DNA allows data on the relationship between sequence and function to be rapidly acquired. This data can be used to train and validate machine learning models that predict protein function from sequence. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate different phenotypes may be epistatic, requiring us to build models that take the correlation structure into account. Such models rival the accuracy of existing hidden Markov models at sequence annotation, even when given relatively little training data. The representation of sequence space learned by the model can be used to build families that the model did not see during training. Finally, prospective experiments show that machine learning models identify variants of the AAV capsid protein that assemble integral capsids and package their genome with >55% accuracy, for gene therapy applications.