Representation Learning

Machine learning promises automation. In an era where we can generate millions of images in a day, we’d really like a computer to be able to look at these images for us and identify which ones are most interesting. But in practice, to implement machine learning, users either have to label potentially millions of images to train a neural network, so it’s not much less effort than just looking at screens manually.

To address this issue, we use self-supervised learning. Self-supervised learning methods train deep learning models to solve autonomous proxy tasks. The proxy tasks do not need to produce useful outputs, and are only meant to teach the models transferable skills and perceptions of data: self-supervised proxy tasks often resemble puzzles, like solving jigsaw puzzles or determining how an image has been rotated.

By designing specialized self-supervised learning tasks that rely on an understanding of nuanced biology to solve, our models automatically teach themselves protein biology - no manual labeling required. We learned “representations” of images using a task called “paired cell inpainting”. These representations perform better than those laboriously designed by experts in analyses like classifying protein subcellular localization, suggesting that they capture more biology. In addition, our methods are general: we were able to analyze datasets that were never analyzed computationally previously due to technical challenges.

We also applied these methods to the intrinsically disordered “dark proteome”. These are regions of proteins that don’t fold into a stable secondary/tertiary structure. Although they are widespread and critical to protein function, we still don’t understand them well because they evolve too rapidly to be analyzed by classic bioinformatics methods. By creating a self-supervised method, “reverse homology”, that exploits principles of evolutions to learn about conserved elements of these sequences, we can automatically learn hundreds of important features that must be conserved over evolution for these regions of proteins to carry out their function. Interpreting these features lets us produce hypotheses that fuel biological discovery.

Alex Lu
Senior Researcher

Senior Researcher at Microsoft Research New England.