Talk to sales

How to Get Started Using AI for Protein Design: A Guide to Data Collection

Think you need thousands of data points to use AI for protein engineering? We've seen substantial improvements with just 96-well plate assays and smart data collection.

Feb 7

Stef

Feb 7

Stef

To embed a website or widget, add it to the properties panel.

If you are reading this blog, chances are you are at least curious about the potential of generative machine learning (ML) as a valuable new tool for designing proteins. But you may be thinking that to take advantage of generative ML you need to gather tens of thousands of data points, the kind of throughput that is only attainable by industry behemoths with dedicated ML programs. Turns out, that is not the case.

Here at Cradle, we’ve helped many teams with their protein engineering projects. Through this work, we found that even with relatively low throughput, such as 96-well plate assays, ML models can learn enough to substantially enhance your protein design process. We believe it is a tool everyone should use if they want to obtain the best results.

So, what do you need to do to take advantage of generative ML? The simple answer is: just start using it. You don’t need to uproot your entire workflow or become an ML expert. Don’t worry: We will teach you everything you need to know to get started, including what the “ideal” training dataset looks like.

First, let’s go over some of the misconceptions about what kind of data ML models require and banish those myths.

Myth #1: You need a ton of data to train generative ML models on

Truth: This is true—but only partially. Large volumes of data are only required in the initial steps of teaching generative ML models the language of proteins. These models use a process called “unsupervised learning”, which means they identify patterns in unlabelled data and learn to mimic or predict those patterns. The more data points are used to train the model, the “smarter” it will be. For example, Cradle models are trained on nearly 1 billion sequences.

Training unsupervised models on hundreds of millions of protein sequences teaches ML to understand the basic rules of proteins. This is similar to how ChatGPT was trained on the entire corpus of the internet to understand the rules of writing text. What this means for you is that you don’t have to start from scratch. The knowledge level of an “off-the-shelf” protein ML model is on par with that of a broadly trained protein engineer. Now, that model just needs to become an expert on your specific protein.

With the unsupervised model already in place, any lab can take advantage of generative ML, provided it can consistently test around 96 protein sequence variants for each property they want to improve. You can even get started and make significant progress with just a few dozen sequences, given that they are chosen wisely. At Cradle, we have identified a few important aspects of good training datasets:

Wide distribution: Whether you are engineering thermostability or catalytic activity, including a wide range of performance values helps the model understand the fitness landscape of your protein. The fitness landscape is a theoretical map that relates sequence (the equivalent of the GPS coordinates of your location) with performance (your elevation). It is impossible to cover the entire protein fitness map, but we can measure points across it to infer what the landscape might look like, which will help us find the fitness “peaks”. Including both low- and high-performing variants will teach the model which sequences, or map areas, correlate with better performance and which ones to avoid.
Frequency over volume: Instead of waiting to collect 1000 data points in one large experimental round, it is better to provide the model with 100 data points every other month. The model improves with every new dataset and will suggest better sequences for you to test in order to refine its predictive capabilities.
Even if you don’t have ANY data, we can provide suggestions for initial sequences you can test in your lab. By generating sequences that mimic natural diversity—that is, sampling points across the entire fitness landscape, including high- and low-performing variants,—we can start training the model on your specific protein and mapping its fitness landscape.

Myth #2: Using AI will disrupt your entire workflow

Bioengineering teams typically follow the Design-Build-Test-Learn (DBTL) process, and most research labs are designed to run these cycles. For example, many labs invest in expensive genotyping and phenotyping instruments (like an in-house sequencer or mass spec) and hire trained personnel to operate them. It can take hundreds of thousands of dollars to set up a lab. Does adopting generative ML methods mean that those instruments will be left to collect dust and all the work will be done by AI instead?

Truth: The DBTL framework is still the cornerstone of protein engineering. ML just makes the ‘Learn’ and ‘Design’ parts a lot more sophisticated and fruitful, increasing your chances of success with each round. ML can improve the efficiency of your workflow and help you achieve better results than what is possible with traditional protein engineering methods.

For example, ML can help you take bigger “design steps”. Traditionally, protein engineers include up to 3 mutations per sequence in every round. But with AI you can introduce many more mutations per sequence and the model will help discern which of those generate improvements and which ones do not. ML helps you gain insights from each data point, good or bad.

Additionally, ML can help with hypothesis generation and save you time coming up with design ideas yourself. For example, if you are using rational design to generate hypotheses, you can find yourself stuck or limited by the strategies that have previously shown success. ML can supplement your workflow and diversify your design strategy by suggesting unintuitive designs to try, as well as help in cases where rational hypotheses are not available or have been exhausted.

Myth #3: Data quality has to be impeccable for ML models

Truth: The infamous saying “garbage in equals garbage out” holds just as true for generative ML as for any other type of model or method. It is important to generate the best possible data; however, we can help you deal with potential data quality issues to make it usable for ML. For example, we can enrich the dataset with publicly available data if some important values are missing. Here are a few other tricks we can use to enhance the quality of your data:

Include confidence intervals. If you know the error margin of your instrument or assay, we can add those parameters to the model. This will help discern when a variant shows statistically significant improvement for the model to take note of. Alternatively, you can simply provide the raw values of the individual measurements and we will use those to calculate the error margin.
Assign “weight” to certain measurements. If you have collected some potentially useful data points but you are not entirely sure how much you can trust those measurements (let’s say they were gathered right after your lab moved to a new location), we can assign a lower confidence weight to those data.
Track experimental variation. If you are collecting measurements for the same variants across different experiments (for example, by including controls such as the wild-type sequence or best-performing variant from the previous round), we can use those measurements to help control for experimental variation between rounds.

That said, you can help us get the best value out of your data by following these guidelines for data collection:

Include individual measurements and not the average if your process involves running replicates.
Do not normalize values if you are measuring improvement against a baseline. The model prefers direct measurements of a property as opposed to fold improvement.
Do not use pooled data. Each sequence should correspond to one measurement. If there is no way to deconvolute pooled data, it cannot be used to train the model.
Use consistent conditions. While it may be tempting to try out different conditions, our current models require consistency across experiments. Bundling together multiple experimental conditions may add noise and “confuse” the model.

Improve data confidence with more replicates. If you are not confident in the quality of the measurements, you will need to run a higher number of replicates. It is also helpful to mention possible sources of error to us.

Myth #4: Negative data is useless

Truth: Machine learning models value diversity over performance. Biologists typically select and recombine the variants that perform well, ignoring the vast amount of “negative data”. However, with next-generation sequencing becoming much more affordable, you can (and should) obtain the sequences of ALL variants, not just the best ones. This information is super valuable for ML models.

When you are training an ML model, the more diverse sequences you give it, the better it understands the factors that make the protein perform better or worse. Imagine you are exploring a landscape in the Swiss Alps. In searching for the highest peaks, it is equally important to note the valleys as the areas on the map where you don’t want to go.

Similarly, while high-performing protein sequences teach the model about the areas of the fitness landscape that should be explored more, the low-performing sequences help it understand which areas to avoid.

Myth #5: ML will give you great results right off the bat

When people talk about generative ML, some imagine it as this magical model trained on all of the naturally occurring protein sequences that can generate new proteins with any set of specified properties. This type of ML approach—designing proteins with properties that the model has not been specifically trained on—is called 'zero-shot' learning. Zero-shot models are kind of like generally trained protein engineers: they understand the underlying concepts of protein design and can make educated guesses about how to improve protein functions. But they have no experience with your specific protein.

Truth: While zero-shot models are impressive, you can make even more headway by training the model to become an expert on your specific protein.

Every project we work on is unique: customers come to us with different proteins and requirements for optimization, different environments those proteins need to function in, and different assays they use. The chances of getting the optimal performance the first time you run a model on a new protein are low.

The good news is—our models are really good students. They will get better at predicting the sequence-function relationships with each round of training on your custom data and come up with progressively improved protein designs.

It is helpful to think of AI as a hypothesis-generation tool, rather than a magical black box that knows everything about the world of proteins. By testing those hypotheses and feeding the data back to the model, you improve its performance. This way, the model quickly becomes an expert on your specific protein.

Myth #6: You have to understand ML to take advantage of it

Some people claim that all protein engineers need to become ML experts—or even that ML models will replace bioengineers altogether. At Cradle, we do not want to oversell the capabilities of ML, nor understate its power. The truth is somewhere in the middle.

Truth: As a protein engineer, ML is a valuable tool you can (and should) use to make your job easier and improve your results. It can give you the edge over your competition and help you get your product to market faster. As Richard Baldwin, Professor of International Economics at the Geneva Graduate Institute, said: “AI won't take your job. It is somebody using AI that will take your job.”

Don’t be the one who gets left behind.

So, what does the “ideal” ML training dataset look like?

Generative ML can be an incredibly powerful tool to help you engineer proteins. The models you train on the specific data for your protein will get progressively better and help you uncover protein variants that you would not find using traditional methods. But keep in mind that ML models learn differently from you. If you want to obtain the best results with generative AI, follow these data collection guidelines:

Wide data distribution is key. This includes both sequence diversity (amino acids diversity, distribution of mutations across the protein, single- and multi-amino acid mutations, etc.) and the distribution of measured property values.
It is better to provide fewer measurements more frequently than many data points all at once. The model is refined with each round of data collection.
Negative data are valuable as well. Variants that don’t perform well will teach the model what NOT to do.
Individual measurements are preferred over averages. Each data point finetunes the model, and that includes teaching it about what kind of variability in measurements it can expect.
Raw data is preferred over heavily processed data.
Data needs to be consistent. Assays should be carried out under the same conditions.
Improving the confidence of your measurements. If your process has a high error rate, you will want to run more replicates. We can also help account for this by assigning weight to certain measurements, as well as incorporating parameters like assay error into the model.
Remember: the quality of the model predictions will be as good as the data it is trained on.

Incorporating generative machine learning into your workflow is easier than you think. At Cradle, we try to remove the barriers to entry and help guide you to the best results. Although ML does not necessarily replace other protein engineering tools, it can make the process a lot more efficient and help you achieve your protein design goals faster.

Generative ML is not wizardry, but it can create some impressive results. To achieve that, we have to work together to provide the models with the best quality data possible. We can help you do that by informing the experimental design and tending to the data analysis process. Remember: Your data are the seeds, our model is the soil.

Let’s see what ML can help you grow.

First, let’s go over some of the misconceptions about what kind of data ML models require and banish those myths.

Myth #1: You need a ton of data to train generative ML models on

Wide distribution: Whether you are engineering thermostability or catalytic activity, including a wide range of performance values helps the model understand the fitness landscape of your protein. The fitness landscape is a theoretical map that relates sequence (the equivalent of the GPS coordinates of your location) with performance (your elevation). It is impossible to cover the entire protein fitness map, but we can measure points across it to infer what the landscape might look like, which will help us find the fitness “peaks”. Including both low- and high-performing variants will teach the model which sequences, or map areas, correlate with better performance and which ones to avoid.
Frequency over volume: Instead of waiting to collect 1000 data points in one large experimental round, it is better to provide the model with 100 data points every other month. The model improves with every new dataset and will suggest better sequences for you to test in order to refine its predictive capabilities.
Even if you don’t have ANY data, we can provide suggestions for initial sequences you can test in your lab. By generating sequences that mimic natural diversity—that is, sampling points across the entire fitness landscape, including high- and low-performing variants,—we can start training the model on your specific protein and mapping its fitness landscape.

Myth #2: Using AI will disrupt your entire workflow

Myth #3: Data quality has to be impeccable for ML models

Include confidence intervals. If you know the error margin of your instrument or assay, we can add those parameters to the model. This will help discern when a variant shows statistically significant improvement for the model to take note of. Alternatively, you can simply provide the raw values of the individual measurements and we will use those to calculate the error margin.
Assign “weight” to certain measurements. If you have collected some potentially useful data points but you are not entirely sure how much you can trust those measurements (let’s say they were gathered right after your lab moved to a new location), we can assign a lower confidence weight to those data.
Track experimental variation. If you are collecting measurements for the same variants across different experiments (for example, by including controls such as the wild-type sequence or best-performing variant from the previous round), we can use those measurements to help control for experimental variation between rounds.

That said, you can help us get the best value out of your data by following these guidelines for data collection:

Include individual measurements and not the average if your process involves running replicates.
Do not normalize values if you are measuring improvement against a baseline. The model prefers direct measurements of a property as opposed to fold improvement.
Do not use pooled data. Each sequence should correspond to one measurement. If there is no way to deconvolute pooled data, it cannot be used to train the model.
Use consistent conditions. While it may be tempting to try out different conditions, our current models require consistency across experiments. Bundling together multiple experimental conditions may add noise and “confuse” the model.

Myth #4: Negative data is useless

Myth #5: ML will give you great results right off the bat

Truth: While zero-shot models are impressive, you can make even more headway by training the model to become an expert on your specific protein.

Myth #6: You have to understand ML to take advantage of it

Don’t be the one who gets left behind.

So, what does the “ideal” ML training dataset look like?

Wide data distribution is key. This includes both sequence diversity (amino acids diversity, distribution of mutations across the protein, single- and multi-amino acid mutations, etc.) and the distribution of measured property values.
It is better to provide fewer measurements more frequently than many data points all at once. The model is refined with each round of data collection.
Negative data are valuable as well. Variants that don’t perform well will teach the model what NOT to do.
Individual measurements are preferred over averages. Each data point finetunes the model, and that includes teaching it about what kind of variability in measurements it can expect.
Raw data is preferred over heavily processed data.
Data needs to be consistent. Assays should be carried out under the same conditions.
Improving the confidence of your measurements. If your process has a high error rate, you will want to run more replicates. We can also help account for this by assigning weight to certain measurements, as well as incorporating parameters like assay error into the model.
Remember: the quality of the model predictions will be as good as the data it is trained on.

Let’s see what ML can help you grow.

How to Get Started Using AI for Protein Design: A Guide to Data Collection

Myth #1: You need a ton of data to train generative ML models on

Myth #2: Using AI will disrupt your entire workflow

Myth #3: Data quality has to be impeccable for ML models

Myth #4: Negative data is useless

Myth #5: ML will give you great results right off the bat

Myth #6: You have to understand ML to take advantage of it

So, what does the “ideal” ML training dataset look like?

Myth #1: You need a ton of data to train generative ML models on

Myth #2: Using AI will disrupt your entire workflow

Myth #3: Data quality has to be impeccable for ML models

Myth #4: Negative data is useless

Myth #5: ML will give you great results right off the bat

Myth #6: You have to understand ML to take advantage of it

So, what does the “ideal” ML training dataset look like?

Recent posts