Contact us info@computomics.com +49 7071 568 3995
Overview

Make Sense of Science: Predicting Protein Instructions Using Context

CoCoPred: Context-based Codon Prediction

Welcome back to Make Sense of Science, where we break down complex research into easier-to-understand insights. Today, we're talking about CoCoPred*, a new tool our colleague developed for her Master Thesis to improve the production of proteins. Proteins are essential for medicine, food, and other industries. CoCoPred helps choose the best "DNA instructions" to create these proteins.

We are presenting a summary of the Master Thesis by Friederike Lüdtke, Software Developer at Computomics. We are proud to share the essence of her work!

 


Friederike Lüdtke
Software Developer

 

The Challenge

The general challenge is to make organisms produce a protein that they would not naturally produce. Like, for example human insulin, which is nowadays produced in bacteria or yeast, although this doesn't happen naturally. When scientists want to make a specific protein for medicine or industry, they need to insert the right DNA instructions into an organism like bacteria or yeast. But not all organisms “read” these instructions the same way. Some prefer different "words" (called codons) to make the same protein. Choosing the wrong words can lead to the protein being made poorly or not at all.

The elements and their functions:

 

DNA
DNA is the instruction manual for every living thing. It tells cells what to do, e.g. how to build proteins. DNA is made up of four letters (A, T, C, G), which are arranged in sequences that provide different instructions. The entire set of DNA instructions found in a cell is the genome.

 

Codons
A codon is a group of three DNA letters that code for an amino acid. Amino acids are the building blocks of proteins. There are 64 codons, but only 20 amino acids, so many codons can create the same amino acid. Different organisms prefer certain codons.

 

Amino Acids
Amino acids are the building blocks of proteins, which are essential for almost every function in living organisms.

 

 

Image 1: Illustration of DNA, Genome, Codons, and Amino Acids 

 

 

The Solution: CoCoPred

CoCoPred  is a new tool that uses machine learning - a type of AI - to predict the best codons for making proteins. It was developed by Friederike Lüdtke as part of her Master Thesis. Instead of just using the most common codons, CoCoPred looks at all possible codons and picks the best one for each organism. This helps improve protein production.

 

How Does CoCoPred Work?

CoCoPred takes the DNA from an organism (like bacteria or yeast). It breaks the DNA down into small pieces and looks at how codons are used. Then, it trains a machine learning model to predict the best codon for each part of the protein. Scientists could use this prediction to design new genes, ensuring the best protein production.

Bacteria and Yeast in Science
Bacteria and yeast are used in science and industry to help produce important products such as medicines and food. Both are single-cell organisms, yeast with and bacteria without a nucleus. Bacteria are often "engineered" to produce medicines like insulin or to break down waste. Yeasts are widely used in research to understand human biology and diseases.

 

 

Which Methods Were Used?

To figure out the best way to predict codon usage, the study tested three different methods:

1. CoCoPred:

  • Genomes of different organisms were used to develop the methos.
  • Protein sequences were divided into small pieces, and 18 machine learning models (called random forests) were trained, one for each amino acid.
  • These models predict which codon is best, based on the surrounding amino acids.

 

Machine Learning Models
A machine learning model is a type of computer program that learns from data to make predictions or decisions without being explicitly programmed. Random forests is one example of a machine learning model, used in CoCoPred, that makes predictions based on patterns in the data by combining multiple decision trees.

 

2. PresynCodon:

  • This method uses data from several genomes to create something called a Codon Selection Index (CSI), which helps predict codon usage.
  • Protein fragments from different species were tested against this index.
  • To avoid bias, similar sequences were removed from the training data.

3. Base Model:

  • A simple model that just picks the most common codon for each amino acid from the filtered genome data.

These methods were then used to predict the best codons when designing new genes, with the aim of helping scientists improve protein production.

When designing new genes, these tools (all except for the Base Model) use a "window" of seven amino acids at a time. They look at the middle amino acid and use one of the prediction methods to choose the best codon for it. The surrounding amino acids help give context to make the prediction more accurate. This method helps ensure the best codons are picked, which eventually leads to better protein production in the organism.
The scientists can use these predicted patterns to create a synthetic gene, which is then inserted into the host organism (e.g. bacteria)

 

What Did the Study Find?

CoCoPred was tested on 15 species, bacteria, fungi as well as insects and mammals, and it performed better than older methods for predicting codons. Here’s what was found:

  • Better Accuracy: CoCoPred was up to 2.6 times more accurate than the old methods.
  • Larger Genomes: For organisms with more DNA (like humans), CoCoPred worked even better, improving accuracy by up to 41.6%.
  • Rare Codons: CoCoPred also did well at predicting rare codons, which can be crucial for making proteins that work correctly.

Image 2: Performance of CoCoPred in comparison to other methods

 

Why Does This Matter?

CoCoPred could help scientists make proteins more efficiently in organisms used in medicine, agriculture, and industry. For example, it could be used to produce proteins for gene therapy in humans or enzymes in bacteria that help in food production.

CoCoPred is a powerful new tool that improves how we design DNA instructions for making proteins. It is especially useful in organisms with large genomes, like humans, and could lead to better results in medical and industrial applications.

 

If you wish to know more about CoCoPred or the Master Thesis, feel free to contact Friederike Lüdtke directly!

 

*Note: During the development of CoCoPred by Friederike Lüdtke, a tool with a different functionality was developed under the same name. It predicts the coiled coil structure of a protein.

 

Share on

Get in touch with us