Computomics - An Overview of Foundation Models in Plant Breeding

An Overview of Foundation Models in Plant Breeding

Why Foundation Models?

The growth that plant breeding in the past decades has led to a significant increase in terms of both volume and variety of data that breeders must collect and analyze for efficient breeding programs. Bioinformatics tools, and machine-learning (ML) techniques have proved to be useful to inspect such complex data, allowing the development of a more efficient and less expensive breeding pipeline. An example of successful applications is the use of ML in the management of plant health. Traditional methods are known to rely on manual and effort intense inspection from breeders in the field, resulting in activities that are extremely time consuming and prone to error. ML models can speed up the process by identifying early indicators of disease or pest infestation, then favoring fast intervention and decrease of crop loss [1, 2].

Another example is given by the use of ML to predict plant traits and potential crosses using different types of information (for example, genotypic and environmental) to promote genetic gain and shorten the breeding cycle. Even though ML techniques have obviously brought numerous positive effects in plant breeding, they come with the limitation of needing very large datasets that are often expensive and difficult to acquire. Another limitation of ML model is the lack of flexibility in terms of tasks they can handle. For all these reasons, current research in AI for plant breading is tending towards the use of Foundational Models [1]. This article aims at offering a high-level overview of what Foundational Model (FM) are, the current progress in the bioinformatics domain, and the possible open research directions of FM in plant breeding.

What are Foundation Models?

Foundation models were developed to overcome the limitations of traditional ML models and to improve their generalization across different tasks. These models consist of very large deep learning systems that can be customized and fine-tuned to the users' needs. Their ability to deal with different tasks, even those they were not explicitly trained for, results from Foundation Models learning from large and diverse datasets.

To have a better intuition of what Foundation Models are and why they work, here are a couple of examples.
If we consider an agriculture company, Foundation Models can be seen as the most experienced employees. Over their career, they have been exposed to an enormous amount of information and various tasks, allowing them to transfer their knowledge to new situations (maybe to new plant types) and make better decisions.
Another way of looking at Foundation Models is to think of them as the base for a cake. Depending on the occasion, you can add extra layers or add wedding decorations rather than birthday decorations. The foundation cake provides a versatile starting point, allowing you to adapt it quickly for any event and optimize the time you need to deliver the final product.

Main Features of a Foundation Model

Given the high level intuition of Foundation Models the challenges they are trying to overcome, we can summarize their main features as follows:

Scalability and Architecture: Deep learning architecture with very large number of parameters
Pre-training process: The process is lead on very large amount of unlabeled data with the use of self-supervised learning techniques (e.g. predicting missing parts of a DNA sequence based on surrounding genetic information)
Transferability and Fine-Tuning: The patterns learned in the pre-training process can be transferred to more domain-specific tasks by adjusting (fine-tuning) the model parameters

Foundation Models for general biological problems impacting plant breeding

For the scope of this article, it can be of interest to cover the main research areas where FMs have been successfully used in the field of bioinformatics, which lays the basis for the emerging field of FMs for plant breeding. The extensive survey provided by Li et al. (2024) identifies five types of biological problems that bioinformatics-specific FMs have tailored so far, among which it is worth to mention the problem of function prediction of targets, such as proteins and genes, and the problem of sequence analysis, which is fundamental to identifying single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations that influence phenotypes. The survey mentions several models and their technical advances for both problems. However, this article will cover only the main results of the Geneformer Network for the function prediction scenario and HyenaDNA for the sequence analysis.

Foundation Models and Plant Breeding

Figure: Timeline and key milestones of Foundation Models (FMs) and their deep learning origins. The development of FMs in bioinformatics closely follows the rise of deep learning, gaining momentum as these models demonstrated groundbreaking capabilities in the big data era. (Adapted from Li et al., 2024)

In August 2024, Zhai et al. launched the open-source framework called PlantCaduceus, a multi-species plant DNA LM pretrained on a curated set of 16 evolutionarily distant Angiosperm genomes, enabling cross-species prediction of functional annotations with limited data [5]. The authors show how the model outperforms supervised deep learning models in several tasks, including modelling transcription, translation and evolutionary constraints, for several species, even if not part of the pre-training process.

The interesting results in cross-species prediction hint at the possible ability of PlantCaduceus to accelerate plant genomic research [5] but also to assist in improving the prediction of complex phenotypic traits across diverse plant species. These functional insights could accelerate the development of predictive models that link genotype to phenotype, ultimately enhancing the precision of crop breeding and genomic selection.

However, future work should consider that high accuracy in predictions for phenotypic traits present some important challenges:

Intrinsic link to the processing of multidimensional and multimodal data (genotypes, environment and their interaction), that can be very complex to model if not using multi-mode FMs [1]
Heterogeneous and imbalanced tabular data are often the main data source, which favours the use of tree-based methods over deep learning methods [4]
Genes that are associated with complex traits can be located at various distant parts of the whole genome, which requires very long context (orders of magnitude longer context than annotation tasks, for example)
Such models need to capture the effects of allelic variations, i.e., the genomic differences between individuals.

Conclusion

At Computomics, we recognize the transformative potential and the impact it will have on plant breeding.
Traditional ML has already accelerated key breeding steps, but FMs offer the next leap—enabling cross-species generalization, multi-modal data integration, and robust phenotype prediction, even in low-data regimes.
Pretrained, domain-aware FMs can outperform task-specific models, aligning well with our focus on predictive breeding and data-driven decision support.

Outlook

We see three key innovation directions for the integration of FMs into Computomics’ solutions:
Cross-crop trait prediction: Leveraging FMs to extend predictive insights to underrepresented or orphan crops.
Genotype-to-phenotype modeling: Using FMs to bridge large-scale genotypic data with complex field-level traits.
Sustainable breeding strategies: Applying FMs to simulate breeding outcomes under climate variability scenarios, improving long-term crop resilience.
As early adopters of AI in plant science, we evaluate, benchmark, and co-develop FM-powered pipelines that can scale across partners, geographies, and species — keeping our focus on speeding up breeding decisions while reducing trial-and-error costs.

We believe FMs represent a powerful complement to our current ML toolkit and an opportunity to push the boundaries of what predictive breeding can achieve.

References

[1] Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, and Zhaojian Li. Large Language Models and Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges. arXiv preprint arXiv:2308.06668 (2023), 18,

doi: https://doi.org/10.48550/arXiv.2308.06668

[2] Dolatabadian, Aria, Ting Xiang Neik, Monica F. Danilevicz, Shriprabha R. Upadhyaya, Jacqueline Batley, and David Edwards. "Image‐based crop disease detection using machine learning." Plant Pathology 74, no. 1 (2025): 18-38,

doi: https://doi.org/10.1111/ppa.14006

[3] Li, Qing, Zhihang Hu, Yixuan Wang, Lei Li, Yimin Fan, Irwin King, Gengjie Jia, Sheng Wang, Le Song, and Yu Li. "Progress and opportunities of foundation models in bioinformatics." Briefings in Bioinformatics 25, no. 6 (2024): bbae548.

doi: https://doi.org/10.1093/bib/bbae548

[4] Hollmann, Noah, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. "Accurate predictions on small data with a tabular foundation model." Nature 637, no. 8045 (2025): 319-326.

doi: https://doi.org/10.1038/s41586-024-08328-6

[5] Zhai, Jingjing, Aaron Gokaslan, Yair Schiff, Ana Berthel, Zong-Yan Liu, Wei-Yun Lai, Zachary R. Miller et al. "Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model." bioRxiv (2024).

doi: https://doi.org/10.1101/2024.06.04.596709

An Overview of Foundation Models in Plant Breeding

An Overview of Foundation Models in Plant Breeding

Why Foundation Models?

What are Foundation Models?

Main Features of a Foundation Model

Foundation Models for general biological problems impacting plant breeding

Foundation Models and Plant Breeding

Conclusion

Outlook

References

Get in touch with us