The growth that plant breeding in the past decades has led to a significant increase in terms of both volume and variety of data that breeders must collect and analyze for efficient breeding programs. Bioinformatics tools, and machine-learning (ML) techniques have proved to be useful to inspect such complex data, allowing the development of a more efficient and less expensive breeding pipeline. An example of successful applications is the use of ML in the management of plant health. Traditional methods are known to rely on manual and effort intense inspection from breeders in the field, resulting in activities that are extremely time consuming and prone to error. ML models can speed up the process by identifying early indicators of disease or pest infestation, then favoring fast intervention and decrease of crop loss [1, 2].
Another example is given by the use of ML to predict plant traits and potential crosses using different types of information (for example, genotypic and environmental) to promote genetic gain and shorten the breeding cycle. Even though ML techniques have obviously brought numerous positive effects in plant breeding, they come with the limitation of needing very large datasets that are often expensive and difficult to acquire. Another limitation of ML model is the lack of flexibility in terms of tasks they can handle. For all these reasons, current research in AI for plant breading is tending towards the use of Foundational Models [1]. This article aims at offering a high-level overview of what Foundational Model (FM) are, the current progress in the bioinformatics domain, and the possible open research directions of FM in plant breeding.
Foundation models were developed to overcome the limitations of traditional ML models and to improve their generalization across different tasks. These models consist of very large deep learning systems that can be customized and fine-tuned to the users' needs. Their ability to deal with different tasks, even those they were not explicitly trained for, results from Foundation Models learning from large and diverse datasets.
To have a better intuition of what Foundation Models are and why they work, here are a couple of examples.
If we consider an agriculture company, Foundation Models can be seen as the most experienced employees. Over their career, they have been exposed to an enormous amount of information and various tasks, allowing them to transfer their knowledge to new situations (maybe to new plant types) and make better decisions.
Another way of looking at Foundation Models is to think of them as the base for a cake. Depending on the occasion, you can add extra layers or add wedding decorations rather than birthday decorations. The foundation cake provides a versatile starting point, allowing you to adapt it quickly for any event and optimize the time you need to deliver the final product.
Given the high level intuition of Foundation Models the challenges they are trying to overcome, we can summarize their main features as follows:
For the scope of this article, it can be of interest to cover the main research areas where FMs have been successfully used in the field of bioinformatics, which lays the basis for the emerging field of FMs for plant breeding. The extensive survey provided by Li et al. (2024) identifies five types of biological problems that bioinformatics-specific FMs have tailored so far, among which it is worth to mention the problem of function prediction of targets, such as proteins and genes, and the problem of sequence analysis, which is fundamental to identifying single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations that influence phenotypes. The survey mentions several models and their technical advances for both problems. However, this article will cover only the main results of the Geneformer Network for the function prediction scenario and HyenaDNA for the sequence analysis.
Figure: Timeline and key milestones of Foundation Models (FMs) and their deep learning origins. The development of FMs in bioinformatics closely follows the rise of deep learning, gaining momentum as these models demonstrated groundbreaking capabilities in the big data era. (Adapted from Li et al., 2024)
In August 2024, Zhai et al. launched the open-source framework called PlantCaduceus, a multi-species plant DNA LM pretrained on a curated set of 16 evolutionarily distant Angiosperm genomes, enabling cross-species prediction of functional annotations with limited data [5]. The authors show how the model outperforms supervised deep learning models in several tasks, including modelling transcription, translation and evolutionary constraints, for several species, even if not part of the pre-training process.
The interesting results in cross-species prediction hint at the possible ability of PlantCaduceus to accelerate plant genomic research [5] but also to assist in improving the prediction of complex phenotypic traits across diverse plant species. These functional insights could accelerate the development of predictive models that link genotype to phenotype, ultimately enhancing the precision of crop breeding and genomic selection.
However, future work should consider that high accuracy in predictions for phenotypic traits present some important challenges:
We believe FMs represent a powerful complement to our current ML toolkit and an opportunity to push the boundaries of what predictive breeding can achieve.
[1] Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, and Zhaojian Li. Large Language Models and Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges. arXiv preprint arXiv:2308.06668 (2023), 18,
doi: https://doi.org/10.48550/arXiv.2308.06668
[2] Dolatabadian, Aria, Ting Xiang Neik, Monica F. Danilevicz, Shriprabha R. Upadhyaya, Jacqueline Batley, and David Edwards. "Image‐based crop disease detection using machine learning." Plant Pathology 74, no. 1 (2025): 18-38,
doi: https://doi.org/10.1111/ppa.14006
[3] Li, Qing, Zhihang Hu, Yixuan Wang, Lei Li, Yimin Fan, Irwin King, Gengjie Jia, Sheng Wang, Le Song, and Yu Li. "Progress and opportunities of foundation models in bioinformatics." Briefings in Bioinformatics 25, no. 6 (2024): bbae548.
doi: https://doi.org/10.1093/bib/bbae548
[4] Hollmann, Noah, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. "Accurate predictions on small data with a tabular foundation model." Nature 637, no. 8045 (2025): 319-326.
doi: https://doi.org/10.1038/s41586-024-08328-6
[5] Zhai, Jingjing, Aaron Gokaslan, Yair Schiff, Ana Berthel, Zong-Yan Liu, Wei-Yun Lai, Zachary R. Miller et al. "Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model." bioRxiv (2024).
doi: https://doi.org/10.1101/2024.06.04.596709
Share on