Best Agtech Open Source Datasets and Strategies for How To Use and Build Off Them

Hanan Othman

Hanan Othman

Content Writer | 2022/11/3 | 8 min read

Data is an essential resource, seeing that it's the key to creating powerful machine learning algorithms, but it comes at a high cost, monetarily. Industries like Agriculture, rapidly growing and profiting from AI technologies and systems to increase crop yield and reduce the costs of doing so, are in particular need to support these advancements and urge on these impressive innovations.

On course to profit $4 billion by 2026, a future where CV-powered applications replace traditional human-operated agricultural machines is within reach for Agriculture. With exciting smart farming benefits like a higher return come harvest, efficient herbicide use and pest control measures paving the way to stability in the form of food security for society.

Start any machine learning project off with less expense and more value by taking advantage of existing datasets for some of the most popular AgTech use cases, such as crop health monitoring and protection, field mapping, fruitification, and many others, by referencing our list below of the best and most relevant open source datasets as the ideal foundation to build off of.

We will cover:

  • What Open Source Offers AgTech AI

  • Where To Find the Best Agriculture Open Source Datasets

  • How To Clean and Prep Open Source Datasets

  • Building on Open Source Datasets

  • What’s in Store for Data Management in AgTech

What Public or Open Source Datasets Offer Agriculture

For any computer vision system to accurately perform the most basic of agricultural tasks, they need a solid understanding of certain "objects of interest," relevant to that working environment.

Specifically, the ability to comprehend and differentiate between different species of crops, weeds, and fruit, they'll be interacting with and surrounding them while performing those tasks once approved for production and deployed out in the field (in both a literal and figurative sense).

The first step to enabling a CV system or model to be able to recognize these objects as precisely as possible, since the more adept the model becomes in perceiving them, the more accurate and proficient it will perform as an application carrying out the purpose it was designed for. This first step solely entails sourcing or collecting data for the model while considering its individual and unique use case and application.

AgTech applications benefit from overlapping “objects of interest” public dataset pools offer (i.e. variants of weed, crop, and plant species, their distinct growth stages and associated appearance in various environments and climates).

Some AgTech systems are used for crop monitoring, which might call for image data of several types of plants they'll be coming across and monitoring measured indicators such as crop growth and health; while an application focused on weeding might exclusively require data to classify weeds correctly and remove them.

The previous and second example presented doubles as an affirmation of the importance of precision in agricultural technologies - imagine a robot meant to apply herbicides to weeds making the mistake of only spraying half or a quarter of the invasive plants - it can have problematic consequences down the road that are too easy to overlook for comfort.

Acquiring Open Source Data for AgTech Applications

The standard methods of sourcing data for computer vision development cycles are as follows: in-house, crowdsourcing, outsourcing, synthetic, and open source or publicly available datasets, (also referenced as existing datasets in this article).

Each type has its own distinct advantages and suitability according to a data labeling team's needs or for data scientists in general, and more than one can be used for a single development project or build, but open source datasets stand out as the practical option to start out with, regardless of project scale.

Existing datasets are a great asset for object detection and segmentation tasks, thanks to their variety and the multitude available for easy access online. Having these datasets publicly available saves data prep teams a considerable amount of time and resources in the early stages of data processing.

However, though there are a wide range of image datasets to choose from, most are generally collected, not necessarily oriented towards or directly serving the specialized needs of a custom-built model capable of achieving precision agriculture goals.

Precision agriculture, as a concept, also often termed, precision ag or precision farming, refers to any act of observing, measuring, and responding to crop and field behavior or conditions with intelligent technologies to help improve upon and further farming practices.

Procuring data for precision farming efforts has proven to be especially challenging, with the amount of detail and specialization found to be necessary for each phase of the data analysis pipeline: acquisition, categorization, and annotation.

With the use of shared, open source datasets, it makes it much easier during the early days of constructing complex CV frameworks to start with the basics and work up to creating specific datasets that they'll need eventually, but certainly not in the beginning.

Cleaning and Prepping Data

After collecting or acquiring enough open source data, usually as much as you can find that is relevant to your use case, there's the necessary duty of cleaning it, and possibly reformatting it as well.

Handling "dirty" and disordered data is an essential part of any proper model data preparation and training pipeline, dealing with dirty data often involves two tasks: first, to detect the issues, then addressing them in an effective manner.

Some operations that are commonly performed to clean data include, in no particular order:

  • Extracting structure

  • Dealing with missing values.

  • Removing duplicates.

  • Handling incorrect data.

  • Correcting values to fall within certain ranges.

  • Adjusting values to map to existing values in external data sources.

Data cleaning will also likely require iteration, as issues are detected during the process, addressed, verified, then possibly requiring further cleaning to resolve the issues for certain. It's also quite common to return to the cleaning stage of the data prep cycle during the model evaluation phase; since issues that went undetected until that point can arise, a few typical ones to mention are missing values, label misspellings, or incorrect formatting.

Any existing dataset should be thoroughly examined to help identify and understand any inconsistencies or flaws it may have. Public data may be quite messy, so labeling teams should plan on spending some time on cleaning them before they're considered ready for training use.

Furthermore, it's necessary to clean public datasets because although open source datasets may seem like an endless and bountiful gift of detailed data that's free to use, it also was likely created for a different, original purpose, so it needs to be adapted to new training goals.

Only after ensuring that the data is clean and also relevant, it's time to annotate it. While there are multiple avenues available, they are costly in terms of finances, time, and human resources;  like the popular decision of outsourcing the annotation work.

Another common solution businesses tend to fall back on, especially if they're constricted by their project, budget or their size as an organization is to keep annotation work in-house, but that poses a different concern that data scientists and ML engineers focus more on annotation tasks rather than actual programming and model development.

End-to-end data labeling or annotation platforms are an excellent remedy to those pestering concerns, one that features the most useful labeling techniques and tools to train an effective agriculture CV model:

  • Bounding box

  • Semantic segmentation

  • Polygon

  • Cuboid

  • Polyline

  • Landmark

The right data labeling platform can also help alleviate the strain on a business of making hard decisions; by taking on the labeling workflow and enabling a team to take a step back and monitor labeling processes without performing repetitive and time-consuming labeling tasks manually. Freeing them up to concentrate on developing more valuable datasets that are specific to their project and its particular requirements.

Building On Open Source Datasets for AgTech

As cited at the start of this article, the strategic use, or the choice of using open source data should involve treating it as a foundational, initial step to the data curating process.

Regardless of the Agriculture application or system the model is built for, existing datasets give ML teams a base of available data that can go a long way to fulfilling training needs and satisfying a portion of the high data volumes the average AI system requires.

In the context of agriculture as an industry and the AI-enabled machines that are currently being developed as upgraded solutions to traditional farming operations; robots, drones, and other automated machines are the most prominent and leading examples for visual perception and computer vision systems.

To reiterate, these systems require the annotation of specific objects in image and video data in order to navigate their environment properly, mainly; objects of interest like crops, fruits, vegetables, and other plants (whether they're weeds or not). They also need to be annotated through specific techniques to be precise.

While open source data is helpful in providing this object matter through image datasets, it likely won't be sufficient for iterative data labeling and training needs for agricultural applications.You'll find below a breakdown of the most important functions of a CV agriculture model and the associated image data content required for precise and reliable performance:


Robots are utilized in precision farming operations to detect diseases in a variety of plants, and should be capable of recognizing pests and evaluating plant health and nutrition needs.

Identifying Crop Ripeness Levels

Analyzing the various fructifying levels through labeling or annotation for a CV system to accurately sort and grade the growth stages of crops and when they're ideal to harvest or ready; since the size and color of the fruit or vegetable indicate its ripeness, this training data needs to be customized to the ML model.

Detecting Weeds

Any agricultural activity is negatively impacted by the presence of Invasive or unwanted plants, as they lower farming harvest ouput by invading and suffocating crops. With sensors, these weeds can be detected and the right herbicide applied accordingly, leading to improved crop yields, not to mention, through AI and CV, the amount of herbicide used can be significantly reduced.

3D Field Mapping and Surveillance

Through drone and satellite image data, 3D field mapping applications powered by deep learning technology can help agriculturalists predict and manipulate crop yield by measuring soil conditions, nitrogen levels, moisture, seasonal weather and historical yield.

Public and Custom Data Team Up

In short, following through on precise agriculture demands and smart farming functions is more easily achieved through the combined use of open source and custom training datasets.

Having both types allows ML teams to adapt to their unique agricultural use case affordably, while also resulting in a high quality dataset for more accurate object recognition by CV applications, capable of making the right predictions.

The Future of Data for Agriculture

The agricultural or farming industry is known to be slow in adopting new technologies, mainly because of long-held operating standards and traditions and uncertainty around the idea of investing in unfamiliar and unproven innovations. With the increase in the amount of data publicly available or accessible to AI development teams, more capable applications that appear promising are encouraging the agriculture sector to embrace technologies.

Although the agricultural industry is in a better position to use these systems, there's still the age-old question and challenge of sourcing large amounts of data to power them, high-quality data for that matter. By leveraging existing datasets, such as the ones listed and described in this article, data practitioners and ML model developers can get a head start on delivering even more exciting applications to this sector, in addition to making a tangible difference in the world.

Subscribe to our newsletter

Stay updated latest MLOps news and our product releases

About Superb AI

Superb AI is an enterprise-level training data platform that is reinventing the way ML teams manage and deliver training data within organizations. Launched in 2018, the Superb AI Suite provides a unique blend of automation, collaboration and plug-and-play modularity, helping teams drastically reduce the time it takes to prepare high quality training datasets. If you want to experience the transformation, sign up for free today.

Join The Ground Truth Community

The Ground Truth is a community newsletter featuring computer vision news, research, learning resources, MLOps, best practices, events, podcasts, and much more. Read The Ground Truth now.


Designed for Data-Centric Teams

We’ve built a platform for everyone involved in the journey from training to production - from data scientists and engineers to ML engineers, product leaders, labelers, and everyone in between. Get started today for free and see just how much faster you can go from ideation to precision models.