Quality data is at the heart of successful enterprise artificial intelligence (AI). It remains the primary challenge for companies aiming to apply machine learning (ML) in their applications and operations. Despite significant advances in helping enterprises overcome data sourcing and preparation barriers, there is still much work to be done at various levels, including organizational structure and company policies, according to the latest State of AI Report.
The Costs and Challenges of Data in AI
The enterprise AI lifecycle can be divided into four stages: data sourcing, data preparation, model testing and deployment, and model evaluation. Advances in computing and ML tools have automated and accelerated tasks such as training and testing different ML models. Cloud computing platforms enable the simultaneous training and testing of numerous models of various sizes and structures. However, as machine learning models increase in number and size, they require more training data.
Obtaining and annotating training data still demands considerable manual effort and is largely application-specific. The report highlights several obstacles, including insufficient data for specific use cases, new ML techniques requiring larger data volumes, and inefficient data sourcing processes. “High-quality training data is essential for accurate model performance, and large, inclusive datasets are expensive,” says one industry expert. “However, valuable AI data can significantly increase the chances of transitioning projects from pilot to production, justifying the expense.”
ML teams may start with pre-labeled datasets but will eventually need to collect and label their own custom data to scale their efforts. Depending on the application, labeling can become extremely costly and labor-intensive. Many companies have enough data but struggle with quality issues. Biased, mislabeled, inconsistent, or incomplete data reduces ML model quality, adversely affecting the ROI of AI initiatives. “Training ML models with bad data results in inaccurate predictions,” the expert continues. “To ensure AI works well in real-world scenarios, teams need a mix of high-quality datasets, synthetic data, and human-in-the-loop evaluation.”
Bridging the Gap Between Data Scientists and Business Leaders
The report indicates a disconnect between business leaders and technical staff regarding the main challenges of AI initiatives. Business leaders are less likely to consider data sourcing and preparation as primary challenges. “There are still gaps between technologists and business leaders in understanding the greatest bottlenecks in the AI lifecycle, leading to misalignment in priorities and budget within organizations,” the report states.
The biggest bottlenecks for AI initiatives often lie in a lack of technical resources and executive buy-in. Data scientists, machine learning engineers, software developers, and executives are spread across different areas, leading to a lack of aligned strategy due to conflicting priorities. The diversity of roles involved in AI initiatives complicates achieving alignment, as developers manage data, data scientists address ground-level issues, and executives make strategic decisions with different goals and priorities.
However, the gap is slowly narrowing as organizations better understand the importance of high-quality data to AI success. “Emphasizing the importance of data—especially high-quality data that matches application scenarios—has brought teams together to tackle these challenges,” the expert explains.
Promising Trends in Machine Learning
Data challenges are not new to applied ML, but as models grow larger and data becomes more abundant, scalable solutions are needed to assemble quality training data. Encouragingly, several trends are helping companies overcome these challenges. The AI Report shows a decline in the average time spent managing and preparing data.
Automated labeling is one such trend. Object detection models, for example, require manually specifying the bounding boxes of each object in training examples, which is labor-intensive. Automated and semi-automated labeling tools use deep learning models to predict bounding boxes, significantly speeding up the process. Although automated labels require human review and adjustment, the system improves with feedback from human labelers. “Many teams start with manual labeling but are increasingly turning to time-saving methods to automate the process partially,” the expert notes.
The market for synthetic data is also growing. Companies use artificially generated data to supplement real-world data, particularly when obtaining real data is costly or dangerous. For instance, self-driving car companies use synthetic data to train AI models for complex or dangerous scenarios like accidents and emergency vehicle interactions, filling in gaps where human-sourced data is insufficient.
The evolution of the MLops market is another positive trend. MLops helps companies manage various aspects of the ML pipeline, including labeling and versioning datasets, training, testing, and comparing models, deploying models at scale, tracking performance, and updating models with fresh data over time.
As ML becomes increasingly integral to enterprises, human control will become more important. “Human-in-the-loop (HITL) evaluations are crucial for delivering accurate, relevant information and avoiding bias,” the expert says. “Contrary to the belief that humans will take a backseat in AI training, we will see more HITL evaluations to promote responsible AI and ensure transparency in model performance.”
By addressing these challenges and leveraging these trends, enterprises can optimize their AI initiatives and drive significant value from their investments in data technology.
Comentários