Enabling Machine Learning to Inquire Can Enhance its Intelligence

Enabling Machine Learning to Inquire Can Enhance its Intelligence

Credit: Pixaobay

Researchers from Duke University’s biomedical engineering department have showcased a novel approach that significantly enhances the performance of machine learning models in the search for new molecular therapeutics, even when utilizing only a small portion of the available data. By employing an algorithm that actively detects gaps in datasets, the accuracy of the models can be more than doubled in certain instances.

This innovative approach has the potential to simplify the identification and classification of molecules with valuable characteristics for the development of new drugs and materials. The research was published in the journal Digital Discovery by the Royal Society of Chemistry on June 23.

Challenges of Machine Learning Algorithms in Predicting Molecular Properties

Machine learning algorithms play an increasingly crucial role in predicting the properties of small molecules, including drug candidates and compounds. However, their effectiveness is currently limited by imperfect datasets used for training, particularly due to data bias.

This bias arises when certain properties of molecules are overrepresented compared to others in the dataset, leading the algorithm to prioritize the overrepresented property and overlook other important features.

Daniel Reker, an assistant professor of biomedical engineering at Duke University, compared this bias issue to training an algorithm to differentiate between pictures of dogs and cats but providing it with an overwhelming number of dog pictures and only a few cat pictures. As a result, the algorithm becomes excessively proficient at identifying dogs and ignores other important distinctions.

Data Bias and Its Impact on Drug Discovery

This bias poses significant challenges in drug discovery, where datasets often consist of a vast majority of “ineffective” compounds, with only a small fraction showing potential usefulness. To address this, researchers resort to data subsampling, where the algorithm learns from a smaller but hopefully representative subset of the data. However, this process can lead to the loss of crucial information, impacting the accuracy of the algorithm.

The new method proposed by the Duke University biomedical engineers addresses this limitation by employing an algorithm that actively identifies gaps in datasets. By doing so, the researchers can enhance the accuracy of machine learning models, sometimes achieving more than double their original accuracy when using only a fraction of the available data. This breakthrough could greatly facilitate the identification and classification of molecules with desirable properties for drug development and other material applications.

Reker and his team set out to investigate whether active machine learning could address the longstanding issue mentioned earlier.

An Interactive Approach

In active machine learning, the algorithm can ask questions or request more information when it encounters confusion or detects data gaps, making it highly efficient in predicting performance. While active learning algorithms are usually used to generate new data, the team wanted to explore its application on existing datasets in molecular biology and drug development.

To assess the effectiveness of their active subsampling approach, the team compiled datasets containing molecules with various characteristics, such as those crossing the blood-brain barrier, inhibiting a protein linked to Alzheimer’s disease, and compounds inhibiting HIV replication. They compared their active-learning algorithm with models that learned from the complete dataset and 16 state-of-the-art subsampling strategies.

The results showed that active subsampling outperformed each of the standard subsampling strategies in identifying and predicting molecular characteristics. Moreover, it was up to 139 percent more effective than the algorithm trained on the full dataset in some cases. The model also demonstrated its ability to adapt to mistakes in the data, proving especially valuable for low-quality datasets.

Surprising Discoveries

Interestingly, the team found that the ideal amount of data needed was much lower than expected, sometimes requiring only 10% of the available data. The active-subsampling model reached a point where additional data became detrimental to performance, even within the subsample.

While the team intends to explore this inflection point further in future research, they also plan to utilize this new approach to identify potential therapeutic target molecules. They believe their work will enhance understanding of active machine learning and its resilience to data errors in various research fields.

Besides boosting machine learning performance, this approach can reduce data storage needs and costs since it works with a more refined dataset, making machine learning more accessible, reproducible, and powerful for all researchers.


Read the original article on Tech Xplore.

Read more: Neuralink, Mind Control or Advanced Technology.

Share this post