Optimization techniques applied to machine learning model development

Optimization techniques applied to machine learning model development

This PhD thesis was developed by Jose Carlos Garcia Garcia, under the supervision of Dr. Ricardo Garcia Ródenas and Dr. Jose Angel Martín Baos at the University of Castilla-La Mancha. The thesis was defended on October 11, 2024.

Abstract

Machine learning (ML) techniques have emerged as essential tools for solving a wide range of challenges, while also providing critical insights that significantly improve decision-making processes within organizations. These insights are derived from the extensive volumes of data produced by users on a daily basis. Among the various ML methods, clustering is used in data analysis to group data points into clusters, where each cluster contains elements with similar characteristics. These algorithms help identify patterns and structures within the data. The Density Peak Clustering (DPC) method stands out as a clustering algorithm known for its efficiency in identifying clusters with varying densities and non-convex shapes. However, the dependence of DPC on the manual selection of cluster centers and the tuning of key parameters presents significant challenges, particularly when applied to unfamiliar or complex datasets.

To address these limitations, this thesis proposes an optimization-based methodology that automates the selection of cluster centers and parameter configuration in DPC. The approach utilizes internal and external cluster validity indices, such as Gaussian entropy and the V-Measure, respectively, to guide the process of cluster center selection and parameter tuning. Numerical experiments conducted on real-world datasets validated the proposed methodology, showing that the automatic tuning of DPC and Fuzzy Weighted K-Nearest Neighbor DPC (FKNN-DPC) outperforms traditional methods in terms of efficiency and accuracy.

Moreover, the data collected from various sources often includes additional or supplementary labeled information, which can be used to enhance the value of machine learning methods. These labels may include socioeconomic factors or geographic data, providing richer context for analysis. In response to this, the present doctoral thesis introduces a novel piece-wise machine learning approach that incorporates these additional labeled variables into the analysis of time series data. This methodology was applied to electricity consumption data, where a bi-level optimization problem was formulated. The proposed approach enabled the segmentation of data according to geographical regions, allowing for the adjustment of regression models within each region. The piece-wise autoregressive model demonstrated lower relative error compared to traditional models, highlighting the effectiveness of this method in improving prediction accuracy and identifying patterns among electrical consumers.

To address the high computational cost associated with the previous optimization problem, this doctoral thesis investigates the application of Response Surface Methods (RSMs), particularly within parallel computing environments, using a cooperative framework for parallel RSM algorithms, defining a class referred to as Sequential Multi-point Infill Sampling Algorithms (SMISAs). This framework improves the robustness of the optimization process by enabling the sharing of solutions across multiple algorithms. The proposed algorithm, named CPEI, which integrates a radial-basis-based algorithm (CORS-RBF) with a kriging-based method (EGO-PEI), was evaluated using benchmark functions. The results demonstrated that CPEI not only reduces computational time but also improves convergence properties compared to traditional methods.

Overall, this doctoral thesis makes significant contributions to the development and application of optimization techniques for building machine learning models, offering methodologies that are both innovative and practical for addressing complex, high-dimensional problems in the field.