Prof. Rubén Pérez Chacón, is a Computer Science Engineer (Pablo de Olavide University, 2011), Master Degree in Information Security (Autonomous University of Barcelona, 2014) and PhD. student in the Big Data and Data Science lab since 2015. He is Assistant Professor in the area of Languages and Information Systems of Pablo de Olavide University.
With more than 5 years of experience in the private sector in some IT roles as software developer (J2EE, PHP or PL/SQL) and pentesting, it is worth mentioning the research experience using distributed technologies such as Apache Spark and the development of algorithms in programming languages such as Scala, Java o R.
The research lines of Prof. Rubén Pérez Chacón focus on data mining and machine learning, with direct application in the prediction of Big time series of Data. He is the author or co-author of 3 publications in impact journals according to JCR (two in Q1 and one Q2) and member of the program committee in numerous international conferences.
Publications
2020 |
F. Martínez-Álvarez and G. Asencio-Cortés and J. F. Torres and D. Gutiérrez-Avilés and L. Melgar-García and R. Pérez-Chacón and C. Rubio-Escudero and A. Troncoso and J. C. Riquelme Coronavirus Optimization Algorithm: A bioinspired metaheuristic based on the COVID-19 propagation model Journal Article Big Data, 8 (4), pp. 308-322, 2020. @article{MARTINEZ-ALVAREZ20, title = {Coronavirus Optimization Algorithm: A bioinspired metaheuristic based on the COVID-19 propagation model}, author = {F. Martínez-Álvarez and G. Asencio-Cortés and J. F. Torres and D. Gutiérrez-Avilés and L. Melgar-García and R. Pérez-Chacón and C. Rubio-Escudero and A. Troncoso and J. C. Riquelme}, url = {https://www.liebertpub.com/doi/full/10.1089/big.2020.0051}, doi = {10.1089/big.2020.0051}, year = {2020}, date = {2020-07-22}, journal = {Big Data}, volume = {8}, number = {4}, pages = {308-322}, abstract = {This work proposes a novel bioinspired metaheuristic, simulating how the coronavirus spreads and infects healthy people. From a primary infected individual (patient zero), the coronavirus rapidly infects new victims, creating large populations of infected people who will either die or spread infection. Relevant terms such as reinfection probability, super-spreading rate, social distancing measures or traveling rate are introduced into the model in order to simulate the coronavirus activity as accurately as possible. The infected population initially grows exponentially over time, but taking into consideration social isolation measures, the mortality rate and number of recoveries, the infected population gradually decreases. The Coronavirus Optimization Algorithm has two major advantages when compared to other similar strategies. Firstly, the input parameters are already set according to the disease statistics, preventing researchers from initializing them with arbitrary values. Secondly, the approach has the ability to end after several iterations, without setting this value either. Furthermore, a parallel multi-virus version is proposed, where several coronavirus strains evolve over time and explore wider search space areas in less iterations. Finally, the metaheuristic has been combined with deep learning models, in order to find optimal hyperparameters during the training phase. As application case, the problem of electricity load time series forecasting has been addressed, showing quite remarkable performance.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This work proposes a novel bioinspired metaheuristic, simulating how the coronavirus spreads and infects healthy people. From a primary infected individual (patient zero), the coronavirus rapidly infects new victims, creating large populations of infected people who will either die or spread infection. Relevant terms such as reinfection probability, super-spreading rate, social distancing measures or traveling rate are introduced into the model in order to simulate the coronavirus activity as accurately as possible. The infected population initially grows exponentially over time, but taking into consideration social isolation measures, the mortality rate and number of recoveries, the infected population gradually decreases. The Coronavirus Optimization Algorithm has two major advantages when compared to other similar strategies. Firstly, the input parameters are already set according to the disease statistics, preventing researchers from initializing them with arbitrary values. Secondly, the approach has the ability to end after several iterations, without setting this value either. Furthermore, a parallel multi-virus version is proposed, where several coronavirus strains evolve over time and explore wider search space areas in less iterations. Finally, the metaheuristic has been combined with deep learning models, in order to find optimal hyperparameters during the training phase. As application case, the problem of electricity load time series forecasting has been addressed, showing quite remarkable performance. |
R. Pérez-Chacón and G. Asencio-Cortés and F. Martínez-Álvarez and A. Troncoso Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand Journal Article Information Sciences, 540 , pp. 160-174, 2020. @article{PEREZ20, title = {Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand}, author = {R. Pérez-Chacón and G. Asencio-Cortés and F. Martínez-Álvarez and A. Troncoso}, url = {https://www.sciencedirect.com/science/article/pii/S0020025520306010}, doi = {10.1016/j.ins.2020.06.014}, year = {2020}, date = {2020-06-06}, journal = {Information Sciences}, volume = {540}, pages = {160-174}, abstract = {This work proposes a novel algorithm to forecast big data time series. Based on the well-established Pattern Sequence Forecasting algorithm, this new approach has two major contributions to the literature. First, the improvement of the aforementioned algorithm with respect to the accuracy of predictions, and second, its transformation into the big data context, having reached meaningful results in terms of scalability. The algorithm uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust. Physical and cloud clusters have been used to carry out the experimentation, which consisted in applying the algorithm to real-world data from Uruguay electricity demand.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This work proposes a novel algorithm to forecast big data time series. Based on the well-established Pattern Sequence Forecasting algorithm, this new approach has two major contributions to the literature. First, the improvement of the aforementioned algorithm with respect to the accuracy of predictions, and second, its transformation into the big data context, having reached meaningful results in terms of scalability. The algorithm uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust. Physical and cloud clusters have been used to carry out the experimentation, which consisted in applying the algorithm to real-world data from Uruguay electricity demand. |
2019 |
R. Talavera-Llames and R. Pérez-Chacón and A. Troncoso and F. Martínez-Álvarez MV-kWNN: A novel multivariate and multi-output weighted nearest neighbors algorithm for big data time series forecasting Journal Article Neurocomputing, 353 , pp. 56-73, 2019. @article{NEUCOM2019, title = {MV-kWNN: A novel multivariate and multi-output weighted nearest neighbors algorithm for big data time series forecasting}, author = {R. Talavera-Llames and R. Pérez-Chacón and A. Troncoso and F. Martínez-Álvarez}, url = {https://www.sciencedirect.com/science/article/pii/S0925231219303236?via%3Dihub}, doi = {10.1016/j.neucom.2018.07.092}, year = {2019}, date = {2019-01-01}, journal = {Neurocomputing}, volume = {353}, pages = {56-73}, abstract = {This paper introduces a novel algorithm for big data time series forecasting. Its main novelty lies in its ability to deal with multivariate data, i.e. to consider multiple time series simultaneously, in order to make multi-output predictions. Real-world processes are typically characterised by several interrelated variables, and the future occurrence of certain time series cannot be explained without understanding the influence that other time series might have on the target time series. One key issue in the context of the multivariate analysis is to determine a priori whether exogenous variables must be included in the model or not. To deal with this, a correlation analysis is used to find a minimum correlation threshold that an exogenous time series must exhibit, in order to be beneficial. Furthermore, the proposed approach has been specifically designed to be used in the context of big data, thus making it possible to efficiently process very large time series. To evaluate the performance of the proposed approach we use data from Spanish electricity prices. Results have been compared to other multivariate approaches showing remarkable improvements both in terms of accuracy and execution time.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This paper introduces a novel algorithm for big data time series forecasting. Its main novelty lies in its ability to deal with multivariate data, i.e. to consider multiple time series simultaneously, in order to make multi-output predictions. Real-world processes are typically characterised by several interrelated variables, and the future occurrence of certain time series cannot be explained without understanding the influence that other time series might have on the target time series. One key issue in the context of the multivariate analysis is to determine a priori whether exogenous variables must be included in the model or not. To deal with this, a correlation analysis is used to find a minimum correlation threshold that an exogenous time series must exhibit, in order to be beneficial. Furthermore, the proposed approach has been specifically designed to be used in the context of big data, thus making it possible to efficiently process very large time series. To evaluate the performance of the proposed approach we use data from Spanish electricity prices. Results have been compared to other multivariate approaches showing remarkable improvements both in terms of accuracy and execution time. |
2018 |
R. Talavera-Llames and R. Pérez-Chacón and A. Troncoso and F. Martínez-Álvarez Big data time series forecasting based on nearest neighbors distributed computing with Spark Journal Article Knowledge-Based Systems, 161 (1), pp. 12-25, 2018. @article{KNOSYS2018b, title = {Big data time series forecasting based on nearest neighbors distributed computing with Spark}, author = {R. Talavera-Llames and R. Pérez-Chacón and A. Troncoso and F. Martínez-Álvarez}, url = {https://www.sciencedirect.com/science/article/pii/S0950705118303770}, doi = {10.1016/j.knosys.2018.07.026}, year = {2018}, date = {2018-01-01}, journal = {Knowledge-Based Systems}, volume = {161}, number = {1}, pages = {12-25}, abstract = {A new approach for big data forecasting based on the k-weighted nearest neighbours algorithm is introduced in this work. Such an algorithm has been developed for distributed computing under the Apache Spark framework. Every phase of the algorithm is explained in this work, along with how the optimal values of the input parameters required for the algorithm are obtained. In order to test the developed algorithm, a Spanish energy consumption big data time series has been used. The accuracy of the prediction has been assessed showing remarkable results. Additionally, the optimal configuration of a Spark cluster has been discussed. Finally, a scalability analysis of the algorithm has been conducted leading to the conclusion that the proposed algorithm is highly suitable for big data environments.}, keywords = {}, pubstate = {published}, tppubtype = {article} } A new approach for big data forecasting based on the k-weighted nearest neighbours algorithm is introduced in this work. Such an algorithm has been developed for distributed computing under the Apache Spark framework. Every phase of the algorithm is explained in this work, along with how the optimal values of the input parameters required for the algorithm are obtained. In order to test the developed algorithm, a Spanish energy consumption big data time series has been used. The accuracy of the prediction has been assessed showing remarkable results. Additionally, the optimal configuration of a Spark cluster has been discussed. Finally, a scalability analysis of the algorithm has been conducted leading to the conclusion that the proposed algorithm is highly suitable for big data environments. |
R. Pérez-Chacón and J. M. Luna and A. Troncoso and F. Martínez-Álvarez and J. C. Riquelme Big data analytics for discovering electricity consumption patterns in smart cities Journal Article Energies, 11 (3), pp. 683, 2018. @article{Energies2018, title = {Big data analytics for discovering electricity consumption patterns in smart cities}, author = {R. Pérez-Chacón and J. M. Luna and A. Troncoso and F. Martínez-Álvarez and J. C. Riquelme}, url = {http://www.mdpi.com/1996-1073/11/3/683 }, doi = {10.3390/en11030683 }, year = {2018}, date = {2018-01-01}, journal = {Energies}, volume = {11}, number = {3}, pages = {683}, abstract = {New technologies such as sensor networks have been incorporated into the management of buildings for organizations and cities. Sensor networks have led to an exponential increase in the volume of data available in recent years, which can be used to extract consumption patterns for the purposes of energy and monetary savings. For this reason, new approaches and strategies are needed to analyze information in big data environments. This paper proposes a methodology to extract electric energy consumption patterns in big data time series, so that very valuable conclusions can be made for managers and governments. The methodology is based on the study of four clustering validity indices in their parallelized versions along with the application of a clustering technique. In particular, this work uses a voting system to choose an optimal number of clusters from the results of the indices, as well as the application of the distributed version of the k-means algorithm included in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition, the performance of the proposed methodology is evaluated using synthetic big data, which cab represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered are proposed to optimize energy usage across the university campus.}, keywords = {}, pubstate = {published}, tppubtype = {article} } New technologies such as sensor networks have been incorporated into the management of buildings for organizations and cities. Sensor networks have led to an exponential increase in the volume of data available in recent years, which can be used to extract consumption patterns for the purposes of energy and monetary savings. For this reason, new approaches and strategies are needed to analyze information in big data environments. This paper proposes a methodology to extract electric energy consumption patterns in big data time series, so that very valuable conclusions can be made for managers and governments. The methodology is based on the study of four clustering validity indices in their parallelized versions along with the application of a clustering technique. In particular, this work uses a voting system to choose an optimal number of clusters from the results of the indices, as well as the application of the distributed version of the k-means algorithm included in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition, the performance of the proposed methodology is evaluated using synthetic big data, which cab represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered are proposed to optimize energy usage across the university campus. |
2016 |
R. Talavera-Llames and R. Pérez-Chacón and M. Martínez-Ballesteros and A. Troncoso and F. Martínez-Álvarez A Nearest Neighbours - Based Algorithm for Big Time Series Data Forecasting Conference HAIS 11th International Conference on Hybrid Artificial Intelligence Systems, Lecture Note in Computer Science 2016. @conference{HAIS2016b, title = {A Nearest Neighbours - Based Algorithm for Big Time Series Data Forecasting}, author = {R. Talavera-Llames and R. Pérez-Chacón and M. Martínez-Ballesteros and A. Troncoso and F. Martínez-Álvarez}, url = {https://link.springer.com/chapter/10.1007/978-3-319-32034-2_15}, year = {2016}, date = {2016-01-01}, booktitle = {HAIS 11th International Conference on Hybrid Artificial Intelligence Systems}, series = {Lecture Note in Computer Science}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
R. Pérez-Chacón and R. Talavera-Llames and F. Martínez-Álvarez and A. Troncoso Finding Electric Energy Consumption Patterns in Big Time Series Data Conference DCAI 13th International Conference on Distributed Computing and Artificial Intelligence, Advances in Intelligent Systems and Computing 2016. @conference{DCAI2016, title = {Finding Electric Energy Consumption Patterns in Big Time Series Data}, author = {R. Pérez-Chacón and R. Talavera-Llames and F. Martínez-Álvarez and A. Troncoso}, url = {https://link.springer.com/chapter/10.1007%2F978-3-319-40162-1_25}, year = {2016}, date = {2016-01-01}, booktitle = {DCAI 13th International Conference on Distributed Computing and Artificial Intelligence}, series = {Advances in Intelligent Systems and Computing}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |