Big Data Streaming

Big Data Streaming

Big data was initially defined by three V’s, and among them, volume might be the feature in which researchers have made a greatest effort in. The technology developed provides approximate solutions to the usual problems of massive data analysis. However, speed, despite being a feature present in many massive data problems, has not had the same repercussion. The analysis of continuous data flows or data streaming (DS) has been indeed deeply studied in the last decade. Therefore, we manage techniques that allow a model to be updated at all times, with the model being learned from a set of data that appropriately summarizes a data stream.

This project has the objective of analyzing massive data received continuously and requires updating the model in real time. Therefore, we will study and develop models and technologies for the preprocessing, analysis and exploitation of big data streaming, with the objective of creating services that generate value in specific areas where the models can be applied: eHealth, energy or seismology. The general methodology will be based on a schema for DS processing, consisting of an initial batch model and an online model that is based on the patterns of the first one for its permanency or modification.

The specific sub-goals match the taxonomy of data analysis tasks: preprocessing, supervised learning, unsupervised learning and applications. In preprocessing, we present two tasks: attribute selection and generation of new attributes through the extension of deep learning to DS. In supervised learning, the classical classification or regression tasks can be extended to two applications related to DS: detection of anomalies and detection of novelties. In unsupervised learning tasks, we will work on models for quantitative association rules, in which the group has a long history with stationary data; and in clustering, our objective will be to look for indexes that allow us to approximate the optimum number of clusters.