My full list of publications is available as a

Selected publications (with emphasis on the past years)

Interactive Visual Data Exploration with Subjective Feedback

Kai Puolamäki, Bo Kang, Jefrey Lijffijt, Tijl De Bie. ECML PKDD 2016, Part II, LNCS 9852, pp. 214-229, 2016. [slides pdf, slides pptx, poster] http://dx.doi.org/10.1007/978-3-319-46227-1_14
Topics: explorative data analysis, randomization

Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and data mining techniques are rare, and those that exist are often specialized to a single problem or domain. In this paper, we introduce a novel generic method for interactive visual exploration of high-dimensional data. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection of the data and then employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints. These constraints expressed by the user are then taken into account by a projection-finding algorithm to compute a new ‘interesting’ projection, a process that can be iterated until the user runs out of time or finds that constraints explain everything she needs to find from the data. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data. The data and software related to this paper are available at http://www.interesting-patterns.net/forsied/interactive-visual-data-exploration-with-subjective-feedback/

Using regression makes extraction of shared variation in multiple datasets easy

Jussi Korpela, Andreas Henelius, Lauri Ahonen, Arto Klami, Kai Puolamäki. Data Mining and Knowledge Discovery, 30(5): 1112-1133, 2016. http://dx.doi.org/10.1007/s10618-016-0465-y
Topics: time series

In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed COCOREG, for the extraction of variation common to all datasets in a given collection of arbitrary size. COCOREG extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the COCOREG algorithm on five artificial and three real datasets.

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki. Data Mining and Knowledge Discovery, 29(6): 1838-1864, 2015. http://dx.doi.org/10.1007/s10618-014-0397-3
Topics: time series

In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.

A peek into the black box: exploring classifiers by randomization

Andreas Henelius, Kai Puolamäki, Henrik Boström, Lars Asker, Panagiotis Papapetrou. Data Mining and Knowledge Discovery, 28(5-6): 1503-1529, 2014. [slides, poster] http://dx.doi.org/10.1007/s10618-014-0368-8
Topics: randomization

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

A statistical significance testing approach to mining the most informative set of patterns

Jefrey Lijffijt, Panagiotis Papapetrou, and Kai Puolamäki. Data Mining and Knowledge Discovery, 28(1):238–263, 2014. http://dx.doi.org/10.1007/s10618-012-0298-2
Topics: randomization, significance testing

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.

Confidence bands for time series data

Jussi Korpela, Kai Puolamäki, Aristides Gionis. Data Mining and Knowledge Discovery, 28(5-6): 1530-1553, 2014. http://dx.doi.org/10.1007/s10618-014-0371-0
Topics: time series, significance testing

Simultaneous confidence intervals, or confidence bands, provide an intuitive description of the variability of a time series. Given a set of N time series of length M, we consider the problem of finding a confidence band that contains a (1−α)-fraction of the observations. We construct such confidence bands by finding the set of N−K time series whose envelope is minimized. We refer to this problem as the minimum width envelope problem. We show that the minimum width envelope problem is NP-hard, and we develop a greedy heuristic algorithm, which we compare to quantile- and distance-based confidence band methods. We also describe a method to find an effective confidence level αeff and an effective number of observations to remove Keff, such that the resulting confidence bands will keep the family-wise error rate below α. We evaluate our methods on synthetic and real datasets. We demonstrate that our method can be used to construct confidence bands with guaranteed family-wise error rate control, also when there is too little data for the quantile-based methods to work.

Dental functional traits of mammals resolve productivity in terrestrial ecosystems past and present

Liping Liu, Kai Puolamäki, Jussi T. Eronen, Majid M. Ataabadi, Elina Hernesniemi, and Mikael Fortelius. Proceedings of the Royal Society B, 279(1739):2793–2799, 2012. http://dx.doi.org/10.1098/rspb.2012.0211
Topics: ecological modeling

We have recently shown that rainfall, one of the main climatic determinants of terrestrial net primary productivity (NPP), can be robustly estimated from mean molar tooth crown height (hypsodonty) of mammalian herbivores. Here, we show that another functional trait of herbivore molar surfaces, longitudinal loph count, can be similarly used to extract reasonable estimates of rainfall but also of temperature, the other main climatic determinant of terrestrial NPP. Together, molar height and the number of longitudinal lophs explain 73 per cent of the global variation in terrestrial NPP today and resolve the main terrestrial biomes in bivariate space. We explain the functional interpretation of the relationships between dental function and climate variables in terms of long- and short-term demands. We also show how the spatially and temporally dense fossil record of terrestrial mammals can be used to investigate the relationship between biodiversity and productivity under changing climates in geological time. The placement of the fossil chronofaunas in biome space suggests that they most probably represent multiple palaeobiomes, at least some of which do not correspond directly to any biomes of today’s world.

Visually controllable data mining methods

Kai Puolamäki, Panagiotis Papapetrou, and Jefrey Lijffijt. In IEEE International Conference on Data Mining Workshops 2010, 2010. http://dx.doi.org/10.1109/ICDMW.2010.141
Topics: explorative data analysis

A large number of data mining methods are, as such, not applicable to fast, intuitive, and interactive use. Thus, there is a need for visually controllable data mining methods. Such methods should comply with three major requirements: their model structure can be represented visually, they can be controlled using visual, interaction, and they should be fast enough for visual interaction. We, define a framework for using data mining methods in, interactive visualization. These data mining methods are called, “visually controllable’’ and combine data mining with visualization, and user-interaction, bridging the gap between data mining and, visual analytics. Our main objective is to define the interactive, visualization scenario and the requirements for visually, controllable data mining. Basic data mining algorithms are reviewed, and it is demonstrated how they can be controlled visually. We also discuss how existing visual analytics tools fit to the proposed framework., From a data mining perspective, this work creates a reference framework for, designing and evaluating visually controllable algorithms and visual analytics systems.

Tell me something I don’t know: Randomization strategies for iterative data mining

Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘09), pages 379-388, New York, NY, USA, 2009. ACM. http://dx.doi.org/10.1145/1557019.1557065
Topics: randomization, significance testing

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.

In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

Latent grouping models for user preference prediction

Eerika Savia, Kai Puolamäki, and Samuel Kaski. Machine Learning, 74(1):75–109, 2009. http://dx.doi.org/10.1007/s10994-008-5081-7
Topics: user modeling

We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. Analogously, generalization over documents improves predictions in the case of new users. We show that if either users and documents or both are new, two-way generalization becomes necessary. We demonstrate the benefits of grouping of users, grouping of documents, and two-way grouping, with artificial data and in two case studies with real data. We have introduced a probabilistic latent grouping model for predicting the relevance of a document to a user. The model assumes a latent group structure for both users and items. We compare the model against a state-of-the-art method, the User Rating Profile model, where only the users have a latent group structure. We compute the posterior of both models by Gibbs sampling. The Two-Way Model predicts relevance more accurately when the target consists of both new documents and new users. The reason is that generalization over documents becomes beneficial for new documents and at the same time generalization over users is needed for new users.

Combining eye movements and collaborative filtering for proactive information retrieval

Kai Puolamäki, Jarkko Salojärvi, Eerika Savia, Jaana Simola, and Samuel Kaski. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘05), pages 146–153, New York, NY, USA, 2005. ACM. http://dx.doi.org/10.1145/1076034.1076062
Topics: user modeling, gaze tracking

We study a new task, proactive information retrieval by combining implicit relevance feedback and collaborative filtering. We have constructed a controlled experimental setting, a prototype application, in which the users try to find interesting scientific articles by browsing their titles. Implicit feedback is inferred from eye movement signals, with discriminative hidden Markov models estimated from existing data in which explicit relevance feedback is available. Collaborative filtering is carried out using the User Rating Profile model, a state-of-the-art probabilistic latent variable model, computed using Markov Chain Monte Carlo techniques. For new document titles the prediction accuracy with eye movements, collaborative filtering, and their combination was significantly better than by chance. The best prediction accuracy still leaves room for improvement but shows that proactive information retrieval and combination of many sources of relevance feedback is feasible.