Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning

The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points-such as metric learning, clustering or ranking-do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartition-ing steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets.

Mots clés

Distributed Machine Learning Distributed Data Processing U-Statistics AUC Optimization

Domaines

Apprentissage [cs.LG] Machine Learning [stat.ML]

Fichier principal

ecml19.pdf (880.55 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Aurélien Bellet : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02166428

Soumis le : mercredi 26 juin 2019-18:00:33

Dernière modification le : mercredi 24 janvier 2024-09:54:24

Dates et versions

hal-02166428 , version 1 (26-06-2019)

Identifiants

HAL Id : hal-02166428 , version 1

Citer

Robin Vogel, Aurélien Bellet, Stéphan Clémençon, Ons Jelassi, Guillaume Papa. Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning. ECML PKDD 2019 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Sep 2019, Würzburg, Germany. ⟨hal-02166428⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM CNRS INRIA CRISTAL INRIA2 CRISTAL-MAGNET UNIV-LILLE LTCI IDS S2A IP_PARIS

141 Consultations

113 Téléchargements