systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@gmail.com>
Subject Roadmap Merge and Rename SystemDS
Date Mon, 09 Mar 2020 17:37:22 GMT
Hi all,

as you're probably aware, development activities of Apache SystemML 
significantly slowed down and were virtually non-existing in the last 
year for various reasons. Part of that was that my team and I [1] 
decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a 
new vision and roadmap for the future.

During PMC discussions regarding the retirement of SystemML, we came to 
the conclusions that the best path forward -- for the entire community 
-- would be to merge SystemDS back into Apache SystemML, rename it to 
SystemDS, and continue jointly. Before doing so, I want to share the 
plan with the entire community.

SystemDS aims at providing better systems support for the end-to-end 
data science lifecycle, with a special focus on ML pipelines from data 
integration, cleaning, and preparation, over efficient ML model 
training, to model debugging and serving. A key observation is that 
state-of-the-art data integration and cleaning primitives are themselves 
based on machine learning. Our main objectives are to support effective 
and efficient data preparation, ML training and debugging at scale, 
something that cannot be composed from existing libraries. The game plan 
includes three major parts:

1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of 
abstractions for the different lifecycle tasks as well as users with 
different expertise (ML researchers, data scientists, domain experts), 
based on our DSL for ML training and scoring. Exploratory data science 
interleaves data preparation, ML training, scoring, and debugging in an 
iterative process; and once these tasks are expressed in dense or sparse 
linear algebra, we expect very good performance.

2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide 
variety of algorithm classes, we will continue to provide different 
parallelization strategies, enriched by a new backend for federated ML 
and privacy enhancing technologies. Since the hierarchy of language 
abstractions inevitably leads to redundancy, we further aim to improve 
the automatic optimization capabilities of the compiler and underlying 
runtime.

3) Data Model - Heterogeneous Tensors: To support data integration and 
cleaning primitives in linear algebra programs requires a more generic 
data model for handling heterogeneous and structured data. In contrast 
to existing ML systems, our central data model are heterogeneous 
tensors. Thus, we generalize SystemML's FP64 matrices to 
multi-dimensional arrays where one dimension may have a schema including 
JSON strings to represent nested data.

Admin: We intend to create the SystemDS 0.2 release in March. Afterwards 
we would then rebase all our commits (369) back onto the SystemML 
codeline. Subsequently, we will rename Apache SystemML to Apache 
SystemDS and continue our development under Apache umbrella. I just went 
through the Apache name search guidelines and we'll perform a 'suitable 
name search' accordingly and then transfer SystemDS. The existing PMC 
and committer status stays of course intact unless people want to leave. 
Shortly after the merge, I will nominate the four most active 
contributors of the last year to become committers. Regarding releases 
(and JIRA numbers), it's up for discussion but both, continuing with 
SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to me.

Roadmap: At technical level, SystemDS will continue to support all 
operations and algorithms SystemML provided but significantly extent the 
scope and functionality via the mentioned hierarchy of language 
abstractions (in form of builtin functions). However, during the fork we 
already removed old baggage like the MR backend, the scrip-level 
debugger, the PyDML frontend and several other things [4]. Major new 
internals are native support for lineage tracing and reuse, the data 
model of heterogeneous tensors, and a new federated backend.

[1] https://damslab.github.io/
[2] https://github.com/tugraz-isds/systemds
[3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
[4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0

Regards,
Matthias

Mime
View raw message