systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@gmail.com>
Subject Re: Roadmap Merge and Rename SystemDS
Date Sat, 21 Mar 2020 21:46:56 GMT
just FYI, we created a ticket for the suitable name search, and shared 
the related results [1]. So from my perspective, it really boils down to 
the question if we accept the closeness to 'Linux systemd'. Back in 2018 
(when starting SystemDS), I came to the conclusion that it's fine 
because of the very different objectives and because SystemDS reflects 
both the origin from SystemML and its new focus on data science pipelines.

[1] 
https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues

Regards,
Matthias

On 3/9/2020 6:37 PM, Matthias Boehm wrote:
> Hi all,
> 
> as you're probably aware, development activities of Apache SystemML 
> significantly slowed down and were virtually non-existing in the last 
> year for various reasons. Part of that was that my team and I [1] 
> decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a 
> new vision and roadmap for the future.
> 
> During PMC discussions regarding the retirement of SystemML, we came to 
> the conclusions that the best path forward -- for the entire community 
> -- would be to merge SystemDS back into Apache SystemML, rename it to 
> SystemDS, and continue jointly. Before doing so, I want to share the 
> plan with the entire community.
> 
> SystemDS aims at providing better systems support for the end-to-end 
> data science lifecycle, with a special focus on ML pipelines from data 
> integration, cleaning, and preparation, over efficient ML model 
> training, to model debugging and serving. A key observation is that 
> state-of-the-art data integration and cleaning primitives are themselves 
> based on machine learning. Our main objectives are to support effective 
> and efficient data preparation, ML training and debugging at scale, 
> something that cannot be composed from existing libraries. The game plan 
> includes three major parts:
> 
> 1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of 
> abstractions for the different lifecycle tasks as well as users with 
> different expertise (ML researchers, data scientists, domain experts), 
> based on our DSL for ML training and scoring. Exploratory data science 
> interleaves data preparation, ML training, scoring, and debugging in an 
> iterative process; and once these tasks are expressed in dense or sparse 
> linear algebra, we expect very good performance.
> 
> 2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide 
> variety of algorithm classes, we will continue to provide different 
> parallelization strategies, enriched by a new backend for federated ML 
> and privacy enhancing technologies. Since the hierarchy of language 
> abstractions inevitably leads to redundancy, we further aim to improve 
> the automatic optimization capabilities of the compiler and underlying 
> runtime.
> 
> 3) Data Model - Heterogeneous Tensors: To support data integration and 
> cleaning primitives in linear algebra programs requires a more generic 
> data model for handling heterogeneous and structured data. In contrast 
> to existing ML systems, our central data model are heterogeneous 
> tensors. Thus, we generalize SystemML's FP64 matrices to 
> multi-dimensional arrays where one dimension may have a schema including 
> JSON strings to represent nested data.
> 
> Admin: We intend to create the SystemDS 0.2 release in March. Afterwards 
> we would then rebase all our commits (369) back onto the SystemML 
> codeline. Subsequently, we will rename Apache SystemML to Apache 
> SystemDS and continue our development under Apache umbrella. I just went 
> through the Apache name search guidelines and we'll perform a 'suitable 
> name search' accordingly and then transfer SystemDS. The existing PMC 
> and committer status stays of course intact unless people want to leave. 
> Shortly after the merge, I will nominate the four most active 
> contributors of the last year to become committers. Regarding releases 
> (and JIRA numbers), it's up for discussion but both, continuing with 
> SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to me.
> 
> Roadmap: At technical level, SystemDS will continue to support all 
> operations and algorithms SystemML provided but significantly extent the 
> scope and functionality via the mentioned hierarchy of language 
> abstractions (in form of builtin functions). However, during the fork we 
> already removed old baggage like the MR backend, the scrip-level 
> debugger, the PyDML frontend and several other things [4]. Major new 
> internals are native support for lineage tracing and reuse, the data 
> model of heterogeneous tensors, and a new federated backend.
> 
> [1] https://damslab.github.io/
> [2] https://github.com/tugraz-isds/systemds
> [3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
> [4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0
> 
> Regards,
> Matthias

Mime
View raw message