systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Regarding incubator systemml/breast_cancer project
Date Tue, 18 Apr 2017 19:13:05 GMT
Hi Aishwarya,

Certainly, here is some more detailed information about``:

  * The preprocessing Python script is located at
 Note that this is different than the library module at

  * This script is used to preprocess a set of histology slide images, which are `.svs` files
in our case, and `.tiff` files in your case.
  * Lines 63-79 contain "settings" such as the output image sizes, folder paths, etc.  Of
particular interest, line 72 has the folder path for the original slide images that should
be commonly accessible from all machines being used, and lines 74-79 contain the names of
the output DataFrames that will be saved.
  * Line 82 performs the actual preprocessing and creates a Spark DataFrame with the following
columns: slide number, tumor score, molecular score, sample.  The "sample" in this case is
the actual small, chopped-up section of the image that has been extracted and flattened into
a row Vector.  For test images without labels (`training=false`), only the slide number and
sample will be contained in the DataFrame (i.e. no labels).  This calls the `preprocess(...)`
function located on line 371 of,
which is a different file.
  * Line 87 simply saves the above DataFrame to HDFS with the name from line 74.
  * Line 93 splits the above DataFrame row-wise into separate "training" and "validation"
DataFrames, based on the split percentage from line 70 (`train_frac`).  This is performed
so that downstream machine learning tasks can learn from the training set, and validate performance
and hyperparameter choices on the validation set.  These DataFrames will start with the same
columns as the above DataFrame.  If `add_row_indices` from line 69 is true, then an additional
row index column (`__INDEX`) will be pretended.  This is useful for SystemML in downstream
machine learning tasks as it gives the DataFrame row numbers like a real matrix would have,
and SystemML is built to operate on matrices.
  * Lines 97 & 98 simply save the training and validation DataFrames using the names defined
on lines 76 & 78.
  * Lines 103-137 create smaller train and validation DataFrames by taking small row-wise
samples of the full train and validation DataFrames.  The percentage of the sample is defined
on line 111 (`p=0.01` for a 1% sample).  This is generally useful for quicker downstream tasks
without having to load in the larger DataFrames, assuming you have a large amount of data.
 For us, we have ~7TB of data, so having 1% sampled DataFrames is useful for quicker downstream
tests.  Once again, the same columns from the larger train and validation DataFrames will
be used.
  * Lines 146 & 147 simply save these sampled train and validation DataFrames.

As a summary, after running ``, you will be left with the following saved DataFrames
in HDFS:
  * Full DataFrame
  * Training DataFrame
  * Validation DataFrame
  * Sampled training DataFrame
  * Sampled validation DataFrame

As for visualization, you may visualize a "sample" (i.e. small, chopped-up section of original
image) from a DataFrame by using the `breastcancer.visualization.visualize_sample(...)` function.
 You will need to do this after creating the DataFrames.  Here is a snippet to visualize the
first row sample in a DataFrame, where `df` is one of the DataFrames from above:

from breastcancer.visualization import visualize_sample

Please let me know if you have any additional questions.


- Mike


Mike Dusenberry

Sent from my iPhone.

> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <> wrote:
> Hello sir,
> Can you please elaborate more on what output we would be getting because we
> tried executing the file using spark submit it keeps on
> adding the tiles in rdd and while running the file it
> isn't showing any output. Can you please help us out asap stating the
> output we will be getting and the sequence of execution of files.
> Thank you.
>> On 07-Apr-2017 5:54 AM, <> wrote:
>> Hi Aishwarya,
>> Thanks for sharing more info on the issue!
>> To facilitate easier usage, I've updated the preprocessing code by pulling
>> out most of the logic into a `breastcancer/` module,
>> leaving just the execution in the `Preprocessing.ipynb` notebook.  There is
>> also a `` script with the same contents as the notebook for
>> use with `spark-submit`.  The choice of the notebook or the script is just
>> a matter of convenience, as they both import from the same
>> `breastcancer/` package.
>> As part of the updates, I've added an explicit SparkSession parameter
>> (`spark`) to the `preprocess(...)` function, and updated the body to use
>> this SparkSession object rather than the older SparkContext `sc` object.
>> Previously, the `preprocess(...)` function accessed the `sc` object that
>> was pulled in from the enclosing scope, which would work while all of the
>> code was colocated within the notebook, but not if the code was extracted
>> and imported.  The explicit parameter now allows for the code to be
>> imported.
>> Can you please try again with the latest updates?  We are currently using
>> Spark 2.x with Python 3.  If you use the notebook, the pyspark kernel
>> should have a `spark` object available that can be supplied to the
>> functions (as is done now in the notebook), and if you use the
>> `` script with `spark-submit`, the `spark` object will be
>> created explicitly by the script.
>> For a bit of context to others, Aishwarya initially reached out to find
>> out if our breast cancer project could be applied to TIFF images, rather
>> than the SVS images we are currently using (the answer is "yes" so long as
>> they are "generic tiled TIFF images, according to the OpenSlide
>> documentation), and then followed up with Spark issues related to the
>> preprocessing code.  This conversation has been promptly moved to the
>> mailing list so that others in the community can benefit.
>> Thanks!
>> -Mike
>> --
>> Mike Dusenberry
>> GitHub:
>> LinkedIn:
>> Sent from my iPhone.
>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <>
>> wrote:
>>> Hey,
>>> The object sc is already defined in pyspark and yet this name error keeps
>>> occurring. We are using spark 2.*
>>> Here is the link to error that we are getting :
>> YhyRLivL9gydE=

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message