From dev-return-1608-apmail-systemml-dev-archive=systemml.apache.org@systemml.incubator.apache.org Wed Apr 19 12:53:42 2017 Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C0CB019126 for ; Wed, 19 Apr 2017 12:53:42 +0000 (UTC) Received: (qmail 76789 invoked by uid 500); 19 Apr 2017 12:53:42 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 76743 invoked by uid 500); 19 Apr 2017 12:53:42 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 76726 invoked by uid 99); 19 Apr 2017 12:53:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Apr 2017 12:53:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id DAF75C0CE3 for ; Wed, 19 Apr 2017 12:53:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.146 X-Spam-Level: X-Spam-Status: No, score=-0.146 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id I9jwxMRCCiKh for ; Wed, 19 Apr 2017 12:53:37 +0000 (UTC) Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 61D275F39D for ; Wed, 19 Apr 2017 12:53:32 +0000 (UTC) Received: by mail-io0-f175.google.com with SMTP id k87so20525941ioi.0 for ; Wed, 19 Apr 2017 05:53:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=tbbXeDJbvCmDjd/w5ruTTDJnNwGh+ERD2uF/93MNQOw=; b=doZJqI3jkah30BoPBa3TBdYMwcJWjVKY2WMfVQR4Q3D/5VJZiHhVwb3dcWpKHr7JbL GrC/K+Ul5Mme97ZVfRWLkO0TBnbKQi6d46GZBb/RZ0sWQQ7daHp/9FaPxrnh0yLc3WK7 6Rpq4IOo3WZ/9hzTjy1bP4dkK46bBhFofXJw5Evk5mKTvFsFGc4Hu7eRLs5Tq+tzQifZ pRrnLOI1kXXMTsfWI14GyOOERYeEVBdARrD49TY+mupUehc+wxus7J7M9gCpskp3/WWK /CD03nk/zjAvQHn5mvNgwQFGXhZdXxlXAoXAELjgWxyg6FLRDVFOXWNbFBWFbX1WMRde FYQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=tbbXeDJbvCmDjd/w5ruTTDJnNwGh+ERD2uF/93MNQOw=; b=C/X2hh/Ql0kfrf3thO5eQcqmOf0dA9KpJ2ZIF/YEfe1xJ6CbV3C9ezwnaUjPVT+z81 7O/039O2dIfEgD6OJxT8PSubsJ0f0CDEw9IaT99O9acl47Iq61JoxJmQRglhTe71I7aU KH5FjSrwtTACyhV4NcaamWQvpfYSaiUwpUj636aKwoQnMb5vROjaK2detpkfSZH0zauT 8XeFCuvqqSH4ydlLQ91S5ItML6QRM/PytrxAzxKLJYdA1uw/4ZEC8V/DqE0YLnhmIviN JWYFM7OdUXG24T2bxRr/jaI2pZ5CuUSnS6hGc7zUaQy7xemBjCFxLVLYc61rPIr/BaR1 2PFw== X-Gm-Message-State: AN3rC/63VpTlYw6lyTMOpoSrLcxSayIFUSmYFZoUQ+XwsevaOrBz2fe6 q9U+IRtRF0g/6zLHTRGWqLupF3QdvA== X-Received: by 10.107.142.201 with SMTP id q192mr2715535iod.138.1492606410938; Wed, 19 Apr 2017 05:53:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.62.197 with HTTP; Wed, 19 Apr 2017 05:53:30 -0700 (PDT) Received: by 10.107.62.197 with HTTP; Wed, 19 Apr 2017 05:53:30 -0700 (PDT) In-Reply-To: References: <3BBAF0E1-A2E9-4CDB-AF7F-137A0529C473@gmail.com> From: Aishwarya Chaurasia Date: Wed, 19 Apr 2017 18:23:30 +0530 Message-ID: Subject: Re: Regarding incubator systemml/breast_cancer project To: dev@systemml.incubator.apache.org Content-Type: multipart/alternative; boundary=001a114ec68ce9e54f054d848240 --001a114ec68ce9e54f054d848240 Content-Type: text/plain; charset=UTF-8 Hello sir, We also wanted to ensure that the spark-submit command we're using is the correct one for running 'preprocess.py'. Command : /home/new/sparks/bin/spark-submit preprocess.py Thank you. Aishwarya Chaurasia. On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" wrote: Hello sir, On running the file preprocess.py we are getting the following error : https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG YhyRLivL9gydE= Can you please help us by looking into the error and kindly tell us the solution for it. Thanks a lot. Aishwarya Chaurasia On 19-Apr-2017 12:43 AM, wrote: > Hi Aishwarya, > > Certainly, here is some more detailed information about`preprocess.py`: > > * The preprocessing Python script is located at > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/preprocess.py. Note that this is different than > the library module at https://github.com/apache/incu > bator-systemml/blob/master/projects/breast_cancer/breastc > ancer/preprocessing.py. > * This script is used to preprocess a set of histology slide images, > which are `.svs` files in our case, and `.tiff` files in your case. > * Lines 63-79 contain "settings" such as the output image sizes, folder > paths, etc. Of particular interest, line 72 has the folder path for the > original slide images that should be commonly accessible from all machines > being used, and lines 74-79 contain the names of the output DataFrames that > will be saved. > * Line 82 performs the actual preprocessing and creates a Spark > DataFrame with the following columns: slide number, tumor score, molecular > score, sample. The "sample" in this case is the actual small, chopped-up > section of the image that has been extracted and flattened into a row > Vector. For test images without labels (`training=false`), only the slide > number and sample will be contained in the DataFrame (i.e. no labels). > This calls the `preprocess(...)` function located on line 371 of > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/breastcancer/preprocessing.py, which is a > different file. > * Line 87 simply saves the above DataFrame to HDFS with the name from > line 74. > * Line 93 splits the above DataFrame row-wise into separate "training" > and "validation" DataFrames, based on the split percentage from line 70 > (`train_frac`). This is performed so that downstream machine learning > tasks can learn from the training set, and validate performance and > hyperparameter choices on the validation set. These DataFrames will start > with the same columns as the above DataFrame. If `add_row_indices` from > line 69 is true, then an additional row index column (`__INDEX`) will be > pretended. This is useful for SystemML in downstream machine learning > tasks as it gives the DataFrame row numbers like a real matrix would have, > and SystemML is built to operate on matrices. > * Lines 97 & 98 simply save the training and validation DataFrames using > the names defined on lines 76 & 78. > * Lines 103-137 create smaller train and validation DataFrames by taking > small row-wise samples of the full train and validation DataFrames. The > percentage of the sample is defined on line 111 (`p=0.01` for a 1% > sample). This is generally useful for quicker downstream tasks without > having to load in the larger DataFrames, assuming you have a large amount > of data. For us, we have ~7TB of data, so having 1% sampled DataFrames is > useful for quicker downstream tests. Once again, the same columns from the > larger train and validation DataFrames will be used. > * Lines 146 & 147 simply save these sampled train and validation > DataFrames. > > As a summary, after running `preprocess.py`, you will be left with the > following saved DataFrames in HDFS: > * Full DataFrame > * Training DataFrame > * Validation DataFrame > * Sampled training DataFrame > * Sampled validation DataFrame > > As for visualization, you may visualize a "sample" (i.e. small, chopped-up > section of original image) from a DataFrame by using the ` > breastcancer.visualization.visualize_sample(...)` function. You will > need to do this after creating the DataFrames. Here is a snippet to > visualize the first row sample in a DataFrame, where `df` is one of the > DataFrames from above: > > ``` > from breastcancer.visualization import visualize_sample > visualize_sample(df.first().sample) > ``` > > Please let me know if you have any additional questions. > > Thanks! > > - Mike > > -- > > Mike Dusenberry > GitHub: github.com/dusenberrymw > LinkedIn: linkedin.com/in/mikedusenberry > > Sent from my iPhone. > > > > On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia < > aishwarya2612@gmail.com> wrote: > > > > Hello sir, > > Can you please elaborate more on what output we would be getting because > we > > tried executing the preprocess.py file using spark submit it keeps on > > adding the tiles in rdd and while running the visualisation.py file it > > isn't showing any output. Can you please help us out asap stating the > > output we will be getting and the sequence of execution of files. > > Thank you. > > > >> On 07-Apr-2017 5:54 AM, wrote: > >> > >> Hi Aishwarya, > >> > >> Thanks for sharing more info on the issue! > >> > >> To facilitate easier usage, I've updated the preprocessing code by > pulling > >> out most of the logic into a `breastcancer/preprocessing.py` module, > >> leaving just the execution in the `Preprocessing.ipynb` notebook. > There is > >> also a `preprocess.py` script with the same contents as the notebook for > >> use with `spark-submit`. The choice of the notebook or the script is > just > >> a matter of convenience, as they both import from the same > >> `breastcancer/preprocessing.py` package. > >> > >> As part of the updates, I've added an explicit SparkSession parameter > >> (`spark`) to the `preprocess(...)` function, and updated the body to use > >> this SparkSession object rather than the older SparkContext `sc` object. > >> Previously, the `preprocess(...)` function accessed the `sc` object that > >> was pulled in from the enclosing scope, which would work while all of > the > >> code was colocated within the notebook, but not if the code was > extracted > >> and imported. The explicit parameter now allows for the code to be > >> imported. > >> > >> Can you please try again with the latest updates? We are currently > using > >> Spark 2.x with Python 3. If you use the notebook, the pyspark kernel > >> should have a `spark` object available that can be supplied to the > >> functions (as is done now in the notebook), and if you use the > >> `preprocess.py` script with `spark-submit`, the `spark` object will be > >> created explicitly by the script. > >> > >> For a bit of context to others, Aishwarya initially reached out to find > >> out if our breast cancer project could be applied to TIFF images, rather > >> than the SVS images we are currently using (the answer is "yes" so long > as > >> they are "generic tiled TIFF images, according to the OpenSlide > >> documentation), and then followed up with Spark issues related to the > >> preprocessing code. This conversation has been promptly moved to the > >> mailing list so that others in the community can benefit. > >> > >> > >> Thanks! > >> > >> -Mike > >> > >> -- > >> > >> Mike Dusenberry > >> GitHub: github.com/dusenberrymw > >> LinkedIn: linkedin.com/in/mikedusenberry > >> > >> Sent from my iPhone. > >> > >> > >>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia < > aishwarya2612@gmail.com> > >> wrote: > >>> > >>> Hey, > >>> > >>> The object sc is already defined in pyspark and yet this name error > keeps > >>> occurring. We are using spark 2.* > >>> > >>> Here is the link to error that we are getting : > >>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIG > >> YhyRLivL9gydE= > >> > --001a114ec68ce9e54f054d848240--