Hi Shirish,
Thank you for the explanation. That clears everything up. I misinterpreted
the meaning of the error message.
I'll add that useful table to the documentation.
Deron
On Wed, Dec 9, 2015 at 5:59 PM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:
> Hi Deron,
>
> As the error said "A column can not be binned and scaled.", no column can
> be subjected to both *binning* and *scaling *because it does not make
> sense. *Binning* turns a scale column with continuous values into a
> categorical column. On the other hand, *Scaling* can only be done on
> continuous values.
>
> The error *does not *mean that *Scaling* is not supported. We do support S
> *caling*.
>
> At some point, I wanted to add the following table (which is currently
> present in Java code as comments) to our documentation to indicate
> transformations that can be used *simultaneously* on a single column. While
> you are at it, could you make sure it is added to the documentation?
>
> x indicates the combination is invalid.
> * indicates the combination is allowed.
> - indicates the combination is not applicable.
>
> OMIT MVI RCD BIN DCD SCL
> OMIT - x * * * *
> MVI x - * * * *
> RCD * * - x * x
> BIN * * x - * x
> DCD * * * * - x
> SCL * * x x x -
>
> OMIT = Missing value handling by *omitting *rows
> MVI = Missing value handling by *imputation*
> RCD = Recoding
> BIN = Binning
> DCD = Dummycoding
> SCL = Scaling
>
> Let me know if you have any further questions.
>
> Thank you,
> Shirish
>
>
> On Wed, Dec 9, 2015 at 4:53 PM, Deron Eriksson <deroneriksson@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm working on updating the online docs for the DML transform() function
> > since a couple things didn't copy over in the conversion to markdown.
> > However, I've run into an issue when I execute the transform() example.
> In
> > summary, is the "scale" transformation no longer allowed, and "bin" is
> > allowed?
> >
> > I did the following:
> >
> > I created data.csv:
> >
> >
> >
> zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
> > 95141,south,3002,6,3,2,FALSE,929,934
> > NA,west,1373,,1,3,FALSE,695,698
> > 91312,south,NA,6,2,2,FALSE,902,
> > 94555,NA,1835,3,,3,,888,892
> > 95141,west,2770,5,2.5,,TRUE,812,816
> > 95141,east,2833,6,2.5,2,TRUE,927,
> > 96334,NA,1339,6,3,1,FALSE,672,675
> > 96334,south,2742,6,2.5,2,FALSE,872,876
> > 96334,north,2195,5,2.5,2,FALSE,799,803
> >
> > I created data.csv.mtd:
> >
> > {
> > "data_type": "frame",
> > "format": "csv",
> > "sep": ",",
> > "header": true,
> > "na.strings": [ "NA", "" ]
> > }
> >
> > I created data.spec.json:
> >
> > {
> > "omit": [ "zipcode" ]
> > ,"impute":
> > [ { "name": "district" , "method": "constant", "value": "south" }
> > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> > ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> > ,{ "name": "floors" , "method": "constant", "value": 1 }
> > ,{ "name": "view" , "method": "global_mode" }
> > ,{ "name": "askingprice" , "method": "global_mean" }
> > ]
> >
> > ,"recode":
> > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> > ,"bin":
> > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
> > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
> > ]
> >
> > ,"dummycode":
> > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> > ,"scale":
> > [ { "name": "sqft", "method": "mean-subtraction" }
> > ,{ "name": "saleprice", "method": "z-score" }
> > ,{ "name": "askingprice", "method": "z-score" }
> > ]
> > }
> >
> > I executed the following DML:
> >
> > D = read("data.csv");
> > tfD = transform(target=D,
> > transformSpec="data.spec.json",
> > transformPath="example-transform");
> > s = sum(tfD);
> > print("Sum = " + s);
> >
> > This generated the following error:
> >
> > java.lang.IllegalArgumentException: Invalid transformations on column ID
> 3.
> > A column can not be binned and scaled.
> >
> > So, I removed the "scale" from data.spec.json:
> >
> > {
> > "omit": [ "zipcode" ]
> > ,"impute":
> > [ { "name": "district" , "method": "constant", "value": "south" }
> > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> > ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> > ,{ "name": "floors" , "method": "constant", "value": 1 }
> > ,{ "name": "view" , "method": "global_mode" }
> > ,{ "name": "askingprice" , "method": "global_mean" }
> > ]
> >
> > ,"recode":
> > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> > ,"bin":
> > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
> > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
> > ]
> >
> > ,"dummycode":
> > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> > }
> >
> > This generated:
> >
> > java.lang.RuntimeException: Encountered "NA" in column ID "3", when
> > expecting a numeric value. Consider adding "NA" to na.strings, along with
> > an appropriate imputation method.
> >
> > So, I set "sqft" to be "global_mean" in the "impute" section of the spec.
> >
> > {
> > "omit": [ "zipcode" ]
> > ,"impute":
> > [ { "name": "district" , "method": "constant", "value": "south" }
> > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> > ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> > ,{ "name": "floors" , "method": "constant", "value": 1 }
> > ,{ "name": "view" , "method": "global_mode" }
> > ,{ "name": "askingprice" , "method": "global_mean" }
> > ,{ "name": "sqft" , "method": "global_mean" }
> > ]
> >
> > ,"recode":
> > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> > ,"bin":
> > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
> > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
> > ]
> >
> > ,"dummycode":
> > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> > }
> >
> > This allowed the DML to execute successfully.
> >
> > So, is "scale" not allowed anymore? And "bin" is allowed (despite the
> > message saying it isn't allowed)?
> >
> > Thank you,
> > Deron
> >
>
|