hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yu Li <car...@gmail.com>
Subject Re: How spark writes to HBASE
Date Tue, 23 Jan 2018 03:49:11 GMT
Setting hbase.bulkload.locality.sensitive.enabled to true and
hbase.mapreduce.hfileoutputformat.table.name to the <target_table> would do
the magic to keep locality with best efforts during bulkload, FYI. More
details please refer to HBASE-12596
<https://issues.apache.org/jira/browse/HBASE-12596>

Best Regards,
Yu

On 23 January 2018 at 01:13, Ted Yu <yuzhihong@gmail.com> wrote:

> I did a search in SHC for saveAsNewHadoop (case insensitive) - there was no
> match.
>
> I suggest you use SHC forum for related questions.
>
> On Mon, Jan 22, 2018 at 9:07 AM, vignesh <vignesh093@gmail.com> wrote:
>
> > It would be similar to case 2 right. Say for example in spark I read a
> file
> > of size 512mb which would span 4 cores(if block size is 128). Executor
> will
> > be spanned based on data locality and if the executor is launched
> > machine1,2,3,4. May be if block3 region is handled by machine6, then
> when I
> > bulk load via spark hbase connector (which uses
> saveAsNewHadoopApidataset)
> > then in this case the HFILE write of block3 would be to any of those 4
> > machines and not to machine6. Is that right? Or i misunderstood?
> >
> > On Jan 22, 2018 22:27, "Ted Yu" <yuzhihong@gmail.com> wrote:
> >
> > > For case 1, HFile would be loaded into the region (via staging
> > directory).
> > >
> > > Please see:
> > > http://hbase.apache.org/book.html#arch.bulk.load
> > >
> > > On Mon, Jan 22, 2018 at 8:52 AM, vignesh <vignesh093@gmail.com> wrote:
> > >
> > > > If it is a bulk load I use spark hbase connector provided by
> > hortonworks.
> > > > For time series writes I use normal hbase client API's.
> > > >
> > > > So does that mean in case 2(client API write)  the write to memstore
> > will
> > > > happen via network? In case 1(bulk load)the HFile will be moved to
> the
> > > > region which is supposed to hold or it will write to local and keep
> > that
> > > as
> > > > a copy and the second replication would go to that particular region?
> > > >
> > > > On Jan 22, 2018 22:16, "Ted Yu" <yuzhihong@gmail.com> wrote:
> > > >
> > > > Which connector do you use to perform the write ?
> > > >
> > > > bq. Or spark will wisely launch an executor on that machine
> > > >
> > > > I don't think that is the case. Multiple writes may be performed
> which
> > > > would end up on different region servers. Spark won't provide the
> > > affinity
> > > > described above.
> > > >
> > > > On Mon, Jan 22, 2018 at 7:19 AM, vignesh <vignesh093@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a Spark job which reads some timeseries data and pushes that
> > to
> > > > > HBASE using HBASE client API. I am executing this Spark job on a
10
> > > > > node cluster. Say at first when spark kicks off it picks
> > > > > machine1,machine2,machine3 as its executors. Now when the job
> inserts
> > > > > a row to HBASE. Below is what my undersatnding on what it does.
> > > > >
> > > > > Based on the row key a particular region(from the META) would be
> > > > > chosen and that row will be pushed to that RegionServer's memstore
> > and
> > > > > WAL and once the memestore is full it would be flushed to the
> > disk.Now
> > > > > if assume a particular row is being processed by a executor on
> > > > > machine2 and the regionserver which handles that region to which
> the
> > > > > put is to be made is on machine6. Will the data be transferred from
> > > > > machine2 to machine6 over network and then the data will be stored
> in
> > > > > memstore of machine6. Or spark will wisely launch an executor on
> that
> > > > > machine during write(if the dynamic allocation is turned on) and
> > > > > pushes to it?
> > > > >
> > > > >
> > > > > --
> > > > > I.VIGNESH
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message