kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Some bulk requests are missing when a tserver stopped
Date Mon, 24 Apr 2017 21:45:27 GMT
I think it's also worth trying 'kudu cluster ksck -checksum_scan
<master1,master2,master3>' to perform a consistency check. This will ensure
that the available replicas have matching data (and uses the SNAPSHOT scan
mode to avoid the inconsistency that David mentioned above).

On Mon, Apr 24, 2017 at 2:38 PM, David Alves <davidralves@gmail.com> wrote:

> Hi Jason
>
>   What do you mean that 2% are missing? Were you not able to insert them
> (got a timeout) or where there no errors but you can't see the rows as the
> result of a scan?
>   How are you checking that all the rows are there? Through a regular scan
> in spark? In particular the default ReadMode for scans makes no guarantees
> about replica recency, so it might happen that when you kill a tablet
> server, the other chosen replica is not up-to-date and returns less rows.
> In this case it's not that the rows are missing just that the replica that
> served the scan doesn't have them yet.
>   These kinds of checks should likely be done with the READ_AT_SNAPSHOT
> ReadMode but even if you can't change ReadModes, do you still observe that
> rows are missing if you run the scans again?
>   Currently some throttling might be required to make sure that the
> clients don't overload the server with writes which causes writes to start
> timing out. More efficient bulk loads is something we're working on right
> now.
>
> Best
> David
>
>
> On Sat, Apr 22, 2017 at 6:48 AM, Jason Heo <jason.heo.sde@gmail.com>
> wrote:
>
>> Hi.
>>
>> I'm using Apache Kudu 1.2. I'm currently testing high availability of
>> Kudu.
>>
>> During bulk loading, one tserver is stopped via CDH Manager intentionally
>> and 2% of rows are missing.
>>
>> I use Spark 1.6 and package org.apache.kudu:kudu-spark_2.10:1.1.0 for
>> bulk loading.
>>
>> I got a error several times during insertion. Although 2% is lost when
>> tserver is stop and not started again, If I start it right after stopped,
>> there was no loss even though I got same error messages.
>>
>>
>> I watched Comcast's recent presentation at Strata Hadoop, They said that
>>
>>
>> Spark is recommended for large inserts to ensure handling failures
>>>
>>>
>> I'm curious Comcast has no issues with tserver failures and how can I
>> prevent rows from being lost.
>>
>> ----------------------------------
>>
>> Below is an spark error message. ("01d....b64" is the killed one.)
>>
>>
>> java.lang.RuntimeException: failed to write 2 rows from DataFrame to
>> Kudu; sample errors: Timed out: RPC can not complete before timeout:
>> Batch{operations=2, tablet='1e83668a9fa44883897474eaa20a7cad'
>> [0x00000001323031362D3036, 0x00000001323031362D3037),
>> ignoreAllDuplicateRows=false, rpc=KuduRpc(method=Write,
>> tablet=1e83668a9fa44883897474eaa20a7cad, attempt=25,
>> DeadlineTracker(timeout=30000, elapsed=29298), Traces: [0ms] sending RPC to
>> server 01d513bc5c1847c29dd89c3d21a1eb64, [589ms] received from server
>> 01d513bc5c1847c29dd89c3d21a1eb64 response Network error: [Peer
>> 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset, [589ms] delaying RPC
>> due to Network error: [Peer 01d513bc5c1847c29dd89c3d21a1eb64] Connection
>> reset, [597ms] querying master, [597ms] Sub rpc: GetTableLocations sending
>> RPC to server 50cb634c24ef426c9147cc4b7181ca11, [599ms] Sub rpc:
>> GetTableLocations sending RPC to server 50cb634c24ef426c9147cc4b7181ca11,
>> [643ms
>> ...
>> ...
>> received from server 01d513bc5c1847c29dd89c3d21a1eb64 response Network
>> error: [Peer 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset,
>> [29357ms] delaying RPC due to Network error: [Peer
>> 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset)}
>> at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.
>> apply(KuduContext.scala:184)
>> at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.
>> apply(KuduContext.scala:179)
>> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu
>> n$apply$33.apply(RDD.scala:920)
>> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu
>> n$apply$33.apply(RDD.scala:920)
>> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC
>> ontext.scala:1869)
>> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC
>> ontext.scala:1869)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> ------------------
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message