hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: MR missing lines
Date Thu, 20 Dec 2012 04:24:27 GMT
Hi All
           Be careful with selecting the Delete#deleteColumn() Delete#deleteColumns().
deleteColumn() API is to delete just one version of a column in a give row. While the other
deletes all the versions data of the column.
In Jean's case which API is used will not matter in a functional way as he is having only
one version for a column and even one column in every row.

But deleteColumn will be having an overhead. When this is used and not passing any TS ( latestTimeStamp
by default comes in), there will be a get operation happening within the HRegion to get the
ts of the most recent version for this column.   deleteColumn (cf,qualifier) API tells to
delete the most recent version of the cf:qualifier while deleteColumns(cf,qualifier) tells
to delete the whole column from the row (all the versions)

From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
Sent: Thursday, December 20, 2012 6:09 AM
To: user@hbase.apache.org
Subject: Re: MR missing lines

Hi Anoop,

Thanks for the hint! Even if it's not fixing my issue, at least my
tests are going to be faster.

I will take a look at the documentation to understand what
deleteColumn was doing.


2012/12/19, Anoop Sam John <anoopsj@huawei.com>:
> Jean:  just one thought after seeing the description and the code.. Not
> related to the missing as such
> You want to delete the row fully right?
>>My table is only one CF with one C with one version
> And your code is like
>>  Delete delete_entry_proposed = new Delete(key);
>>  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
>> KVs.get(0).getQualifier());
> deleteColumn() is useful when you want to delete specific column's specific
> version in a row.  In your case this may be really not needed. Just Delete
> delete_entry_proposed = new Delete(key);  may be enough so that the delete
> type is ROW delete.
> You can see the javadoc of the deleteColumn() API in which it clearly says
> it is an expensive op. At the server side there will be a need to do a Get
> call..
> In your case these are really unwanted over head .. I think...
> -Anoop-
> ________________________________________
> From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
> Sent: Tuesday, December 18, 2012 7:07 PM
> To: user@hbase.apache.org
> Subject: Re: MR missing lines
> I faced the issue again today...
> RowCounter gave me 104313 lines
> Here is the output of the job counters:
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
> 12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
> 12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311
> There is a 2 lines difference between ROWS_PARSED and he counter.
> ENTRY_EXISTING are the 5 states an entry can have. Total of all those
> counters is equal to the ROWS_PARSED value, so it's alligned. Code is
> handling all the possibilities.
> The ROWS_PARSED counter is incremented right at the beginning like
> that (I removed the comments and javadoc for lisibility):
>                 /**
>                  * The comments ...
>                  */
>                 @Override
>                 public void map(ImmutableBytesWritable row__, Result values,
> Context
> context) throws IOException
>                 {
> context.getCounter(Counters.ROWS_PARSED).increment(1);
>                         List<KeyValue> KVs = values.list();
>                         try
>                         {
>                                 // Get the current row.
>                                 byte[] key = values.getRow();
>                                 // First thing we do, we mark this line to
> be deleted.
>                                 Delete delete_entry_proposed = new
> Delete(key);
> delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
> KVs.get(0).getQualifier());
> deletes_entry_proposed.add(delete_entry_proposed);
> The deletes_entry_proposed is a list of rows to delete. After each
> call to the delete method, the number of remaining lines into this
> list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
> should be deleted correctly.
> I re-ran the rowcounter after the job, and I still have ROWS=5971
> lines into the table. I check all my "feeding process" and they are
> all closed.
> My table is only one CF with one C with one version.
> I can guess that the remaining 5971 lines into the table is an error
> on my side, but I'm not able to find where since all the counters are
> matching. I will add one counter which will add all the entries in the
> delete list before calling the delete method. This should match the
> number of rows.
> Again, I will re-feed the table today with fresh data and re-run the job...
> JM
> 2012/12/17, Jean-Marc Spaggiari <jean-marc@spaggiari.org>:
>> The job run the morning, and of course, this time, all the rows got
>> processed ;)
>> So I will give it few other tries and will keep you posted if I'm able
>> to reproduce that again.
>> Thanks,
>> JM
>> 2012/12/16, Jean-Marc Spaggiari <jean-marc@spaggiari.org>:
>>> Thanks for the suggestions.
>>> I already have logs to display all the exepctions and there is
>>> nothing. I can't display the work done, there is to much :(
>>> I have counters "counting" the rows processed and they match what is
>>> done, minus what is not processed. I have just added few other
>>> counters. One right at the beginning, and one to count what are the
>>> records remaining on the delete list, as suggested.
>>> I will run the job again tomorrow, see the result and keep you posted.
>>> JM
>>> 2012/12/16, Asaf Mesika <asaf.mesika@gmail.com>:
>>>> Did you check the returned array of the delete method to make sure all
>>>> records sent for delete have been deleted?
>>>> Sent from my iPhone
>>>> On 16 בדצמ 2012, at 14:52, Jean-Marc Spaggiari
>>>> <jean-marc@spaggiari.org>
>>>> wrote:
>>>>> Hi,
>>>>> I have a table where I'm running MR each time is exceding 100 000
>>>>> rows.
>>>>> When the target is reached, all the feeding process are stopped.
>>>>> Yesterday it reached 123608 rows. So I stopped the feeding process,
>>>>> and ran the MR.
>>>>> For each line, the MR is creating a delete. The delete is placed on a
>>>>> list, and when the list reached 10 elements, it's sent to the table.
>>>>> In the clean method, the list is sent to the table if there is any
>>>>> element in it.
>>>>> So at the en of the MR, I should have an empty table.
>>>>> The table is splitted over 128 regions. And I have 8 region servers.
>>>>> What is disturbing me is that after the MR, I had 38 lines remaining
>>>>> on the table. the MR took 348 minutes to run. So I ran the MR again,
>>>>> which this time took 2 minutes, and now I have 1 row remaining in the
>>>>> table.
>>>>> I looked at the logs (for the 38 lines run) and there is nothing in
>>>>> it. There is some scanner timeout exception for the run of the 100K
>>>>> rows.
>>>>> I'm running HBase 0.94.3.
>>>>> I will hava another 100K rows today, so I will re-run the job. I will
>>>>> increase the timeout to make sure I got no exception, but even when I
>>>>> ran the 38 lines with no exception one was remaining...
>>>>> Any idea why and where I can seach? It's not really an issue for me
>>>>> since I can just re-run the job, but this might be an issue for some
>>>>> others.
>>>>> JM
View raw message