hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Adrien <a...@jeanjean.ch>
Subject Insertion and timestamps test
Date Fri, 08 Aug 2008 11:21:38 GMT


Hello.

I made some tests with HBase 0.2.0 (RC2), focused on insertion and
timestamps behaviour. I had some surprising results, and I was wondering if
people using hbase already tried such an usage, and what was their
conclusion.

First of all I created a table with the default column attributes, using
hbase shell



## TABLE

hbase(main):008:0> describe 'proxy-0.2'
{NAME => 'proxy-0.2', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'status', BLOOMFILTER => '
false', IN_MEMORY => 'false', LENGTH => '2147483647', BLOCKCACHE => 'false',
VERSIONS => '3', TTL => '-1', COM
PRESSION => 'NONE'}, {NAME => 'header', BLOOMFILTER => 'false', IN_MEMORY =>
'false', LENGTH => '2147483647',
BLOCKCACHE => 'false', VERSIONS => '3', TTL => '-1', COMPRESSION => 'NONE'},
{NAME => 'bytes', BLOOMFILTER =>
'false', IN_MEMORY => 'false', LENGTH => '2147483647', BLOCKCACHE =>
'false', VERSIONS => '3', TTL => '-1', CO
MPRESSION => 'NONE'}, {NAME => 'info', BLOOMFILTER => 'false', IN_MEMORY =>
'false', LENGTH => '2147483647', B
LOCKCACHE => 'false', VERSIONS => '3', TTL => '-1', COMPRESSION => 'NONE'}]}


Test1

I make a loop that inserts the same row with different values at different
timestamps, arbitrary from 1000 incrementing from 10 to 10. I have a method
for dumping the row history: it makes a query for the last version, and
queries for past version using the current version timestamp minus 1. Note
that my table object is created once for entire program life cycle.


## GLOBAL CODE

	// somewhere in constructor
	t = new HTable(conf, TABLE_NAME);

	/**
	 * Dump reversed history of a HBase row, querying for older version
	 * using the max timestamp of all cells -1 until there is no cell returned
	 * @param rowKey
	 */
	private void dumpRowVersions(String rowKey) {
		Logger.log.info("Versions or row : "+rowKey);
		try {
			// first query. The newest version of the row
			RowResult rr = t.getRow(rowKey);
			int version = 1;
			long maxTs;
			
			do {
				maxTs = -1;
				String line = "";
				// go through all cells of the row
				for (Map.Entry en : rr.entrySet()) {
					long ts = en.getValue().getTimestamp();
					maxTs = Math.max(maxTs, ts);
					line += new String(en.getKey());
					line += " => " + new String(en.getValue().getValue());
					line += " ["+ts+"], ";
				}

				// remove the last coma and space for smarter output
				if (line.length() > 0) {
					line = line.substring(0, line.length()-2);
				}

				// prefix result with a version counter and the max timestamp 
				// found in the cells
				line = "#"+version+" MXTS["+maxTs+"] "+line;
				if (maxTs != -1) {
					// there was resulting cell. Continue iteration
					Logger.log.info(line);
					
					// get previous version
					version++;
					rr = t.getRow(rowKey, maxTs-1);
				}
			} while (maxTs != -1);
			
		} catch (IOException ex) {
			throw new IllegalStateException("Cannot fetch history of row
"+rowKey,ex);
		}
	}

## LOOP CODE 

			long ts = 1000;
			do {
				// insert the testrow with a new timestamp
				BatchUpdate bu = new BatchUpdate("testrow", ts);
				bu.put("bytes:", ("valbytes ts "+ts).getBytes());
				bu.put("status:", ("valstat ts"+ts).getBytes());
				t.commit(bu);
				Logger.log.info("-- Inserted ts "+ts);
				
				// dump row history
				Thread.sleep(70);
				dumpRowVersions("testrow");
				
				// next iteration in two seconds
				ts += 10;
				Thread.sleep(2000);
			} while (true);

## OUTPUT

> Connecting to hbase master...
 > -- Inserted ts 1000
 > Versions or row : testrow
 > #1 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1010
 > Versions or row : testrow
 > #1 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #2 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1020
 > Versions or row : testrow
 > #1 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #2 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #3 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1030
 > Versions or row : testrow
 > #1 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #2 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #3 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #4 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1040
 > Versions or row : testrow
 > #1 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #2 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #3 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #4 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #5 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1050
 > Versions or row : testrow
 > #1 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #2 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #3 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #4 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #5 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #6 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1060
 > Versions or row : testrow
 > #1 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #2 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #3 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #4 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #5 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #6 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #7 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1070
 > Versions or row : testrow
 > #1 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #2 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #3 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #4 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #5 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #6 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #7 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #8 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1080
 > Versions or row : testrow
 > #1 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #2 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #3 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #4 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #5 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #6 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #7 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #8 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #9 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1090
 > Versions or row : testrow
 > #1 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #2 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #3 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #4 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #5 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #6 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #7 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #8 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #9 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #10 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1100
 > Versions or row : testrow
 > #1 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
 > #2 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #3 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #4 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #5 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #6 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #7 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #8 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #9 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #10 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #11 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]


Despite the VERSIONS parameter of the columns (3) it seems that all versions
are stored. 

Question: is there some garbage collector process that removes the old
versions ? if yes, when does it take place ?

Test 2

A bit more surprising: I delete my row, using the delete-all command in
shell:


# SHELL 

hbase(main):001:0> scan 'proxy-0.2'
ROW                          COLUMN+CELL
 testrow                     column=bytes:, timestamp=1100, value=valbytes
ts 1100
 testrow                     column=status:, timestamp=1100, value=valstat
ts1100
2 row(s) in 0.3560 seconds
hbase(main):002:0> deleteall 'proxy-0.2', 'testrow'
0 row(s) in 0.1050 seconds
hbase(main):003:0> scan 'proxy-0.2'
ROW                          COLUMN+CELL
0 row(s) in 0.2540 seconds


The table is now empty, and if I try to launch my dumpRowHistory() method,
the emptiness is confirmed. Ok. Now I launch my test 1 again. Restarting
from timestamp 1000:


# OUTPUT

> Connecting to hbase master...
 > -- Inserted ts 1000
 > Versions or row : testrow
 > -- Inserted ts 1010
 > Versions or row : testrow
 > -- Inserted ts 1020
 > Versions or row : testrow
 > -- Inserted ts 1030
 > Versions or row : testrow
 > -- Inserted ts 1040
 > Versions or row : testrow
 > -- Inserted ts 1050
 > Versions or row : testrow
 > -- Inserted ts 1060
 > Versions or row : testrow
 > -- Inserted ts 1070
 > Versions or row : testrow


It seems that the row are not inserted. Querying from shell:


# SHELL 

hbase(main):004:0> scan 'proxy-0.2'
ROW                          COLUMN+CELL
0 row(s) in 0.2030 seconds


But, If I allow the program to make more iterations than the first time (ts
> 1100), the newest timestamps are taken in account. As if the table
remembers of the previous maximum value of the timestamp:

Relaunching the code of Test 1 :


# OUTPUT

> Connecting to hbase master...
 > -- Inserted ts 1000
 > Versions or row : testrow
 > -- Inserted ts 1010
 > Versions or row : testrow
 > -- Inserted ts 1020
 > Versions or row : testrow
 > -- Inserted ts 1030
 > Versions or row : testrow
 > -- Inserted ts 1040
 > Versions or row : testrow
 > -- Inserted ts 1050
 > Versions or row : testrow
 > -- Inserted ts 1060
 > Versions or row : testrow
 > -- Inserted ts 1070
 > Versions or row : testrow
 > -- Inserted ts 1080
 > Versions or row : testrow
 > -- Inserted ts 1090
 > Versions or row : testrow
 > -- Inserted ts 1100
 > Versions or row : testrow
 > #1 MXTS[1100] bytes: => valbytes ts 1100 [1100]
 > #2 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #3 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #4 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #5 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #6 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #7 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #8 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #9 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #10 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #11 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1110
 > Versions or row : testrow
 > #1 MXTS[1110] bytes: => valbytes ts 1110 [1110]
 > #2 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
 > #3 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #4 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #5 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #6 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #7 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #8 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #9 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #10 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #11 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #12 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1120
 > Versions or row : testrow
 > #1 MXTS[1120] bytes: => valbytes ts 1120 [1120]
 > #2 MXTS[1110] bytes: => valbytes ts 1110 [1110], status: => valstat
ts1110 [1110]
 > #3 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
 > #4 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #5 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #6 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #7 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #8 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #9 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #10 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #11 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #12 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #13 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
 > -- Inserted ts 1130
 > Versions or row : testrow
 > #1 MXTS[1130] bytes: => valbytes ts 1130 [1130]
 > #2 MXTS[1120] bytes: => valbytes ts 1120 [1120], status: => valstat
ts1120 [1120]
 > #3 MXTS[1110] bytes: => valbytes ts 1110 [1110], status: => valstat
ts1110 [1110]
 > #4 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
 > #5 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
 > #6 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
 > #7 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
 > #8 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
 > #9 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
 > #10 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
 > #11 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
 > #12 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
 > #13 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
 > #14 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]


Since the timestamp reachs a newest value, the row is inserted. Moreover,
the previous insertions appears !

Notice another problem: the last insertion is missing one cell: the
'status:' column.

Using shell to scan the table give the same result:


# SHELL
hbase(main):003:0> scan 'proxy-0.2'
ROW                          COLUMN+CELL
 testrow                     column=bytes:, timestamp=1130, value=valbytes
ts 1130


Relauching hbase with the stop-hbase.sh / start-hbase.sh scripts yields to
another unexpected behaviour:

When I run the scan command in the shell, I have the same result than above:


# SHELL
hbase(main):001:0> scan 'proxy-0.2'
ROW                          COLUMN+CELL
 testrow                     column=bytes:, timestamp=1130, value=valbytes
ts 1130


but if I launch the dumpRowHistory method it appears that most of history of
the status: column is lost.

Notice that I tried many times and I never had the same behaviour twice
here, sometime the other column is missing, or the row is entirely lost
giving no result at all.


# OUTPUT

 > #1 MXTS[1130] bytes: => valbytes ts 1130 [1130]
 > #2 MXTS[1120] bytes: => valbytes ts 1120 [1120], status: => valstat
ts1120 [1120]
 > #3 MXTS[1110] bytes: => valbytes ts 1110 [1110]
 > #4 MXTS[1100] bytes: => valbytes ts 1100 [1100]
 > #5 MXTS[1090] bytes: => valbytes ts 1090 [1090]
 > #6 MXTS[1080] bytes: => valbytes ts 1080 [1080]
 > #7 MXTS[1070] bytes: => valbytes ts 1070 [1070]
 > #8 MXTS[1060] bytes: => valbytes ts 1060 [1060]
 > #9 MXTS[1050] bytes: => valbytes ts 1050 [1050]
 > #10 MXTS[1040] bytes: => valbytes ts 1040 [1040]
 > #11 MXTS[1030] bytes: => valbytes ts 1030 [1030]
 > #12 MXTS[1020] bytes: => valbytes ts 1020 [1020]
 > #13 MXTS[1010] bytes: => valbytes ts 1010 [1010]
 > #14 MXTS[1000] bytes: => valbytes ts 1000 [1000]


I tried other tests, replacing only one column, using an existing timestamp
to modify one single value, inserting past values, and so on... My
conclusion is either I don't understand the general behaviour of that, or I
make a bad usage of the API. 

However, using normal insertion and normal query (I mean without any
timestamp) gives me coherent and predictable results. As well as normal
insertion and querying with past timestamps does.

Thanks for your work, and if someone has more information about timestamps
and designed behaviour, I'm very interested in it.

Have a nice day.


--

-- Jean-Adrien

-- 
View this message in context: http://www.nabble.com/Insertion-and-timestamps-test-tp18890143p18890143.html
Sent from the HBase User mailing list archive at Nabble.com.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message