Saturday, 10 August 2013

Cassandra: choosing a Partition Key

Cassandra: choosing a Partition Key

I'm undecided whether it's better, performance-wise, to use a very
commonly shared column value (like Country) as partition key for a
compound primary key or a rather unique column value (like Last_Name).
Looking at Cassandra 1.2's documentation about indexes I get this:
"When to use an index Cassandra's built-in indexes are best on a table
having many rows that contain the indexed value. The more unique values
that exist in a particular column, the more overhead you will have, on
average, to query and maintain the index. For example, suppose you had a
user table with a billion users and wanted to look up users by the state
they lived in. Many users will share the same column value for state (such
as CA, NY, TX, etc.). This would be a good candidate for an index."
"When not to use an index
Do not use an index to query a huge volume of records for a small number
of results. For example, if you create an index on a column that has many
distinct values, a query between the fields will incur many seeks for very
few results. In the table with a billion users, looking up users by their
email address (a value that is typically unique for each user) instead of
by their state, is likely to be very inefficient. It would probably be
more efficient to manually maintain the table as a form of an index
instead of using the Cassandra built-in index. For columns containing
unique data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an indexed
column is moderate and not under constant load."
Looking at the examples from CQL's SELECT for
"Querying compound primary keys and sorting results", I see something like
a UUID being used as partition key... which would indicate that it's
preferable to use something rather unique?

No comments:

Post a Comment