Final Notes from the 2015 Cassandra Summit

This is my last post about the 2015 Cassandra Summit. This is mostly a list of random details that I wanted to keep track of. Most people may not find this useful.

DataStax 4.8 will have better ‘encryption at rest’ then previous DataStax versions. There are other providers then DataStax that should be looked at too. Note that you should use eCryptFS to encrypt the commit-log file since its typically not encrypted at rest.

Slide reviews for getting encryption right can be found at Nate’s Slide Share site. He goes over a ton including what’s wrong with how DataStax documents how to install node-node and client-node encryption, and how to do it right.

Vnodes… 256 could be high. Cassandra 3.0 will start doing 64 vnodes per physical server instead. Do not mix single token nodes and vnodes in the same datacenter. To get a mix in the same cluster, use two datacenters. Solr/Lucene and Spark currently want single-token nodes, but that’s changing.

Java drivers should have token-aware policies enabled. No load-balancer between clients and datacenter cluster. Seriously, your load balancer will do all the wrong things.

When developing code, use local consistency levels even if you have just one data center.  Also, you only think you need immediate consistency. When possible, use LOCAL_ONE for both reads and writes. (And don’t mistakenly use SimpleStrategy in production.)

Dropping keyspaces does not remove data from disk. (Snapshots) Remember this QA folks and for integration tests.

In general, secondary indexes are useless with the following caveats. If a partition has a ton of values, a secondary index is useful provided you also provide the partition key. Spark Integration can actually benefit from secondary indexes with the DSE install, as each Spark instance will talk to their local Cassandra node.

Low values in commitlog_total_space_in_mb will reduce the number of memtables in memory. So you may need to up that number. 4G may be appropriate. There is a direct correlation between heap size and commit log size.

Compaction can be tuned to start when sstables count is between 4 and 32 sstables per memtable. Less SSTables on disk makes reads faster, but compaction causes high io… so… yeah.

Memtables are HashMaps of array lists (currently)

Remember, if you enable RowCache in your table, the cassandra.yaml file needs to have it enabled too. (Each node)

LevelCompaction strategy should only be used with SSDs. You don’t really need the commit log on SSDs, even if your SSTables are.

Do not manually compact. If you do, you will have to forever. Also, if you change the compaction strategy, the next compaction will be huge. So just don’t.

Cassandra is CPU bound for writes, and uses memory for reads. 16G-64G ram is recommended even if the heap size is only 8G. Disk caching in linux gets the rest of them memory, which helps you out a ton.

Cassandra sweet spot is 8 cores. More i you have Spark/Solr with Cassandra on the same box.

Sized compaction needs 50% of disk free. Level compaction needs 10% free. SSDs give you 3-5t/node, with rotation drives, 1t/node. Be careful if you go as high as 20T/node… rebuilds will suck, as much as your admins life will.

Expect nodes to be added. Single-token nodes you’ll have to double them up. Vnodes you can just add them one at a time.

Use nodetool cleanup after you add nodes to the cluster or decrease replication factor. That will clean up disk space. Its an optimized compaction. If you wait, it’ll clean up itself.

Run repair weekly in 2.1. Looks like that will change in 3.0. Run repairs on a few nodes at a time to reduce overhead. Also, use the ‘pr’ setting so you’re not repairing too much. (Should have been the default)  ‘pr’ means only repair data it owns, not data from other nodes. Repairing the data you own will also cause repairs on other nodes… so, yeah.

Always use prepared statements. Always. If you are not, you’re doing something wrong. (Reduces load)

async queries are better, but more complicated.

Batch queries should stick in the same partition key for performance gain.

Cassandra/Lucene plugin that is recommended outside of DataStax: cassandra-lucene-index by Stratio.


Cassandra Summit: Conference Sessions

The Cassandra summit that DataStax hosted this year had just shy of 140 sessions over two days. Each session was grouped into tracks such as operations, development and architecture. They had a half-way decent app built by Double Dutch that provided a way to schedule which sessions you wanted to see.  The app worked well, and provided a few ‘games’ mostly designed to get you to visit the vendors.

The sessions were divided into 3 groups. The first group was geared towards managers on what people did or how to integrate Cassandra into your company. Typically these were fairly useless. The ones I accidently attended were fairly useless. There was a session defined as ‘hands on’ that was a overview of technologies installed.

The second group of sessions were technical deep-dives. A fairly crowded one consisted of folks from The Last Pickle going over the source code for how data is deleted in Cassandra. Extremely useful as it shows why certain behaviors within Cassandra exist, and guided you into programming with those behaviors in mind. There could have been more of these types of sessions and in bigger rooms. I had to sit on the floor for one of them even with my priority pass.

The third group was a best practices or “Hey, this is what worked for us.”  The tech head from the Weather Group did a great presentation about their attempts to scale up their ability to process incoming datasets… showing what they tried first that failed, and what actually worked.

A few notes from the summit: Spark is everywhere. People seem to be using Spark with Cassandra for any type of analytics or reporting. Also, Zeppelin has been getting a lot of mention too. Its a electronic notebook to create and share Spark ‘recipes’ in the same way you can a RStudio project… perfect for folks in data analytics or just looking for a quick way to visualize data in Cassandra. I need to install both Spark and Zeppelin and see what I can do there.

Cassandra Summit: Training and Certification

This last week I went to the Cassandra summit that Datastax had put on. The first day was training and certification, and the following two day was the conference itself. I had been playing with Cassandra for years, though nothing major, and certainly nothing in production yet.

The training itself was six days worth of material supplied within six hours. Datastax has training online and to a large degree, the session that day before the test was intended to be a review. You were supposed to take two classes online before the training; each class had about 3 hours of video and quizzes to go over. But many of the folks who were at the training never even looked at the site. So DataStax tried to cram tons of knowledge into everyone’s eye-socket in those six hours.

Honestly, I didn’t care about the certification, the training was more important to me. Hands-on usage of Cassandra is the only certification that’s really important here. If you don’t use Cassandra after getting your certification, then all that information you gained is likely lost within a few months of the course at best. If instead you set up a few nodes and tried to store/retrieve data from them after the online training, then you’d likely have the same level of knowledge as someone who passed the certification. Each of you will have some tidbits of information that help keeps that cluster alive.

I’m glad I went for the training. Having the certification is a nice ‘feature’… but it’ll actually mean something once we have Cassandra in production.

Finally making progress

I finally took my cousin’s advice and I’m making a game with a story to it. I’ve made some progress getting a generic RPG together with libgdx where the game flow, stories, npcs and loot/abilities are loaded in dynamically. This means the it will be easy to enhance over time. I’m going to add in the ability for friends to play against the same map at the same time, in a serverless connection. That’s something I’ve wanted to build into my games for a while.

Google opinions reward app is a bit creepy…

I installed the Google opinions app a while ago thinking it was just a standard marketing survey app. It comes up with a few questions during the week about various products, and for your time, it’ll reward you with a quarter or two of google play store credit. It does make you feel like you’re doing tricks for google… rewarding you the way a animal performer rewards their dog with nibbles of food. But its been mostly harmless.

I say ‘mostly’ because lately its moved into the creepy stage. Twice now I’ve gone shopping only to have surveys pop up about the store I was just at. The first time it happened I thought it was a strange coincidence. This second time is giving me pause.

The survey app is on my tablet so its not that the ‘app’ detected where I was, rather it gives the impression that the servers that the app talks are asked where I’ve been lately. Funnier is both those locations are where I tried to use Google wallet with my phone but the transactions failed. Google’s “silent” war with Apple pay, trying to get data on failed usaged?

It still could be a coincidence… but I’m starting to get unnerved.

Sprint, S5 and bloatware

I had a ‘massive’ install of a bunch of apps on my phone. It looks like its sprint-mandated bloatware. (I use ting, but we all know how that goes…. had to get the S5 from ting, so….)

Anyways…. I removed them quickly. Disabled ‘mobile id’ which is what I think is causing it. (Mobile ID runs ‘MVNO Configuration Update’ which provides wallpapers among other apps.)

What was added? 11 files, each less then 1M. Almost wiped my phone from this. Still might…. not sure if it was actually pown’d or not. Here is the list:

  • Sprint Money Express
  • Scout
  • Featured Apps
  • NextRadio
  • NBA Game Time
  • NASCAR Mobile 2014
  • Messaging+
  • Eureka Offers
  • eBay
  • App Pass
  • Amazon Preloader

I should point out I just updated podkicker recently…. then this happened a few hours later. I could be that app. Either way I’m kinda pissed, but not sure exactly who to be pissed at.