BrainSINS is a 360º personalization solution for eCommerce, actually working for hundreds of eCommerce websites such as Toys’R Us, Mothercare, Caterpillar, etc.
Our solution has been designed to increase the online sales and improve the online shopping experience. On average, we help our customers to increase their sales by 20% using a combination of the products we offer. Our personalization suite includes personalized recommendations, an advanced cart abandonment recovery system, an in-site behavioral targeting solution and a set of gamification features.
From the lines above you could imagine a whole bunch of technologies we use to achieve our objectives; so in order to work in this awesome team you need to be a full-stack developer. I work as a web architect, leading the definition and implementation of our web architecture, and also supporting our backend developers as a sysadmin regarding our NoSQL architecture.
To provide full personalization to our customers, we need to track every possible action the users perform in the online stores: visits to products, visits to any other page (categories, homepage, etc.), when the user adds products to the shopping cart, when the checkout process starts, etc.
Tracking user actions is not an easy task, but it can be managed with a relational database until a specific point. Once reaching a certain limit, you can see the performance degrade fast as hell. From this point on, you need to start thinking what can you do to improve the overall performance of the system, without increasing the investment in infrastructure.
Vertical scaling of the relational database is out of the question because is too expensive, and it has its own limits. Horizontal scaling of relational database is a “pain in the a..”, since you have to maintain the consistency of the data among all the servers in real time.
We receive thousands of data points per second and we need to react to those inputs in real time and needed a robust data management system that is able to scale and process huge amounts of information. Cassandra helps us achieve that and the reliability we need for our infrastructure.
Cassandra helped us achieve two objectives:
From the business point of view, cheaper and faster machines that do the same work.
From the technical point of view, a cluster of machines with no single point of failure, synced and easy to scale.
We are currently using Cassandra v1.2.15 in a cluster built in Amazon EC2 Amazon-Linux machines. Cassandra gives support to our whole NoSQL infrastructure from the bottom of it. Currently, the cluster goes from 2 machines minimum up to a finite number regarding the traffic. It is very flexible, thus it allows us to boot up new machines and join them into the ring in less than two minutes.
When a user performs an action, it goes directly to a SQS queue which feeds a group of machines that we call “workers”. These machines write into Cassandra the action performed by the user in a way that is easy for us to retrieve later. We use Astyanax to read and write into Cassandra, and we are using Cassandra file system instead of HDFS.
Later, to generate analytics reports/data for your clients, we run Hadoop map-reduce jobs to calculate aggregate data using PIG, which is written into a SQL database that feeds the web application our clients use to setup the different services. In order to write PIG results into SQL we use Sqoop, and to automate this part we use Oozie. As you can see, we love the Apache Software Foundation.
Thanks to Cassandra’s flexibility, we have built a single machine with all these technologies that allows us to scale up/down the cluster as we need it without touching any config file.
We needed to read a lot of “Getting started with NoSQL technologies” documentation in order to know the different solutions out there. We had four candidates, Cassandra, HBase, MongoDB and DynamoDB (Amazon). First of all we had to choose three of them to test:
The easiest to manage: DynamoDB.
The fastest: HBase.
The cheapest: Cassandra.
The coolest: MongoDB.
Since we already knew DynamoDB because we were using it for a small processes, we divided our developers in three teams: Team HBase ,Team Cassandra and Team MongoDB. Each team had to build a development environment with the technology assigned to them, including failure/performance tests. The throughput in HBase was a bit higher than in Cassandra, at least that was what the results of our tests, but Cassandra’s benefits, such as its scalability, allowed it to surpass HBase.
HBase was removed from the list because the single point of failure issue, and MongoDB because Cassandra’s throughput is better, so the latter and DynamoDB were the finalists. DynamoDB was easier to maintain, but it also offered more restrictions and a higher cost, so finally we decided we should go with Cassandra.
I am sure any of you trying out Cassandra have read about having 1GB RAM min in your system, that’s true. Think about the limits of your hardware on the first exceptions see in the logs. From the sysadmin point of view, the best part of Cassandra is that you can automate every aspect of it, but you may fight a little to achieve it.
Since it is a new technology, it has not been easy to find solutions for specific problems, nevertheless the community has enthusiasm and everybody wants to contribute in order to boost Cassandra as a standard in NoSQL ecosystem.
Thank you for allowing us to share our experience, from here we encourage other teams around the world to do the same, more people using Cassandra means more people contributing with knowledge to this outstanding technology.