Thierry Schellenbach: CTO and Co-Founder at Fashiolista
TL;DR: Fashiolista is a social network for fashion inspiration, a platform where bloggers and other users, follow each other and see what type of shop-able fashion items they find. Think of it as Pinterest, aimed at fashion.
The main functionality for Fashiolista is a feed of the content from people you follow, very similar to Twitter or Facebook’s news feed. Fashiolista started out with a PostgreSQL database to build up that functionality, but eventually, as they grew and hit two million members problems arose. Fashiolista next deployed Redis, but eventually, their needs changed and they needed to store more and more data, which, in Redis quickly became too expensive.
Eventually, with the support of a strong community, Fashiolista migrated from Redis to Cassandra making it possible for them to scale while lowering operating costs. Fashiolista now runs Apache Cassandra with a 5 to 20 node cluster across AWS, using M1X large servers and open-sourced their news feed solution, known as Feedly.
Today, we have Thierry Schellenbach, the CTO and co-founder at Fashiolista. Thierry, thank you so much for joining us today. To get things started, could you tell us a little bit about what Fashiolista does?
Fashiolista is a social network for fashion inspiration. Basically, it’s a platform where bloggers and other users, which have a high interest in fashion, follow each other and see what type of shop-able fashion items they find. So, for instance, they will be browsing for clothes, on Nelly, or another website, and they’ll add the items which they like to their Fashiolista profile. Think of it as Pinterest, aimed at fashion.
Very interesting. What are you using Apache Cassandra for?
The main functionality for Fashiolista is a feed of the content from people you follow. It’s a technical program very similar to Twitter or Facebook’s news feed. For us, we started out with a PostgreSQL database to build up that functionality, but eventually, as we grew and we hit two million members, it became more and more problematic to run it that way. Right now, we are using Apache Cassandra to help store the large data sets of news feeds for users.
Excellent! What was the motivation for choosing Cassandra? Were there other technologies that you evaluated it against when making the decision?
Yes. We started with a basic approach to getting your feed working. Soon after that, we switched to Redis, which is a common alternative for building functionality like this, but, eventually, our needs changed and we needed to store more and more data. If you use a solution like Redis, it becomes way too expensive eventually. There are some companies still using Redis; for instance, Twitter still uses it, but for our use, it became too expensive, which is why we started looking at solutions like HBase, Cassandra, etc.
Eventually, we picked Cassandra, because it seemed the community had the most support. There were a few widely publicized cases of other companies switching to it; for instance, Instagram. That is why we built our solution using Cassandra. We actually open sourced it in a project called Feedly.
That’s great. Would you be able to share some insights into what your deployment looks like?
Sure. We have a very typical set-up, like most other companies who are running on AWS. It took a while to figure out which type of instances to run on, which gave us the best performance, etc. We actually ended up settling on the M1X large servers, which gave us the best performance. We are running between 5-20 nodes, depending on how much functionality we have turned on at the moment.
Great! For future versions of Apache Cassandra, is there anything that you would like to see that might make your life a little easier?
I think the main thing was that, we are using Python, as are a lot of other Cassandra users. In the Python ecosystem, it is still really hard to use the latest functionality. Solutions, like Pycassa, are using the old API, are still referenced in most documents and in most examples, so, it take a while to develop a new application using Python and Cassandra. You have to front a lot of projects before you can actually use the latest API, provided by Cassandra. I think the ecosystem would be the main thing to improve.
Another thing which we had trouble with is limiting the growth of certain data structures. If you want to trim a list, there is not a clear API you can use for that; at least, we didn’t find one. Thirdly, one other thing which would be great to improve is the batch inserts. With the SQL approach, there is not a clear way of handling batch inserts in an efficient way. Eventually, we found the batch functionality to unlock it, but it is quite hidden and we are still not sure if it is the right way to go. So, more docs in that area would be great.
Interesting. That is good info to hear. What is your experience with the Apache Cassandra community?
Thank you for asking. I think it is really an important part of any open source project; the number of people who actually work and are trying to help. The community part of Cassandra is really good. We found often that there were people on and they were helpful. For example, Rick, from Instagram, which is one of the larger users of Cassandra, started helping. It is a really good community.
Thierry, thank you so much for joining us today. Before we sign off here, is there anything else you would like to add about Cassandra, or the community, or Fashiolista?
At Fashiolista, we open-sourced our solution to dealing with this feed problem and it would be great to get other companies involved and see if they can help out or have comments about what we are doing.
Check out the Feedly project here: Github.com/tschellenbach/Feedly