Episode 152 – Roaring News

Another fortnight, another roaring news episode covering this time: de-anonymizing anonymized data is reportedly easy, Kubernetes is easier than Big Data, Big Data is hard and hard to understand Kafka can be made easier using Factorio visualization.

It’s not because you’re paranoid, they’re not out to get you!

Not totally unsurprising to your co-hosts, this article discusses how easy it is to recombine previously anonymized data to regain the ability to identify a person, based on the data sets.

Now this does involve combining multiple datasets and this is something legislators have warned against in the past. GDPR specifically has a clause that adreses this and data owners need to exercise care to avoid this fro happening. That being the case, though, there are bound to be entities that are not bound by privacy legislation, or that simply choose not to follow them.

So long story short (too late!), take care abot what information you share where!

Going on a litle tangent, we discuss how bad this loss of privacy actually is. Is it really Dangerous (with the capital ‘D’) or merely an inconvenience where it is more the lack of control that poses the most irritation?

Kubernetes is less complex than Hadoop! For now?

This first article in a sequence of three is more than a little click-baity, but the comparison of complexity between Kubernetes and the Hadoop ecosystem did intrigue enough to validate a discussion.

The inherent unfairness of comparing a scheduler (Kubernetes) with a full big data platform including all the higher level applications does make the premise of the article a bit hard to take.

So we don’t really discuss the article itself rather than try and predict the future of Kubernetes and how they will avoid the complexity pitfall.

As a bonus, here is an older article that goes a little deeper “On complexity in big data“.

I’m not Gaming, I’m studying up on my Kafka skills!

Following on from the previous article, this little gem uses an indy game Factorio to explain the rather abstract concepts used by Kafka.

The advantages of distributed systems deployed in a de-coupled fashion are highlighted and the concepts of topics and partitions are also demonstrated in what we find a cvery clear way.

Now this is one of those articles that suffer from the visually-impaired nature of mp3 podcasting, so do pull up the article wile you;re listening to this section, if at all possible.

And if the visuals intrigue, feel free to check out Factorio by Wube Software and let us know what you think! 🙂

Open Source to the rescue of Cloudera (stock price)!

For the final article in this episode, we take a look at how Cloudera, pretty much the only remaining Hadoop distribution vendor out there, is betting the farm on Open Source.

When Hortonworks was still a separate entity, one of the important differentiators between them and Cloudera was the the Hortonworks Data Platform was 100% open source where Cloudera’s Data Hub was open core wit proprietary add-ons you pretty much needed to get if you wanted to have a somewhat livable user experience.

Now Cloudera has announced to open source everything in their new distribution. This is definitely remarkable and open source enthusiasts will welcome this move. However the question remains why Cloudera is going this path now and what they think they will gain form it.

Hortonworks, near the end of their existence, added proprietary add-ons on top of the open distribution in an attempt to increase revenue (or so we guess) and as the article mentiones near the end, predatory behavior by the likes of Amazon have caused popular open source projects to move away from 100% open source, in orde to protect their sustainability… Only the future can tell how this will turn out.

Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.