Episode 56 – Dataworks Summit Sydney recap by Dave – Part 1

Dataworks Summit
Dave has attended the Dataworks Summit in Sidney and we go over the different sessions he attended there. In this first of two episodes, the focus lies on the new goodness that Hadoop 3.0 will bring us soon.
  • Hadoop 3.0 – Sanjay Radia
    • https://www.slideshare.net/Hadoop_Summit/apache-hadoop-30-community-update-79999467
    • JDK 8+
    • Port number changes
    • Class-path isolation
    • HDFS – 3 node Namenode, intra data node balancer for balanced storage within a node, erasure coding
    • 10TB node recovering in a few hours on a large cluster (3000 nodes)
    • Erasure coding 2012, 2013, 2014
    • Erasure coding methods, blogs or stripes
    • Surprisingly little performance difference for EC, what’s not shown is the network bandwidth cost, which is significantly higher
    • Yarn 3.0
    • Scheduler, priorities within a queue
    • Q – Inter queue priorities
    • Long running services, dynamic container configuration, cpu and io easy, hard to do memory
    • Service discovery in YARN via zookeeper, dns
    • Elastic resource model, graceful decommissioning node managers
    • Resource isolation with disk and network
    • Yarn UI
    • YARN federation


Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.