In the mind of many, big data is intrinsically tied to the cloud. In fact, big data has often been touted as the cloud’s killer app. Big data is, quite obviously, big — involving enormous velocity and volume of data. Public cloud platforms offer an elasticity that makes handling such large amounts of data more efficient and economical than would otherwise be possible.
Although the public cloud has many benefits where big data is concerned, bare metal infrastructure modalities — dedicated server clusters — should not be so easily discounted.
What makes a great big data platform? Scalability, certainly. But that’s not the only prerequisite. Efficient big data processing depends on the ability to move vast amounts of data as quickly as possible. If big data applications are bottlenecked by IO then they won’t perform optimally. Ideally, big data applications would also be able to access all the power that a physical server can provide — minimizing overhead from other applications (including hypervisors and guest operating systems) is beneficial to reducing the infrastructure investment necessary for effective big data processing. Applications like Spark and Hadoop will take all the processing power and memory an organization can throw at them.
As InformationWeek’s Charles Babcock understands:
“Bare metal is well suited to tasks that require frequent imports of large amounts of data, such as applying inserts and updates to the database and doing quick analyses with export of results, such as analyzing activity on a social networking or large e-commerce site. In other words, bare metal shines on big data tasks associated with lots of I/O.”
Modern cloud platforms are no slouches where IO is concerned. They can move data around quickly, but in that regard, the state of the art in virtualized infrastructure is always likely to lag behind dedicated hardware.
To be clear, I’m not claiming that the cloud is a poor choice for big data workloads. There’s a lot to be said for the scalability benefits the public cloud can bring. But I do think that public cloud platforms should not be the only option on the table when organizations are considering infrastructure deployments for big data applications.
Organizations should consider the relative merits of each use-case. Would your particular application benefit more from the elasticity of a public cloud or from the lower latencies and greater server efficiencies of being closer to the metal?
It’s worth mentioning that modern bare metal / dedicated server cluster platforms — although not as intrinsically elastic as a public cloud platform — can be scaled quickly enough for many applications. If an organization has long-running big data applications, deploying on dedicated hardware can an effective strategy.