To describe what the Data Lake is, is easier to start with what it is not – it is not a Data Warehouse.
The analogy of a lake can be expanded by imagining a large body of water into and out of which many streams and rivers flow. The origin of incoming streams is from multiple different sources and the streams leading from the lake can flow to many other different destinations. Some streams are a mere trickle and others are raging rivers. The water in the lake shares a common repository or location and this common resource is used for many different purposes, such as fishing, swimming, sailing, speed boating…
Apply this imagery to data and it’s easy to understand the benefits of this common repository.
While business units and divisions in an enterprise develop applications independently for multiple specific business functions they generate and store relative information – most often in siloed repositories. It’s been reported by many independent analysts and is now commonly accepted that the exponential growth in data storage is and will be with unstructured and semi-unstructured files (non-database). This brings multiple challenges such as:
- Inefficient use of resources
- Administrative overhead of multiple repositories
- Multiple file access protocols
- Inconsistent security policies for corporate governance and regulatory compliance
- Inability to correlate information without shifting data around
- Isolation of data preventing a wider audience from gaining business insights
- Time to results using Hadoop analytics platforms
Until now I’ve avoided using the term “Big Data” because I believe it should be understood as being the result of three V’s – Volume, Velocity & Variety. By that I mean the huge amounts of data that are being accumulated at high speed from many different sources.
The Data Lake is the place for Big Data. It is a single scale out repository that will service many different streams of data usingmultiple protocols such as SMB, NFS, FTP, HTTP, REST SWIFT, HDFS and NDMP. Data is deposited once and common files securely shared by many users using different protocols with permissions being maintained consistently across the multiple protocol stacks.
EMC® Isilon® scale-out network attached storage (NAS) is a simple and scalable platform to build a scale-out data lake and persist enterprise files of all sizes and which scales from Terabytes to Petabytes in a single cluster.
A scale-out Data Lake built on Isilon can help an organization reduce costs, decrease storage complexity and comply with corporate governance and regulatory mandates.
With the growing popularity of Hadoop as the Big Data analytics platform, Isilon helps speed time to insights and reduces the risks associated with deploying new systems or extending existing ones as business needs change and develop since the data already resides on the single platform and does not have to be copied to a separate analytics environment.
By: Phil Coombs
EMC Isilon Specialist SE