HDFS is an effective file system in the Hadoop ecosystem since its storage capabilities. Correspondingly, you can find storage in Cloud. Those discussed in detail for your quick reference.
HDFS file system
Hadoop supporting file formats
- Plain text storage (eg, CSV, TSV files)
- Sequence Files
- Avro
- Parquet
HDFS file system
You can store any type of data text data in binary format, image, or audio files in HDFS. HDFS is currently developed to be used by MapReduce. So the file format that fits the MapReduce or Hive workload is usually used.
One challenge with implementing HDFS is achieving availability and scalability at the same time. You may have a large amount of data that can’t fit on a single physical machine disk, so it’s necessary to distribute the data among multiple machines.
HDFS can do this automatically and transparently while providing a user-friendly interface to developers.
HDFS snapshot
- It copies data in the filesystem at some point in time. A snapshot can be taken for a subtree or the entire filesystem.
- Snapshot can usually be used for data backup for protection against some failures or disaster recovery, and snapshot is read-only data because it is meaningless if you can modify the snapshot data after it is created.
- HDFS snapshot was designed to copy data efficiently, and the main effectiveness of making an HDFS snapshot includes:
- Creating a snapshot takes constant time order O(1), excluding the inode lookup time, because it does not copy actual data but only makes a reference.
- Additional memory is used only when the original data is modified. The size of additional memory is proportional to the number of modifications.
- The modifications are recorded as the collection in reverse chronological order. The current data is not modified anymore, and the snapshot data is computed by subtracting the modifications from the current data.
HDFS Cluster
In real-time creating an HDFS cluster in the HDOOP ecosystem is expensive. So can go for Cloud storage
Cloud Storages
Amazon EMR
- Amazon Elastic MapReduce is a cloud service for Hadoop. It provides an easy way to create Hadoop clusters on EC2 instances and to access HDFS or S3. You can use major distributions on Amazon EMR such as Hortonworks Data Platform, and MapR distributions.
- The launching process is automated and simplified by Amazon EMR, and HDFS can be used to store intermediate data generated while running a job on an Amazon EMR cluster. Only input and final output are put on S3, which is the best practice for using EMR storage
Treasure Data Service
- Treasure Data is a fully managed cloud data platform. You can easily import any type of data on a storage system managed by Treasure Data, which uses HDFS and S3 internally but encapsulates their detail. You do not have to pay attention to these storage systems.
- Treasure Data mainly uses Hive and Presto as its analytics platform. You can write SQL to analyze what is imported on a Treasure Data storage service. Treasure Data is using HDFS and S3 as its backend and makes use of their advantages respectively. If you do not want to do any operation on HDFS, Treasure Data can be the best choice.
Azure Blob Storage
- Azure Blob Storage is a cloud storage service provided by Microsoft. The combination of Azure Blob Storage and HDInsight provides a full-featured HDFS-compatible storage system.
- A user who is used to HDFS can seamlessly use Azure Blob Storage. A lot of Hadoop ecosystems can operate directly on the data that Azure Blob Storage manages.
- Azure Blob Storage is optimized to be used by a computation layer such as HDInsight, and it provides various types of interfaces, such as PowerShell and of course, Hadoop HDFS commands.
- The developers who are already comfortable using Hadoop can get started easily with Azure Blob Storage
Related Posts