When it comes to IaaS, a majority of enterprises prefer AZURE over AWS despite the fact that AWS provides more services. One of the reasons is, Azure allows the utilization of the same tried and trusted technologies that several enterprises and businesses have used in the past and are still using today. These include Windows and Linux, Active Directory, and virtual machines. Azure is designed based on Security Development Lifecycle (SDL) which is a leading industry assurance process which comprises security at its core. Private data and services stay secured and protected while they are on Azure Cloud. Azure’s compatibility with the .NET programming language is one of the most useful benefits of Azure, which gives Microsoft a clear upper hand over AWS and the rest of the competitors.
In parallel processing of data download and data analysis, there is a shortcoming in Azure Blob storage. Many of the AI and Big Data projects involve processes such as sentimental analysis for text data like chat data, user reviews, and feedbacks etc. Usually, for each project, there would be GZIP data for around 1 TB to be analyzed before deriving useful insights. When it comes to AWS S3, thanks to Rara technologies, they have created an adapter to stream GZIP which allows for parallel data processing and data download. But for Azure Blob, such an adapter is not available and it results in the waste of time whenever data processing happens. Using ‘Smart open’ adapter, we have the option to process the bulk data in chunks with AWS. Here the process does not have to wait till the entire file is downloaded. Smart open can work asynchronously to process chunks of 10 MB size while the rest of file is being downloaded simultaneously. This freedom to do parallel processing saves a lot of time.
We built a Python Adapter to stream GZIP data from Azure Blob storage. Using the ‘get_blob_to_stream’, which is the default functionality to stream data in the Azure cloud, we can stream data from Blob storage. By defining the start and stop indexes and setting the start index to 0, we can initiate a process which streams the data with a base file size as defined by the indexes and process can run until the condition that available file size in Blob storage is less than the chunk size defined by the indexes. Once the blob data is streamed as chunks, zlib in Python can be used to decompress the GZIP chunk to fetch the data and process it. While this happens, blob storage will still be streamed in parallel. Essentially, the adapter makes it possible to stream the GZIP data in parallel while processing it.
We are open sourcing this tool, you can fork the code from our Github