Using Microsoft Azure Blob Storage from within Python
When working with cloud-born applications it is sometimes nice to work with any local files. In my case I was working on building some Python pipeline to preprocess data before doing some Machine Learning with it. Actually, my Python code is living in a Jupyter notebook hosted by the Azure Machine Learning Studio.
As my data is living in Azure Blob Storage (this is the fast and cheap generic storage in the Microsoft cloud for your files) I wanted to write some Python scripts that would read from blob storage and write back to blob storage without having any local temp files. As the official documentation is not very clear (at least I find some parts confusing) I will share some bits of Python code that is working for me. Obviously this is all at your own risk and I cannot guarantee this solution will be stable nor that it will be the only or best way to do this.
#connect to your storage account from azure.storage import BlobService blob_service = BlobService(account_name='YourAccountName', account_key='YourKey') #list all CSV files in your storage account blobs = [] marker = None while True: batch = blob_service.list_blobs('YourContainer', marker=marker, prefix='input_') blobs.extend(batch) if not batch.next_marker: break marker = batch.next_marker for blob in blobs: print(blob.name) #read the blob file as a text file #I just read in the first from the pervious list data = blob_service.get_blob_to_text('rockt', blobs[0].name).split("\n") print("Number of lines in CSV " + str(len(data))) #do your stuff #I want to filter out some lines of my CSV and only keep those having ABC or DEF in them matchers = ['abc', 'def'] matching = [s for s in data if any(xs in s for xs in matchers)] print("Number of lines in CSV " + str(len(matching))) #write your text directly back to blob storage blob_service.put_block_blob_from_text( 'YourContainer', 'YourOutputFile.csv', ''.join(matching), x_ms_blob_content_type='text' )
6 Replies to “Using Microsoft Azure Blob Storage from within Python”
Thank you. Your code is really helpful.
Please note that azure.storage 0.30.0 , BlobSrvice is split into BlockBlobService, AppendBlobService, PageBlobService object so this code shall not work.
Thanks for this Pedro! I should re-write this code bit and make it compliant again with the new Azure Storage version APIs.
Hi Sander,
I am getting the following error. Do you by any chance know the reason for this:
(Caused by NewConnectionError(‘: Failed to establish a new connection: [Errno 11001] getaddrinfo failed’,))
Hi Pratik,
Can you ping the blob storage account that you want to use from your dev machine? Sounds like there is a connection (maybe typo) mistake?
Hi Sander Timmer,
I have stored files in Azure Blob storage container like( .pdf, .docx, .pptx, .xlsx, .csv…etc). Requirement is I want to loop through all the files in a container and read the content from each file using Python code and store it in Python List variables. Could you please help me to accomplish this task.
Also suggest me is there any alternate approach to read the content of these files.
Regards,
Sandani Basha Shaik.