Going straight to the point there are cases in which it is necessary to create a custom data retention policy on a data lake, for example due to GDPR regulation that requires Companies not to retain data older than a certain period, depending on the industry.
Some products have built-in features to remove data after a while, but thanks to Logic App it is possible to easily implement data retention with more customized rules.
The aim of the article is to show how to programmatically remove blobs from an Azure Storage Account Gen2 using an Azure Logic App.
Ready to go?
Strategy
Let’s assume it is necessary to remove blobs contained in the hierarchical organization of folders where the name of the latest folder of the tree contains the hour, and many hours are inside folders with the following format: yyyy-MM-dd i.e. year-month-day. Be aware that not all days might contain all hours, hence this edge case has to be managed by the Logic App.
For sake of clarity a portion of a path could contain “../2021-03-14/11/blobName.blob“, but the solution could be readapted easily to any other format like “../YYYY/MM/DD/hh/blobName.blob” and other ones, by simply updating the format as it is shown in the screenshots below.
The requirement is to remove data older than 395 days [(1 year = 365 days) + (1 month = 30 days)]
The strategy is to:
- Schedule the logic app to run every days, take the current date and retrieve the date occurred 395 days before.
- Extract the Year, Month and Day from that day
- Compose the directory address containing the blobs belonging to that day
- Iterate through all blobs and deleting them
Logic App
The first block of the Logic App is the Recurrence, setting it to a daily trigger.
Followed by the initialization of a variable, here called “DateToRemove”, of type string, populated dynamically by get the utcNow() timestamp, removing 395 days, and then formatting the result according to how folders are organized in the storage account. In this precise case we have inside the folder Event, one folder per each day, and the format of the name of each folder is yyyy-MM-dd, hence the variable is defined as: “addDays(utcNow(),-395,’yyyy-MM-dd’)“
The presence of Scope is used to manage any exceptions or errors like directories not found, and it depends on the personal organization of data and desired behavior (it is not mandatory). The example shown is part of a bigger solution where several folders are logically separated hence the LogicApp doesn’t have to fail if in one of those folders there are no data found. In case of a simple flow it is possible to avoid using scope and the Logic App will just fail. Without wrapping the code inside a scope it won’t be possible to manage the IT output asof the scope.and the final condition in the logic app.
Inside the Scope there is the Lists blobs in a container task, here called: “CO Folders Selection”, which contains a standard path completed with the dynamic value taken from the DateToRemove variable. When clicking on the small folder icon in the Blob Path it is possible to navigate through the blob container.
The result will give the list of all blobs Delete Blobs operation is inside a cycle to read the output of the previous command and delete all blobs.
If the path actually exists, then the Delete Blob operation will be performed, otherwise it won’t be triggered. This is to avoid any errors in case of folder is not found, since it is possible that not all hours have data within a day as anticipated above in the article.
Extra: Scope output management
The last step is taking the result of what happens inside the Scope: it takes the result and it is possible to choose to trigger new actions. In this case if the Scope failed, nothing is done, and the LogicApp won’t fail.
Conclusion
Clearly this logic is simply based on a rolling date, but starting from this example it will be possible to implement any more complex logic to implement the data retention rules inside the data lake.