ADLS Gen2 Storage Audit & Optimization

Jerin Scaria
2 min readJan 15, 2021

Organizations, big and small are leveraging cloud driven data lakes and Delta Lake technologies. With Delta lake, we get ACID compliant & time travel along with other benefits. But with the introduction of time travel, every DML operation is a metadata operation and none of the older files gets deleted until a manual VACUUM command is issued. This along with the conviction that we can/should dump all the data to lake and never worry about it as costs are less resulted petabyte scale lakes even for smaller orgs.

For an effective management of such huge data store, we need to understand the size of data at different hierarchical(folder) levels.

On ADLS Gen2, storage explorer or metrics doesn’t provide a way to get the size of the data at a folder level. Below explore ways to audit file count and data size at any folder structure level using PySpark on a databricks environment.

Works for storage accounts with petabytes of data and billions of file.

Get multiprocessing package for parallel execution

from pyspark.sql.functions import *
from multiprocessing.pool import ThreadPool
from multiprocessing import Value, Lock

Use locks to get the counts correctly in a multithread environment

counter_lock = Lock()
def fileIncrement():
with counter_lock:
fileCounter.value += 1
return fileCounter.value

Iterative run to fetch the size parameters at folder level and append to a list. Since dbutils.fs.ls in python couldn’t get the file type, use a combination of file characteristics to understand whether the list item is a folder or a file.

def getTableSizeIterator(path, FileSizeList):
tableFileList = dbutils.fs.ls(path)
if len(tableFileList) > 0:
cnt = fileIncrement()
print("paths: ", path, "cnt:", cnt, "cntFiles: ", len(tableFileList))
tableFileDF = spark.createDataFrame(data = tableFileList, schema = ["path", "name","size"])
FileSizeDF = tableFileDF.agg(count('*').alias('fileCount'),round((sum('size')/(1073741824)),3).alias('sizeGB'))
FileSizeDF = FileSizeDF.withColumn('timestamp',current_timestamp()).withColumn('path',lit(path))
FileSizeDF = FileSizeDF.select("path","timestamp","fileCount","sizeGB")
FileSizeTempList = FileSizeDF.collect()
FileSizeList += FileSizeTempList
dirDF = tableFileDF.select("path").where("size = 0 and name not like '%.%' and name not like '$_%' ESCAPE '$'")
for path in dirDF.rdd.map(lambda line: "|".join([str(x) for x in line])).collect():
getTableSizeIterator(path, FileSizeList)
return FileSizeList

Function to initiate the call and persist the results at subfolder level. This avoids memory issues as we audit millions of files in a storage account.

def getFolderSizePool(row):
print ("getting size of: ", row.path)
fileCounter = Value('i',0)
FinalList = []
FinalList = getTableSizeIterator(row.path, FinalList)
FileSizeCheckDF = spark.createDataFrame(data = FinalList, schema = ["path","timestamp","fileCount","sizeGB"])
FileSizeCheckDF.write.mode("append").format("delta").saveAsTable("Lake_File_Details")

Provide the highest folder level that you want to use.

fileCounter = Value('i',0)
tableFileList = dbutils.fs.ls("dbfs:/mnt/lake/<folder>")
#multithreading to get the stats parallelly.
pool=ThreadPool(10)
pool.map(getFolderSizePool,tableFileList)

Use below SQL or its variant to get the folder size.

select substring_index(path,'/',5) as path
,count(*),sum(fileCount) as fileCount
,round(sum(sizeGB),2) as sizeGB
from Lake_File_Details
group by 1
order by 4 desc

You can also find the original script in git

--

--

Jerin Scaria
0 Followers

A Tech Enthusiast working on Big Data & Analytics.