Unzipping Files with Power Query: A Step-by-Step Guide
- mandarp0
- Mar 12, 2024
- 3 min read
Updated: Sep 4
One common task in data processing and analysis is unzipping files. You may be shocked to learn that Power Query, the data transformation language used in Power BI and Excel Power Query Editor, can be used to accomplish file extraction, even though there are many other tools and computer languages available for this purpose.
We'll look at a custom Power Query method in this blog article that lets you unzip a file by just supplying its name. This function takes advantage of Power Query's binary capabilities to manage the complexities of the ZIP file format. Now let's go specific.
Function
(ZIPFile) =>
let
ushort = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
uint = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
EDOCfn = BinaryFormat.Record([
ZipContent = BinaryFormat.Binary(Binary.Length(ZIPFile) - 22),
Magic = uint,
DiskNum = ushort,
CDirectoryDiskId = ushort,
CDirectoryRecordCountOnDisk = ushort,
CDirectoryRecordCount = ushort,
SizeOfCentralDirectory = uint,
CentralDirectoryOffset = uint,
CommendLength = ushort
]),
EDOC = EDOCfn(ZIPFile),
BeforeCentralDirectory = BinaryFormat.Binary(EDOC[CentralDirectoryOffset]),
CentralDirectory = BinaryFormat.Length(BinaryFormat.Record(
[
ZipContent = BeforeCentralDirectory,
Items = BinaryFormat.List(BinaryFormat.Record(
[
Magic = uint,
CurrentVersion = ushort,
MinVersion = ushort,
Flags = ushort,
CompressionMethod = ushort,
FileModificationTime = ushort,
FileModificationDate = ushort,
CRC32 = uint,
BinarySize = uint,
FileSize = uint,
FileInfo = BinaryFormat.Choice(
BinaryFormat.Record(
[
Len = ushort,
FieldsLen = ushort,
FileCommentLength = ushort,
Disk = ushort,
InternalFileAttr = ushort,
ExternalAttr = uint,
PosOfFileHeader = uint
]),
(fileInfo) => BinaryFormat.Record(
[
FileName = BinaryFormat.Text(fileInfo[Len], TextEncoding.Ascii),
Fields = BinaryFormat.Binary(fileInfo[FieldsLen]),
FileComment = BinaryFormat.Text(fileInfo[FileCommentLength], TextEncoding.Ascii),
Disk = BinaryFormat.Transform(BinaryFormat.Null, each fileInfo[Disk]),
InternalFileAttr = BinaryFormat.Transform(BinaryFormat.Null, each fileInfo[Disk]),
ExternalAttr = BinaryFormat.Transform(BinaryFormat.Null, each fileInfo[InternalFileAttr]),
PosOfFileHeader = BinaryFormat.Transform(BinaryFormat.Null, each fileInfo[PosOfFileHeader])
])
)
]),
EDOC[CDirectoryRecordCount]
)
]),
EDOC[CentralDirectoryOffset] + EDOC[SizeOfCentralDirectory]),
Contents = List.Transform(
CentralDirectory(ZIPFile)[Items],
(cdEntry) =>
let
ZipEntry = BinaryFormat.Record(
[
PreviousData = BinaryFormat.Binary(cdEntry[FileInfo][PosOfFileHeader]),
Magic = uint,
ZipVersion = ushort,
ZipFlags = ushort,
CompressionMethod = ushort,
FileModificationTime = ushort,
FileModificationDate = ushort,
CRC32 = uint,
BinarySize = uint,
FileSize = uint,
FileName = BinaryFormat.Choice(
BinaryFormat.Record(
[
Len = ushort,
FieldsLen = ushort
]),
(fileInfo) => BinaryFormat.Record(
[
FileName = BinaryFormat.Text(fileInfo[Len], TextEncoding.Ascii),
Fields = BinaryFormat.Binary(fileInfo[FieldsLen])
])
),
FileContent = BinaryFormat.Transform(
BinaryFormat.Binary(cdEntry[BinarySize]),
each Binary.Decompress(_, Compression.Deflate)
)
])(ZIPFile)
in
[FileName=ZipEntry[FileName][FileName], Content=ZipEntry[FileContent]]
)
in
Contents
Understanding the Function
Before we proceed, let's break down the provided code snippet and understand how the unzipping process works within Power Query.
The main function, named Contents, takes a ZIP file as input and performs the following steps:
Extracting Central Directory Information: The function starts by extracting the Central Directory (CD) information from the ZIP file. The CD contains metadata about the files in the ZIP archive, such as file names, sizes, and compression details.
Processing Central Directory Entries: Once the CD information is extracted, the function processes each entry in the Central Directory. It retrieves details such as file names, sizes, and compression methods for each file in the ZIP archive.
Decompressing File Content: The function then decompresses the content of each file using the DEFLATE compression algorithm. DEFLATE is a widely used algorithm for ZIP file compression.
Outputting Results: Finally, the function outputs a list containing the file names and corresponding decompressed content.
How to Use the Function
Now that we have an understanding of the function, let's see how you can use it to unzip files in Power Query.
Open Power Query Editor: In either Power BI or Excel, go to the Power Query Editor.
Load the Function: Copy and paste the provided code into the "Advanced Editor" in Power Query.
Invoke the Function: Create a new query or use an existing one, and invoke the Contents function by passing the ZIP file as an argument.
Review Output: The function will return a list with file names and their corresponding decompressed content.
Here's a simple example of how you can use the function:
let
ZIPFile = Binary.Buffer(File.Contents("C:\Path\To\Your\File.zip")),
UnzippedContents = Contents(ZIPFile)
in
UnzippedContents
Make sure to Replace "C:\Path\To\Your\File.zip" with the actual path to your ZIP file. The Binary.Buffer function is used to load the entire file into memory, enhancing performance.
In summaryThis custom Power Query function makes unzipping files in the Excel and Power BI environments a smooth operation. This feature is easy to include into your data processing workflows, which makes it a useful tool for working with ZIP archives without requiring complicated programming or additional tools.
Please feel free to modify the function to meet your own needs and investigate Power Query's data manipulation capabilities. Cheers to unzipping!
Comments