Compress Microsoft Office Word Documents (.docx)

DOCX File Structure

A .docx file is essentially a ZIP package containing XML documents and media resources:

[Content_Types].xml defines content types
.rels files define relationships
word/document.xml stores main document content
word/styles.xml stores styles
word/numbering.xml stores list definitions

If you rename .docx to .zip, you can extract and inspect the directory structure.

How to Compress DOCX

Large DOCX files are often dominated by files under word/media, so compression should focus there.

At the moment, DOCX compatibility for newer image formats can be limited in some workflows, so common formats like JPG/PNG are often safer.

1. Unzip

Rename extension to ZIP and extract, or use Python:

1
2
3
4
5
6
7
8


def unzip(file):
    docname = file[0:-5]
    if os.path.exists(docname):
        print('os.path.exists! remove!')
        shutil.rmtree(docname)

    with pyzipper.PyZipFile(file, "r") as zf:
        zf.extractall(docname)

2. Compress JPG/PNG

You can compress images directly. Caesium provides good compression results via CLI.

1
2
3
4
5
6
7
8


def compress_image(input_path: str, quality: int = 80):
    command = 'caesiumclt.exe --same-folder-as-input --quality ' + str(quality) + ' ' + input_path
    print(command)

    try:
        os.system(command)
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

In practice:

quality=50 often gives strong size reduction with acceptable quality.
Even quality=80 can still noticeably reduce size.

3. Handle EMF files

EMF files are often large. Converting EMF to JPG/PNG can reduce size significantly. You can use ImageMagick for conversion, then update word/_rels/document.xml.rels if file extension/path changes.

4. Repack

Re-zip the extracted folder using ZIP_DEFLATED.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


def zip(folder, zipfile):
    print('zip:', folder, ' -> ', zipfile)
    with pyzipper.PyZipFile(zipfile, "w", compression=pyzipper.ZIP_DEFLATED) as zf:
        for root, dirs, files in os.walk(folder):
            for file in files:
                abs_path = os.path.join(root, file)
                rel_path = os.path.relpath(abs_path, folder)
                zf.write(abs_path, rel_path)

    shutil.rmtree(folder)

5. End-to-end script example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


def compress_docx(indir, outdir):
    for root, dirs, files in os.walk(indir):
        for file in files:
            if file.endswith('.docx'):
                docfile = os.path.join(root, file)
                unzip(docfile)
                docname = file[0:-5]
                imgpath = os.path.join(root, docname, 'word/media/')
                compress_image(imgpath, 50)

                outfolder = os.path.join(outdir, os.path.relpath(root, indir))
                if not os.path.exists(outfolder):
                    os.mkdir(outfolder)
                zip(os.path.join(root, docname), os.path.join(outfolder, file))

Summary

With this approach and quality=50, DOCX files can often be reduced to around one-third of the original size.