Microsoft Office Word文書docxを圧縮する

docxドキュメントの構造

docx ファイルは基本的に圧縮パッケージであり、Content_Types.xml で定義されたコンテンツタイプ、.rels ファイルで維持される関係、document.xml のドキュメントコンテンツ、styles.xml のスタイル定義、numbering.xml のリストスタイルが含まれます。これらのコンポーネントは連携して、ドキュメントの構造とスタイルを構築して表示します。ファイル名の拡張子 docx を zip に変更すると、次のディレクトリ構造に解凍できます。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


│  [Content_Types].xml
├─docProps
│      app.xml
│      core.xml
│      custom.xml
│
├─word
│  │  document.xml
│  │  endnotes.xml
│  │  fontTable.xml
│  │  footer1.xml
│  │  footnotes.xml
│  │  settings.xml
│  │  styles.xml
│  │  webSettings.xml
│  │
│  ├─media
│  │      image1.jpg
│  │      image10.emf
│  │      image11.emf
│  │      image12.emf
│  │      image13.png
│  │      image14.emf
│  │      image15.emf
│  │      image16.emf
│  │      image17.emf
│  │      image2.jpg
│  │      image3.jpg
│  │      image4.jpg
│  │      image5.png
│  │      image6.png
│  │      image7.emf
│  │      image8.png
│  │      image9.png
│  │
│  ├─theme
│  │      theme1.xml
│  │
│  └─_rels
│          document.xml.rels
│
└─_rels
        .rels

docx ファイルを圧縮する

多くのスペースを占めるファイルは通常、word/media ディレクトリ内のファイルであり、圧縮は主にこれらのファイルを対象としています。 Docx は現在、JPEG XL、AVIF、WebP 2 などの最新の画像圧縮形式をサポートしていないため、一般的な jpg、png、およびその他の一般的な形式を使用する必要があります。

1.解凍する

拡張子を zip に変更して解凍します

1
2
3
4
5
6
7
8


def unzip(file):
    docname = file[0:-5]
    if os.path.exists(docname) :
        print('os.path.exists! remove!')
        shutil.rmtree(docname)

    with pyzipper.PyZipFile(file, "r") as zf:
        zf.extractall(docname)

2. jpg、png、その他のファイルを圧縮する

直接圧縮するだけです。圧縮にセシウムを使用すると、より優れた圧縮効果が得られます。コマンドラインツールを使用します。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def compress_image(
    input_path: str,
    quality: int = 80
):
    command = 'caesiumclt.exe --same-folder-as-input --quality ' + str(quality) + ' ' + input_path
    print(command)

    try:
        os.system(command)
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

通常、品質を重視して 50 を選択します。圧縮された画像でも見栄えは良くなります。良い結果を得るために 20 を選択することもできます。より高い品質 80 を選択した場合でも、圧縮ファイルははるかに小さくなります。

3.emfファイルを圧縮する

emf は通常比較的大きく、jpg png に変換すると通常はさらに小さくなります。 imagemagick を使用して形式を変換し、圧縮することができます。ファイル名の拡張子が emf から jpg png に変更されるため、word_rels\document.xml.rels ファイルを変更する必要があります。

4. 梱包

変更したファイルをそのままパッケージ化し、圧縮方法として ZIP_DEFLATED を選択するだけです。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def zip(folder, zipfile):
    print('zip:', folder, ' -> ', zipfile)
    with pyzipper.PyZipFile(zipfile, "w",compression=pyzipper.ZIP_DEFLATED) as zf:
        for root,dirs,files in os.walk(folder):
            for file in files:
                abs_path = os.path.join(root,file)
                rel_path = os.path.relpath(abs_path,folder)
                # print(abs_path, rel_path)
                zf.write(abs_path, rel_path)
    
    shutil.rmtree(folder)

5. プロセスコード全体

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


def compress_docx(indir, outdir):
    for root,dirs,files in os.walk(indir):
        #    print(root,dirs,files) 
        for file in files:
            if file.endswith('.docx'):
                docfile = os.path.join(root, file)
                unzip(docfile)
                docname = file[0:-5]
                imgpath = os.path.join(root, docname, 'word/media/')
                # print('imgpath=', imgpath)
                compress_image(imgpath, 50)

                outfolder =  os.path.join(outdir, os.path.relpath(root, indir))
                if not os.path.exists(outfolder):
                    os.mkdir(outfolder)
                zip(os.path.join(root, docname), os.path.join(outfolder, file))

要約する

上記の方法、品質 = 50 によれば、docx ファイルは通常、元のサイズの約 1/3 になります。