go語言gzip壓縮 go語言打包

深入理解gzip原理

gzip 使用deflate算法進(jìn)行壓縮。gzip 對于要壓縮的文件，首先使用LZ77算法的一個變種進(jìn)行壓縮，對得到的結(jié)果再使用Huffman編碼的方法

和平ssl適用于網(wǎng)站、小程序/APP、API接口等需要進(jìn)行數(shù)據(jù)傳輸應(yīng)用場景，ssl證書未來市場廣闊！成為創(chuàng)新互聯(lián)的ssl證書銷售渠道，可以享受市場價格4-6折優(yōu)惠！如果有意向歡迎電話聯(lián)系或者加微信：18980820575（備注：SSL證書合作）期待與您的合作！

如果文件中有兩塊內(nèi)容相同的話，那么只要知道前一塊的位置和大小，我們就可以確定后一塊的內(nèi)容。所以我們可以用（兩者之間的距離，相同內(nèi)容的長度）這樣一對信息，來替換后一塊內(nèi)容。由于（兩者之間的距離，相同內(nèi)容的長度）這一對信息的大小，小于被替換內(nèi)容的大小，所以文件得到了壓縮。

舉一個例子

有一個文件的內(nèi)容如下

其中有些部分的內(nèi)容，前面已經(jīng)出現(xiàn)過了，下面用()括起來的部分就是相同的部分。

( .)nease(.net )

我們使用 (兩者之間的距離，相同內(nèi)容的長度) 這樣一對信息，來替換后一塊內(nèi)容。（22,13)中，22為相同內(nèi)容塊與當(dāng)前位置之間的距離，13為相同內(nèi)容的長度。(23,4)中，23為相同內(nèi)容塊與當(dāng)前位置之間的距離，4為相同內(nèi)容的長度。由于（兩者之間的距離，相同內(nèi)容的長度）這一對信息的大小，小于被替換內(nèi)容的大小，所以文件得到了壓縮。

LZ77算法使用"滑動窗口"的方法，來尋找文件中的相同部分，也就是匹配串.

解壓縮：

從文件開始到文件結(jié)束，每次先讀一位標(biāo)志位，通過這個標(biāo)志位來判斷下面是一個(之間的距離，匹配長度) 對，還是一個沒有改動的字節(jié)。如果是一個（之間的距離，匹配長度）對，就讀出固定位數(shù)的（之間的距離，匹配長度）對，然后根據(jù)對中的信息，將匹配串輸出到當(dāng)前位置。如果是一個沒有改動的字節(jié)，就讀出一個字節(jié)，然后輸出這個字節(jié)。

LZ77壓縮時需要做大量的匹配工作，而解壓縮時需要做的工作很少，也就是說解壓縮相對于壓縮將快的多。這對于需要進(jìn)行一次壓縮，多次解壓縮的情況，是一個巨大的優(yōu)點。

深入理解

要理解這種算法，我們先了解3個關(guān)鍵詞:短語字典，滑動窗口和向前緩沖區(qū)。

前向緩沖區(qū)

每次讀取數(shù)據(jù)的時候，先把一部分?jǐn)?shù)據(jù)預(yù)載入前向緩沖區(qū)。為移入滑動窗口做準(zhǔn)備

滑動窗口

一旦數(shù)據(jù)通過緩沖區(qū)，那么它將移動到滑動窗口中，并變成字典的一部分。

短語字典

從字符序列S1...Sn，組成n個短語。比如字符(A,B,D) ,可以組合的短語為{(A),(A,B),(A,B,D),(B),(B,D),(D)},如果這些字符在滑動窗口里面，就可以記為當(dāng)前的短語字典，因為滑動窗口不斷的向前滑動，所以短語字典也是不斷的變化。

LZ77的主要算法邏輯就是，先通過前向緩沖區(qū)預(yù)讀數(shù)據(jù)，然后再向滑動窗口移入（滑動窗口有一定的長度），不斷的尋找能與字典中短語匹配的最長短語，然后通過標(biāo)記符標(biāo)記。

當(dāng)壓縮數(shù)據(jù)的時候，前向緩沖區(qū)與移動窗口之間在做短語匹配的是后會存在2種情況:

短語標(biāo)記包含三部分信息：（滑動窗口中的偏移量（從匹配開始的地方計算）、匹配中的符號個數(shù)、匹配結(jié)束后的前向緩沖區(qū)中的第一個符號）。

我們采用圖例來看:

1、開始

2、滑動窗口中沒有數(shù)據(jù)，所以沒有匹配到短語，將字符A標(biāo)記為A

3、滑動窗口中有A,沒有從緩沖區(qū)中字符（BABC）中匹配到短語，依然把B標(biāo)記為B

4、緩沖區(qū)字符（ABCB）在滑動窗口的位移6位置找到AB,成功匹配到短語AB,將AB編碼為(6,2,C)

5、緩沖區(qū)字符（BABA）在滑動窗口位移4的位置匹配到短語BAB,將BAB編碼為(4,3,A)

6、緩沖區(qū)字符（BCAD）在滑動窗口位移2的位置匹配到短語BC，將BC編碼為（2,2,A）

7、緩沖區(qū)字符D,在滑動窗口中沒有找到匹配短語，標(biāo)記為D

8、緩沖區(qū)中沒有數(shù)據(jù)進(jìn)入了，結(jié)束

解壓類似于壓縮的逆向過程，通過解碼標(biāo)記和保持滑動窗口中的符號來更新解壓數(shù)據(jù)。

我們還是采用圖例來看下:

1、開始

2、符號標(biāo)記A解碼

3、符號標(biāo)記B解碼

4、短語標(biāo)記(6,2,C)解碼

5、短語標(biāo)記(4,3,A)解碼

6、短語標(biāo)記(2,2,A)解碼

7、符號標(biāo)記D解碼

大多數(shù)情況下LZ77壓縮算法的壓縮比相當(dāng)高，當(dāng)然了也和你選擇滑動窗口大小，以及前向緩沖區(qū)大小有關(guān)系。其壓縮過程是比較耗時的，因為要花費(fèi)很多時間尋找滑動窗口中的短語匹配，不過解壓過程會很快，因為每個標(biāo)記都明確告知在哪個位置可以讀取了。

哈夫曼樹也叫最優(yōu)二叉樹（哈夫曼樹）

問題：什么是哈夫曼樹？

例：將學(xué)生的百分制成績轉(zhuǎn)換為五分制成績：≥90 分: A，80～89分: B，70～79分: C，60～69分: D，＜60分: E。

判別樹：用于描述分類過程的二叉樹。

如果每次輸入量都很大，那么應(yīng)該考慮程序運(yùn)行的時間

如果學(xué)生的總成績數(shù)據(jù)有10000條，則5％的數(shù)據(jù)需 1 次比較，15％的數(shù)據(jù)需 2 次比較，40％的數(shù)據(jù)需 3 次比較，40％的數(shù)據(jù)需 4 次比較，因此 10000 個數(shù)據(jù)比較的

次數(shù)為： 10000 (5％＋2×15％＋3×40％＋4×40％)＝31500次

此種形狀的二叉樹，需要的比較次數(shù)是：10000 (3×20％＋2×80％)＝22000次，顯然：兩種判別樹的效率是不一樣的。

問題：能不能找到一種效率最高的判別樹呢?

那就是哈夫曼樹

樹的帶權(quán)路徑長度：樹中所有葉子結(jié)點的帶權(quán)路徑長度之和，通常記作：

其中，n表示葉子結(jié)點的數(shù)目，wi和li分別表示葉子結(jié)點ki的權(quán)值和樹根結(jié)點到葉子結(jié)點ki之間的路徑長度。

赫夫曼樹（哈夫曼樹，huffman樹）定義：

在權(quán)為w1,w2,…,wn的n個葉子結(jié)點的所有二叉樹中，帶權(quán)路徑長度WPL最小的二叉樹稱為赫夫曼樹或最優(yōu)二叉樹。

例：有4 個結(jié)點 a, b, c, d，權(quán)值分別為 7, 5, 2, 4，試構(gòu)造以此 4 個結(jié)點為葉子結(jié)點的二叉樹。

WPL=7′2+5′2+2′2+4′2= 36

WPL=7′3+5′3+2′1+4′2= 46

哈夫曼樹的構(gòu)造(哈夫曼算法)

1.根據(jù)給定的n個權(quán)值{w1,w2,…,wn}構(gòu)成二叉樹集合F={T1,T2,…,Tn},其中每棵二叉樹Ti中只有一個帶權(quán)為wi的根結(jié)點,其左右子樹為空.

2.在F中選取兩棵根結(jié)點權(quán)值最小的樹作為左右子樹構(gòu)造一棵新的二叉樹,且置新的二叉樹的根結(jié)點的權(quán)值為左右子樹根結(jié)點的權(quán)值之和.

3.在F中刪除這兩棵樹,同時將新的二叉樹加入F中.

4.重復(fù)2、3,直到F只含有一棵樹為止.(得到哈夫曼樹)

哈夫曼樹的應(yīng)用很廣，哈夫曼編碼就是其在電訊通信中的應(yīng)用之一。廣泛地用于數(shù)據(jù)文件壓縮的十分有效的編碼方法。其壓縮率通常在20%～90%之間。在電訊通信業(yè)務(wù)中，通常用二進(jìn)制編碼來表示字母或其他字符，并用這樣的編碼來表示字符序列。

例：如果需傳送的電文為 ‘ABACCDA’，它只用到四種字符，用兩位二進(jìn)制編碼便可分辨。假設(shè) A, B, C, D 的編碼分別為 00, 01,10, 11，則上述電文便為 ‘00010010101100’（共 14 位），譯碼員按兩位進(jìn)行分組譯碼，便可恢復(fù)原來的電文。

能否使編碼總長度更短呢？

實際應(yīng)用中各字符的出現(xiàn)頻度不相同，用短（長）編碼表示頻率大（小）的字符，使得編碼序列的總長度最小，使所需總空間量最少

數(shù)據(jù)的最小冗余編碼問題

在上例中，若假設(shè) A, B, C, D 的編碼分別為 0，00，1，01，則電文 ‘ABACCDA’ 便為 ‘000011010’（共 9 位），但此編碼存在多義性：可譯為： ‘BBCCDA’、‘ABACCDA’、‘AAAACCACA’ 等。

譯碼的惟一性問題

要求任一字符的編碼都不能是另一字符編碼的前綴，這種編碼稱為前綴編碼（其實是非前綴碼）。在編碼過程要考慮兩個問題，數(shù)據(jù)的最小冗余編碼問題，譯碼的惟一性問題，利用最優(yōu)二叉樹可以很好地解決上述兩個問題

以電文中的字符作為葉子結(jié)點構(gòu)造二叉樹。然后將二叉樹中結(jié)點引向其左孩子的分支標(biāo) ‘0’，引向其右孩子的分支標(biāo) ‘1’；每個字符的編碼即為從根到每個葉子的路徑上得到的 0, 1 序列。如此得到的即為二進(jìn)制前綴編碼。

編碼： A：0， C：10，B：110，D：111

任意一個葉子結(jié)點都不可能在其它葉子結(jié)點的路徑中。

用哈夫曼樹設(shè)計總長最短的二進(jìn)制前綴編碼

例：如果需傳送的電文為 ‘ABACCDA’，即：A, B, C, D 的頻率（即權(quán)值）分別為 0.43, 0.14, 0.29, 0.14，試構(gòu)造哈夫曼編碼。

編碼： A：0， C：10， B：110， D：111 。電文 ‘ABACCDA’ 便為 ‘0110010101110’（共 13 位）。

譯碼

從哈夫曼樹根開始，對待譯碼電文逐位取碼。若編碼是“0”，則向左走；若編碼是“1”，則向右走，一旦到達(dá)葉子結(jié)點，則譯出一個字符；再重新從根出發(fā)，直到電文結(jié)束。

電文為 “1101000” ，譯文只能是“CAT”

gzip怎么壓縮和怎么解壓縮文件到其他目錄

解決：gzip -c test.txt /root/test.gz，文件流重定向，解壓也是，gunzip -c /root/test.gz ./test.txt

經(jīng)驗：更常用的命令tar同樣可以解壓*.gz，參數(shù)為-c

附gzip幫助文件

GZIP(1) ? ? ? ? ? ? ? ? ? ? General Commands Manual ? ? ? ? ? ? ? ? ? ?GZIP(1)

NAME

?gzip, gunzip, zcat - compress or expand files

SYNOPSIS

?gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ... ?]

?gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ?]

?zcat [ -fhLV ] [ name ... ?]

OPTIONS

?-a --ascii

? ? ? ? Ascii text mode: convert end-of-lines using ?local ?conventions.

? ? ? ? This ?option ?is ?supported ?only ?on some non-Unix systems. For

? ? ? ? MSDOS, CR LF is converted to LF when compressing, and LF is con‐

? ? ? ? verted to CR LF when decompressing.

?-c --stdout --to-stdout

? ? ? ? Write ?output on standard output; keep original files unchanged.

? ? ? ? If there are several input ?files, ?the ?output ?consists ?of ?a

? ? ? ? sequence ?of ?independently compressed members. To obtain better

? ? ? ? compression, concatenate ?all ?input ?files ?before ?compressing

? ? ? ? them.

?-d --decompress --uncompress

? ? ? ? Decompress.

?-f --force

? ? ? ? Force compression or decompression even if the file has multiple

? ? ? ? links or the corresponding file already exists, or if ?the ?com‐

? ? ? ? pressed data is read from or written to a terminal. If the input

? ? ? ? data is not in a format recognized by gzip, and ?if ?the ?option

? ? ? ? --stdout ?is ?also ?given, copy the input data without change to

? ? ? ? the standard output: let zcat behave ?as ?cat. ? If ?-f ?is ?not

? ? ? ? given, ?and ?when not running in the background, gzip prompts to

? ? ? ? verify whether an existing file should be overwritten.

?-h --help

? ? ? ? Display a help screen and quit.

?-l --list

? ? ? ? For each compressed file, list the following fields:

? ? ? ? ? ? compressed size: size of the compressed file

? ? ? ? ? ? uncompressed size: size of the uncompressed file

? ? ? ? ? ? ratio: compression ratio (0.0% if unknown)

? ? ? ? ? ? uncompressed_name: name of the uncompressed file

? ? ? ? The uncompressed size is given as -1 for files not in gzip ?for‐

? ? ? ? mat, ?such ?as compressed .Z files. To get the uncompressed size

? ? ? ? for such a file, you can use:

? ? ? ? ? ? zcat file.Z | wc -c

? ? ? ? In combination with the --verbose option, the ?following ?fields

? ? ? ? are also displayed:

? ? ? ? ? ? method: compression method

? ? ? ? ? ? crc: the 32-bit CRC of the uncompressed data

? ? ? ? ? ? date time: time stamp for the uncompressed file

? ? ? ? The ?compression ?methods ?currently supported are deflate, com‐

? ? ? ? press, lzh (SCO compress -H) and pack. ? The ?crc ?is ?given ?as

? ? ? ? ffffffff for a file not in gzip format.

? ? ? ? With ?--name, ?the ?uncompressed name, ?date and time ?are those

? ? ? ? stored within the compress file if present.

? ? ? ? With --verbose, the size totals and compression ?ratio ?for ?all

? ? ? ? files ?is ?also ?displayed, ?unless some sizes are unknown. With

? ? ? ? --quiet, the title and totals lines are not displayed.

?-L --license

? ? ? ? Display the gzip license and quit.

?-n --no-name

? ? ? ? When compressing, do not save the original file ?name ?and ?time

? ? ? ? stamp by default. (The original name is always saved if the name

? ? ? ? had to be truncated.) When decompressing, ?do ?not ?restore ?the

? ? ? ? original ?file name if present (remove only the gzip suffix from

? ? ? ? the compressed file name) and do not restore the ?original ?time

? ? ? ? stamp if present (copy it from the compressed file). This option

? ? ? ? is the default when decompressing.

?-N --name

? ? ? ? When compressing, always save the original file ?name ?and ?time

? ? ? ? stamp; ?this ?is ?the ?default. ?When decompressing, restore the

? ? ? ? original file name and time stamp if ?present. ?This ?option ?is

? ? ? ? useful on systems which have a limit on file name length or when

? ? ? ? the time stamp has been lost after a file transfer.

?-q --quiet

? ? ? ? Suppress all warnings.

?-r --recursive

? ? ? ? Travel the directory structure recursively. If any of ?the ?file

? ? ? ? names ?specified ?on the command line are directories, gzip will

? ? ? ? descend into the directory and compress all the files ?it ?finds

? ? ? ? there (or decompress them in the case of gunzip ).

?-S .suf --suffix .suf

? ? ? ? When compressing, use suffix .suf instead of .gz. ?Any non-empty

? ? ? ? suffix can be given, but suffixes other than .z and ?.gz ?should

? ? ? ? be ?avoided ?to ?avoid ?confusion ?when files are transferred to

? ? ? ? other systems.

? ? ? ? When decompressing, add .suf to the beginning ?of ?the ?list ?of

? ? ? ? suffixes to try, when deriving an output file name from an input

? ? ? ? file name.

? ? ? ? pack(1).

?-t --test

? ? ? ? Test. Check the compressed file integrity.

?-v --verbose

? ? ? ? Verbose. Display the name and percentage reduction for each file

? ? ? ? compressed or decompressed.

?-V --version

? ? ? ? Version. Display the version number and compilation options then

? ? ? ? quit.

?-# --fast --best

? ? ? ? Regulate the speed of compression using the specified ?digit ?#,

? ? ? ? where ?-1 ?or ?--fast ?indicates ?the fastest compression method

? ? ? ? (less compression) and -9 or --best indicates the ?slowest ?com‐

? ? ? ? pression ?method ?(best ?compression). ? The default compression

? ? ? ? level is -6 (that is, biased towards high compression at expense

? ? ? ? of speed).

golang中compress/flate包

官方標(biāo)準(zhǔn)庫對flate包的定義是：flate包實現(xiàn)了deflate壓縮數(shù)據(jù)格式，參見 RFC 1951 。gzip包和zlib包實現(xiàn)了對基于deflate的文件格式的訪問。

這邊什么是deflate？

維基百科給出的解釋是： DEFLATE 是同時使用了 LZ77 算法與哈夫曼編碼（Huffman Coding）的一個無損數(shù)據(jù)壓縮算法。它最初是由菲爾·卡茨（Phil Katz）為他的 PKZIP 軟件第二版所定義的，后來被 RFC 1951 標(biāo)準(zhǔn)化。

1）func NewReader(r io.Reader) io.ReadCloser

2）func NewReaderDict(r io.Reader, dict []byte) io.ReadCloser

3）func NewWrite(w io.Write, level int) (*Write, error)

4）func NewWriteDict(w io.Writer, level int, dict []byte) (*Writer, error)

5）func (e InternalError) Error() string

6）func (e *ReadError) Error() string

7）func (e *WriteError) Error() string

8）func (w *Writer) Close() error

9）func (w *Writer) Flush() error

9）func (w *Writer) Reset(dst io.Writer)

10）func (w *Writer) Write(data []byte) (n int, err error)

非常好的一個資源鏈接：

如果有很好的資源，歡迎在評論區(qū)留言分享

zstd，未來可期的數(shù)據(jù)壓縮算法

最近了解到了 zstd 這種新的壓縮算法。不像lz4，lzo，snappy等近幾年流行的壓縮算法專注于壓縮和解壓縮性能，zstd在性能不錯的同時號稱壓縮率跟Deflate（zip/gzip的算法）相當(dāng)。下面是官網(wǎng) 列出的數(shù)據(jù)：

我們知道，壓縮算法的效果和性能跟被壓縮的數(shù)據(jù)類型和模式有很大的關(guān)系，光看別人的測試數(shù)據(jù)、benchmark是不夠的。正好有功能開發(fā)需要，于是結(jié)合我們的使用場景真實測試的一下。

驚喜的是，實測的結(jié)果比官方提供的還好，終于找到了我們的cup of tea。

Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, 8G內(nèi)存

CentOS 7.0

對幾種支持流式寫入的壓縮算法，使用對應(yīng)的命令行工具進(jìn)行壓縮測試。

除了snappy，各種壓縮算法/工具都支持設(shè)置壓縮級別，高級別意味著以更長的壓縮時間換取更高的壓縮率。

100萬行不重復(fù)的某個應(yīng)用的日志文件，大小為977MB。

從上面可以看出：

zstd無論從處理時間還是壓縮率來看都占優(yōu)。snappy, lz4, lzo的壓縮率較低，但壓縮速度都很快，而zstd甚至比這些算法更快。Gzip的壓縮率比lz4等高不少，而zstd的壓縮率比gzip還提升一倍。

如果從上面的比較還不是特別直觀的話，我們再引入一個創(chuàng)造性的指標(biāo)（從網(wǎng)上其他壓縮算法對比沒有見過使用這項指標(biāo)）：

代表單位處理時間可以壓縮去掉多少冗余數(shù)據(jù)。其中權(quán)重系數(shù) 用來指定壓縮率和壓縮速度哪個更重要，這里我們認(rèn)為在我們的使用場景里兩者同樣重要，取系數(shù)為1。

從這里我們可以明顯看出， zstd lz4 lzo snappy 其他。

對1000行、大小約為1MB的文件進(jìn)行壓縮測試，各種算法的壓縮率跟1GB大文件的壓縮率幾乎一樣。

下面再對更小的數(shù)據(jù)量——10行日志數(shù)據(jù)的壓縮率進(jìn)行對比。雖然我們的使用場景里沒有對小數(shù)據(jù)量的壓縮處理，但還是比較好奇zstd字典模式的效果。

其中最后一組數(shù)據(jù)為zstd使用10000行日志進(jìn)行訓(xùn)練生成字典文件，并利用字典文件輔助壓縮測試數(shù)據(jù)。

可以看出來，除了zstd字典模式外，各種壓縮算法在處理更小的數(shù)據(jù)量時壓縮率都下降很多。而zstd字典模式對壓縮率帶來幫助非常明顯，與gzip對比，壓縮率從1000行時相差1倍，到10行時變?yōu)榱讼嗖罱咏?倍。

下一篇文章將給大家對比這幾種算法的golang開源庫的性能和壓縮率。敬請期待。

當(dāng)前名稱：go語言gzip壓縮 go語言打包
網(wǎng)頁URL：http://www.xueling.net.cn/article/dogcgjh.html

老熟女激烈的高潮_日韩一级黄色录像_亚洲1区2区3区视频_精品少妇一区二区三区在线播放_国产欧美日产久久_午夜福利精品导航凹凸

重慶分公司，新征程啟航

go語言gzip壓縮 go語言打包

深入理解gzip原理

gzip怎么壓縮和怎么解壓縮文件到其他目錄

golang中compress/flate包

zstd，未來可期的數(shù)據(jù)壓縮算法

其他資訊