Python 檔案編碼問題

Introduction

當存取外部檔案時，很容易因為不同的環境而產生編碼相關的錯誤（尤其是 Windows 作業系統），不管是用 Python 原生的 open/read 或是 Pandas 提供的 read_csv 方法，都可以再開檔的時候指定 encoding 參數解決這個問題。

假設要使用 utf-8 編碼方式存取檔案，使用方法如下：


encoding ='utf-8'

f =open('filename.txt','r', encoding=encoding)

text = f.read()

f.close()


import pandas as pd

encoding ='utf-8'

df = pd.read_csv('filename.csv',encoding=encoding)

另外一種常見的問題是，可能不知道原始檔案的編碼方式為何，可以透過 chardet 工具來查詢：


import chardet

with open('filename.txt','rb')as f:

   result = chardet.detect(f.read())

最後科普一下幾種常見的編碼格式：