NLTKのバグ?
最近自然言語処理入門を買ったりしていろいろやっていて気づいたこと。
pythonのnltkをimportした後、”ム”(\xe3\x83\xa0)をstrip()すると\xa0が取り除かれてしまう。通常は\xe3\x83\xa0のままのはず。
Python 2.6.5 (r265:79063, Apr 12 2010, 01:06:47) [GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>> s='ム' >>> s '\xe3\x83\xa0' >>> s.strip() '\xe3\x83\xa0' >>>s.strip().decode('utf-8') u'\u30e0' >>> import nltk >>> nltk.__version__ '2.0b9' >>> s '\xe3\x83\xa0' >>> s.strip() '\xe3\x83' >>> sys.getdefaultencoding() 'utf-8' >>>s.strip().decode('utf-8') UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data
2011/2/7 追記
Python2.7.1だと大丈夫だった
Python 2.7.1 (r271:86832, Feb 7 2011, 12:54:42) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>> s='ム' >>> s '\xe3\x83\xa0' >>> s.strip() '\xe3\x83\xa0' >>>s.strip().decode('utf-8') u'\u30e0' >>> import nltk >>> nltk.__version__ '2.0b9' >>> s '\xe3\x83\xa0' >>> s.strip() '\xe3\x83\xa0'Posted: February 7th, 2011 | Author: yamakk | Filed under: 技術 | Tags: nltk, python | No Comments »