NLTKのバグ?

最近自然言語処理入門を買ったりしていろいろやっていて気づいたこと。

pythonのnltkをimportした後、”ム”(\xe3\x83\xa0)をstrip()すると\xa0が取り除かれてしまう。通常は\xe3\x83\xa0のままのはず。

Python 2.6.5 (r265:79063, Apr 12 2010, 01:06:47)
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> s='ム'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'
>>>s.strip().decode('utf-8')
u'\u30e0'
>>> import nltk
>>> nltk.__version__
'2.0b9'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83'
>>> sys.getdefaultencoding()
'utf-8'
>>>s.strip().decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

2011/2/7 追記
Python2.7.1だと大丈夫だった

Python 2.7.1 (r271:86832, Feb  7 2011, 12:54:42) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> s='ム'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'
>>>s.strip().decode('utf-8')
u'\u30e0'
>>> import nltk
>>> nltk.__version__
'2.0b9'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'
Posted: February 7th, 2011 | Author: | Filed under: 技術 | Tags: , | No Comments »

Leave a Reply