Networkxをつかってグラフを描く

Wikipediaのカテゴリ間のネットワークを見てみたくて、Networkxというpythonのライブラリをつかってグラフを描きました。例えばWikipediaの「ヒップホップ」というページは

ヒップホップ

ヒップホップ用語

アメリカ合衆国の音楽

サブカルチャー

風俗

アフリカ系アメリカ人の文化

というカテゴリに登録されています。これらのカテゴリがそれぞれ所属するカテゴリをさらに辿ってそれらの関係を視覚化すると下の図のようになります。(この図では3階層上のカテゴリまで)

日本語をグラフに描画するのに苦労しました。とにかく描ければよいということで適当なパッチを作って対応しています。
patch for networkx1.5 https://gist.github.com/1247256

Learning to Link with Wikipedia
http://www.cs.waikato.ac.nz/~dnk2/publications/CIKM08-LearningToLinkWithWikipedia.pdf

Posted: September 30th, 2011 | Author: yamakk | Filed under: 技術 | Tags: graph, networkx, python, wikipedia | No Comments »

Matplotlibで日本語フォントを使う

gistを使ってみたかったので、Matplotlibで日本語フォントを使うサンプルを上げます。
http://matplotlib.sourceforge.net

	#coding:utf-8
	"""
	matplotlibで日本語フォントを使うサンプル

	"""

	import matplotlib.pyplot
	import matplotlib.font_manager

	# for Mac
	font_path = '/Library/Fonts/Osaka.ttf'
	# for Linux
	#font_path = '/home/k01/.fonts/ipag00303/ipag.ttf'

	# font_pathが見つからない場合
	# RuntimeError: Could not open facefile
	# /home/k01/.fonts/ipag00303/ipag.ttf; Cannot_Open_Resource になる

	font_prop = matplotlib.font_manager.FontProperties(fname=font_path)
	matplotlib.pyplot.title(u'日本語のテストタイトル', fontproperties=font_prop)
	matplotlib.pyplot.plot(range(10))

	# matplotlib.pyplot.show()だと日本語は正しく表示されない savefigはOK
	matplotlib.pyplot.savefig('test.png')

view raw matplotlib_jpfont_sample.py hosted with ❤ by GitHub

test

Posted: September 28th, 2011 | Author: yamakk | Filed under: 技術 | Tags: font, gist, matplotlib, python | No Comments »

NLTKのバグ?

最近自然言語処理入門を買ったりしていろいろやっていて気づいたこと。

pythonのnltkをimportした後、”ム”(\xe3\x83\xa0)をstrip()すると\xa0が取り除かれてしまう。通常は\xe3\x83\xa0のままのはず。

Python 2.6.5 (r265:79063, Apr 12 2010, 01:06:47)
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> s='ム'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'
>>>s.strip().decode('utf-8')
u'\u30e0'
>>> import nltk
>>> nltk.__version__
'2.0b9'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83'
>>> sys.getdefaultencoding()
'utf-8'
>>>s.strip().decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

2011/2/7　追記
Python2.7.1だと大丈夫だった

Python 2.7.1 (r271:86832, Feb  7 2011, 12:54:42) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> s='ム'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'
>>>s.strip().decode('utf-8')
u'\u30e0'
>>> import nltk
>>> nltk.__version__
'2.0b9'
>>> s
'\xe3\x83\xa0'
>>> s.strip()
'\xe3\x83\xa0'

Posted: February 7th, 2011 | Author: yamakk | Filed under: 技術 | Tags: nltk, python | No Comments »

ElementTreeでiTunesのfeedをパースする

iPhoneアプリのデータをとるべくPythonではみんなが使っているUniversal Feed Parserで書き始めたものの、拡張されたnamespaceを操作できないことがわかり、ElementTreeをつかうことにした。
当然できるはずだと思っていたので時間を無駄にした。ドキュメントを読みましょう。

#coding:utf-8
import urllib
from xml.etree import ElementTree
from dateutil import parser as date_parser


def parse_app_feed():
    free_url = "http://itunes.apple.com/jp/rss/topfreeapplications/limit=10/xml"
    atom_ns = 'http://www.w3.org/2005/Atom'
    itunes_ns = 'http://itunes.apple.com/rss'
    xml_string = urllib.urlopen(free_url).read()
    #xml = ElementTree(file=urllib.urlopen(free_url))
    xml = ElementTree.fromstring(xml_string)
    xml_updated = date_parser.parse(
        xml.find('./{%s}updated' % atom_ns).text)
    xml_uri = xml.find('./{%s}id' % atom_ns).text
    for e in xml.findall('.//{%s}entry' % atom_ns):
        uri = e.find('./{%s}id' % atom_ns).text
        name = e.find('./{%s}name' % itunes_ns).text
        title = e.find('./{%s}title' % atom_ns).text
        category = e.find('./{%s}category' % atom_ns).attrib['term']
        description = e.find('./{%s}summary' % atom_ns).text
        release_at = date_parser.parse(
            e.find('./{%s}releaseDate' % itunes_ns).text)
        artist_name = e.find('./{%s}artist' % itunes_ns).text
        artist_uri = e.find('./{%s}artist' % itunes_ns).attrib['href']
        price = float(e.find('./{%s}price' % itunes_ns).attrib['amount'])
        currency = e.find('./{%s}price' % itunes_ns).attrib['currency']
        icon_uri = e.findall('./{%s}image' % itunes_ns)[-1].text
        screen_uri = [_ent for _ent in e.findall('./{%s}link' % atom_ns) \
                          if 'image' in _ent.attrib['type']][0].text

        print '%s\t%s\t%s' % (name, category, description[:10])

if __name__ == '__main__':
    parse_app_feed()

Tapic	Games	FREE FOR A
!	Games	テレビ朝日系列「お願
Kozeni Lite	Games	コインをタップして消
整形マニア	Entertainment	iPhoneで美容整
Chariso(Bike Rider)	Games	☆今だけ無料☆
Find My iPhone	Utilities	iPhone、iPa
ガイラルディア	Games	無料で遊べる王道系の
ZOZOTOWN	Lifestyle	日本最大級のファッシ
電卓少女	Entertainment	『電卓少女』で計算萌
Ringtone Maker - Make free ringtones from your music!	Music	☆ iPodの中の曲

Posted: November 30th, 2010 | Author: yamakk | Filed under: 技術 | Tags: feed, iphone, itunes, python, rss | No Comments »

rpy2 Scatter plot

rpy2で散布図を書く

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr

grdevices = importr('grDevices')
grdevices.png(file="file.png", width=512, height=512)
# plotting code here

# 月齢
lstx = [0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
        10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0,
        20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0]

# 漁獲高
lsty = [190.0, 200.0, 50.0, 1700.0, 360.0, 620.0, 450.0, 2100.0,
        120.0, 5000.0, 1900.0, 780.0, 880.0, 1500.0, 1500.0, 1900.0,
        200.0, 270.0, 2100.0, 7000.0, 900.0, 1200.0, 3700.0, 2000.0,
        2900.0, 1300.0, 140.0, 2250.0, 120.0, 1000.0]

rx = robjects.FloatVector(lstx)
ry = robjects.FloatVector(lsty)

robjects.r.plot(x=rx, y=ry, xlab="moon age", ylab="fish", col="blue")

grdevices.dev_off()

csvファイルから読み込む場合は、Rの命令文をそのまま書いてもよい

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
grdevices = importr('grDevices')
grdevices.png(file="file.png", width=512, height=512)

robjects.r('''
data <-read.csv('fish_moonage.csv', header=TRUE)
plot(fish~moonage, xlab="foo", data=data, col="purple")
''')

grdevices.dev_off()

Posted: August 27th, 2010 | Author: yamakk | Filed under: 技術 | Tags: R, python, rpy2, statistics, visualization | No Comments »

SimpleGeo with Python

今日 SimpleGeo という位置情報APIを提供するクラウドサービスにふれた。中で使っているDBは

We call it GiselleDB. Based on Cassandra and other NoSQL

らしい。

SimpleGeoにはLayerという位置情報データを貯めておくストレージがあって、1つだけなら無料で作って動かすことができる。この無料プランの場合、APIの呼び出しは月100万回まで。有料プランは$399〜$9999まであってAmazonS3にバックアップができる。
SimpleGeo – Plans & Pricing

面白そうなのが Layer MarketPlace で、いくつか無償のLayerが提供されている。天気や震源地のLayerや米国政府が提供している保養地のLayerとかもある。これらを自分のLayerと組み合わせたりして、マッシュアップアプリが作れる。

以下アカウントをつくってPythonで使うAPIのチュートリアルをやってみた。
How do I get started using Python?

Read the rest of this entry »

Posted: July 6th, 2010 | Author: yamakk | Filed under: 技術 | Tags: api, cassandra, geo, gis, nosql, python, simplegeo | No Comments »

Using R from Python rpy2

Today I installed R (vesion2.11.1 2010.5.31) from mac installer.
R is an enviroment for statistical computing and graphics.
Python has some packages for R like rpy and rpy2.

Here is my install log.

install numpy and rpy2 via pip

pip install numpy
pip install rpy2

import rpy2.robjects as robjects                                                                                
                                                                                                                
ctl = robjects.FloatVector([4.17, 5.58, 5.18, 6.11, 4.50,                                                       
                            4.61, 5.17, 4.53, 5.33, 5.14])                                                      
trt = robjects.FloatVector([4.81, 4.17, 4.41, 3.59, 5.87,                                                       
                            3.83, 6.03, 4.89, 4.32, 4.69])                                                      
                                                                                                                
correlation = robjects.r('function(x, y) cor.test(x, y)')                                                       
print correlation(ctl, trt)

$ python Desktop/correlation_sample.py

	Pearson's product-moment correlation

data:  x and y 
t = -1.4559, df = 8, p-value = 0.1835
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.8440680  0.2415684 
sample estimates:
       cor 
-0.4576683

Posted: July 4th, 2010 | Author: yamakk | Filed under: 技術 | Tags: mac, math, python, rpy2, statistics | No Comments »

Fixture + GoogleAppEngine

fixtureをGoogleAppEngineのDataStoreテストで使う。json編

参考 Using Fixture To Test A Google App Engine Site

DataSetsをjson化するのにfixture.dataset.converterでdataset_to_jsonが提供されていたりもする。ただdate_created(auto_now_add=True)のようなフィールドはDataSetsで前もって値を指定できないので、ここでjson化したデータを、実際(date_createdの付いた)のjsonのレスポンスと比較しても同じにはならない。

.zshrc

gaedir=/usr/local/google_appengine
if [ -d $gaedir ] ; then ;
    export PATH=${PATH}:$gaedir
    export PYTHONPATH=${PYTHONPATH}:${gaedir}:${gaedir}/lib/antlr3:\
${gaedir}/lib/cacerts:${gaedir}/lib/django:${gaedir}/lib/ipaddr:\
${gaedir}/lib/webob:${gaedir}/lib/yaml/lib
fi;

$ source ~/.zshrc
$ pip install WebTest
$ pip install nose
$ pip install NoseGAE

app.yaml

application: sampleblog # _ はapplicationの識別子として使えないので注意
version: 1
runtime: python
api_version: 1

handlers:
- url: /.*
  script: blog.py

blog.py

#coding:utf-8
import logging
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext import db
from django.utils import simplejson


class Entry(db.Model):
    title = db.StringProperty()
    body = db.TextProperty()
    date_created = db.DateTimeProperty(auto_now_add=True)


class Comment(db.Model):
    entry = db.ReferenceProperty(Entry)
    comment = db.TextProperty()
    date_created = db.DateTimeProperty(auto_now_add=True)


class EntriesHandler(webapp.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'application/json;charset=utf-8'
        entries = []
        for entry in Entry.all():
            comments = []
            for comment in Comment.all().filter("entry =", entry):
                comment_dict = \
                    dict(comment=comment.comment,
                         date_created=comment.date_created.isoformat())
                comments.append(comment_dict)
            entry_dict = dict(title=entry.title,
                              body=entry.body,
                              date_created=entry.date_created.isoformat(),
                              comments=comments)
            entries.append(entry_dict)
        json = simplejson.dumps(entries, indent=True)
        #logging.info(json)
        self.response.out.write(json)


routing = [('/entries', EntriesHandler)]

application = webapp.WSGIApplication(
        routing,
        debug=True)


def main():
    run_wsgi_app(application)

if __name__ == '__main__':
    main()

tests/datasets.py (http://fixture.googlecode.com/hg/fixture/examples/google_appengine_example/tests/datasets.py)

from fixture import DataSet


class EntryData(DataSet):
    class great_monday:
        title = "Monday Was Great"
        body = """\
Monday was the best day ever.  I got up (a little late, but that's OK) then I ground some coffee.
Mmmm ... coffee!  I love coffee.  Do you know about
<a href="http://www.metropoliscoffee.com/">Metropolis</a> coffee?  It's amazing.  Delicious.
I drank a beautiful cup of french pressed
<a href="http://www.metropoliscoffee.com/shop/coffee/blends.php">Spice Island</a>, had a shave
and went to work.  What a day!
"""


class CommentData(DataSet):

    class monday_liked_it:
        entry = EntryData.great_monday
        comment = """\
I'm so glad you have a blog because I want to know what you are doing everyday.  Heh, that sounds
creepy.  What I mean is it's so COOL that you had a great Monday.  I like Mondays too.
"""

    class monday_sucked:
        entry = EntryData.great_monday
        comment = """\
Are you serious?  Mannnnnn, Monday really sucked.
"""

load_data_locally.py
(http://fixture.googlecode.com/hg/fixture/examples/google_appengine_example/load_data_locally.py)一部修正

import sys
import os
import optparse
from fixture import GoogleDatastoreFixture
from fixture.style import NamedDataStyle


def main():
    p = optparse.OptionParser(usage="%prog [options]")
    default = "/tmp/dev_appserver.datastore"
    p.add_option("--datastore_path", default=default, help=(
            "Path to datastore file.  This must match the value used for "
            "the same option when running dev_appserver.py if you want to view the data.  "
            "Default: %s" % default))
    default = "/tmp/dev_appserver.datastore.history"
    p.add_option("--history_path", default=default, help=(
            "Path to datastore history file.  This doesn't need to match the one you use for "
            "dev_appserver.py.  Default: %s" % default))
    default = "/usr/local/google_appengine"
    p.add_option("--google_path", default=default, help=(
            "Path to google module directory.  Default: %s" % default))
    (options, args) = p.parse_args()

    if not os.path.exists(options.google_path):
        p.error("Could not find google module path at %s.  You'll need to specify the path" % options.google_path)

    groot = options.google_path
    sys.path.append(groot)
    sys.path.append(os.path.join(groot, "lib/django"))
    sys.path.append(os.path.join(groot, "lib/webob"))
    sys.path.append(os.path.join(groot, "lib/yaml/lib"))

    from google.appengine.tools import dev_appserver
    import blog
    from tests import datasets

    config, explicit_matcher = dev_appserver.\
        LoadAppConfig(os.path.dirname(__file__), {})
    dev_appserver.SetupStubs(
        config.application,
        clear_datastore=False,  # just removes the files when True
        datastore_path=options.datastore_path,
        history_path=options.history_path,
        blobstore_path=None,  # 追加 KeyError: 'blobstore_path' を避ける
        login_url=None)

    datafixture = GoogleDatastoreFixture(env={'EntryData': blog.Entry,
                                              'CommentData': blog.Comment})

    data = datafixture.data(datasets.CommentData, datasets.EntryData)
    data.setup()
    print "Data loaded into datastore %s" % \
        (options.datastore_path or "[default]")

if __name__ == '__main__':
    main()

tests/test_entries.py

#coding:utf-8
import unittest
from fixture import GoogleDatastoreFixture
from webtest import TestApp
import blog
from datasets import CommentData, EntryData
from django.utils import simplejson

datafixture = GoogleDatastoreFixture(env={'EntryData': blog.Entry,
                                          'CommentData': blog.Comment})


class TestListEntries(unittest.TestCase):
    def setUp(self):
        self.app = TestApp(blog.application)
        self.data = datafixture.data(CommentData, EntryData)
        self.data.setup()

    def tearDown(self):
        self.data.teardown()

    def test_entries(self):
        response = self.app.get("/entries")
        assert simplejson.dumps(EntryData.great_monday.title) in response
        assert simplejson.dumps(EntryData.great_monday.body) in response
        assert simplejson.dumps(CommentData.monday_liked_it.comment) \
            in response
        assert simplejson.dumps(CommentData.monday_sucked.comment) \
            in response

Create custom datasets.

$ ./load_data_locally.py --datastore_path=./my.datastore

Run server with custom data.

$ dev_appserver.py . --datastore_path=./my.datastore

Run tests.

$ nosetests -v --with-gae

test_entries (tests.test_entries.TestListEntries) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.296s
OK

Posted: May 26th, 2010 | Author: yamakk | Filed under: 技術 | Tags: database, fixture, gae, googleappengine, json, python | No Comments »

Fixture + SQLAlchemy

Using LoadableFixtureを読む

A DataSet class is loaded via some storage medium, say, an object that implements a Data Mapper or Active Record pattern. A Fixture is an environment that knows how to load data using the right objects. Behind the scenes the rows and columns of the DataSet are simply passed to the storage medium so that it can save the data.

DataSetクラスはなんらかの格納媒体(以下ORMapper)例えばDataMapperやActiveRecordパターンで実装されたオブジェクトによってロードされる。Fixtureは正しいオブジェクトを使って、どのようにデータをロードするかを知っている環境だ。内部的にはDataSetの行と列は単純にORMapperに渡されて、ORMapperがデータを保存する。

The Fixture class is designed to support many different types of databases and other storage media by hooking into 3rd party libraries that know how to work with that media. There is also a section later about creating your own Fixture.

Fixtureクラスは多くの異なるタイプのデータベースやストレージメディアに対し、どのように接続するかを知っているサードパーティーのライブラリにフックすることで、それらをサポートするようデザインされている。後のほうに独自のFixtureを作るセクションがある。

Fixture is designed for applications that already have a way to store data; the LoadableFixture just hooks in to that interface.

Fixture はデータの格納方法を既にそなえたアプリケーションのためにデザインされている。LoadableFixture はただそのインターフェースにフックしているに過ぎない。
—————-
以下An Example of Loading Data Using SQLAlchemy を参考にデータセットとテストコードを書いた。

Read the rest of this entry »

Posted: May 13th, 2010 | Author: yamakk | Filed under: 技術 | Tags: database, fixture, python, sqlalchemy, test | No Comments »

DBまわりのテストを簡単にするfixture

fixutre というpythonモジュールは、テストに関するDBの面倒をみてくれる様子。
個人的にはteardown, setupをリッチに一元管理できるという感覚で捉えています。
http://farmdev.com/projects/fixture

テストDBにデータをロードし、アサーションするときに簡単に参照したい
外部キーとリレーションのあるデータを自動的にロードし、integritty errorなしに簡単に削除したい
IDではなく意味のある名前で接続された行を参照したい
auto-incrementを気にしたくない
バグを検証するためにDBの実際のデータにSQLを発行し、環境を再構築したい
ファイルについてファイルシステムに依存せずテストしたい

意訳するとこんなとき便利らしい。

Loading and referencing test data

There are a couple ways to test a database-backed application. You can create mock objects and concentrate entirely on unit testing individual components without testing the database layer itself, or you can simply load up sample data before you run a test. Thanks to sqlite in-memory connections, the latter may be more efficient than you think.

But it’s easy enough to insert data line by line in code, right? Or simply load a SQL file? Yes, but this has two major downsides: you often have to worry about and manage complex chains of foreign keys manually; and when referencing data values later on, you either have to copy / paste the values or pass around lots of variables.

The fixture module simplifies this by breaking the process down to two independent components:

DataSet
Defines sets of sample data
Fixture
Knows how to load data

DBアプリをテストする方法はいくつかある。DBのレイヤーそのもののテストはせずに、モックオブジェクトをつくって完全に個々のコンポーネントのユニットテストに集中するとか、または単純にテストを走らせる前にサンプルデータをロードする手もある。sqliteのin-memory 接続のおかげで、後者は思ったより効果的。

でも一行一行コードにデータを入れていくのってホントに簡単か？それかシンプルにSQLファイルをロードしてまう？いいけどでもこれは２つの大きな欠点がある。しばしば外部キーの連鎖に配慮したり、手で管理したりしなければならない。そしてその後にデータの値を参照するときに、値をコピペするか、たくさんの変数をたらい回しにしなければならない。

fixture moduleはこれを、二つの独立したコンポーネントにそのプロセスを分けることで単純化している。

DataSet
サンプルデータのセットを定義
Fixture
データのロードの仕方を知ってる

(意訳終わり)

個人的に使いたい状況は、例えばUserとEmailの2つのテーブルがあるとする。Userのあるレコード(仮にbob)を削除すると、bobの持つメールアドレスbob@example.com, bob@bob.netもEmailから消したい。ORMapperのセッティングを間違えると、これがメールアドレスを削除するとユーザが削除されたりして、とても危険(実際にあった)なのでこういう基本的なところは日々のテストで確認したい。

Posted: May 10th, 2010 | Author: yamakk | Filed under: 技術 | Tags: database, python, test | No Comments »

yamakk blog