Google App Engine：utf-8の文字列切り出し - 中級プログラマの自宅でPHP ブログ

Pythonではutf-8とUnicodeは別物らしい。

utf-8のままでは、文字単位で正しく文字列処理をしてくれない。Unicodeにすれば、正しく処理される。

utf-8の日本語文字列から、一部を取り出す場合、結構面倒。

≪例≫

utf8 = str.encode('utf-8')

html = '<td>'+utf8[:5]+'</td>'

では、asciiとしてバイト単位で取り出すため、文字化けする。

utf8 = str.encode('utf-8')

html = '<td>'+unicode(utf8,'utf-8')[:5].encode('utf-8')+'</td>'

みたいにするとうまくいく。まわりくどい。何か勘違いしているだけかも。

関数にして、次のようにして使ってみる。

html = '<td>'+utf8left(str,5)+'</td>'

def utf8left(str,length):

str = unicode(str.encode('utf-8'),'utf-8')

if len(str) <= length:

return str.encode('utf-8')

else:

return (str[:length]+'...').encode('utf-8')