OrderbookのDataFrameを整然データに変換

OrderbookのDataFrameを整然データに変換

ここでは Public Data API From CryptoChassis から、Orderbook(depth=10)のデータを取得します。

import gzip
import io
import json
from urllib import request

import pandas as pd
url = "https://api.cryptochassis.com/v1/market-depth/coinbase/btc-usd?startTime=1594166400&depth=10"

with request.urlopen(url) as res:
    json_data = json.loads(res.read().decode())

json_data
{'urls': [{'startTime': {'seconds': 1594166400,
    'iso': '2020-07-08T00:00:00.000Z'},
   'endTime': {'seconds': 1594252800, 'iso': '2020-07-09T00:00:00.000Z'},
   'url': 'https://marketdata-e0323a9039add2978bf5b49550572c7c.s3.amazonaws.com/market_depth/coinbase/btc_usd/1594166400.csv.gz?AWSAccessKeyId=ASIATPNB7YZIRYRUYH6E&Expires=1647314231&Signature=15BugAnUze5vX9Xp4d%2ByxR5Dmjw%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEPD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJIMEYCIQC9757DDmjrKc%2Ff37ZukleooFJnSjX5yNP60fqvjsnZKAIhAKXxh%2Fb8GX941U13FecZJMSN9UZKrjJUAVVPq8RWURXDKvoDCGkQAhoMMjM5MjQ3Mjc5Njk3Igx6%2BHbD4P%2B%2FvMwGNXMq1wNLsVB7ZHoDj9XV%2BNveAYPCD7AJC2IiwjLz9Ai6295MQ7g%2FENH4FzM%2FLiP0h6pkRZ0lhjXvyFAQ%2FhY6pwDrkIn11IbO6JbrN4zcYvv6GcdRghTXrVr7LFQaXd%2Foqrp5DRj5n9jVPeENClWMjh11y1ry%2BA7MHgDXKTU6g03K2uwUBgmD7Q%2BR%2BP46EgZP3SfOTKRokZEooMmHwg9QMQ0M6DYUzUh8JLZMpDWX9IbsqnOAv%2B%2BHD7OFiUAzMdpjWynLbM5ar%2Fd0JV8P8mRyQoEm63VdXAGWF4g0bWcUwhXcy%2BRBQIeHhqvfH0cBaJCd0G5103bYsOtgV1H%2FDOWv2lUK4jNo06f29FO50OpPOUcl69G1zjSiCwrYNgOBHrsCyA3EM9xRySZG85oeFgt6ka11cYLRAPVexAz9dYq%2B1bJCptk6uaeoglRLEVm%2FTXRRpoZpwSj0IDRdRJmDSSzlm%2Fe5mb6Vxl%2Ft7J%2BCA7kHwaWgA3XWta9wvlOPN%2FR6rt8ZS3ir3EDeAsRo9IY6KXTW1X00Qgsz4LX%2F16bDwWjVMFXfnv715AdgYKUAc5%2BeqyGGUNfpyb7N%2FAIVQEZvqBH0gvFCbggbZvcgDO%2FQHFFPhfVXUR9wrl6Em8ku%2FvAwwqi%2FkQY6pAEV0v%2B9yGcGGUT%2BWVfEcVRuUUuNdM4QrrOfxZXhdgEGzv9SRMCbRNjxmpnXYJAY54gXRb9Ymbqi0mm75V1x54sHc8b7ccfzQaOF37HNWdkRy2LQ%2B77ou4IOgyGP0Mf5MHUP%2FQxff3MLsrmzwTUnHj5ferevzVtozKeH%2Be%2BPtsqQ3nU%2BlosLtUljiYnwzeR1%2Fz4CbSmAU%2Fdp%2FaMjU%2F0NkpwxP5LknQ%3D%3D'}],
 'expiration': '300 seconds'}

APIから得られたURLをもとに、gzipファイルを展開し、CSVファイルを read_csv 関数でDataFrameに読み込みます。

with request.urlopen(json_data["urls"][0]["url"]) as res:
    gz_data = res.read()

with gzip.open(io.BytesIO(gz_data), "rt") as f:
    csv_data = f.read()

raw_df = pd.read_csv(io.StringIO(csv_data))
del json_data, gz_data, csv_data
raw_df.columns = "timestamp", "bid", "ask"

注意

実際には上記のコードで取得しますが、ここでは取得済みのCSVファイルから読み込みます。

raw_df = pd.read_csv("btcusd_2020-07-08.zip")
raw_df.head()
timestamp bid ask
0 1594245605 9433.24_2.30558256|9433.07_1.6680282|9432.49_0... 9433.25_0.2|9433.63_0.159|9433.68_0.06|9434_0....
1 1594245606 9433.24_2.30558256|9433.07_1.6680282|9432.24_1... 9433.25_0.2|9433.63_0.159|9433.67_0.06|9434_0....
2 1594245607 9433.24_2.32361076|9433.07_1.65|9432.24_1.65|9... 9433.25_0.17401468|9433.63_0.159|9433.67_0.06|...
3 1594245608 9433.61_0.60504644|9433.51_1.15223697|9433.25_... 9433.62_0.2|9433.63_0.159|9433.66_0.06|9434_0....
4 1594245609 9433.51_1.17026517|9433.25_1.65|9433.24_1.65|9... 9433.52_0.637005|9433.61_0.10947951|9433.62_0....

読み込んだデータは雑然データのため、整然データに整形します。

ヒント

整然データとは何か: https://id.fnshr.info/2017/01/09/tidy-data-intro/

def format_dataframe_by_side(df, side):
    new_df = (
        df.loc[:, side]
        .str.split("|")
        .explode()
        .str.split("_", expand=True)
        .rename({0: "price", 1: "size"}, axis=1)
    )
    new_df = new_df.astype(float)
    new_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
    new_df.loc[:, "side"] = side
    new_df.set_index("timestamp", drop=False, inplace=True)
    new_df.index.name = None
    return new_df


def format_dataframe(df):
    new_df = pd.concat(
        [format_dataframe_by_side(df, "bid"), format_dataframe_by_side(df, "ask")]
    )
    return new_df.sort_index()


df = format_dataframe(raw_df)
df.head()
price size timestamp side
2020-07-08 22:00:05 9433.24 2.305583 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9434.32 1.000000 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.35 0.530100 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.90 0.015848 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9435.00 0.050000 2020-07-08 22:00:05 ask

ここからは関数の内容を分解して解説します。

raw_df
timestamp bid ask
0 1594245605 9433.24_2.30558256|9433.07_1.6680282|9432.49_0... 9433.25_0.2|9433.63_0.159|9433.68_0.06|9434_0....
1 1594245606 9433.24_2.30558256|9433.07_1.6680282|9432.24_1... 9433.25_0.2|9433.63_0.159|9433.67_0.06|9434_0....
2 1594245607 9433.24_2.32361076|9433.07_1.65|9432.24_1.65|9... 9433.25_0.17401468|9433.63_0.159|9433.67_0.06|...
3 1594245608 9433.61_0.60504644|9433.51_1.15223697|9433.25_... 9433.62_0.2|9433.63_0.159|9433.66_0.06|9434_0....
4 1594245609 9433.51_1.17026517|9433.25_1.65|9433.24_1.65|9... 9433.52_0.637005|9433.61_0.10947951|9433.62_0....
... ... ... ...
7063 1594252795 9435.27_16.94086454|9435.26_0.13966131|9435.05... 9435.28_8.05085315|9435.3_0.110163|9439.5_0.11...
7064 1594252796 9435.27_16.94086454|9435.26_0.13966131|9435.03... 9435.28_7.80012384|9435.3_0.110163|9439.5_0.11...
7065 1594252797 9435.27_16.94086454|9435.26_0.13966131|9435.04... 9435.28_7.79200253|9435.3_0.110163|9439.41_0.0...
7066 1594252798 9435.27_17.05242522|9435.26_0.13966131|9435.03... 9435.28_7.79200253|9435.3_0.110163|9439.4_0.11...
7067 1594252799 9435.27_17.05242522|9435.26_0.13966131|9435.04... 9435.28_7.00936353|9435.3_0.110163|9439.4_0.05...

7068 rows × 3 columns

「bid」列、「ask」列は板情報が | の文字列で連結されています。これを str.split メソッドで分割します。

split_ser = raw_df.loc[:, "bid"].str.split("|")
split_ser.head()
0    [9433.24_2.30558256, 9433.07_1.6680282, 9432.4...
1    [9433.24_2.30558256, 9433.07_1.6680282, 9432.2...
2    [9433.24_2.32361076, 9433.07_1.65, 9432.24_1.6...
3    [9433.61_0.60504644, 9433.51_1.15223697, 9433....
4    [9433.51_1.17026517, 9433.25_1.65, 9433.24_1.6...
Name: bid, dtype: object

Seriesの各要素がリスト型になっていることが確認できます。

split_ser[0]
['9433.24_2.30558256',
 '9433.07_1.6680282',
 '9432.49_0.04',
 '9432.24_1.65',
 '9432.16_1.65',
 '9432.15_1.65',
 '9432.09_0.25112245',
 '9432_0.60011883',
 '9431.6_1.585',
 '9430.88_1.586']
type(split_ser[0])
list

explode メソッドは各要素のオブジェクトを行に展開します。

ちなみに

explode メソッドは整然データを生成する際に便利です

explode_ser = split_ser.explode()
explode_ser
0       9433.24_2.30558256
0        9433.07_1.6680282
0             9432.49_0.04
0             9432.24_1.65
0             9432.16_1.65
               ...        
7067           9435_0.0477
7067    9434.65_0.06359115
7067             9434.64_2
7067         9434.39_0.555
7067         9434.25_0.203
Name: bid, Length: 70680, dtype: object

explode_ser の各要素は 価格_サイズ(枚数) の形式になっています。これを str.split メソッドで分割します。

ちなみに

str.split メソッドの引数 expand=True にすることで、分割されたデータがDataFrameの列として展開されます

bid_df = explode_ser.str.split("_", expand=True)
bid_df
0 1
0 9433.24 2.30558256
0 9433.07 1.6680282
0 9432.49 0.04
0 9432.24 1.65
0 9432.16 1.65
... ... ...
7067 9435 0.0477
7067 9434.65 0.06359115
7067 9434.64 2
7067 9434.39 0.555
7067 9434.25 0.203

70680 rows × 2 columns

列名を変更します。

bid_df = bid_df.rename({0: "price", 1: "size"}, axis=1)
bid_df
price size
0 9433.24 2.30558256
0 9433.07 1.6680282
0 9432.49 0.04
0 9432.24 1.65
0 9432.16 1.65
... ... ...
7067 9435 0.0477
7067 9434.65 0.06359115
7067 9434.64 2
7067 9434.39 0.555
7067 9434.25 0.203

70680 rows × 2 columns

データ型をfloat型に変換します。

bid_df = bid_df.astype(float)
bid_df.dtypes
price    float64
size     float64
dtype: object

raw_df の「timestamp」列をdatetime型に変換し、 new_df の「timestamp」列として追加します。

bid_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
bid_df.dtypes
price               float64
size                float64
timestamp    datetime64[ns]
dtype: object
bid_df.head()
price size timestamp
0 9433.24 2.305583 2020-07-08 22:00:05
0 9433.07 1.668028 2020-07-08 22:00:05
0 9432.49 0.040000 2020-07-08 22:00:05
0 9432.24 1.650000 2020-07-08 22:00:05
0 9432.16 1.650000 2020-07-08 22:00:05

「side」列を追加し、値に "bid" を入力します。

bid_df.loc[:, "side"] = "bid"
bid_df.head()
price size timestamp side
0 9433.24 2.305583 2020-07-08 22:00:05 bid
0 9433.07 1.668028 2020-07-08 22:00:05 bid
0 9432.49 0.040000 2020-07-08 22:00:05 bid
0 9432.24 1.650000 2020-07-08 22:00:05 bid
0 9432.16 1.650000 2020-07-08 22:00:05 bid

set_index メソッドで「timestamp」列をインデックスにします。

ちなみに

set_index メソッドの引数 drop=False では、インデックス対象列を列としてそのまま残します

ちなみに

bid_df.index.name 属性はインデックスの名前が格納されており、この属性を操作してインデックス名を変更できます

bid_df.set_index("timestamp", drop=False, inplace=True)
bid_df.index.name = None
bid_df.head()
price size timestamp side
2020-07-08 22:00:05 9433.24 2.305583 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9433.07 1.668028 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.49 0.040000 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.24 1.650000 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.16 1.650000 2020-07-08 22:00:05 bid

ask側で同様の処理をします。

ask_df = (
    raw_df.loc[:, "ask"]
    .str.split("|")
    .explode()
    .str.split("_", expand=True)
    .rename({0: "price", 1: "size"}, axis=1)
)
ask_df = ask_df.astype(float)
ask_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
ask_df.loc[:, "side"] = "ask"
ask_df.set_index("timestamp", drop=False, inplace=True)
ask_df.index.name = None
ask_df.head()
price size timestamp side
2020-07-08 22:00:05 9433.25 0.200 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9433.63 0.159 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9433.68 0.060 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.00 0.750 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.32 1.000 2020-07-08 22:00:05 ask

concat 関数で bid_dfask_df を連結します。

new_df = pd.concat([bid_df, ask_df])
new_df.head()
price size timestamp side
2020-07-08 22:00:05 9433.24 2.305583 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9433.07 1.668028 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.49 0.040000 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.24 1.650000 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9432.16 1.650000 2020-07-08 22:00:05 bid
new_df.tail()
price size timestamp side
2020-07-08 23:59:59 9439.90 0.500000 2020-07-08 23:59:59 ask
2020-07-08 23:59:59 9440.31 0.100000 2020-07-08 23:59:59 ask
2020-07-08 23:59:59 9440.33 1.500000 2020-07-08 23:59:59 ask
2020-07-08 23:59:59 9440.61 1.197847 2020-07-08 23:59:59 ask
2020-07-08 23:59:59 9440.63 0.393151 2020-07-08 23:59:59 ask

インデックスの降順にソートします。

new_df.sort_index()
price size timestamp side
2020-07-08 22:00:05 9433.24 2.305583 2020-07-08 22:00:05 bid
2020-07-08 22:00:05 9434.32 1.000000 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.35 0.530100 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9434.90 0.015848 2020-07-08 22:00:05 ask
2020-07-08 22:00:05 9435.00 0.050000 2020-07-08 22:00:05 ask
... ... ... ... ...
2020-07-08 23:59:59 9435.04 0.050000 2020-07-08 23:59:59 bid
2020-07-08 23:59:59 9435.26 0.139661 2020-07-08 23:59:59 bid
2020-07-08 23:59:59 9435.27 17.052425 2020-07-08 23:59:59 bid
2020-07-08 23:59:59 9440.61 1.197847 2020-07-08 23:59:59 ask
2020-07-08 23:59:59 9440.63 0.393151 2020-07-08 23:59:59 ask

141360 rows × 4 columns

DataFrameを to_pickle メソッドで直列化し、ほかのNotebookでも利用できるようにします。

df.to_pickle("btcusd_2020-07-08.pickle", protocol=4)