OrderbookのDataFrameを整然データに変換
OrderbookのDataFrameを整然データに変換¶
ここでは Public Data API From CryptoChassis から、Orderbook(depth=10)のデータを取得します。
import gzip
import io
import json
from urllib import request
import pandas as pd
url = "https://api.cryptochassis.com/v1/market-depth/coinbase/btc-usd?startTime=1594166400&depth=10"
with request.urlopen(url) as res:
json_data = json.loads(res.read().decode())
json_data
{'urls': [{'startTime': {'seconds': 1594166400,
'iso': '2020-07-08T00:00:00.000Z'},
'endTime': {'seconds': 1594252800, 'iso': '2020-07-09T00:00:00.000Z'},
'url': 'https://marketdata-e0323a9039add2978bf5b49550572c7c.s3.amazonaws.com/market_depth/coinbase/btc_usd/1594166400.csv.gz?AWSAccessKeyId=ASIATPNB7YZIRYRUYH6E&Expires=1647314231&Signature=15BugAnUze5vX9Xp4d%2ByxR5Dmjw%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEPD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJIMEYCIQC9757DDmjrKc%2Ff37ZukleooFJnSjX5yNP60fqvjsnZKAIhAKXxh%2Fb8GX941U13FecZJMSN9UZKrjJUAVVPq8RWURXDKvoDCGkQAhoMMjM5MjQ3Mjc5Njk3Igx6%2BHbD4P%2B%2FvMwGNXMq1wNLsVB7ZHoDj9XV%2BNveAYPCD7AJC2IiwjLz9Ai6295MQ7g%2FENH4FzM%2FLiP0h6pkRZ0lhjXvyFAQ%2FhY6pwDrkIn11IbO6JbrN4zcYvv6GcdRghTXrVr7LFQaXd%2Foqrp5DRj5n9jVPeENClWMjh11y1ry%2BA7MHgDXKTU6g03K2uwUBgmD7Q%2BR%2BP46EgZP3SfOTKRokZEooMmHwg9QMQ0M6DYUzUh8JLZMpDWX9IbsqnOAv%2B%2BHD7OFiUAzMdpjWynLbM5ar%2Fd0JV8P8mRyQoEm63VdXAGWF4g0bWcUwhXcy%2BRBQIeHhqvfH0cBaJCd0G5103bYsOtgV1H%2FDOWv2lUK4jNo06f29FO50OpPOUcl69G1zjSiCwrYNgOBHrsCyA3EM9xRySZG85oeFgt6ka11cYLRAPVexAz9dYq%2B1bJCptk6uaeoglRLEVm%2FTXRRpoZpwSj0IDRdRJmDSSzlm%2Fe5mb6Vxl%2Ft7J%2BCA7kHwaWgA3XWta9wvlOPN%2FR6rt8ZS3ir3EDeAsRo9IY6KXTW1X00Qgsz4LX%2F16bDwWjVMFXfnv715AdgYKUAc5%2BeqyGGUNfpyb7N%2FAIVQEZvqBH0gvFCbggbZvcgDO%2FQHFFPhfVXUR9wrl6Em8ku%2FvAwwqi%2FkQY6pAEV0v%2B9yGcGGUT%2BWVfEcVRuUUuNdM4QrrOfxZXhdgEGzv9SRMCbRNjxmpnXYJAY54gXRb9Ymbqi0mm75V1x54sHc8b7ccfzQaOF37HNWdkRy2LQ%2B77ou4IOgyGP0Mf5MHUP%2FQxff3MLsrmzwTUnHj5ferevzVtozKeH%2Be%2BPtsqQ3nU%2BlosLtUljiYnwzeR1%2Fz4CbSmAU%2Fdp%2FaMjU%2F0NkpwxP5LknQ%3D%3D'}],
'expiration': '300 seconds'}
APIから得られたURLをもとに、gzipファイルを展開し、CSVファイルを read_csv
関数でDataFrameに読み込みます。
with request.urlopen(json_data["urls"][0]["url"]) as res:
gz_data = res.read()
with gzip.open(io.BytesIO(gz_data), "rt") as f:
csv_data = f.read()
raw_df = pd.read_csv(io.StringIO(csv_data))
del json_data, gz_data, csv_data
raw_df.columns = "timestamp", "bid", "ask"
注意
実際には上記のコードで取得しますが、ここでは取得済みのCSVファイルから読み込みます。
raw_df = pd.read_csv("btcusd_2020-07-08.zip")
raw_df.head()
timestamp | bid | ask | |
---|---|---|---|
0 | 1594245605 | 9433.24_2.30558256|9433.07_1.6680282|9432.49_0... | 9433.25_0.2|9433.63_0.159|9433.68_0.06|9434_0.... |
1 | 1594245606 | 9433.24_2.30558256|9433.07_1.6680282|9432.24_1... | 9433.25_0.2|9433.63_0.159|9433.67_0.06|9434_0.... |
2 | 1594245607 | 9433.24_2.32361076|9433.07_1.65|9432.24_1.65|9... | 9433.25_0.17401468|9433.63_0.159|9433.67_0.06|... |
3 | 1594245608 | 9433.61_0.60504644|9433.51_1.15223697|9433.25_... | 9433.62_0.2|9433.63_0.159|9433.66_0.06|9434_0.... |
4 | 1594245609 | 9433.51_1.17026517|9433.25_1.65|9433.24_1.65|9... | 9433.52_0.637005|9433.61_0.10947951|9433.62_0.... |
読み込んだデータは雑然データのため、整然データに整形します。
ヒント
整然データとは何か: https://id.fnshr.info/2017/01/09/tidy-data-intro/
def format_dataframe_by_side(df, side):
new_df = (
df.loc[:, side]
.str.split("|")
.explode()
.str.split("_", expand=True)
.rename({0: "price", 1: "size"}, axis=1)
)
new_df = new_df.astype(float)
new_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
new_df.loc[:, "side"] = side
new_df.set_index("timestamp", drop=False, inplace=True)
new_df.index.name = None
return new_df
def format_dataframe(df):
new_df = pd.concat(
[format_dataframe_by_side(df, "bid"), format_dataframe_by_side(df, "ask")]
)
return new_df.sort_index()
df = format_dataframe(raw_df)
df.head()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 22:00:05 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9434.32 | 1.000000 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.35 | 0.530100 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.90 | 0.015848 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9435.00 | 0.050000 | 2020-07-08 22:00:05 | ask |
ここからは関数の内容を分解して解説します。
raw_df
timestamp | bid | ask | |
---|---|---|---|
0 | 1594245605 | 9433.24_2.30558256|9433.07_1.6680282|9432.49_0... | 9433.25_0.2|9433.63_0.159|9433.68_0.06|9434_0.... |
1 | 1594245606 | 9433.24_2.30558256|9433.07_1.6680282|9432.24_1... | 9433.25_0.2|9433.63_0.159|9433.67_0.06|9434_0.... |
2 | 1594245607 | 9433.24_2.32361076|9433.07_1.65|9432.24_1.65|9... | 9433.25_0.17401468|9433.63_0.159|9433.67_0.06|... |
3 | 1594245608 | 9433.61_0.60504644|9433.51_1.15223697|9433.25_... | 9433.62_0.2|9433.63_0.159|9433.66_0.06|9434_0.... |
4 | 1594245609 | 9433.51_1.17026517|9433.25_1.65|9433.24_1.65|9... | 9433.52_0.637005|9433.61_0.10947951|9433.62_0.... |
... | ... | ... | ... |
7063 | 1594252795 | 9435.27_16.94086454|9435.26_0.13966131|9435.05... | 9435.28_8.05085315|9435.3_0.110163|9439.5_0.11... |
7064 | 1594252796 | 9435.27_16.94086454|9435.26_0.13966131|9435.03... | 9435.28_7.80012384|9435.3_0.110163|9439.5_0.11... |
7065 | 1594252797 | 9435.27_16.94086454|9435.26_0.13966131|9435.04... | 9435.28_7.79200253|9435.3_0.110163|9439.41_0.0... |
7066 | 1594252798 | 9435.27_17.05242522|9435.26_0.13966131|9435.03... | 9435.28_7.79200253|9435.3_0.110163|9439.4_0.11... |
7067 | 1594252799 | 9435.27_17.05242522|9435.26_0.13966131|9435.04... | 9435.28_7.00936353|9435.3_0.110163|9439.4_0.05... |
7068 rows × 3 columns
「bid」列、「ask」列は板情報が |
の文字列で連結されています。これを str.split
メソッドで分割します。
split_ser = raw_df.loc[:, "bid"].str.split("|")
split_ser.head()
0 [9433.24_2.30558256, 9433.07_1.6680282, 9432.4...
1 [9433.24_2.30558256, 9433.07_1.6680282, 9432.2...
2 [9433.24_2.32361076, 9433.07_1.65, 9432.24_1.6...
3 [9433.61_0.60504644, 9433.51_1.15223697, 9433....
4 [9433.51_1.17026517, 9433.25_1.65, 9433.24_1.6...
Name: bid, dtype: object
Seriesの各要素がリスト型になっていることが確認できます。
split_ser[0]
['9433.24_2.30558256',
'9433.07_1.6680282',
'9432.49_0.04',
'9432.24_1.65',
'9432.16_1.65',
'9432.15_1.65',
'9432.09_0.25112245',
'9432_0.60011883',
'9431.6_1.585',
'9430.88_1.586']
type(split_ser[0])
list
explode
メソッドは各要素のオブジェクトを行に展開します。
ちなみに
explode
メソッドは整然データを生成する際に便利です
explode_ser = split_ser.explode()
explode_ser
0 9433.24_2.30558256
0 9433.07_1.6680282
0 9432.49_0.04
0 9432.24_1.65
0 9432.16_1.65
...
7067 9435_0.0477
7067 9434.65_0.06359115
7067 9434.64_2
7067 9434.39_0.555
7067 9434.25_0.203
Name: bid, Length: 70680, dtype: object
explode_ser
の各要素は 価格_サイズ(枚数)
の形式になっています。これを str.split
メソッドで分割します。
ちなみに
str.split
メソッドの引数 expand=True
にすることで、分割されたデータがDataFrameの列として展開されます
bid_df = explode_ser.str.split("_", expand=True)
bid_df
0 | 1 | |
---|---|---|
0 | 9433.24 | 2.30558256 |
0 | 9433.07 | 1.6680282 |
0 | 9432.49 | 0.04 |
0 | 9432.24 | 1.65 |
0 | 9432.16 | 1.65 |
... | ... | ... |
7067 | 9435 | 0.0477 |
7067 | 9434.65 | 0.06359115 |
7067 | 9434.64 | 2 |
7067 | 9434.39 | 0.555 |
7067 | 9434.25 | 0.203 |
70680 rows × 2 columns
列名を変更します。
bid_df = bid_df.rename({0: "price", 1: "size"}, axis=1)
bid_df
price | size | |
---|---|---|
0 | 9433.24 | 2.30558256 |
0 | 9433.07 | 1.6680282 |
0 | 9432.49 | 0.04 |
0 | 9432.24 | 1.65 |
0 | 9432.16 | 1.65 |
... | ... | ... |
7067 | 9435 | 0.0477 |
7067 | 9434.65 | 0.06359115 |
7067 | 9434.64 | 2 |
7067 | 9434.39 | 0.555 |
7067 | 9434.25 | 0.203 |
70680 rows × 2 columns
データ型をfloat型に変換します。
bid_df = bid_df.astype(float)
bid_df.dtypes
price float64
size float64
dtype: object
raw_df
の「timestamp」列をdatetime型に変換し、 new_df
の「timestamp」列として追加します。
bid_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
bid_df.dtypes
price float64
size float64
timestamp datetime64[ns]
dtype: object
bid_df.head()
price | size | timestamp | |
---|---|---|---|
0 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 |
0 | 9433.07 | 1.668028 | 2020-07-08 22:00:05 |
0 | 9432.49 | 0.040000 | 2020-07-08 22:00:05 |
0 | 9432.24 | 1.650000 | 2020-07-08 22:00:05 |
0 | 9432.16 | 1.650000 | 2020-07-08 22:00:05 |
「side」列を追加し、値に "bid"
を入力します。
bid_df.loc[:, "side"] = "bid"
bid_df.head()
price | size | timestamp | side | |
---|---|---|---|---|
0 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 | bid |
0 | 9433.07 | 1.668028 | 2020-07-08 22:00:05 | bid |
0 | 9432.49 | 0.040000 | 2020-07-08 22:00:05 | bid |
0 | 9432.24 | 1.650000 | 2020-07-08 22:00:05 | bid |
0 | 9432.16 | 1.650000 | 2020-07-08 22:00:05 | bid |
set_index
メソッドで「timestamp」列をインデックスにします。
ちなみに
set_index
メソッドの引数 drop=False
では、インデックス対象列を列としてそのまま残します
ちなみに
bid_df.index.name
属性はインデックスの名前が格納されており、この属性を操作してインデックス名を変更できます
bid_df.set_index("timestamp", drop=False, inplace=True)
bid_df.index.name = None
bid_df.head()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 22:00:05 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9433.07 | 1.668028 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.49 | 0.040000 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.24 | 1.650000 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.16 | 1.650000 | 2020-07-08 22:00:05 | bid |
ask側で同様の処理をします。
ask_df = (
raw_df.loc[:, "ask"]
.str.split("|")
.explode()
.str.split("_", expand=True)
.rename({0: "price", 1: "size"}, axis=1)
)
ask_df = ask_df.astype(float)
ask_df.loc[:, "timestamp"] = pd.to_datetime(raw_df.loc[:, "timestamp"], unit="s")
ask_df.loc[:, "side"] = "ask"
ask_df.set_index("timestamp", drop=False, inplace=True)
ask_df.index.name = None
ask_df.head()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 22:00:05 | 9433.25 | 0.200 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9433.63 | 0.159 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9433.68 | 0.060 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.00 | 0.750 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.32 | 1.000 | 2020-07-08 22:00:05 | ask |
concat
関数で bid_df
と ask_df
を連結します。
new_df = pd.concat([bid_df, ask_df])
new_df.head()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 22:00:05 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9433.07 | 1.668028 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.49 | 0.040000 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.24 | 1.650000 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9432.16 | 1.650000 | 2020-07-08 22:00:05 | bid |
new_df.tail()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 23:59:59 | 9439.90 | 0.500000 | 2020-07-08 23:59:59 | ask |
2020-07-08 23:59:59 | 9440.31 | 0.100000 | 2020-07-08 23:59:59 | ask |
2020-07-08 23:59:59 | 9440.33 | 1.500000 | 2020-07-08 23:59:59 | ask |
2020-07-08 23:59:59 | 9440.61 | 1.197847 | 2020-07-08 23:59:59 | ask |
2020-07-08 23:59:59 | 9440.63 | 0.393151 | 2020-07-08 23:59:59 | ask |
インデックスの降順にソートします。
new_df.sort_index()
price | size | timestamp | side | |
---|---|---|---|---|
2020-07-08 22:00:05 | 9433.24 | 2.305583 | 2020-07-08 22:00:05 | bid |
2020-07-08 22:00:05 | 9434.32 | 1.000000 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.35 | 0.530100 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9434.90 | 0.015848 | 2020-07-08 22:00:05 | ask |
2020-07-08 22:00:05 | 9435.00 | 0.050000 | 2020-07-08 22:00:05 | ask |
... | ... | ... | ... | ... |
2020-07-08 23:59:59 | 9435.04 | 0.050000 | 2020-07-08 23:59:59 | bid |
2020-07-08 23:59:59 | 9435.26 | 0.139661 | 2020-07-08 23:59:59 | bid |
2020-07-08 23:59:59 | 9435.27 | 17.052425 | 2020-07-08 23:59:59 | bid |
2020-07-08 23:59:59 | 9440.61 | 1.197847 | 2020-07-08 23:59:59 | ask |
2020-07-08 23:59:59 | 9440.63 | 0.393151 | 2020-07-08 23:59:59 | ask |
141360 rows × 4 columns
DataFrameを to_pickle
メソッドで直列化し、ほかのNotebookでも利用できるようにします。
df.to_pickle("btcusd_2020-07-08.pickle", protocol=4)