{ "cells": [ { "cell_type": "markdown", "id": "bbee10b5-c5e1-4dbe-aeb1-f5289372e89e", "metadata": {}, "source": [ "# pandasの高速化" ] }, { "cell_type": "code", "execution_count": 1, "id": "7adf116d-5898-4a7b-b292-25a9ad65208c", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pandarallel import pandarallel\n", "import swifter" ] }, { "cell_type": "markdown", "id": "e32d3a36-a97e-4b72-b470-2725181b3641", "metadata": {}, "source": [ "ここではorderbookのデータをDataFrameに読み込みます。" ] }, { "cell_type": "code", "execution_count": 2, "id": "36b5fd1f-463e-43f0-8212-265f9efeb3b6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pricesizetimestampside
2020-07-08 22:00:059433.242.3055832020-07-08 22:00:05bid
2020-07-08 22:00:059434.321.0000002020-07-08 22:00:05ask
2020-07-08 22:00:059434.350.5301002020-07-08 22:00:05ask
2020-07-08 22:00:059434.900.0158482020-07-08 22:00:05ask
2020-07-08 22:00:059435.000.0500002020-07-08 22:00:05ask
\n", "
" ], "text/plain": [ " price size timestamp side\n", "2020-07-08 22:00:05 9433.24 2.305583 2020-07-08 22:00:05 bid\n", "2020-07-08 22:00:05 9434.32 1.000000 2020-07-08 22:00:05 ask\n", "2020-07-08 22:00:05 9434.35 0.530100 2020-07-08 22:00:05 ask\n", "2020-07-08 22:00:05 9434.90 0.015848 2020-07-08 22:00:05 ask\n", "2020-07-08 22:00:05 9435.00 0.050000 2020-07-08 22:00:05 ask" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_df = pd.read_pickle(\"btcusd_2020-07-08.pickle\")\n", "raw_df.head()" ] }, { "cell_type": "markdown", "id": "bd1441dc-bf62-452a-8b9d-215452886725", "metadata": {}, "source": [ "depth=10の板情報からbid/askが最も近い価格(price)と枚数(size)を抽出します。" ] }, { "cell_type": "code", "execution_count": 3, "id": "62585f89-eae2-4f88-b35b-c500e6d8f12c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bid_pricebid_sizeask_priceask_size
2020-07-08 22:00:059433.242.3055839433.250.015848
2020-07-08 22:00:069433.242.3055839433.250.015848
2020-07-08 22:00:079433.242.3236119433.250.015848
2020-07-08 22:00:089433.611.7000009433.620.015848
2020-07-08 22:00:099433.511.6500009433.520.015848
\n", "
" ], "text/plain": [ " bid_price bid_size ask_price ask_size\n", "2020-07-08 22:00:05 9433.24 2.305583 9433.25 0.015848\n", "2020-07-08 22:00:06 9433.24 2.305583 9433.25 0.015848\n", "2020-07-08 22:00:07 9433.24 2.323611 9433.25 0.015848\n", "2020-07-08 22:00:08 9433.61 1.700000 9433.62 0.015848\n", "2020-07-08 22:00:09 9433.51 1.650000 9433.52 0.015848" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bid = (\n", " raw_df.groupby(\"side\")\n", " .get_group(\"bid\")\n", " .groupby(\"timestamp\")[[\"price\", \"size\"]]\n", " .max()\n", ")\n", "ask = (\n", " raw_df.groupby(\"side\")\n", " .get_group(\"ask\")\n", " .groupby(\"timestamp\")[[\"price\", \"size\"]]\n", " .min()\n", ")\n", "df = pd.concat([bid, ask], axis=1)\n", "df.index.name = None\n", "df.columns = \"bid_price\", \"bid_size\", \"ask_price\", \"ask_size\"\n", "df.head()" ] }, { "cell_type": "markdown", "id": "fb123b4b-ad45-41d2-bef1-a2b3524a47c2", "metadata": {}, "source": [ "サンプルとして、priceとsizeから仲値を算出する関数を作成します。" ] }, { "cell_type": "code", "execution_count": 4, "id": "af5b0670-d0c6-40e5-bd7f-84557bc04736", "metadata": {}, "outputs": [], "source": [ "def get_mid_price(bid, bid_sz, ask, ask_sz):\n", " try:\n", " mid_price = ask + (ask_sz / (ask_sz + bid_sz) * (bid - ask))\n", " except ZeroDivisionError:\n", " mid_price = None\n", " return mid_price\n", "\n", "\n", "def get_mid_price_from_series(ser):\n", " return get_mid_price(*ser)\n", "\n", "\n", "vfunc = np.vectorize(get_mid_price)" ] }, { "cell_type": "markdown", "id": "b582b179-9561-4c77-b143-ecd28e3c8b97", "metadata": {}, "source": [ "作成した関数を各行に対して `apply` メソッドで適用します。" ] }, { "cell_type": "code", "execution_count": 5, "id": "364947fe-b843-49c5-ac1b-dc8920d572e8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "60.6 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit df.apply(get_mid_price_from_series, axis=1)" ] }, { "cell_type": "markdown", "id": "ade45794-5bce-41db-85d0-55701c879413", "metadata": {}, "source": [ "[numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) はベクトル化した関数を定義します。引数には array-like なオブジェクトを渡します。" ] }, { "cell_type": "code", "execution_count": 6, "id": "aca65840-495e-4e51-9ebf-7d7d328e6c5e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.98 ms ± 83.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit vfunc(*df.T.values)" ] }, { "cell_type": "markdown", "id": "1656a1ca-f19a-4cdd-9143-290d4f78f24c", "metadata": {}, "source": [ "```{tip}\n", "get_mid_price 関数の処理は関数化をしなくとも、演算子を使った式で算出できます\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "id": "c28235ba-4dc1-44be-97eb-ec04354bfa7e", "metadata": {}, "outputs": [], "source": [ "bid_, bid_sz_, ask_, ask_sz_ = df.T.values" ] }, { "cell_type": "code", "execution_count": 8, "id": "7ac9e478-9388-444d-8e5f-497477d74f3c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22.8 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "%timeit ask_ + (ask_sz_ / (ask_sz_ + bid_sz_) * (bid_ - ask_))" ] }, { "cell_type": "markdown", "id": "23a1f037-5e58-49d4-855a-8c3978d57e34", "metadata": {}, "source": [ "```{tip}\n", "get_mid_price 関数内の処理は配列(numpy.ndarray)に対応しているため、この関数の引数にSeriesなどを渡せます\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "id": "419d7973-ae1e-4920-92d0-70dcbbe6ec94", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "120 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "%timeit get_mid_price(*df.T.values)" ] }, { "cell_type": "markdown", "id": "58fd4c32-1378-4571-a7dc-57d722905b33", "metadata": {}, "source": [ "- [Pandaral·lel](https://github.com/nalepae/pandarallel)\n", "- [swifter](https://github.com/jmcarpenter2/swifter)" ] }, { "cell_type": "code", "execution_count": 10, "id": "0f379ded-690c-4ab3-a818-26399817be24", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Pandarallel will run on 4 workers.\n", "INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.\n" ] } ], "source": [ "pandarallel.initialize()" ] }, { "cell_type": "markdown", "id": "ab51949d-03b1-4c41-8bb4-ebe9354877cd", "metadata": {}, "source": [ "DataFrameから `apply` メソッドの代わりに `parallel_apply` メソッドを実行すると、関数を並列に適用します。" ] }, { "cell_type": "code", "execution_count": 11, "id": "4fea4994-6b46-4e78-affe-e4b9464c89bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "83 ms ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit df.parallel_apply(get_mid_price_from_series, axis=1)" ] }, { "cell_type": "code", "execution_count": 12, "id": "d2889fe3-26e0-4cdf-a2d9-3dc32a8fcb14", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3ba3da8d5415441ea93efc873f3f36ea", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Pandas Apply: 0%| | 0/7068 [00:00