td-agentでqueue size exceeds limit

昨年構築してからずっと安定稼働していたfluend(td-agent)が年明けくらいから下記のエラーが出て連日停止してしまって、ちょっとハマったのでメモしておく。(ハッキリとした原因は掴めてないですが)

利用しているtd-agentのバージョンはtd-agent-1.1.18-0.x86_64。

現象

fluent.logには下記が出ていて、機能が停止している。

2014-01-09T20:28:59+09:00       fluent.error    {"error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","error_class":"Fluent::BufferQueueLimitError","message":"forward error"}
2014-01-09T20:28:59+09:00       fluent.warn     {"error_class":"Fluent::BufferQueueLimitError","error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","message":"emit transaction failed "}
2014-01-09T20:28:59+09:00       fluent.warn     {"error_class":"Fluent::BufferQueueLimitError","error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","message":"emit transaction failed "}
2014-01-09T20:28:59+09:00       fluent.warn     {"error_class":"Fluent::BufferQueueLimitError","error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","message":"emit transaction failed "}
2014-01-09T20:28:59+09:00       fluent.warn     {"error_class":"Fluent::BufferQueueLimitError","error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","message":"emit transaction failed "}
2014-01-09T20:28:59+09:00       fluent.warn     {"error_class":"Fluent::BufferQueueLimitError","error":"#<Fluent::BufferQueueLimitError: queue size exceeds limit>","message":"emit transaction failed "}

dmesgには下記が延々と出続けている。possible SYN flooding on port 24224はtd-agentがポート開いてるけどaccept(2)してないから出ていると@hiboma先生がサクッと検証用のコードを書いて調査してくれた。

possible SYN flooding on port 24224. Sending cookies.
possible SYN flooding on port 24224. Sending cookies.
possible SYN flooding on port 24224. Sending cookies.

構成は以前書いたもののまま。

対処方法

いろいろ調査して、試してみたんですが、先に対処内容を書いておくと、@hiboma先生が調べてくれた fluentdで死の宣告queue size exceeds limit - boku no blog を参考に、buffer_typeをmemoryからfileに変更して現象は収まって復旧。

変更内容は下記のような感じ

buffer_chunk_limitとbuffer_queue_limitの値は結構気をつけてたんですが、メモリ64GB+SSDのサーバで、盲目的にbuffer_type memoryにしていたのが災いした模様。

調査でやったこと

buffer_chunk_limit,buffer_queue_limitの調整
strace、gdbでバックトレース
monitoring agentを利用して詳細リソースの監視
kill -USR1 飛ばしてみる

など、@hibomaと@hfmに協力してもらってかなりいろいろなことを試してみたけど、症状も改善せず。

調査で分かったのはqueue size exceeds limitが出て、いきなり止まるんじゃなくて、logrotateのタイミングで死んでるような挙動。 logrotate内で実行しているkill -USR1を実行してみると旧プロセスが残ったままになったりして、調査している問題と別の問題と思われる謎の挙動が発生して訳が分からない状態に…

Glide Note

glidenote's blog

td-agentでqueue size exceeds limit

現象

対処方法

調査でやったこと

参考

Comments