PostgreSQL 7.2.x - 気まぐれSE日記

また止まってしまい...

うちの会社の自社アプリで使っているDBがPostgreSQL 7.2.8なんですが、
以前の日記で、そのDBサーバがメモリが足らなくなってハングしてたってことで日記を書きました。

それで、金曜日にまた止まる現象が再現してしまったので、やむなく土曜出社して状況を確認する羽目になったのですが、
どうやらDBへの接続数が多い状態で、VACUUMDBかけるとDBが不安定になり、最終的に落っこちるのでは？
という結論になりました。

これがログ



# ↓ VACUUMDB(最適化)のバッチが午前3時に走る

Mar 16 03:13:07 localhost postgres[7545]: [4] DEBUG:  --Relation pg_type--

Mar 16 03:13:07 localhost postgres[7545]: [5-1] DEBUG:  Pages 6: Changed 0, reaped 0, Empty 0, New 0; Tup 425: Vac 0, Keep/VTL 0/0, UnUsed 0, MinLen 106, MaxLen 106; Re-using:
# 〜省略〜
# ↓VACUUM処理中に、クライアント数が多すぎるとのエラーが発生

Mar 16 03:24:01 localhost postgres[9271]: [4] FATAL 1:  Sorry, too many clients already

Mar 16 03:24:01 localhost postgres[9274]: [4] FATAL 1:  Sorry, too many clients already
# ↓VACUUM処理中にPostgreSQLサーバプロセスが停止する

Mar 16 03:34:27 localhost postgres[7545]: [119-1] DEBUG:  Rel pg_toast_1358704: Pages: 63 --> 63; Tuple(s) moved: 0.

Mar 16 03:34:29 localhost postgres[1471]: [5] DEBUG:  terminating any other active server processes

Mar 16 03:34:44 localhost postgres[10451]: [4-1] NOTICE:  Message from PostgreSQL backend:

Mar 16 03:34:46 localhost postgres[10451]: [4-2] ^IThe Postmaster has informed me that some other backend
# 異常終了・・おそらく共有メモリが破損したとのメッセージ。

Mar 16 03:35:04 localhost postgres[10451]: [4-3] ^Idied abnormally and possibly corrupted shared memory.
# ロールバックを試みている？

Mar 16 03:35:23 localhost postgres[10451]: [4-4] ^II have rolled back the current transaction and am
# リカバリーモードになる

Mar 16 03:34:45 localhost postgres[10812]: [6] FATAL 1:  The database system is in recovery mode
# 強制的にDB接続を切断

Mar 16 03:35:42 localhost postgres[10451]: [4-5] ^Igoing to terminate your database system connection and exit.
# 再接続してクエリを再実行するよう促すメッセージ

Mar 16 03:36:01 localhost postgres[10451]: [4-6] ^IPlease reconnect to the database system and repeat your query.

Mar 16 03:36:46 localhost postgres[8897]: [4-2] ^IThe Postmaster has informed me that some other backend

今の現状は、DBがまともに起動しなくなって自社システム的に止まってます(涙。

どうしたらいいんだ？という状況です。
最悪はデータベース自体が逝ってしまったのかもしれないってことです。

サーバ自体作り直すのは簡単なことですが、原因が...イマイチつかめない。

その後

とりあえず、DB自体は壊れていないようだったので、しばらく稼働させてみて様子見となりました。

また、MRTGかmuninつっこんでみて接続数がどの程度まで行くとコケルか監視してみることになりました。
ログを見る限りでは、意味がわかんない部分が多いので視覚化できるツールがやっぱり有効かなと。