3 Jun 2020
My Ceph cluster at home isn't designed for performance. It's not designed to maximum availability. It's designed for a low cost per TiB while still maintaining usability and decent disk-level redundancy. Here is some recent tuning to help with performance and corruption prevention....
- mds
- mds_cache_memory_limit: 4GiB => 8GiB
- Target maximum memory usage of MDS cache
- The default wasn't enough for the CephFS instance I have. Probably related to the number objects or frequency FS scans for metadata.
- mds_cache_memory_limit: 4GiB => 8GiB
- osd
- osd_deep_scrub_interval: 1w => 8w
- Deep scrub each PG (i.e., verify data checksums) at least this often
- Scrubs on 200TiB (and about to grow another 150TiB) of data isn't feasible once a week
- osd_op_queue: wpq => mclock_client
- Which operation priority queue algorithm to use
- Changed from wpq to mclock_client to help with client and workload type per-osd disk queuing
- osd_max_backfills: 1 => 4
- Maximum number of concurrent local and remote backfills or recoveries per OSD
- While most spinning should keep this at one, this should help overcome network/cpu latency. I think.
- osd_recovery_max_active: 0 => 4
- Number of simultaneous active recovery operations per OSD (overrides _ssd and _hdd if non-zero)
- Helps ensure the recovery operations go as fast as possible. This may impact client read performance during recovery. But I'd prefer that over a higher chance of corruption, data loss, or other problems.
- osd_recovery_max_single_start: 1 => 4
- The maximum number of recovery operations per OSD that will be newly started when an OSD is recovering
- See 2.3.2
- osd_scrub_auto_repair: false => true
- Automatically repair damaged objects detected during scrub
- If something corrupt or out of sync is found, let's get that fixed asap.
- osd_scrub_during_recovery: false => true
- Allow scrubbing when PGs on the OSD are undergoing recovery
- See 2.3.2
- osd_scrub_load_threshold: 0.5 => 4
- Allow scrubbing when system load divided by number of CPUs is below this value
- The load on my OSD servers is usually above 1 during normal operations. It only seems to go above four during heavy recovery and other heavy ops. So this seemed a good middle ground.
- osd_scrub_max_interval: 1w => 4w
- Scrub each PG no less often than this interval
- With 15 million objects / 220 million replicas as of 2020-06-03, every day seems like overkill.
- osd_scrub_min_interval: 1d => 1w
- Scrub each PG no more often than this interval
- See 2.9.2
- osd_deep_scrub_interval: 1w => 8w
- global
- target_max_misplaced_ratio: 5% => 1%
- Max ratio of misplaced objects to target when throttling data rebalancing activity
- Given the large number of objects in my cluster I figured this should be low so the reblancing is a higher priority.
- target_max_misplaced_ratio: 5% => 1%
These can be fetched with `ceph config get <level 0> <level 1>` and set with `ceph config set <level 0> <level 1> <value>` where "level 0" is the top level indent in the list and "level 1" is the second level indentation.
Details for the OSD configuration can be found at https://docs.ceph.com/docs/octopus/rados/configuration/osd-config-ref/