Kernel RBD的QOS配置方案

January 4, 2018, 11:23 pm

前言

关于qos的讨论有很多，ceph内部也正在实现着一整套的基于dmclock的qos的方案，这个不是本篇的内容，之前在社区的邮件列表看过有研发在聊qos的相关的实现的，当时一个研发就提出了在使用kernel rbd的时候，可以直接使用linux的操作系统qos来实现，也就是cgroup来控制读取写入

cgroup之前也有接触过，主要测试了限制cpu和内存相关的，没有做io相关的测试，这个当然可以通过ceph内部来实现qos，但是有现成的解决方案的时候，可以减少很多开发周期，以及测试的成本

本篇将介绍的是kernel rbd的qos方案

时间过长

首先介绍下几个测试qos相关的命令，用来比较设置前后的效果
验证写入IOPS命令

fio -filename=/dev/rbd0 -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=4K -size=1G -numjobs=1 -runtime=60 -group_reporting -name=mytest

验证写入带宽的命令

fio -filename=/dev/rbd0 -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=4M -size=1G -numjobs=1 -runtime=60 -group_reporting -name=mytest

验证读取IOPS命令

fio -filename=/dev/rbd0 -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=4K -size=1G -numjobs=1 -runtime=60 -group_reporting -name=mytest

验证读取带宽命令

fio -filename=/dev/rbd0 -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=4M -size=1G -numjobs=1 -runtime=60 -group_reporting -name=mytest

上面为什么会设置不同的块大小，这个是因为测试的存储是会受到带宽和iops的共同制约的，当测试小io的时候，这个时候的峰值是受到iops的限制的，测试大io的时候，受到的是带宽限制，所以在做测试的时候，需要测试iops是否被限制住的时候就使用小的bs=4K，需要测试大的带宽的限制的时候就采用bs=4M来测试

测试的时候都是，开始不用做qos来进行测试得到一个当前不配置qos的性能数值，然后根据需要进行qos设置后通过上面的fio去测试是否能限制住

启用cgroup的blkio模块

mkdir -p  /cgroup/blkio/
mount -t cgroup -o blkio blkio /cgroup/blkio/

获取rbd磁盘的major/minor numbers

[root@lab211 ~]# lsblk 
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
rbd0   252:0    0  19.5G  0 disk 
sda      8:0    1 238.4G  0 disk 
├─sda4   8:4    1     1K  0 part 
├─sda2   8:2    1  99.9G  0 part 
├─sda5   8:5    1     8G  0 part [SWAP]
├─sda3   8:3    1     1G  0 part /boot
├─sda1   8:1    1   100M  0 part 
└─sda6   8:6    1 129.4G  0 part /

通过lsblk命令可以获取到磁盘对应的major number和minor number,这里可以看到rbd0对应的编号为252:0

设置rbd0的iops的qos为10

echo "252:0 10" > /cgroup/blkio/blkio.throttle.write_iops_device

如果想清理这个规则,把后面的数值设置为0就清理了，后面几个配置也是相同的方法

echo "252:0 0" > /cgroup/blkio/blkio.throttle.write_iops_device

限制写入的带宽为10MB/s

echo "252:0 10485760" > /cgroup/blkio/blkio.throttle.write_bps_device

限制读取的qos为10

echo "252:0 10" > /cgroup/blkio/blkio.throttle.read_iops_device

限制读取的带宽为10MB/s

echo "252:0 10485760" > /cgroup/blkio/blkio.throttle.read_bps_device

以上简单的设置就完成了kernel rbd的qos设置了，我测试了下，确实是生效了的

总结

这个知识点很久前就看到了，一直没总结，现在记录下，个人观点是能快速，有效，稳定的实现功能是最好的，所以使用这个在kernel rbd方式下可以不用再进行qos的开发了

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-01-05

Source: zphj1987@gmail (Kernel RBD的QOS配置方案)

The post Kernel RBD的QOS配置方案 appeared first on Ceph.

↧

CTDB使用rados object作为lock file

January 6, 2018, 7:29 am

≫ Next: How to create a vagrant VM from a libvirt vm/image

≪ Previous: Kernel RBD的QOS配置方案

object

前言

服务器的服务做HA有很多种方式，其中有一种就是是用CTDB，之前这个是独立的软件来做HA的，现在已经跟着SAMBA主线里面了，也就是跟着samba发行包一起发行

之前CTDB的模式是需要有一个共享文件系统，并且在这个共享文件系统里面所有的节点都去访问同一个文件，会有一个Master会获得这个文件的锁

在cephfs的使用场景中可以用cephfs的目录作为这个锁文件的路径，这个有个问题就是一旦有一个节点down掉的时候，可能客户端也会卡住目录，这个目录访问会被卡住，文件锁在其他机器无法获取到，需要等到这个锁超时以后，其它节点才能获得到锁，这个切换的周期就会长一点了

CTDB在最近的版本当中加入了cluster mutex helper using Ceph RADOS的支持，本篇将介绍这个方式锁文件配置方式

实践过程

安装CTDB

[root@customos ~]# yum install samba ctdb

检查默认包里面是否有rados的支持

[root@customos ~]# rpm -qpl ctdb-4.6.2-12.el7_4.x86_64.rpm
…
usr/libexec/ctdb
/usr/libexec/ctdb/ctdb_event
/usr/libexec/ctdb/ctdb_eventd
/usr/libexec/ctdb/ctdb_killtcp
/usr/libexec/ctdb/ctdb_lock_helper
/usr/libexec/ctdb/ctdb_lvs
/usr/libexec/ctdb/ctdb_mutex_fcntl_helper
/usr/libexec/ctdb/ctdb_natgw
/usr/libexec/ctdb/ctdb_recovery_helper
/usr/libexec/ctdb/ctdb_takeover_helper
/usr/libexec/ctdb/smnotify
…

这个可以看到默认并没有包含这个rados的支持，这个很多通用软件都会这么处理，因为支持第三方插件的时候需要开发库，而开发库又有版本的区别，所以默认并不支持，需要支持就自己编译即可，例如fio支持librbd的接口就是这么处理的，等到插件也通用起来的时候，可能就会默认支持了

很多软件的编译可以采取源码的编译方式，如果不是有很强的代码合入和patch跟踪能力，直接用发行包的方式是最稳妥的，所以为了不破坏这个稳定性，本篇采用的是基于发行版本，增加模块的方式，这样不会破坏核心组件的稳定性，并且后续升级也是比较简单的，这个也是个人推荐的方式

查询当前使用的samba版本

[root@customos ~]# rpm -qa|grep samba
samba-4.6.2-12.el7_4.x86_64

打包新的CTDB

可以查询得到这个的源码包为samba-4.6.2-12.el7_4.src.rpm,进一步搜索可以查询的到这个src源码rpm包

http://vault.centos.org/7.4.1708/updates/Source/SPackages/samba-4.6.2-12.el7_4.src.rpm

下载这个rpm包

[root@customos ~]# wget http://vault.centos.org/7.4.1708/updates/Source/SPackages/samba-4.6.2-12.el7_4.src.rpm

如果下载比较慢的话就用迅雷下载，会快很多，国内的源里面把源码包的rpm都删除掉了，上面的是官网会有最全的包

解压这个rpm包

[root@customos ~]# rpm2cpio samba-4.6.2-12.el7_4.src.rpm |cpio -div

检查包的内容

[root@customos myctdb]# ls
CVE-2017-12150.patch                                 samba-v4-6-fix-cross-realm-refferals.patch
CVE-2017-12151.patch                                 samba-v4-6-fix-kerberos-debug-message.patch
CVE-2017-12163.patch                                 samba-v4-6-fix_net_ads_changetrustpw.patch
CVE-2017-14746.patch                                 samba-v4-6-fix-net-ads-keytab-handling.patch
CVE-2017-15275.patch                                 samba-v4-6-fix_path_substitutions.patch
CVE-2017-7494.patch                                  samba-v4-6-fix_smbclient_session_setup_info.patch
gpgkey-52FBC0B86D954B0843324CDC6F33915B6568B7EA.gpg  samba-v4-6-fix_smbclient_username_parsing.patch
pam_winbind.conf                                     samba-v4.6-fix_smbpasswd_user_pwd_change.patch
README.dc                                            samba-v4-6-fix-spoolss-32bit-driver-upload.patch
README.downgrade                                     samba-v4-6-fix-vfs-expand-msdfs.patch
samba-4.6.2-12.el7_4.src.rpm                         samba-v4-6-fix_winbind_child_crash.patch
samba-4.6.2.tar.asc                                  samba-v4-6-fix_winbind_normalize_names.patch
samba-4.6.2.tar.xz                                   samba-v4.6-graceful_fsctl_validate_negotiate_info.patch
samba.log                                            samba-v4.6-gss_krb5_import_cred.patch
samba.pamd                                           samba-v4.6-lib-crypto-implement-samba.crypto-Python-module-for-.patch
samba.spec                                           samba-v4.7-config-dynamic-rpc-port-range.patch
samba-v4.6-credentials-fix-realm.patch               smb.conf.example
samba-v4-6-fix-building-with-new-glibc.patch         smb.conf.vendor

可以看到在源码包基础上还打入了很多的patch，内部的编译采用的是waf编译的方式，内部的过程就不做太多介绍了，这里只去改动我们需要的部分即可，也就是去修改samba.spec文件

我们先获取相关的编译选项，这个我最开始的时候打算独立编译ctdb的rpm包，发现有依赖关系太多，后来多次验证后，发现直接可以在samba编译里面增加选项的，选项获取方式

[root@lab211 samba-4.6.2]# ./configure --help|grep ceph
  --with-libcephfs=LIBCEPHFS_DIR
            Directory under which libcephfs is installed
  --enable-cephfs
            Build with cephfs support (default=yes)
  --enable-ceph-reclock

这个可以知道需要添加ceph-reclock的支持就添加这个选项，我们把这个选项添加到samba.spec当中
修改samba.spec文件

…
%configure 
        --enable-fhs 
        --with-piddir=/run 
        --with-sockets-dir=/run/samba 
        --with-modulesdir=%{_libdir}/samba 
        --with-pammodulesdir=%{_libdir}/security 
        --with-lockdir=/var/lib/samba/lock 
        --with-statedir=/var/lib/samba 
        --with-cachedir=/var/lib/samba 
        --disable-rpath-install 
        --with-shared-modules=%{_samba4_modules} 
        --bundled-libraries=%{_samba4_libraries} 
        --with-pam 
        --with-pie 
        --with-relro 
        --enable-ceph-reclock 
        --without-fam 
…
%dir %{_libexecdir}/ctdb
%{_libexecdir}/ctdb/ctdb_event
%{_libexecdir}/ctdb/ctdb_eventd
%{_libexecdir}/ctdb/ctdb_killtcp
%{_libexecdir}/ctdb/ctdb_lock_helper
%{_libexecdir}/ctdb/ctdb_lvs
%{_libexecdir}/ctdb/ctdb_mutex_fcntl_helper
%{_libexecdir}/ctdb/ctdb_mutex_ceph_rados_helper
…
%{_mandir}/man1/ctdb.1.gz
%{_mandir}/man1/ctdb_diagnostics.1.gz
%{_mandir}/man1/ctdbd.1.gz
%{_mandir}/man1/onnode.1.gz
%{_mandir}/man1/ltdbtool.1.gz
%{_mandir}/man1/ping_pong.1.gz
%{_mandir}/man7/ctdb_mutex_ceph_rados_helper.7.gz
%{_mandir}/man1/ctdbd_wrapper.1.gz
…

这个文件当中一共添加了三行内容

--enable-ceph-reclock 
%{_libexecdir}/ctdb/ctdb_mutex_ceph_rados_helper
%{_mandir}/man7/ctdb_mutex_ceph_rados_helper.7.gz

把解压后的目录里面的所有文件都拷贝到源码编译目录,就是上面ls列出的那些文件，以及修改好的samba.spec文件都一起拷贝过去

[root@customos myctdb]# cp -ra * /root/rpmbuild/SOURCES/

安装librados2的devel包

[root@customos myctdb]# yum install librados2-devel

如果编译过程缺其他的依赖包就依次安装即可，这个可以通过解压源码先编译一次的方式来把依赖包找全，然后再打rpm包

开始编译rpm包

[root@customos myctdb]# rpmbuild -bb samba.spec

这个可以就在当前的目录执行即可

检查生成的包

[root@customos myctdb]# rpm -qpl /root/rpmbuild/RPMS/x86_64/ctdb-4.6.2-12.el7.centos.x86_64.rpm|grep rados
/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper
/usr/share/man/man7/ctdb_mutex_ceph_rados_helper.7.gz

可以看到已经生成了这个，把这个包拷贝到需要更新的机器上面

配置ctdb

首先要升级安装下新的ctdb包，因为名称有改变，会提示依赖问题,这里忽略依赖的问题

[root@customos ~]# rpm -Uvh ctdb-4.6.2-12.el7.centos.x86_64.rpm --nodeps

添加一个虚拟IP配置

[root@customos ~]# cat /etc/ctdb/public_addresses 
192.168.0.99/16 ens33

添加node配置

[root@customos ~]# cat /etc/ctdb/nodes 
192.168.0.18
192.168.0.201

修改配置文件

[root@customos ~]# cat /etc/ctdb/ctdbd.conf|grep -v "#"
 CTDB_RECOVERY_LOCK="!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin rbd lockctdb"
 CTDB_NODES=/etc/ctdb/nodes
 CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
 CTDB_LOGGING=file:/var/log/log.ctdb
# CTDB_DEBUGLEVEL=debug

上面为了调试，我开启了debug来查看重要的信息

CTDB_RECOVERY_LOCK=”!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin rbd lockctdb”
最重要的是这行配置文件规则是
CTDB_RECOVERY_LOCK="!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper [Cluster] [User] [Pool] [Object]"
Cluster: Ceph cluster name (e.g. ceph)
User: Ceph cluster user name (e.g. client.admin)
Pool: Ceph RADOS pool name
Object: Ceph RADOS object name

在ctdb的机器上面准备好librados2和ceph配置文件，这个配置的rbd的lockctdb对象会由ctdb去生成

[root@customos ~]# systemctl restart ctdb

配置好了以后就可以启动进程了，上面的/etc/ctdb/ctdbd.conf配置文件最好是修改好一台机器的，然后scp到其它机器，里面内容有一点点偏差都会判断为异常的，所以最好是相同的配置文件

查看进程状态

[root@customos ceph]# ctdb status
Number of nodes:2
pnn:0 192.168.0.18     OK (THIS NODE)
pnn:1 192.168.0.201    OK
Generation:1662303628
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:1

查看/var/log/log.ctdb日志

2018/01/06 23:18:11.399849 ctdb-recoverd[129134]: Node:1 was in recovery mode. Start recovery process
2018/01/06 23:18:11.399879 ctdb-recoverd[129134]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery
2018/01/06 23:18:11.399903 ctdb-recoverd[129134]: Attempting to take recovery lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin rbd lockctdb)
2018/01/06 23:18:11.400657 ctdb-recoverd[129134]: ../ctdb/server/ctdb_cluster_mutex.c:251 Created PIPE FD:17
2018/01/06 23:18:11.579865 ctdbd[129038]: ../ctdb/server/ctdb_daemon.c:907 client request 40 of type 7 length 72 from node 1 to 4026531841

日志中可以看到ctdb-recoverd已经是采用的ctdb_mutex_ceph_rados_helper来获取的recovery lock

停掉ctdb的进程，IP可以正常的切换，到这里，使用对象作为lock文件的功能就实现了，其他更多的ctdb的高级控制就不在这个里作过多的说明

总结

本篇是基于发行版本的ctdb包进行模块的加入重新发包，并且把配置做了一次实践，这个可以作为一个ctdb的方案之一，具体跟之前的方案相比切换时间可以改善多少，需要通过数据进行对比，这个进行测试即可

资源

已经打好包的ctdb共享一下，可以直接使用

http://7xweck.com1.z0.glb.clouddn.com/ctdb-4.6.2-12.el7.centos.x86_64.rpm

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-01-06

Source: zphj1987@gmail (CTDB使用rados object作为lock file)

The post CTDB使用rados object作为lock file appeared first on Ceph.

↧

How to create a vagrant VM from a libvirt vm/image

January 11, 2018, 2:52 am

≫ Next: 定位一个网络问题引起的ceph异常

≪ Previous: CTDB使用rados object作为lock file

It cost’s me some nerves and time to figure out how to create a vagrant image from
a libvirt kvm vm and how to modify an existing one. Thanks to pl_rock from stackexchange
for the awesome start.

First of all you have to install a new vm as usual. I’ve installed a new vm with Ubuntu 16.04 LTS.
I’m not sure if it’s really neccessary but set the root password to “vagrant”, just to be sure.
Connect to your VM via ssh or terminal and do the following steps.

定位一个网络问题引起的ceph异常

January 16, 2018, 7:10 am

≫ Next: Placement Groups with Ceph Luminous stay in activating state

≪ Previous: How to create a vagrant VM from a libvirt vm/image

network

前言

有一个ceph环境出现了异常，状态就是恢复异常的慢，但是所有数据又都在走，只是非常的慢，本篇将记录探测出问题的过程，以便以后处理类似的问题有个思路

处理过程

问题的现象是恢复的很慢，但是除此以外并没有其它的异常，通过iostat监控磁盘，也没有出现异常的100%的情况，暂时排除了是osd底层慢的问题

检测整体写入的速度

通过rados bench写入

rados -p rbd bench 5 write

刚开始写入的时候没问题，但是写入了以后不久就会出现一只是0的情况，可以判断在写入某些对象的时候出现了异常

本地生成一些文件

seq 0 30|xargs -i dd if=/dev/zero of=benchmarkzp{} bs=4M count=2

通过rados put 命令把对象put进去

for a in `ls ./`;do time rados -p rbd put $a $a;echo $a;ceph osd map rbd $a;done

得到的结果里面会有部分是好的，部分是非常长的时间，对结果进行过滤，分为bad 和good

开始怀疑会不会是固定的盘符出了问题，首先把磁盘组合分出来，完全没问题的磁盘全部排除，结果最后都排除完了，所以磁盘本省是没问题的

根据pg的osd组合进行主机分类

1  2  4  ok
3  1   2  bad
2  4   1 ok
3  1 2   bad
3  4  2  bad
……

上面的编号是写入对象所在的pg对应的osd所在的主机，严格按照顺序写入，第一个主机为发送数据方，第二个和第三个为接收数据方，并且使用了cluster network

通过上面的结果发现了从3往2进行发送副本数据的时候出现了问题，然后去主机上排查网络

在主机2上面做iperf -s
在主机3上面做iperf -c host2然后就发现了网络异常了

最终还是定位在了网络上面

已经在好几个环境上面发现没装可以监控实时网络流量dstat工具或者ifstat的动态监控，做操作的时候监控下网络，可以发现一些异常

总结

这个环境在最开始的时候就怀疑是网络可能有问题，但是没有去进行全部服务器的网络的检测，这个在出现一些奇奇怪怪的异常的时候，还是可能出现在网络上面，特别是这种坏掉又不是完全坏掉，只是掉速的情况，通过集群的一些内部告警还没法完全体现出来，而主机很多的时候，又没有多少人愿意一个个的去检测，就容易出现这种疏漏了

在做一个ceph的管理平台的时候，对整个集群做全员对等网络带宽测试还是很有必要的，如果有一天我来设计管理平台，一定会加入这个功能进去

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-01-16

Source: zphj1987@gmail (定位一个网络问题引起的ceph异常)

The post 定位一个网络问题引起的ceph异常 appeared first on Ceph.

↧

Placement Groups with Ceph Luminous stay in activating state

January 29, 2018, 2:54 am

≫ Next: Building Ceph master with C++17 support on openSUSE Leap 42.3

≪ Previous: 定位一个网络问题引起的ceph异常

Placement Groups stuck in activating When migrating from FileStore with BlueStore with Ceph Luminuous you might run into the problem that certain Placement Groups stay stuck in the activating state. 44 activating+undersized+degraded+remapped PG Overdose This is a side-effect of the new PG overdose protection in Ceph Luminous. Too many PGs on your OSDs can cause … Continue reading Placement Groups with Ceph Luminous stay in activating state

Source: widodh (Placement Groups with Ceph Luminous stay in activating state)

The post Placement Groups with Ceph Luminous stay in activating state appeared first on Ceph.

↧

Building Ceph master with C++17 support on openSUSE Leap 42.3

January 29, 2018, 11:07 am

≫ Next: Ceph Manager Dashboard v2

≪ Previous: Placement Groups with Ceph Luminous stay in activating state

Ceph now requires C++17 support, which is available with modern compilers such as gcc-7. openSUSE Leap 42.3, my current OS of choice, includes gcc-7. However, it’s not used by default.

Using gcc-7 for the Ceph build is a simple matter of:

> sudo zypper in gcc7-c++
> CC=gcc-7 CXX=/usr/bin/g++-7 ./do_cmake.sh ...
> cd build && make -j

Source: David Disseldorp (Building Ceph master with C++17 support on openSUSE Leap 42.3)

The post Building Ceph master with C++17 support on openSUSE Leap 42.3 appeared first on Ceph.

↧

Ceph Manager Dashboard v2

February 2, 2018, 6:13 am

≫ Next: REDHAT 7.5beta 新推出的VDO功能

≪ Previous: Building Ceph master with C++17 support on openSUSE Leap 42.3

The original Ceph Manager Dashboard that was introduced in
Ceph “Luminous” started out as a simple, read-only view into various run-time
information and performance data of a Ceph cluster, without authentication or
any administrative functionality.

However, as it turns out, there is a growing demand for adding more web-based
management capabilities, to make it easier for administrators that prefer a
WebUI to manage Ceph over the command line. Sage Weil also touched upon this
topic in the Ceph Developer monthly call in December and created an etherpad with some
ideas for improvement.

/galleries/ceph-dashboard-v2-screenshots-2018-02-02/dashboard-v2-health.png

A preliminary screen shot of the Ceph health dashboard

After learning about this, we approached Sage and John Spray from the Ceph
project and offered our help to implement the missing functionality. Based on
our experiences in developing the Ceph support in openATTIC, we think we have a
lot to offer in the form of code and experience in creating a Ceph
administration and monitoring UI.

REDHAT 7.5beta 新推出的VDO功能

February 10, 2018, 12:25 am

≫ Next: How to do a Ceph cluster maintenance/shutdown

≪ Previous: Ceph Manager Dashboard v2

network

前言

关于VDO

VOD的技术来源于收购的Permabit公司，一个专门从事重删技术的公司，所以技术可靠性是没有问题的

VDO是一个内核模块，目的是通过重删减少磁盘的空间占用，以及减少复制带宽，VDO是基于块设备层之上的，也就是在原设备基础上映射出mapper虚拟设备，然后直接使用即可，功能的实现主要基于以下技术：

零区块的排除：

在初始化阶段，整块为0的会被元数据记录下来，这个可以用水杯里面的水和沙子混合的例子来解释，使用滤纸（零块排除），把沙子（非零空间）给过滤出来，然后就是下一个阶段的处理

重复数据删除：

在第二阶段，输入的数据会判断是不是冗余数据（在写入之前就判断），这个部分的数据通过UDS内核模块来判断（U niversal D eduplication S ervice），被判断为重复数据的部分不会被写入，然后对元数据进行更新，直接指向原始已经存储的数据块即可
压缩：

一旦消零和重删完成，LZ4压缩会对每个单独的数据块进行处理，然后压缩好的数据块会以固定大小4KB的数据块存储在介质上，由于一个物理块可以包含很多的压缩块，这个也可以加速读取的性能

上面的技术看起来很容易理解，但是实际做成产品还是相当大的难度的，技术设想和实际输出还是有很大距离，不然redhat也不会通过收购来获取技术，而不是自己去重新写一套了

如何获取VDO

主要有两种方式，一种是通过申请测试版的方式申请redhat 7.5的ISO，这个可以进行一个月的测试

另外一种方式是申请测试版本，然后通过源码在你正在使用的ISO上面进行相关的测试，从适配方面在自己的ISO上面进行测试能够更好的对比，由于基于redhat的源码做分发会涉及法律问题，这里就不做过多讲解，也不提供rpm包，自行申请测试即可

实践过程

安装VDO

安装的操作系统为CentOS Linux release 7.4.1708

[root@lab101 ~]# lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.4.1708 (Core) 
Release:	7.4.1708
Codename:	Core

内核版本如下

[root@lab101 ~]# uname -a
Linux lab101 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@lab101 ~]# rpm -qa|grep kernel
kernel-tools-libs-3.10.0-693.el7.x86_64
abrt-addon-kerneloops-2.1.11-48.el7.centos.x86_64
kernel-3.10.0-693.el7.x86_64

我们把内核升级一下，因为这个模块比较新，所以选择目前updates里面最新的

wget http://mirror.centos.org/centos/7/updates/x86_64/Packages/kernel-3.10.0-693.17.1.el7.x86_64.rpm

大版本一致，小版本不同，直接安装即可

[root@lab101 ~]# rpm -ivh kernel-3.10.0-693.17.1.el7.x86_64.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:kernel-3.10.0-693.17.1.el7       ################################# [100%]
[root@lab101 ~]# grub2-set-default 'CentOS Linux (3.10.0-693.17.1.el7.x86_64) 7 (Core)'

重启服务器
安装

[root@lab101 ~]# rpm -ivh kmod-kvdo-6.1.0.98-11.el7.centos.x86_64.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:kmod-kvdo-6.1.0.98-11.el7.centos ################################# [100%]
[root@lab101 ~]# yum install PyYAML   
[root@lab101 ~]# rpm -ivh vdo-6.1.0.98-13.x86_64.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:vdo-6.1.0.98-13                  ################################# [100%]

到这里安装就完成了

配置VDO

创建一个vdo卷

[root@lab101 ~]# vdo create --name=my_vdo  --device=/dev/sdb1   --vdoLogicalSize=80G --writePolicy=sync
Creating VDO my_vdo
Starting VDO my_vdo
Starting compression on VDO my_vdo
VDO instance 0 volume is ready at /dev/mapper/my_vdo

参数解释：
name是创建的vdo名称，也就是生成的新设备的名称，device是指定的设备，vdoLogicalSize是指定新生成的设备的大小，因为vdo是支持精简配置的，也就是你原来1T的物理空间，这里可以创建出超过1T的逻辑空间，因为内部支持重删，可以根据数据类型进行放大，writePolicy是指定写入的模式的

如果磁盘设备是write back模式的可以设置为aysnc，如果没有的话就设置为sync模式

如果磁盘没有写缓存或者有write throuth cache的时候设置为sync模式
如果磁盘有write back cache的时候就必须设置成async模式

默认是sync模式的，这里的同步异步实际上是告诉vdo，我们的底层存储是不是有写缓存，有缓存的话就要告诉vdo我们底层是async的，没有缓存的时候就是sync

检查我们的磁盘的写入方式

[root@lab101 ~]# cat /sys/block/sdb/device/scsi_disk/0:0:1:0/cache_type 
write through

这个输出的根据上面的规则，我们设置为sync模式

修改缓存模式的命令

vdo changeWritePolicy --writePolicy=sync_or_async --name=vdo_name

格式化硬盘

[root@lab101 ~]# mkfs.xfs -K /dev/mapper/my_vdo 
meta-data=/dev/mapper/my_vdo     isize=512    agcount=4, agsize=5242880 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=20971520, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=10240, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

使用-K参数是加速了格式化的操作，也就是不发送丢弃的请求，因为之前创建了vdo，已经将其初始化为0了，所以可以采用这个操作

我们挂载的时候最好能加上discard的选项，精简配置的设备需要对之前的空间进行回收，一般来说有在线的和离线的回收，离线的就通过fstrim来进行回收即可

挂载设备

[root@lab101 ~]# mount -o discard /dev/mapper/my_vdo /myvod/

[root@lab101 ~]# vdostats --human-readable 
Device                    Size      Used Available Use% Space saving%
/dev/mapper/my_vdo       50.0G      4.0G     46.0G   8%           99%

默认创建完vdo设备就会占用4G左右的空间，这个用来存储UDS和VDO的元数据

检查重删和压缩是否开启

[root@lab101 ~]# vdo status -n my_vdo|grep Deduplication
    Deduplication: enabled
[root@lab101 ~]# vdo status -n my_vdo|grep Compress
    Compression: enabled

如果没有开启，可以通过下面的命令开启

vdo enableCompression -n <vdo_vol_name>
vdo enableDeduplication -n <vdo_vol_name>

验证重删功能

[root@lab101 ~]# df -h|grep vdo
/dev/mapper/my_vdo   80G   33M   80G   1% /myvod
[root@lab101 ~]# vdostats --hu
Device                    Size      Used Available Use% Space saving%
/dev/mapper/my_vdo       50.0G      4.0G     46.0G   8%           99%

传入一个ISO文件CentOS-7-x86_64-NetInstall-1708.iso 422M的

[root@lab101 ~]# df -h|grep vdo
/dev/mapper/my_vdo   80G  455M   80G   1% /myvod
[root@lab101 ~]# vdostats --hu
Device                    Size      Used Available Use% Space saving%
/dev/mapper/my_vdo       50.0G      4.4G     45.6G   8%            9%

然后重复传入3个相同文件，一共四个文件

[root@lab101 ~]# df -h|grep vdo
/dev/mapper/my_vdo   80G  1.7G   79G   3% /myvod
[root@lab101 ~]# vdostats --hu
Device                    Size      Used Available Use% Space saving%
/dev/mapper/my_vdo       50.0G      4.4G     45.6G   8%           73%

可以看到后面传入的文件，并没有占用底层存储的实际空间

验证压缩功能

测试数据来源 silesia的资料库

http://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip

通过资料库里面的文件来看下对不同类型的数据的压缩情况

Filename	描述	类型	原始空间（KB）	实际占用空间（KB）
dickens	狄更斯文集	英文原文	9953	9948
mozilla	Mozilla的1.0可执行文件	执行程序	50020	33228
mr	医用resonanse图像	图片	9736	9272
nci	结构化的化学数据库	数据库	32767	10168
ooffice	Open Office.org 1.01 DLL	可执行程序	6008	5640
osdb	基准测试用的MySQL格式示例数据库	数据库	9849	9824
reymont	瓦迪斯瓦夫·雷蒙特的书	PDF	6471	6312
samba	samba源代码	src源码	21100	11768
sao	星空数据	天文格式的bin文件	7081	7036
webster	辞海	HTML	40487	40144
xml	XML文件	HTML	5220	2180
x-ray	透视医学图片	医院数据	8275	8260

可以看到都有不同程度的压缩，某些类型的数据压缩能达到50%的比例

停止vdo操作

[root@lab101 ~]# vdo stop  -n my_vdo

启动vdo操作

[root@lab101 ~]# vdo start  -n my_vdo

删除vdo操作

[root@lab101 ~]# vdo remove -n my_vdo

VDO和CEPH能产生什么火花？

在ceph里面可以用到vdo的地方有两个，一个是作为Kernel rbd的前端，在块设备的上层，另外一个是作为OSD的底层，也就是把VDO当OSD来使用，我们看下怎么使用

作为rbd的上层

[root@lab101 ceph]# rbd create testvdorbd --size 20G
[root@lab101 ceph]# rbd map testvdorbd

创建rbd的vdo

[root@lab101 ceph]# vdo create --name=rbd_vdo  --device=/dev/rbd/rbd/testvdorbd
Creating VDO rbd_vdo
vdo: ERROR -   Device /dev/rbd/rbd/testvdorbd not found (or ignored by filtering).

被默认排除掉了，这个以前正好见过类似的问题，比较好处理

这个地方因为vdo添加存储的时候内部调用了lvm相关的配置，然后lvm默认会排除掉rbd，这里修改下lvm的配置文件即可
在/etc/lvm/lvm.conf的修改如下

types = [ "fd", 16 ,"rbd", 64 ]

把types里面增加下rbd 的文件类型即可

[root@lab101 ceph]# vdo create --name=rbd_vdo  --device=/dev/rbd/rbd/testvdorbd
Creating VDO rbd_vdo
Starting VDO rbd_vdo
Starting compression on VDO rbd_vdo
VDO instance 2 volume is ready at /dev/mapper/rbd_vdo

挂载

mount -o discard /dev/mapper/rbd_vdo /mnt

查看容量

[root@lab101 mnt]# vdostats --human-readable
Device                    Size      Used Available Use% Space saving%
/dev/mapper/rbd_vdo      20.0G      4.4G     15.6G  22%            3%

[root@lab101 mnt]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    57316M     49409M        7906M         13.79 
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd      0      566M      1.20        46543M         148

[root@lab101 mnt]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    57316M     48699M        8616M         15.03 
POOLS:
    NAME     ID     USED      %USED     MAX AVAIL     OBJECTS 
    rbd      0      1393M      2.95        45833M         355

多次传入相同的时候可以看到对于ceph内部来说还是会产生对象的，只是这个在vdo的文件系统来看是不占用物理空间的

对镜像做下copy

[root@lab101 ~]# rbd cp testvdorbd testvdorbdclone

[root@lab101 ~]#rbd map  testvdorbdclone

[root@lab101 ~]# cat /etc/vdoconf.yml |grep device
      device: /dev/rbd/rbd/testvdorbdclone

修改配置文件为对应的设备，就可以启动了,这个操作说明vdo设备是不绑定硬件的，只需要有相关的配置文件，即可对文件系统进行启动

那么这个在一个数据转移用途下，就可以利用vdo对数据进行重删压缩，然后把整个img转移到远端去，这个也符合现在的私有云和公有云之间的数据传输量的问题，会节省不少空间

vdo作为ceph的osd

ceph对设备的属性有要求，这里直接采用目录部署的方式

[root@lab101 ceph]# vdo create --name sdb1 --device=/dev/sdb1
[root@lab101 ceph]# vdo create --name sdb2 --device=/dev/sdb2
[root@lab101 ceph]# mkfs.xfs -K -f /dev/mapper/sdb1
[root@lab101 ceph]# mkfs.xfs -K -f /dev/mapper/sdb2
[root@lab101 ceph]# mkdir /osd1
[root@lab101 ceph]# mkdir /osd2
[root@lab101 ceph]# mount /dev/mapper/sdb1 /osd1/
[root@lab101 ceph]# mount /dev/mapper/sdb2 /osd2/
[root@lab101 ceph]# chown ceph:ceph /osd1
[root@lab101 ceph]# chown ceph:ceph /osd2
[root@lab101 ceph]# ceph-deploy osd prepare lab101:/osd1/
[root@lab101 ceph]# ceph-deploy osd prepare lab101:/osd2/
[root@lab101 ceph]# ceph-deploy osd activate lab101:/osd1/
[root@lab101 ceph]# ceph-deploy osd activate lab101:/osd2/

写入测试数据

[root@lab101 ceph]# rados  -p rbd bench 60 write --no-cleanup

[root@lab101 ceph]# df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/sda2          56G  2.0G   54G   4% /
devtmpfs          983M     0  983M   0% /dev
tmpfs             992M     0  992M   0% /dev/shm
tmpfs             992M  8.8M  983M   1% /run
tmpfs             992M     0  992M   0% /sys/fs/cgroup
/dev/sda1        1014M  151M  864M  15% /boot
tmpfs             199M     0  199M   0% /run/user/0
/dev/mapper/sdb1   22G  6.5G   16G  30% /osd1
/dev/mapper/sdb2   22G  6.5G   16G  30% /osd2
[root@lab101 ceph]# vdostats --human-readable 
Device                    Size      Used Available Use% Space saving%
/dev/mapper/sdb2         25.0G      3.0G     22.0G  12%           99%
/dev/mapper/sdb1         25.0G      3.0G     22.0G  12%           99%

可以看到虽然在df看到了空间的占用，实际上因为rados bench写入的是填充的空洞数据，vdo作为osd对数据直接进行了重删了，测试可以看到vdo是可以作为ceph osd的，由于我的测试环境是在vmware虚拟机里面的，所以并不能做性能测试，有硬件环境的情况下可以对比下开启vdo和不开启的情况的性能区别

参考文档

vdo-qs-creating-a-volume
Determining the space savings of virtual data optimizer (VDO) in RHEL 7.5 Beta

总结

本篇从配置和部署以及适配方面对vdo进行一次比较完整的实践，从目前的测试情况来看，配置简单，对环境友好，基本是可以作为一个驱动层嵌入到任何块设备之上的，未来应该有广泛的用途，目前还不清楚红帽是否会把这个属性放到centos下面去，目前可以通过在https://access.redhat.com/downloads/ 申请测试版本的ISO进行功能的测试

应该是农历年前的最后一篇文章了，祝新春快乐！

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-02-10

Source: zphj1987@gmail (REDHAT 7.5beta 新推出的VDO功能)

The post REDHAT 7.5beta 新推出的VDO功能 appeared first on Ceph.

↧

How to do a Ceph cluster maintenance/shutdown

February 18, 2018, 11:18 pm

≫ Next: The Ceph Dashboard v2 pull request is ready for review!

≪ Previous: REDHAT 7.5beta 新推出的VDO功能

Last week someone asked on the ceph-users ML how to shutdown a Ceph cluster
and I would like to summarize the steps that are neccessary to do that.

Stop the clients from using your Cluster
(this step is only neccessary if you want to shutdown your whole cluster)
Important – Make sure that your cluster is in a healthy state before proceeding

Now you have to set some OSD flags:

# ceph osd set noout
# ceph osd set nobackfill
# ceph osd set norecover

Those flags should be totally suffiecient to safely powerdown your cluster but you
could also set the following flags on top if you would like to pause your cluster completely::

# ceph osd norebalance
# ceph osd nodown
# ceph osd pause

## Pausing the cluster means that you can't see when OSDs come
back up again and no map update will happen

Shutdown your service nodes one by one
Shutdown your OSD nodes one by one
Shutdown your monitor nodes one by one
Shutdown your admin node

After maintenance just do everything mentioned above in reverse order.

Source: SUSE (How to do a Ceph cluster maintenance/shutdown)

The post How to do a Ceph cluster maintenance/shutdown appeared first on Ceph.

↧

The Ceph Dashboard v2 pull request is ready for review!

March 1, 2018, 4:55 am

≫ Next: CephFS Admin Tips – Create a new user and share

≪ Previous: How to do a Ceph cluster maintenance/shutdown

About a month ago, we shared the news that we
started working on a replacement for the Ceph dashboard, to set the stage for
creating a full-fledged, built-in web-base management tool for Ceph.

We’re happy to announce that we have now finalized the preparations for the
initial pull request, which marks
our first milestone in this venture: reaching feature parity with the existing
dashboard.

/galleries/ceph-dashboard-v2-screenshots-2018-02-23/dashboard-v2-health.png

Screen shot of the Ceph health dashboard

In fact, compared to the dashboard shipped with Ceph Luminous, we already
included a number of additional features that were added after the Luminous
release and added a simple authentication mechanism.

CephFS Admin Tips – Create a new user and share

March 6, 2018, 8:42 am

≫ Next: openATTIC 3.6.2 has been released

≪ Previous: The Ceph Dashboard v2 pull request is ready for review!

Hi my name is Stephen McElroy, and in this guide I will be showing how to create a new user, set permissions, set quotas, mount the share, and make them persistent on the client.

Creating the user

On the Ceph admin nodeLets create a basic user and give it capabilities to read the / and the /test_folder in CephFS.

$ ceph-authtool --create-keyring /etc/ceph/ceph.client.test_user.keyring --gen-key -n client.test_user
$ vi /etc/ceph/ceph.client.test_user.keyring
# Initial keyring only has key value
[client.test_user]
    key = AQAX4PBZw5tcGhAaaaaaBCSJR8qZ25uQB3yYA2gw==
    
# We will add in our capabilities here
    caps mds = "allow r path=/, allow rw path=/test_folder"
    caps mon = "allow r"
    caps osd = "allow class-read object_prefix rbd_children, allow rw pool=cephfs_data"

Once we are done adding capabilities, we will use ceph auth import to update -or- create our user entry. I personally like this way of updating the capabilities for a user for two reasons. First, it allows me to backup clients CAPS, most importantly, It allows me to not accidentally override their CAPS with ceph auth caps command.

1	$ ceph auth import -i /etc/ceph/ceph.client.test_user.keyring

If you don’t already have CephFS mounted somewhere to be able to create directories, lets mount the root directory now. Then create a subdirectory names test_folder.

Note – If you want to set user quotas on directory, use ceph-fuse when mounting. So far its the only way I’ve been able to get quotas to work.

1
2
3

$ mkdir /mnt/cephfs
$ ceph-fuse /mnt/cephfs
$ mkdir /mnt/cephfs/test_folder

Lets set a quota on test_folder.

1 2	$ cd /mnt/cephfs/ $ setfattr -n ceph.quota.max_bytes -v 107300000000 test_folder

Lets mount up the test folder to ensure quotas worked.

$ ceph-fuse -r /test_folder /mnt/cephfs
$ df -h
~~~
ceph-fuse                     100G     0  100G   0% /mnt/cephfs

Next, install packages for ceph-fuse

1	$ yum install ceph-fuse

Create Mount Points

Copy over your client key you made on the admin node, and ceph.conf, to “/etc/ceph/“
Then we will make two directories that will be use for mounting CephFS.
Personally I like to keep the mount directory and Ceph directory name the same.

1 2	$ mkdir /etc/cephfs_root $ mkdir /etc/test_folder

Make this a persistent mount by adding entries in “/etc/fstab”. Change the information as needed.

# Ceph Fuse mount of root cephfs
id=test_user,conf=/etc/ceph/ceph.conf,client_mountpoint=/ /mnt/cephfs_root fuse.ceph noatime 0 0
# Specific Directory in cephfs
id=test_user,conf=/etc/ceph/ceph.conf,client_mountpoint=/test_folder /mnt/test_folder fuse.ceph noatime 0 0

Run mount -a and df -h to ensure everything mounted correctly.

$ mount -a
$ df -h
~~~
ceph-fuse                     4.2E     0  4.2E   0% /mnt/cephfs_root
ceph-fuse                     100G     0  100G   0% /mnt/test_folder

Fin

There you have it, you should now have a fully working CephFS share. I hope this helps out peeps and makes like a little easier. If this even helped out one admin, then it was well worth it. If you have any questions, or need to hire a Ceph Engineer, free to contact me at magusnebula@gmail.com!

Source: Stephen McElroy (CephFS Admin Tips – Create a new user and share)

The post CephFS Admin Tips – Create a new user and share appeared first on Ceph.

↧

openATTIC 3.6.2 has been released

March 7, 2018, 4:20 am

≫ Next: The initial Ceph Dashboard v2 pull request has been merged!

≪ Previous: CephFS Admin Tips – Create a new user and share

We’re happy to announce version 3.6.2 of openATTIC!

Version 3.6.2 is the second bugfix release of the 3.6 stable branch, containing fixes for multiple
issues that were reported by users.

One new feature that we want to point out is the internationalization. openATTIC has been
translated to Chinese and German to be present on other markets as well.
We are working on other translations, for example Spanish. If you would like to see your native
language as part of openATTIC get in touch with us and we guide you how you can contribute and
help us with the translation.
We also had some packaging changes: Due to new requirements, we now use _fillupdir RPM macro in
our SUSE spec file.

As usual the release comes with several usability enhancements and security improvements.
For example we improved the modal deletion dialog in general – instead of just entering “yes”
when deleting an item, it is now required to enter the item name itself – so users do not
accidentally remove the wrong item.
Furthermore we fixed incorrect API endpoint URLs for RGW buckets.
We also adapted/changed some test cases – e.g. e2e tests were converted into Angular unit tests.

The initial Ceph Dashboard v2 pull request has been merged!

March 13, 2018, 2:02 am

≫ Next: Ansible module to create CRUSH hierarchy

≪ Previous: openATTIC 3.6.2 has been released

It actually happened exactly one week ago while I was on vacation: it’s our
great pleasure and honor to announce that we have reached our first milestone –
the initial Ceph Dashboard v2 pull request has now been merged into the upstream
Ceph master git branch, so it will become part of the upcoming Ceph “Mimic”
release!

Ansible module to create CRUSH hierarchy

March 15, 2018, 2:32 pm

≫ Next: Huge changes in ceph-container

≪ Previous: The initial Ceph Dashboard v2 pull request has been merged!

Title

First post of the year after a long time with no article, three months…
I know it has been a while, I wish I had more time to do more blogging.
I have tons of draft articles that never made it through, I need to make up for lost time.

So for this first post, let me introduce an Ansible I wrote for ceph-ansible: ceph_crush.

I. Rationale

ceph-ansible is feature-full, but we lack modules.
I’ve long thought that everything that can be done via a simple command task in Ansible does not deserve a module.
I was wrong.

Day 2 operations, as we call them, refers to consuming and giving access to the storage.
In the context of Ceph, this means several things:

RGW Configuration
- Users
- Buckets
- Bucket policies
- S3 acls
RBD
- Create/delete/modify RBD images
- Map them if kRBD
Mon:
- create pools
- create user and keys

Of course, all of that can be handled by the main playbook, but people are unlikely going to re-run the entire playbook to do that.
What they want is a simple playbook via a simple interface to interact with the cluster.
They don’t want to know anything about Ceph and its CLI, this only thing they care about is to finalize the task they were assigned too.

One the idea behind this is to unify the operational experience through a standard interface, which Ansible and language description, YAML.

II. Ceph CRUSH module

This module, as its name state, allows you to create CRUSH hierarchy.
The creation is done by passing to each host of your inventory a dictionary containing a set of keys where each determines a CRUSH bucket location.
Here is an inventory example:

ceph-osd-0 osd_crush_location="{ 'root': 'mon-roottt', 'rack': 'mon-rackkkk', 'pod': 'monpod', 'host': 'ceph-osd-0' }"

The module is configured like this:

- name: configure crush hierarchy
  ceph_crush:
    cluster: "{{ cluster }}"
    location: "{{ hostvars[item]['osd_crush_location'] }}"
    containerized: "{{ docker_exec_cmd }}"
  with_items: "{{ groups[osd_group_name] }}"

The resulting CRUSH map will be following:

ID CLASS WEIGHT  TYPE NAME                STATUS REWEIGHT PRI-AFF
-5       0.09738 root mon-roottt
-4       0.09738     pod monpod
-3       0.09738         rack mon-rackkkk
-2       0.09738             host ceph-osd-0

The module takes care of the ordering for you so that you can declare the keys of osd_crush_location in any order.
The pre-requisites for the module to successfully run are the following:

at least two buckets must be declared
a ‘host’ bucket must be declared

That’s it :).

This module saves us from hundreds of complex Ansible lines. As I said, more modules are coming for daily operations so stay tuned!
We are planing on adding this module to Ansible core and we are aiming for 2.6.

Source: Sebastian Han (Ansible module to create CRUSH hierarchy)

The post Ansible module to create CRUSH hierarchy appeared first on Ceph.

↧

Huge changes in ceph-container

March 18, 2018, 4:19 pm

≫ Next: See you at the first Cephalocon

≪ Previous: Ansible module to create CRUSH hierarchy

Title

A massive refactor done a week ago on ceph-container.
And yes, I’m saying ceph-container, not ceph-docker anymore.
We don’t have anything against Docker, we believe it’s excellent and we use it extensively.
However, having the ceph-docker name does not reflect the content of the repository.
Docker is only the Dockerfile, the rest is either entrypoints or examples.
In the end, we believe ceph-container is a better match for the repository name.

I. We were doing it wrong…

Hosting and building images from the Docker Hub made us do things wrong.
The old structure we came up with was mostly to workaround the Docker Hub’s limitation which is basically:

You can not quickly build more than one image from a single repository. We have multiple Linux distribution and Ceph releases to support. This was a show stopper for us.

To workaround this, we designed a branching strategy which primarily consisted of each branch as a specific version of the code (distribution and Ceph release), and so at the root of the repository, we had a daemon directory so Docker Hub would fetch all of that and build our images.

The master branch, the one containing all the distribution and Ceph releases had a bunch a symlinks everywhere making the whole structure hard to maintain, modify and this without impacting the rest. Moreover, we had sooo much code duplication, terrible.

But with that, we lost traceability of the code inside the images.
Since the image name was always the same (the tag) and got overwritten for each new content on master (or stable branch).
We only had a single version of a particular distribution and Ceph release.
This made rollbacks pretty hard to achieve for anyone who removed the previous image…

II. New structure: the matriochka approach

The new structure allows us to isolate each portion of the code, from distribution to Ceph release.
One can maintain its distribution; this eases cont maintainer’s life. Importantly, symlinks and code duplication are no more.
The code base has dropped too, 2,204 additions and 8,315 deletions.

For an in-depth description of this approach, please refer to the slides at the end of the blog post.

III. Make make make!

Some would say “Old School,” I’d say, we don’t need to re-invent the wheel and clearly make has demonstrated to be robust.
Our entire image build process relies on make.

So the make approach lets you do a bunch of things, see the list:

Usage: make [OPTIONS] ... <TARGETS>

TARGETS:

  Building:
    stage             Form staging dirs for all images. Dirs are reformed if they exist.
    build             Build all images. Staging dirs are reformed if they exist.
    build.parallel    Build default flavors in parallel.
    build.all         Build all buildable flavors with build.parallel
    push              Push release images to registry.
    push.parallel     Push release images to registy in parallel

  Clean:
    clean             Remove images and staging dirs for the current flavors.
    clean.nones       Remove all image artifacts tagged <none>.
    clean.all         Remove all images and all staging dirs. Implies "clean.nones".
                      Will only delete images in the specified REGISTRY for safety.
    clean.nuke        Same as "clean.all" but will not be limited to specified REGISTRY.
                      USE AT YOUR OWN RISK! This may remove non-project images.

  Testing:
    lint              Lint the source code.
    test.staging      Perform stageing integration test.

  Help:
    help              Print this help message.
    show.flavors      Show all flavor options to FLAVORS.
    flavors.modified  Show the flavors impacted by this branch's changes vs origin/master.
                      All buildable flavors are staged for this test.
                      The env var VS_BRANCH can be set to compare vs a different branch.

OPTIONS:

  FLAVORS - ceph-container images to operate on in the form
    <ceph rel>,<arch>,<os name>,<os version>,<base registry>,<base repo>,<base tag>
    and multiple forms may be separated by spaces.
      ceph rel - named ceph version (e.g., luminous, mimic)
      arch - architecture of Ceph packages used (e.g., x86_64, aarch64)
      os name - directory name for the os used by ceph-container (e.g., ubuntu)
      os version - directory name for the os version used by ceph-container (e.g., 16.04)
      base registry - registry to get base image from (e.g., "_" ~ x86_64, "arm64v8" ~ aarch64)
      base repo - The base image to use for the daemon-base container. generally this is
                  also the os name (e.g., ubuntu) but could be something like "alpine"
      base tag - Tagged version of the base os to use (e.g., ubuntu:"16.04", alpine:"3.6")
    e.g., FLAVORS_TO_BUILD="luminous,x86_64,ubuntu,16.04,_,ubuntu,16.04 
                            luminous,aarch64,ubuntu,16.04,arm64v8,alpine,3.6"

  REGISTRY - The name of the registry to tag images with and to push images to.
             Defaults to "ceph".
    e.g., REGISTRY="myreg" will tag images "myreg/daemon{,-base}" and push to "myreg".

  RELEASE - The release version to integrate in the tag. If omitted, set to the branch name.

IV. We are back to two images

daemon-base is back!
For a while we used to have daemon and base, then we dropped base to include everything in daemon.
However, we recently started to work on Rook.
Rook was having its own Ceph container image; they shouldn’t have to build a Ceph image, we should be providing one.

So now, we have two images:

daemon-base, contains Ceph packages
daemon, contains daemon-base plus ceph-container’s entrypoint / specific packages

So now Rook can build its Rook image but from daemon-base and then add their Rook binary on top of it.
This is not only true for Rook but for any project that would like to use a Ceph container image.

V. Moving away from automated builds

We spent too much time workarounding Docker Hub’s limitation. This even caused us to go with our previous terrible approach.
Now things are different. We are no longer using automated builds from the Docker Hub; we just use it as a registry to store our Ceph images.
Each time a pull request is merged into Github, our CI runs a job that builds and push images to the Docker Hub.
We also have a similar mechanism we stable releases, each time we tag a new release our CI runs triggers a job that builds that our stable container images.

Current images that can be found on this Docker Hub page.

Later, we are planning on pushing our images on Quay, before we do I’d just like to find who’s using the Ceph organization or the Ceph username as I can’t create any of the two… Once this is solved, we will have a Ceph organization on Quay, and we will start pushing Ceph container images in it.

VI. Lightweight baby! (container images)

We now have smaller container images; we went from almost 1GB unzipped to 600MB.
The build mechanism shrinks all the layers to a single one; this drastically reduces the size of the final container image.
Compressed the images went from 320 MB to 231 MB. So this 100MB saved, which is nice.
We could go further, but we decided it was too time-consuming and the value versus the risk is low.

These are just a couple of highlights, if you want to learn more, you should look into this presentation.
So you will learn more about the new project structure, our templating mechanism, and more benefits.

This is huge for ceph-container and I’m so proud of what we achieved. Big shout out to Blaine Gardner and Erwan Velu who did this refactoring work.

Source: Sebastian Han (Huge changes in ceph-container)

The post Huge changes in ceph-container appeared first on Ceph.

↧

See you at the first Cephalocon

March 20, 2018, 3:52 am

≫ Next: parted会启动你的ceph osd，意外不？

≪ Previous: Huge changes in ceph-container

Title

Tomorrow, the first conference fully dedicated to Ceph will start in Beijin, China.
I’m attending and super excited.
I will see you there!

Source: Sebastian Han (See you at the first Cephalocon)

The post See you at the first Cephalocon appeared first on Ceph.

↧

parted会启动你的ceph osd，意外不？

March 23, 2018, 8:54 am

≫ Next: Handling app signals in containers

≪ Previous: See you at the first Cephalocon

disk

前言

如果看到标题，你是不是第一眼觉得写错了，这个怎么可能，完全就是两个不相关的东西，最开始我也是这么想的，直到我发现真的是这样的时候，也是很意外，还是弄清楚下比较好，不然在某个操作下，也许就会出现意想不到的情况

定位

如果你看过我的博客，正好看过这篇ceph在centos7下一个不容易发现的改变，那么应该还记得这个讲的是centos 7 下面通过udev来实现了osd的自动挂载，这个自动挂载就是本篇需要了解的前提

[root@lab101 ~]# df -h|grep ceph
/dev/sdf1                233G   34M  233G   1% /var/lib/ceph/osd/ceph-1
[root@lab101 ~]# systemctl stop ceph-osd@1
[root@lab101 ~]# umount /dev/sdf1 
[root@lab101 ~]# parted -l &>/dev/null
[root@lab101 ~]# df -h|grep ceph
/dev/sdf1                233G   34M  233G   1% /var/lib/ceph/osd/ceph-1
[root@lab101 ~]# ps -ef|grep osd
ceph      62701      1  1 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
root      62843  35114  0 23:25 pts/0    00:00:00 grep --color=auto osd

看这个操作过程，是不是很神奇，是不是很意外，不管怎么说，parted -l的一个操作把我们的osd给自动mount 起来了，也自动给启动了

出现这个以后，我们先看下日志怎么出的，大概看起来的是这样的
parted.gif-4083.1kB

可以看到确实是实时去触发的

服务器上面是有一个这个服务的

systemd-udevd.service
看到在做parted -l 后就会起一个这个子进程的

在尝试关闭这个服务后，再做parted -l操作就不会出现自动启动进程

原因

执行parted -l 对指定设备发起parted命令的时候，就会对内核做一个trigger，而我们的

/lib/udev/rules.d/95-ceph-osd.rules
这个文件一旦触发是会去调用
/usr/sbin/ceph-disk —log-stdout -v trigger /dev/$name

也就是自动挂载加上启动osd的的操作了

可能带来什么困扰

其实这个我也不知道算不算bug，至少在正常使用的时候是没有问题的，以至于这个功能已经有了这么久，而我并没有察觉到，也没有感觉到它给我带来的干扰，那么作为一名测试人员，现在来构思一种可能出现的破坏场景，只要按照正常操作去做的，还会出现的，就是有可能发生的事情

cd /var/lib/ceph/osd/
[root@lab101 osd]# df -h|grep osd
/dev/sdf1                233G   34M  233G   1% /var/lib/ceph/osd/ceph-1
[root@lab101 osd]# systemctl stop ceph-osd@1
[root@lab101 osd]# umount /dev/sdf1
[root@lab101 osd]# parted -l  &>/dev/null
[root@lab101 osd]# rm -rf ceph-1/
rm: cannot remove ‘ceph-1/’: Device or resource busy
[root@lab101 osd]# ll ceph-1/
total 0
[root@lab101 osd]# df -h|grep ceph
/dev/sdf1                233G   33M  233G   1% /var/lib/ceph/osd/ceph-1

可以看到除了上面的parted -l以外，其他操作都是一个正常的操作，umount掉挂载点，然后清理掉这个目录，然后数据就被删了，当然正常情况下也许没人在正好那个点来了一个parted,但是不是完全没有可能

还有种情况就是我是要做维护，我想umount掉挂载点，不想进程起来，执行parted是很常规的操作了，结果自己给我拉起来了，这个操作应该比较常见的

如何解决这个情况

第一种方法
什么都不动，你知道这个事情就行，执行过parted后再加上个df多检查下

第二种方法

systemctl stop systemd-udevd

这个会带来其他什么影响，暂时不好判断，还没深入研究，影响应该也只会在硬件变动和一些udev触发的需求，不确定的情况可以不改，不推荐此方法

第三种方法
不用这个/lib/udev/rules.d/95-ceph-osd.rules做控制了，自己去写配置文件，或者写fstab，都可以，保证启动后能够自动mount，服务能够正常启动就可以了，个人从维护角度还是偏向于第三种方法，记录的信息越多，维护的时候越方便，这个是逼着记录了一些信息，虽然可以什么信息也不记

总结

其实这个问题梳理清楚了也还好，最可怕的也许就是不知道为什么，特别是觉得完全不搭边的东西相互起了关联，至少在我们的研发跟我描述这个问题的时候，我想的是，还有这种神操作，是不是哪里加入了钩子程序什么的，花了点时间查到了原因，也方便在日后碰到不那么惊讶了

ceph北京大会已经顺利开完了，等PPT出来以后再学习一下新的东西，内容应该还是很多的，其实干货不干货，都在于你发现了什么，如果有一个PPT里面你提取到了一个知识点，你都是赚到了，何况分享的人并没有告知的义务的，所以每次看到有分享都是很感谢分享者的

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-03-23

Source: zphj1987@gmail (parted会启动你的ceph osd，意外不？)

The post parted会启动你的ceph osd，意外不？ appeared first on Ceph.

↧

Handling app signals in containers

March 26, 2018, 2:22 pm

≫ Next: The Ceph MON synchronization (election)

≪ Previous: parted会启动你的ceph osd，意外不？

Title

A year ago, I was describing how we were debugging our ceph containers; today I’m back with yet another great thing we wrote :).
Sometimes, when a process receives a signal and if that process runs within a container, you might want to do something before or after its termination.
That’s what we are going to discuss.

Running actions before or after terminating a process

Performing actions before or after a process get terminated on a host is easy because you don’t lose its environment.
In the micro-services world, your application runs in a container, and this application is PID 1, which means if it exits then your container goes away.

However, sometimes you want to gracefully terminate your programs, just like if they were running on a host and that systemd was doing this for you.
For example, on ceph-container we realized at some point that stopping an OSD running on an encrypted partition (dmcrypt + LUKS) was causing issues.
Indeed LUKS was not being closed after the OSD process exited which caused us a lot of trouble to merely do a restart of that container.

Typically what we are looking at here is to unmount OSD partitions and close LUKS devices, but after the OSD termination.
Remember the lines above, how can you perform that action if the container stops? Well your LUKS remained open and stuck in your dead container namespace… Not appealing right?

Fortunately, we came up with a solution that supersedes our debugging mechanism.

As explained in the previous article we remapped the exec function.
Traditionally, we start our container process with exec, so we fork the entrypoint process by breaking any relationship with it. Our new exec function contains a trap that ‘traps’ signal, we look for SIGTERM here. If the container receives a SIGTERM by let’s say docker stop then our trap gets activated. The trap calls a function that has two capabilities:

run a pre task function before sending SIGTERM to the process
run a post task function after SIGTERM was sent to the process

In our scenario, this is our sigterm_cleanup_post function.

Et voilà, that’s how you handle signal for your containers.

More articles to follow on containers!

Source: Sebastian Han (Handling app signals in containers)

The post Handling app signals in containers appeared first on Ceph.

↧

The Ceph MON synchronization (election)

March 28, 2018, 10:01 am

≫ Next: Ceph Dashboard v2 update

≪ Previous: Handling app signals in containers

The Ceph MON synchronization (election)

Here recently I had got asked a question about Ceph that I wasn’t entirely sure how to answer. It had to do with how the synchronization (election) process worked between monitors. I had an idea, but wasn’t quite sure. So here is a quick synopsis of what I found out.

Ceph monitor needs to join the quorum

When a Ceph monitor needs to regain its status in the cluster, it goes through a pretty simple process. For this purpose, each monitor has a role to play. The roles are as follows:

Leader: The leader is the first monitor to achieve the most recent version of the cluster map. And like all good leaders, this monitor will delegate sync duties to a Provider, as not to over burden himself.
Provider: When it comes to the Cluster map olympics, this was the guy who got silver. He has the most recent version of the cluster map, he just wasn’t the first to achieve it. He will be delegated sync duties from the Leader and will then sync his cluster map with the …
Requester: The monitor that wants to join the cool kids club. He no longer has the most recent info, and will make a request to the leader to to join. Before he can do that though the leader will want him to sync up with another monitor.

Lets see how this process would go in normal operation. If this were from a movie, would it be from “The MONchurian Candidate”?

# Ask to sync
Requester: Hey Leader my cluster map is outta date,
and I want back in to the quorum.
Leader: Look man I really don't have time for this,
talk to Provider Cephmon2 to get back up to speed.
# Sync with provider
Requester: Hey bro, the leader told me to talk to you
about getting the current cluster map.
Provider: No problem, I'll send these over to you in
chances, just let me know you received them ok
Requester: Cool man I got them all and I'm up to date.
# Let the Leader know you're done
Requester: Hey there Leader, my sync is done
everything is good to go.
Leader: It sure is, welcome to Quorum, bro.
### END SCENE ###

And thats it in a nutshell. As always, if this even helped out one admin, then it was well worth it. For a more complete and deep dive into this process check out Ceph Monitor Config Reference. Thanks for reading and feel free to contact me if you have any questions!

Source: Stephen McElroy (The Ceph MON synchronization (election))

The post The Ceph MON synchronization (election) appeared first on Ceph.

↧

Ceph Dashboard v2 update

April 9, 2018, 3:30 am

≫ Next: ceph的ISCSI GATEWAY

≪ Previous: The Ceph MON synchronization (election)

It’s been a little over a month now since we reached Milestone
1 (feature parity
with Dashboard v1), which was merged into the Ceph master branch on
2018-03-06.

After the initial merge, we had to resolve a few build and packaging related
issues, to streamline the ongoing development, testing and packaging of the new
dashboard as part of the main Ceph project.

With these teething problems out of the way, the team has started working on
several topics in parallel. A lot of these are “groundwork/foundation” kind of
tasks, e.g. adding UI components and backend functionality that pave the way to
enable the additional user-visible management features.

In the meanwhile, we have submitted over 80 additional pull requests, of
which more than 60 have been merged already.

In this post, I’d like to summarize some of the highlights and notable
improvements we’re currently working on or that have been added to the code base
already. This is by no means a complete list – it’s more a subjective selection
of changes that caught my attention.

It’s also noteworthy that we’ve already received a number of pull requests from
Ceph community members outside of the original openATTIC team that started this
project – we’re very grateful for the support and look forward to future
contributions!