ceph的ISCSI GATEWAY

April 10, 2018, 11:57 pm

gateway

前言

最开始接触这个是在L版本的监控平台里面看到的，有个iscsi网关，但是没看到有类似的介绍，然后通过接口查询到了一些资料，当时由于有比较多的东西需要新内核，新版本的支持，所以并没有配置出来，由于内核已经更新迭代了几个小版本了，经过测试验证可以跑起来了，这里只是把东西跑起来，性能相关的对比需要根据去做

实践过程

架构图

Ceph_iSCSI_HA_424879_1116_ECE-01.png-79.4kB

这个图是引用的红帽的架构图，可以理解为一个多路径的实现方式，那么这个跟之前的有什么不同

主要是有个新的tcmu-runner来处理LIO TCM后端存储的用户空间端的守护进程，这个是在内核之上多了一个用户态的驱动层，这样只需要根据tcmu的标准来对接接口就可以了，而不用去直接跟内核进行交互

需要的软件

Ceph Luminous 版本的集群或者更新的版本
RHEL/CentOS 7.5或者Linux kernel v4.16或者更新版本的内核
其他控制软件

targetcli-2.1.fb47 or newer package
ython-rtslib-2.1.fb64 or newer package
cmu-runner-1.3.0 or newer package
eph-iscsi-config-2.4 or newer package
eph-iscsi-cli-2.5 or newer package

以上为配置这个环境需要的软件，下面为我使用的版本的软件，统一打包放在一个下载路径
我安装的版本如下：

kernel-4.16.0-0.rc5.git0.1
targetcli-fb-2.1.fb48
python-rtslib-2.1.67
tcmu-runner-1.3.0-rc4
ceph-iscsi-config-2.5
ceph-iscsi-cli-2.6

下载链接：

链接:https://pan.baidu.com/s/12OwR5ZNtWFW13feLXy3Ezg 密码:m09k

如果环境之前有安装过其他版本，需要先卸载掉，并且需要提前部署好一个Luminous 最新版本的集群
官方建议调整的参数

# ceph tell osd.* injectargs '--osd_client_watch_timeout 15'
# ceph tell osd.* injectargs '--osd_heartbeat_grace 20'
# ceph tell osd.* injectargs '--osd_heartbeat_interval 5'

配置过程

创建一个存储池
需要用到rbd存储池，用来存储iscsi的配置文件，提前创建好一个名字是rbd的存储池

创建iscsi-gateway配置文件

touch /etc/ceph/iscsi-gateway.cfg

修改iscsi-gateway.cfg配置文件

[config]
# Name of the Ceph storage cluster. A suitable Ceph configuration file allowing
# access to the Ceph storage cluster from the gateway node is required, if not
# colocated on an OSD node.
cluster_name = ceph

# Place a copy of the ceph cluster's admin keyring in the gateway's /etc/ceph
# drectory and reference the filename here
gateway_keyring = ceph.client.admin.keyring


# API settings.
# The API supports a number of options that allow you to tailor it to your
# local environment. If you want to run the API under https, you will need to
# create cert/key files that are compatible for each iSCSI gateway node, that is
# not locked to a specific node. SSL cert and key files *must* be called
# 'iscsi-gateway.crt' and 'iscsi-gateway.key' and placed in the '/etc/ceph/' directory
# on *each* gateway node. With the SSL files in place, you can use 'api_secure = true'
# to switch to https mode.

# To support the API, the bear minimum settings are:
api_secure = false

# Additional API configuration options are as follows, defaults shown.
# api_user = admin
# api_password = admin
# api_port = 5001
# trusted_ip_list = 192.168.0.10,192.168.0.11

最后一行的trusted_ip_list修改为用来配置网关的主机IP，我的环境为

trusted_ip_list =192.168.219.128,192.168.219.129

所有网关节点的这个配置文件的内容需要一致，修改好一台直接scp到每个网关节点上

启动API服务

[root@lab101 install]# systemctl daemon-reload
[root@lab101 install]# systemctl enable rbd-target-api
[root@lab101 install]# systemctl start rbd-target-api
[root@lab101 install]# systemctl status rbd-target-api
● rbd-target-api.service - Ceph iscsi target configuration API
   Loaded: loaded (/usr/lib/systemd/system/rbd-target-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-03-15 09:44:34 CST; 18min ago
 Main PID: 1493 (rbd-target-api)
   CGroup: /system.slice/rbd-target-api.service
           └─1493 /usr/bin/python /usr/bin/rbd-target-api

Mar 15 09:44:34 lab101 systemd[1]: Started Ceph iscsi target configuration API.
Mar 15 09:44:34 lab101 systemd[1]: Starting Ceph iscsi target configuration API...
Mar 15 09:44:58 lab101 rbd-target-api[1493]: Started the configuration object watcher
Mar 15 09:44:58 lab101 rbd-target-api[1493]: Checking for config object changes every 1s
Mar 15 09:44:58 lab101 rbd-target-api[1493]:  * Running on http://0.0.0.0:5000/

配置iscsi
执行gwcli命令
image.png-23kB

默认是这样的

进入icsi-target创建一个target

/> cd iscsi-target 
/iscsi-target> create iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
ok

创建iSCSI网关。以下使用的IP是用于iSCSI数据传输的IP,它们可以与trusted_ip_list中列出的用于管理操作的IP相同，也可以不同，看有没有做多网卡分离

/iscsi-target> cd iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/
/iscsi-target...-gw:iscsi-igw> cd gateways 
/iscsi-target...-igw/gateways> create lab101 192.168.219.128 skipchecks=true
OS version/package checks have been bypassed
Adding gateway, syncing 0 disk(s) and 0 client(s)
  /iscsi-target...-igw/gateways> create lab102 192.168.219.129 skipchecks=true
OS version/package checks have been bypassed
Adding gateway, sync'ing 0 disk(s) and 0 client(s)
ok
/iscsi-target...-igw/gateways> ls
o- gateways ............. [Up: 2/2, Portals: 2]
  o- lab101 ............. [192.168.219.128 (UP)]
  o- lab102 ............. [192.168.219.129 (UP)]

创建一个rbd设备disk_1

/iscsi-target...-igw/gateways> cd /disks 
/disks> create pool=rbd image=disk_1 size=100G
ok

创建一个客户端名称iqn.1994-05.com.redhat:75c3d5efde0

/disks> cd /iscsi-target/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/hosts 
/iscsi-target...csi-igw/hosts> create iqn.1994-05.com.redhat:75c3d5efde0
ok

创建chap的用户名密码，由于用户名密码都有特殊要求，如果你不确定，就按我给的去设置，并且chap必须设置，否则服务端是禁止连接的

/iscsi-target...t:75c3d5efde0> auth chap=iqn.1994-05.com.redhat:75c3d5efde0/admin@a_12a-bb
ok

chap的命名规则可以这样查询

/iscsi-target...t:75c3d5efde0> help auth

SYNTAX
======
auth [chap] 


DESCRIPTION
===========

Client authentication can be set to use CHAP by supplying the
a string of the form <username>/<password>

e.g.
auth chap=username/password | nochap

username ... the username is 8-64 character string. Each character
             may either be an alphanumeric or use one of the following
             special characters .,:,-,@.
             Consider using the hosts 'shortname' or the initiators IQN
             value as the username

password ... the password must be between 12-16 chars in length
             containing alphanumeric characters, plus the following
             special characters @,_,-

WARNING: Using unsupported special characters may result in truncation,
         resulting in failed logins.


Specifying 'nochap' will remove chap authentication for the client
across all gateways.

增加磁盘到客户端

/iscsi-target...t:75c3d5efde0> disk add rbd.disk_1
ok

到这里就配置完成了，我们看下最终应该是怎么样的
image.png-38.5kB

windows客户端配置

这个地方我配置的时候用的win10配置的时候出现了无法连接的情况，可能是windows10自身的认证要求跟服务端冲突了，这里用windows server 2016 进行连接测试

windows server开启下Multipath IO

修改windows iscsi客户端的名称
image.png-47.5kB
修改为上面创建的客户端名称

发现门户
image.png-37.7kB
点击发现门户，填写好服务端的IP后直接点确定，这里先不用高级里面的配置

image.png-35.1kB

这个时候目标里面已经有一个发现的目标了，显示状态是不活动的，准备点击连接

image.png-80.7kB
点击高级，选择门户IP，填写chap登陆信息，然后chap名称就是上面设置的用户名称，因为跟客户端名称设置的一致，也就是客户端的名称，密码就是上面设置的admin@a_12a-bb

image.png-21.9kB

切换到卷和设备，点击自动配置
image.png-47.4kB

可以看到已经装载设备了

在服务管理器，文件存储服务，卷，磁盘里面查看设备
image.png-92.8kB

可以看到是配置的LIO-ORG TCMU设备，对设备进行格式化即可

image.png-42.6kB

完成了连接了

Linux的客户端连接

Linux客户端选择建议就选择3.10默认内核，选择高版本的内核的时候在配置多路径的时候碰到内核崩溃的问题

安装连接软件

[root@lab103 ~]# yum install iscsi-initiator-utils
[root@lab103 ~]# yum install device-mapper-multipath

配置多路径

开启服务

[root@lab103 ~]# mpathconf --enable --with_multipathd y

修改配置文件/etc/multipath.conf

devices {
        device {
                vendor                 "LIO-ORG"
                hardware_handler       "1 alua"
                path_grouping_policy   "failover"
                path_selector          "queue-length 0"
                failback               60
                path_checker           tur
                prio                   alua
                prio_args              exclusive_pref_bit
                fast_io_fail_tmo       25
                no_path_retry          queue
        }
}

重启多路径服务

[root@lab103 ~]# systemctl reload multipathd

配置chap的认证

修改配置客户端的名称为上面设置的名称

[root@lab103 ~]# cat /etc/iscsi/initiatorname.iscsi 
InitiatorName=iqn.1994-05.com.redhat:75c3d5efde0

修改认证的配置文件

[root@lab103 ~]# cat /etc/iscsi/iscsid.conf |grep "node.session.auth.username|node.session.auth.password|node.session.auth.authmethod"
node.session.auth.authmethod = CHAP
node.session.auth.username = iqn.1994-05.com.redhat:75c3d5efde0
node.session.auth.password = admin@a_12a-bb

查询iscsi target

[root@lab103 ~]# iscsiadm -m discovery -t st -p 192.168.219.128
192.168.219.128:3260,1 iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
192.168.219.129:3260,2 iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw

连接target

[root@lab103 ~]# iscsiadm -m node -T iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw -l
Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw, portal: 192.168.219.129,3260] (multiple)
Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw, portal: 192.168.219.129,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw, portal: 192.168.219.129,3260] successful.
Login to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw, portal: 192.168.219.129,3260] successful.

[root@lab101 ~]# multipath -ll
mpathb (360014052fc39ba627874fdba9aefcf6c) dm-4 LIO-ORG ,TCMU device     
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=10 status=active
| `- 5:0:0:0 sdc 8:32 active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
  `- 6:0:0:0 sdd 8:48 active ready running

查看盘符

[root@lab101 ~]# parted -s /dev/mapper/mpathb print
Model: Linux device-mapper (multipath) (dm)
Disk /dev/mapper/mpathb: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End    Size   File system  Name                          Flags
 1      17.4kB  134MB  134MB               Microsoft reserved partition  msftres
 2      135MB   107GB  107GB  ntfs         Basic data partition

直接使用这个/dev/mapper/mpathb设备即可

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-04-11

Source: zphj1987@gmail (ceph的ISCSI GATEWAY)

The post ceph的ISCSI GATEWAY appeared first on Ceph.

↧

cosbench使用方法

April 11, 2018, 10:18 am

≫ Next: Ansible module to manage CephX Keys

≪ Previous: ceph的ISCSI GATEWAY

cosbench.png-10.4kB

前言

cosbench的功能很强大，但是配置起来可能就有点不是太清楚怎么配置了，本篇将梳理一下这个测试的配置过程，以及一些测试注意项目，以免无法完成自己配置模型的情况

安装

cosbench模式是一个控制端控制几个driver向后端rgw发起请求

下载最新版本

https://github.com/intel-cloud/cosbench/releases/download/v0.4.2.c4/0.4.2.c4.zip

[root@lab102 cosbench]# unzip 0.4.2.zip
[root@lab102 cosbench]# yum install java-1.7.0-openjdk nmap-ncat

同时可以执行的workloads的个数通过下面的control参数控制

concurrency=1

默认是一个，这个为了保证单机的硬件资源足够，保持单机启用一个workload

创建一个s3用户

[root@lab101 ~]# radosgw-admin user create --uid=test1 --display-name="test1" --access-key=test1  --secret-key=test1
{
    "user_id": "test1",
    "display_name": "test1",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "auid": 0,
    "subusers": [],
    "keys": [
        {
            "user": "test1",
            "access_key": "test1",
            "secret_key": "test1"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "max_size_kb": -1,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "max_size_kb": -1,
        "max_objects": -1
    },
    "temp_url_keys": []
}

配置相关

cosbench的配置文件结构
image.png-47.7kB

一个workload 可以定义一个或者多个work stages
执行多个work stages是顺序的，执行同一个work stage里面的work是可以并行执行的
每个work里面，worker是来调整负载的
认证可以多个级别的定义，低级别的认证会覆盖高级别的配置

可以通过配置多个work的方式来实现并发，而在work内通过增加worker的方式增加并发，从而实现多对多的访问，worker的分摊是分到了driver上面，注意多work的时候的containers不要重名，划分好bucker的空间

image.png-144.5kB

work相关的说明

可以通过写入时间，写入容量，写入iops来控制什么时候结束
interval默认是5s是用来对性能快照的间隔，可以理解为采样点
division 控制workers之间的分配工作的方式是bucket还是对象还是none
默认全部的driver参与工作，也可以通过参数控制部分driver参与
时间会控制执行，如果时间没到，但是指定的对象已经写完了的话就会去进行复写的操作，这里要注意是进行对象的控制还是时间的控制进行的测试

如果读取测试的时候，如果没有那个对象，会中断的提示，所以测试读之前需要把测试的对象都填充完毕（最好检查下先）

单项的配置文件

通过单网关创建bucket

<?xml version="1.0" encoding="UTF-8"?>
<workload name="create-bucket" description="create s3 bucket" config="">
    <auth type="none" config=""/>
    <workflow config="">
        <workstage name="create bucket" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1" type="init" workers="2" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(1,32)">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="init" ratio="100" division="container"
                    config="containers=r(1,32);containers=r(1,32);objects=r(0,0);sizes=c(0)B;containers=r(1,32)" id="none"/>
            </work>
        </workstage>
    </workflow>
</workload>

如上配置的时候，如果设置的是workers=1,那么就会从当前的driver中挑选一个driver出来，然后选择配置storage进行bucket的创建，如果设置的是workers=2，那么就会挑选两个driver出来进行创建，一个driver负责一半的工作，相当于两个客户端同时向一个网关发起创建的操作

rgw的网关是对等的关系，那么这里肯定就有另外一种配置，我想通过不只一个网关进行创建的操作，那么这个地方是通过增加work的配置来实现的，我们看下配置

通过多网关创建bucket

<?xml version="1.0" encoding="UTF-8"?>
<workload name="create-bucket" description="create s3 bucket" config="">
    <auth type="none" config=""/>
    <workflow config="">
        <workstage name="create bucket" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1" type="init" workers="2" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16)">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="init" ratio="100" division="container"
                    config="containers=r(1,16);containers=r(1,16);objects=r(0,0);sizes=c(0)B;containers=r(1,16)" id="none"/>
            </work>
            <work name="rgw2" type="init" workers="2" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32)">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="init" ratio="100" division="container"
                    config="containers=r(17,32);containers=r(17,32);objects=r(0,0);sizes=c(0)B;containers=r(17,32)" id="none"/>
            </work>
        </workstage>
    </workflow>
</workload>

以上配置就实现了通过两个网关进行创建bucket的配置了，下面是做prepare的相关配置，在cosbench里面有两个部分可以进行写操作，在prepare stage里面和 main stage里面
这个地方这样设置的理由是：
如果有读和写混合测试的时候，那么就需要提前进行读数据的准备，然后再开始进行读写并发的测试，所以会有一个prepare的阶段，这个在配置文件里面只是type设置的不同，其他没区别，我们可以看下这里web界面里面提供的配置项目，下面其他项目默认都是采取双并发的模式

prepare

在写的部分是一样的

通过多网关写数据

<workstage name="putobject" closuredelay="0" config="">
    <auth type="none" config=""/>
    <work name="rgw1-put" type="normal" workers="2" interval="5"
        division="container" runtime="60" rampup="0" rampdown="0"
        afr="200000" totalOps="0" totalBytes="0" config="">
        <auth type="none" config=""/>
        <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
        <operation type="write" ratio="100" division="container"
            config="containers=u(1,16);objects=u(1,5);sizes=u(2,2)MB" id="none"/>
    </work>
    <work name="rgw2-put" type="normal" workers="2" interval="5"
        division="container" runtime="60" rampup="0" rampdown="0"
        afr="200000" totalOps="0" totalBytes="0" config="">
        <auth type="none" config=""/>
        <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
        <operation type="write" ratio="100" division="container"
            config="containers=u(17,32);objects=u(1,5);sizes=u(2,2)MB" id="none"/>
    </work>			
</workstage>

这里有几个参数可以注意一下：

containers=u(1,16);objects=u(1,5);sizes=u(2,2)MB

控制写入的bucket的名称的，是全部散列还是把负载均分可以自己去控制，objects是指定写入bucke里面的对象的名称的，sizes是指定大小的，如果两个值不同，就是设置的范围，相同就是设置的指定大小的

runtime="60" rampup="0" rampdown="0" afr="200000" totalOps="0" totalBytes="0"

这个是控制写入什么时候中止的，可以通过时间，也可以通过总的ops，或者总的大小来控制，这个需求可以自己定，afr是控制允许的失效率的，单位为1百万分之

interval="5"

这个是控制抓取性能数据的周期的

写入的配置就完了

并发读取的配置

<workstage name="getobj" closuredelay="0" config="">
    <auth type="none" config=""/>
    <work name="rgw1-get" type="normal" workers="2" interval="5"
        division="none" runtime="30" rampup="0" rampdown="0"
        afr="200000" totalOps="0" totalBytes="0" config="">
        <auth type="none" config=""/>
        <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
        <operation type="read" ratio="100" division="none"
            config="containers=u(1,16);objects=u(1,5);" id="none"/>
    </work>
    <work name="rgw2-get" type="normal" workers="2" interval="5"
        division="none" runtime="30" rampup="0" rampdown="0"
        afr="200000" totalOps="0" totalBytes="0" config="">
        <auth type="none" config=""/>
        <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
        <operation type="read" ratio="100" division="none"
            config="containers=u(17,32);objects=u(1,5);" id="none"/>
    </work>			
</workstage>

删除对象的配置

<workstage name="cleanup" closuredelay="0" config="">
          <auth type="none" config=""/>
          <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
          <work name="rgw1-cleanup" type="cleanup" workers="1" interval="5"
              division="object" runtime="0" rampup="0" rampdown="0"
              afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16);objects=r(1,5);">
              <auth type="none" config=""/>
              <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
              <operation type="cleanup" ratio="100" division="object"
                  config="containers=r(1,16);objects=r(1,5);;deleteContainer=false;" id="none"/>
          </work>
          <work name="rgw2-cleanup" type="cleanup" workers="1" interval="5"
              division="object" runtime="0" rampup="0" rampdown="0"
              afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32);objects=r(1,5);">
              <auth type="none" config=""/>
              <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
              <operation type="cleanup" ratio="100" division="object"
                  config="containers=r(17,32);objects=r(1,5);;deleteContainer=false;" id="none"/>
          </work>
      </workstage>

删除bucket的配置

<workstage name="dispose" closuredelay="0" config="">
          <auth type="none" config=""/>
          <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
          <work name="rgw1-dispose" type="dispose" workers="1" interval="5"
              division="container" runtime="0" rampup="0" rampdown="0"
              afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16);">
              <auth type="none" config=""/>
              <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
              <operation type="dispose" ratio="100"
                  division="container"
                  config="containers=r(1,16);;objects=r(0,0);sizes=c(0)B;;" id="none"/>
          </work>
          <work name="rgw2-dispose" type="dispose" workers="1" interval="5"
              division="container" runtime="0" rampup="0" rampdown="0"
              afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32);">
              <auth type="none" config=""/>
              <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
              <operation type="dispose" ratio="100"
                  division="container"
                  config="containers=r(17,32);;objects=r(0,0);sizes=c(0)B;;" id="none"/>
          </work>			
      </workstage>

上面的workstage一共包括下面几种

init 创建bucket
normal write 写入对象
normal read 读取对象
cleanup 清理对象
dispose 清理bucket

division是控制多个worker之间的操作怎么去分的控制，最好在operation那层进行控制

测试前自我提问

单机用了几个workload（默认一般一个，保证单个测试资源的独占）
采用了几个driver（决定了客户端的发起是有几个客户端，单机一个就可以）
测试了哪几个项目（init,prepare or normal,remove），单独测试还是混合测试
单个项目的workstage里面启动了几个work（work可以控制请求发向哪里）
单个work里面采用了几个workers(这个是控制几个driver进行并发的)
测试的ceph集群有多少个rgw网关，创建了多少个bucket测试
设置的写入每个bucket的对象为多少？对象大小为多少？测试时间为多久？

测试很多文件的时候，可以用ops控制，并且将ops设置大于想测试的文件数目，保证能写入那么多的数据，或者比较确定性能，也可以通过时间控制

那么我来根据自己的需求来进行一个测试模型说明，然后根据说明进行配置

采用两个客户端测试，那么准备两个driver
准备配置两个rgw的网关，那么在配置workstage的时候配置两个work对应到两个storage
测试创建，写入，读取，删除对象，删除bucket一套完整测试
wokers设置为2的倍数，初始值为2，让每个driver分得一半的负载，在进行一轮测试后，成倍的增加driver的数目，来增大并发，在性能基本不增加，时延在增加的时候，记录性能值和参数值，这个为本环境的最大性能
创建32个bucket，每个bucket写入5个2M的对象，测试写入时间为600s，读取时间为60s

简单框架图
cosbench

配置文件如下：

<?xml version="1.0" encoding="UTF-8"?>
<workload name="create-bucket" description="create s3 bucket" config="">
    <auth type="none" config=""/>
    <workflow config="">
        <workstage name="create bucket" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1-create" type="init" workers="2" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16)">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="init" ratio="100" division="container"
                    config="containers=r(1,16);objects=r(0,0);sizes=c(0)B" id="none"/>
            </work>
            <work name="rgw2-create" type="init" workers="2" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32)">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="init" ratio="100" division="container"
                    config="containers=r(17,32);objects=r(0,0);sizes=c(0)B" id="none"/>
            </work>
        </workstage>
		
        <workstage name="putobject" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1-put" type="normal" workers="2" interval="5"
                division="container" runtime="600" rampup="0" rampdown="0"
                afr="200000" totalOps="0" totalBytes="0" config="">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="write" ratio="100" division="container"
                    config="containers=u(1,16);objects=u(1,5);sizes=u(2,2)MB" id="none"/>
            </work>
            <work name="rgw2-put" type="normal" workers="2" interval="5"
                division="container" runtime="600" rampup="0" rampdown="0"
                afr="200000" totalOps="0" totalBytes="0" config="">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="write" ratio="100" division="container"
                    config="containers=u(17,32);objects=u(1,5);sizes=u(2,2)MB" id="none"/>
            </work>			
        </workstage>

        <workstage name="getobj" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1-get" type="normal" workers="2" interval="5"
                division="none" runtime="60" rampup="0" rampdown="0"
                afr="200000" totalOps="0" totalBytes="0" config="">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="read" ratio="100" division="none"
                    config="containers=u(1,16);objects=u(1,5);" id="none"/>
            </work>
            <work name="rgw2-get" type="normal" workers="2" interval="5"
                division="none" runtime="60" rampup="0" rampdown="0"
                afr="200000" totalOps="0" totalBytes="0" config="">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="read" ratio="100" division="none"
                    config="containers=u(17,32);objects=u(1,5);" id="none"/>
            </work>			
        </workstage>
		
		<workstage name="cleanup" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1-cleanup" type="cleanup" workers="1" interval="5"
                division="object" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16);objects=r(1,100);">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="cleanup" ratio="100" division="object"
                    config="containers=r(1,16);objects=r(1,100);;deleteContainer=false;" id="none"/>
            </work>
            <work name="rgw2-cleanup" type="cleanup" workers="1" interval="5"
                division="object" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32);objects=r(1,100);">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="cleanup" ratio="100" division="object"
                    config="containers=r(17,32);objects=r(1,100);;deleteContainer=false;" id="none"/>
            </work>
        </workstage>

		<workstage name="dispose" closuredelay="0" config="">
            <auth type="none" config=""/>
            <work name="rgw1-dispose" type="dispose" workers="1" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(1,16);">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7481;path_style_access=true"/>
                <operation type="dispose" ratio="100"
                    division="container"
                    config="containers=r(1,16);;objects=r(0,0);sizes=c(0)B;;" id="none"/>
            </work>
            <work name="rgw2-dispose" type="dispose" workers="1" interval="5"
                division="container" runtime="0" rampup="0" rampdown="0"
                afr="0" totalOps="1" totalBytes="0" config="containers=r(17,32);">
                <auth type="none" config=""/>
                <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://192.168.19.101:7482;path_style_access=true"/>
                <operation type="dispose" ratio="100"
                    division="container"
                    config="containers=r(17,32);;objects=r(0,0);sizes=c(0)B;;" id="none"/>
            </work>			
        </workstage>		
		
    </workflow>
</workload>

上面的测试是为了做测试模板，所以采用了比较小的对象数目和比较小的测试时间

可以根据自己的硬件环境或者客户的要求来设计测试模型，环境够大的时候，提供足够的rgw和足够的客户端才能测出比较大的性能值

测试的时候，尽量把写入和读取的测试分开，也就是分两次测试，避免第一次的写入没写足够对象，读取的时候读不到中断了，对于长达数小时的测试的时候，中断是很令人头疼的，分段可以减少这种中断后的继续测试的时间

写入的测试在允许的范围内，尽量写入多点对象，尽量避免复写，也能够在读取的时候尽量能够足够散列

测试时间能够长尽量长

测试结果

result
graph

可以通过线图来看指定测试项目的中间情况，一般是去关注是否出现比较大的抖动，相同性能下，抖动越小越好

其他调优

在硬件环境一定的情况下，可以通过增加nginx负载均衡，或者lvs负载均衡来尝试增加性能值，这个不在本篇的讨论范围内

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-04-12

Source: zphj1987@gmail (cosbench使用方法)

The post cosbench使用方法 appeared first on Ceph.

↧

Ansible module to manage CephX Keys

April 1, 2018, 3:31 pm

≫ Next: Ceph Nano big updates

≪ Previous: cosbench使用方法

Title

Following our recent initiative on writing more Ceph modules for Ceph Ansible, I’d like to introduce one that I recently wrote: ceph_key.

The module is pretty straightforward to use and will ease your day two operations for managing CephX keys. It has several capabilities such as:

create: will create the key on the filesystem with the right permissions (support mode/owner) and will import in the Ceph (can be enabled/disabled) with the given capabilities
update: will update the capabilities of a particular key
delete: will delete the key from Ceph
info: will get every information about a particular key
list: will list all the available keys

The module also works on containerized Ceph clusters.

See the following examples:

---
# This playbook is used to manage CephX Keys
# You will find examples below on how the module can be used on daily operations
#
# It currently runs on localhost

- hosts: localhost
  gather_facts: false
  vars:
    cluster: ceph
    keys_to_info:
      - client.admin
      - mds.0
    keys_to_delete:
      - client.leseb
      - client.leseb1
      - client.pythonnnn
    keys_to_create:
      - { name: client.pythonnnn, caps: { mon: "allow rwx", mds: "allow *" } , mode: "0600" }
      - { name: client.existpassss, caps: { mon: "allow r", osd: "allow *" } , mode: "0600" }
      - { name: client.path, caps: { mon: "allow r", osd: "allow *" } , mode: "0600" }

  tasks:
    - name: create ceph key(s) module
      ceph_key:
        name: "{{ item.name }}"
        state: present
        caps: "{{ item.caps }}"
        cluster: "{{ cluster }}"
        secret: "{{ item.key | default('') }}"
      with_items: "{{ keys_to_create }}"

    - name: update ceph key(s)
      ceph_key:
        name: "{{ item.name }}"
        state: update
        caps: "{{ item.caps }}"
        cluster: "{{ cluster }}"
      with_items: "{{ keys_to_create }}"

    - name: delete ceph key(s)
      ceph_key:
        name: "{{ item }}"
        state: absent
        cluster: "{{ cluster }}"
      with_items: "{{ keys_to_delete }}"

    - name: info ceph key(s)
      ceph_key:
        name: "{{ item }}"
        state: info
        cluster: "{{ cluster }}"
      register: key_info
      ignore_errors: true
      with_items: "{{ keys_to_info }}"

    - name: list ceph key(s)
      ceph_key:
        state: list
        cluster: "{{ cluster }}"
      register: list_keys
      ignore_errors: true

The goal is to have all of our Ceph modules included by default in Ansible. Stay tuned, more modules to come!

Source: Sebastian Han (Ansible module to manage CephX Keys)

The post Ansible module to manage CephX Keys appeared first on Ceph.

↧

Ceph Nano big updates

April 30, 2018, 3:36 am

≫ Next: See you at the Red Hat summit

≪ Previous: Ansible module to manage CephX Keys

Title

With its two latest versions (v1.3.0 and v1.4.0) Ceph Nano brought some nifty new functionalities that I’d like to highlight in the article.

Multi cluster support

This is feature is available since v1.3.0.

You can now run more than a single instance of cn, you can run as many as your system allows it (CPU and memory wise). This is how you run a new cluster:

 $ ./cn cluster start s3 -d /tmp
2018/04/30 16:12:07 Running cluster s3...

HEALTH_OK is the Ceph status
S3 object server address is: http://10.36.116.231:8001
S3 user is: nano
S3 access key is: JZYOITC0BDLPB0K6E5WX
S3 secret key is: sF0Vu6seb64hhlsmtxKT6BSrs2KY8cAB8la8kni1
Your working directory is: /tmp

And how you can retrieve the list of running clusters:

$ ./cn cluster ls
+--------+---------+--------------------+----------------+--------------------------------+
| NAME   | STATUS  | IMAGE              | IMAGE RELEASE  | IMAGE CREATION TIME            |
+--------+---------+--------------------+----------------+--------------------------------+
| s3     | running | ceph/daemon:latest | master-d0d98c4 | 2018-04-20T13:37:06.933085171Z |
| trolol | exited  | ceph/daemon:latest | master-d0d98c4 | 2018-04-20T13:37:06.933085171Z |
| e      | running | ceph/daemon:latest | master-d0d98c4 | 2018-04-20T13:37:06.933085171Z |
+--------+---------+--------------------+----------------+--------------------------------+

This feature works well in conjunction with the image support.
You can run any container using any container image available in the Docker Hub. You can even your own one if you want to test a fix.

You can list the available image like this:

$ ./cn image ls
latest-bislatest
latest-luminouslatest-kraken
latest-jewelmaster-da37788-kraken-centos-7-x86_64
master-da37788-jewel-centos-7-x86_64master-da37788-kraken-ubuntu-16.04-x86_64
master-da37788-jewel-ubuntu-14.04-x86_64master-da37788-jewel-ubuntu-16.04-x86_64

Use -a to list all our images.
So using the -i option when starting a cluster will run the image you want.

Dedicated device or directory support

This feature is available since v1.4.0.

You might be after providing more persistent and fast storage for cn. This is possible by specifying either a dedicated block device (a partition works too) or a directory that you might have configured on a particular device.

You have to run cn with sudo here since it performs a couple of checks on that device to make sure its eligible for usage. Thus higher privileges to run cn are required.

sudo ./cn cluster start -b /dev/disk/by-id/wwn-0x600508b1001c4257dacb9870dbc6b1c8 block

Using a directory is identical, just run with -b /srv/cn/ for instance.

I’m so glad to see how cn has evolved, I’m proud of this little tool that I use on daily basis for so many things. I hope you are enjoying it as much as I do.

Source: Sebastian Han (Ceph Nano big updates)

The post Ceph Nano big updates appeared first on Ceph.

↧

See you at the Red Hat summit

May 6, 2018, 1:02 pm

≫ Next: Crypto Unleashed

≪ Previous: Ceph Nano big updates

Title

I will be attending the Red Hat summit as I’m co-presenting a lab.
This goal of the lab is to deploy an OpenStack Hyperconverged environment (HCI) with Ceph.

See you in San Francisco!

Source: Sebastian Han (See you at the Red Hat summit)

The post See you at the Red Hat summit appeared first on Ceph.

↧

Crypto Unleashed

May 7, 2018, 4:26 pm

≫ Next: See you at the OpenStack Summit

≪ Previous: See you at the Red Hat summit

Cryptography made easy…er

Cryptography does not have to be mysterious — as author of Serious Cryptography Jean-Philippe Aumasson points out. It is meant to be fiendishly complex to break, and it remains very challenging to implement (see jokes on rolling your own crypto found all over the Net), but it is well within the grasp of most programmers to understand.

While many are intimidated by the prospect of digging into what is effectively a branch of number theory, the reality is that cryptography is squarely based in discrete mathematics—and good coders are all, without exception and often unknowingly, natural discrete math jugglers. If you are interested and you aced your data structures course, chances are that crypto will not be an unsurmountable challenge to you. Aumasson certainly seems to think so, and he walks us along his own path to the discovery of the cryptographic realm.

The Contenders

Other books can take you along this journey, all with their distinctive traits. The classic 1994-vintage Applied Cryptography by Bruce Schneier is the unchallenged, most authoritative reference in the field. Squarely focused on crypto algorithms and their weaknesses, it belongs on every security nerd’s shelf, but it may not be an engineers first choice when looking at this space: actual production use or even mention of protocols like TLS and SSL are entirely outside of its scope.

Schneier revisited the subject again in 2003 with Niels Ferguson and gave us Practical Cryptography, covering every conceivable engineering aspect of implementing and consuming cryptographic code while having a clue to what is happening inside the system. This is an eminently practical book, and it was re-issued in updated form in 2010 under the new title of Cryptography Engineering under the new co-authorship of Tadayoshi Kohno.

While I had read Shneier’s original tome in installments during my Summer visits at the University of Jyväskylä, my deep-dive in the field came through a so-called MIT course bible, Lecture Notes on Cryptography compiled for 6.87s, a weeklong course on cryptography taught at MIT by future Turing laureate Shafi Goldwasser and Mihir Bellare during the Summers of 1996–2002, which I myself was privileged to attend in 2000. This was one of the most intellectually challenging and mind-stretching weeks of my life, with a new, exciting idea being served every 10 minutes in an amazing tour-de force. These notes are absolutely great, and I still go back to them, but I do not know if they would serve you as well without the live instructors guiding you along the first time.

Graduate Course in Applied Cryptography, by Dan Boneh and Victor Shoup of Stanford and NYU respectively, is another similar course bible, and I am mentioning it here because it has been updated more recently than Goldwasser and Bellare’s, which was last refreshed in 2008.

Enter the New Tome

No Starch Press has been lining up an impressive computer security catalog, and it was inevitable they would venture into crypto at one point or another. Aumasson’s entry into the pantheon of the explainers of cryptography is characterized by his focus on teaching us how the algorithms work with the most meager use of mathematical notation. This is, like most of the other books I referenced, a book aiming to increase the understanding of how cryptography works, covering primitives such as block and stream modes, hash functions, and keyed hashing. But what is noteworthy is how this book also straddles the ranges defined earlier, spanning from pre-requisites like the need for good randomness, hard problems, and the definition of cryptographic security on the one end and the operation of the RSA algorithm and the TLS protocol on the other. This is not a book targeted at experts in the field, but it does not trivialize the subject matter either and it is impressive in its breadth: the state of the art in applied cryptography is distilled here in a mere 282 pages.

It is hard to overstate how pleasing the broad reach of this single book is to this reader: despite my keen interest in the field and all my reading, I myself did not hand-roll the RSA algorithm until Professor H.T. Kung made me a few years later — in a networking graduate course. Isolation between the study of the algorithms and the study of the protocols implementing them is exceedingly common, and it is delightful to see this book bridges the two.

I was drawn to the book by its concise and yet comprehensive coverage of randomness for a talk I have been developing, and stayed to read the explanation of keyed hashing and message authentication codes (MACs) — a jewel in its own right as the author co-developed two hash functions now in widespread use . As someone who had to self-start his own coding in both subjects, I wish this book had been available when I was in grad school. My loss is your gain dear reader, you can catch up to the state of the art much faster than I did a decade ago!

This is still a complex subject, yet Aumasson’s tome should help increase the ranks of those that can confidently contribute when the topic is being discussed. Most programmers need not be cryptanalysts, but many will benefit from a deeper understanding of how security in computer systems is actually achieved.

Source: Federico Lucifredi (Crypto Unleashed)

The post Crypto Unleashed appeared first on Ceph.

↧

See you at the OpenStack Summit

May 17, 2018, 4:32 pm

≫ Next: OpenStack Summit Vancouver: How to Survive an OpenStack Cloud Meltdown with Ceph

≪ Previous: Crypto Unleashed

Title

Next week is the OpenStack Summit conference.
I will attend and will be giving a talk How to Survive an OpenStack Cloud Meltdown with Ceph.

See you there!

Source: Sebastian Han (See you at the OpenStack Summit)

The post See you at the OpenStack Summit appeared first on Ceph.

↧

OpenStack Summit Vancouver: How to Survive an OpenStack Cloud Meltdown with Ceph

May 22, 2018, 4:38 pm

≫ Next: How to Survive an OpenStack Cloud Meltdown with Ceph

≪ Previous: See you at the OpenStack Summit

Date: 22/05/18

Video:

Source: Sebastian Han (OpenStack Summit Vancouver: How to Survive an OpenStack Cloud Meltdown with Ceph)

The post OpenStack Summit Vancouver: How to Survive an OpenStack Cloud Meltdown with Ceph appeared first on Ceph.

↧

How to Survive an OpenStack Cloud Meltdown with Ceph

May 24, 2018, 2:29 pm

≫ Next: Ceph and Ceph Manager Dashboard presentations at openSUSE Conference 2018

≪ Previous: OpenStack Summit Vancouver: How to Survive an OpenStack Cloud Meltdown with Ceph

Los Tres Caballeros —sans sombreros— descended on Vancouver this week to participate in the “Rocky” OpenStack Summit. For the assembled crowd of clouderati, Sébastien Han, Sean Cohen and yours truly had one simple question: what if your datacenter was wiped out in its entirety, but your users hardly even noticed?

We have touched on the disaster recovery theme before, but this time we decided to discuss backup as well as HA, which made for a slightly longer talk than we had planned—we hope you enjoyed our “choose your disaster” tour, we definitely enjoyed leading it.

The recording of our OpenStack Summit session is now live on the OpenStack Foundation’s YouTube channel. It is impressive how quickly the Foundation’s media team releases now:

Our slides are available as a PDF and can be viewed inline below — we are including our backup slides, so you can find out what we could have talked about, had we run over even longer

See you all in Berlin this fall!

Source: Federico Lucifredi (How to Survive an OpenStack Cloud Meltdown with Ceph)

The post How to Survive an OpenStack Cloud Meltdown with Ceph appeared first on Ceph.

↧

Ceph and Ceph Manager Dashboard presentations at openSUSE Conference 2018

May 28, 2018, 5:56 am

≫ Next: cephfs元数据池故障的恢复

≪ Previous: How to Survive an OpenStack Cloud Meltdown with Ceph

Last weekend, the openSUSE Conference 2018 took place in Prague (Czech
Republic). Our team was present to talk about Ceph and our
involvement in developing the Ceph manager dashboard, which will be available as
part of the upcoming Ceph “Mimic” release.

The presentations were held by Laura Paduano and Kai Wagner from our team – thank you for your engagement!
The openSUSE conference team did an excellent job in streaming and recording
each session, and the resulting videos can already be viewed from their YouTube
channel.

Ceph – The Distributed Storage Solution

Ceph Overview Slides

Ceph Manager Dashboard

Ceph Manager Dashboard Slides

Source: SUSE (Ceph and Ceph Manager Dashboard presentations at openSUSE Conference 2018)

The post Ceph and Ceph Manager Dashboard presentations at openSUSE Conference 2018 appeared first on Ceph.

↧

cephfs元数据池故障的恢复

May 29, 2018, 8:37 am

≫ Next: Storage for Data Platforms in 10 minutes

≪ Previous: Ceph and Ceph Manager Dashboard presentations at openSUSE Conference 2018

前言

cephfs 在L版本已经比较稳定了，这个稳定的意义个人觉得是在其故障恢复方面的成熟，一个文件系统可恢复是其稳定必须具备的属性，本篇就是根据官网的文档来实践下这个恢复的过程

实践过程

部署一个ceph Luminous集群

[root@lab102 ~]# ceph -v
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

创建filestore

ceph-deploy osd create  lab102  --filestore  --data /dev/sdb1  --journal /dev/sdb2

这里想用filestore进行测试就按上面的方法去创建osd即可

传入测试数据

doc
pic
vidio
这里提供下载链接

链接：https://pan.baidu.com/s/19tlFi4butA2WjnPAdNEMwg 密码：ugjo

这个是网上下载的模板的数据，方便进行真实的文件的模拟，dd产生的是空文件，有的时候会影响到测试

需要更多的测试文档推荐可以从下面网站下载

视频下载：

https://videos.pexels.com/popular-videos

图片下载：

https://www.pexels.com/

文档下载：

http://office.mmais.com.cn/Template/Home.shtml

元数据模拟故障

跟元数据相关的故障无非就是mds无法启动，或者元数据pg损坏了，这里我们模拟的比较极端的情况，把metadata的元数据对象全部清空掉，这个基本能覆盖到最严重的故障了，数据的损坏不在元数据损坏的范畴

清空元数据存储池

for object in `rados -p metadata ls`;do rados -p metadata rm $object;done

重启下mds进程，应该mds是无法恢复正常的

cluster:
    id:     9ec7768a-5e7c-4f8e-8a85-89895e338cca
    health: HEALTH_ERR
            1 filesystem is degraded
            1 mds daemon damaged
            too few PGs per OSD (16 < min 30)
 
  services:
    mon: 1 daemons, quorum lab102
    mgr: lab102(active)
    mds: ceph-0/1/1 up , 1 up:standby, 1 damaged
    osd: 1 osds: 1 up, 1 in

准备开始我们的修复过程

元数据故障恢复

设置允许多文件系统

ceph fs flag set enable_multiple true --yes-i-really-mean-it

创建一个新的元数据池，这里是为了不去动原来的metadata的数据，以免损坏原来的元数据

ceph osd pool create recovery 8

将老的存储池data和新的元数据池recovery关联起来并且创建一个新的recovery-fs

[root@lab102 ~]# ceph fs new recovery-fs recovery data --allow-dangerous-metadata-overlay
new fs with metadata pool 3 and data pool 2

做下新的文件系统的初始化相关工作

[root@lab102 ~]#cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery

reset下新的fs

[root@lab102 ~]#ceph fs reset recovery-fs --yes-i-really-mean-it
[root@lab102 ~]#cephfs-table-tool recovery-fs:all reset session
[root@lab102 ~]#cephfs-table-tool recovery-fs:all reset snap
[root@lab102 ~]#cephfs-table-tool recovery-fs:all reset inode

做相关的恢复

[root@lab102 ~]# cephfs-data-scan scan_extents --force-pool --alternate-pool recovery --filesystem ceph  data
[root@lab102 ~]# cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem ceph --force-corrupt --force-init data
[root@lab102 ~]# cephfs-data-scan scan_links --filesystem recovery-fs

[root@lab102 ~]# systemctl start ceph-mds@lab102
等待mds active 以后再继续下面操作
[root@lab102 ~]# ceph daemon mds.lab102 scrub_path / recursive repair

设置成默认的fs

[root@lab102 ~]# ceph fs set-default recovery-fs

挂载检查数据

[root@lab102 ~]#  mount -t ceph 192.168.19.102:/ /mnt
[root@lab102 ~]# ll /mnt
total 0
drwxr-xr-x 1 root root 1 Jan  1  1970 lost+found
[root@lab102 ~]# ll /mnt/lost+found/
total 226986
-r-x------ 1 root root   569306 May 25 16:16 10000000001
-r-x------ 1 root root 16240627 May 25 16:16 10000000002
-r-x------ 1 root root  1356367 May 25 16:16 10000000003
-r-x------ 1 root root   137729 May 25 16:16 10000000004
-r-x------ 1 root root   155163 May 25 16:16 10000000005
-r-x------ 1 root root   118909 May 25 16:16 10000000006
-r-x------ 1 root root  1587656 May 25 16:16 10000000007
-r-x------ 1 root root   252705 May 25 16:16 10000000008
-r-x------ 1 root root  1825192 May 25 16:16 10000000009
-r-x------ 1 root root   156990 May 25 16:16 1000000000a
-r-x------ 1 root root  3493435 May 25 16:16 1000000000b
-r-x------ 1 root root   342390 May 25 16:16 1000000000c
-r-x------ 1 root root  1172247 May 25 16:16 1000000000d
-r-x------ 1 root root  2516169 May 25 16:16 1000000000e
-r-x------ 1 root root  3218770 May 25 16:16 1000000000f
-r-x------ 1 root root   592729 May 25 16:16 10000000010

可以看到在lost+found里面就有数据了

[root@lab102 ~]# file /mnt/lost+found/10000000010 
/mnt/lost+found/10000000010: Microsoft PowerPoint 2007+
[root@lab102 ~]# file /mnt/lost+found/10000000011
/mnt/lost+found/10000000011: Microsoft Word 2007+
[root@lab102 ~]# file /mnt/lost+found/10000000012
/mnt/lost+found/10000000012: Microsoft Word 2007+
[root@lab102 ~]# file /mnt/lost+found/10000000013
/mnt/lost+found/10000000013: Microsoft PowerPoint 2007+

这个生成的文件名称就是实际文件存储的数据的prifix，也就是通过原始inode进行的运算得到的

如果提前备份好了原始的元数据信息

[root@lab102 ~]# ceph daemon mds.lab102 dump cache > /tmp/mdscache

那么可以比较轻松的找到丢失的文件

总结

在我另外一篇文章当中已经写过了，通过文件的inode可以把文件跟后台的对象结合起来，在以前我的恢复的思路是，把后台的对象全部抓出来，然后自己手动去对对象进行拼接，实际是数据存在的情况下，反向把文件重新link到一个路径，这个是官方提供的的恢复方法，mds最大的担心就是mds自身的元数据的损坏可能引起整个文件系统的崩溃，而现在，基本上只要data的数据还在的话，就不用担心数据丢掉，即使文件路径信息没有了，但是文件还在

通过备份mds cache可以把文件名称，路径，大小和inode关联起来，而恢复的数据是对象前缀，也就是备份好了mds cache 就可以把整个文件信息串联起来了

虽然cephfs的故障不是常发生，但是万一呢

后续准备带来一篇关于cephfs从挂载点误删除数据后的数据恢复的方案，这个目前已经进行了少量文件的恢复试验了，等后续进行大量文件删除的恢复后，再进行分享

参考文档

disaster-recovery

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-05-29

Source: zphj1987@gmail (cephfs元数据池故障的恢复)

The post cephfs元数据池故障的恢复 appeared first on Ceph.

↧

Storage for Data Platforms in 10 minutes

May 30, 2018, 5:16 pm

≫ Next: ceph erasure默认的min_size分析

≪ Previous: cephfs元数据池故障的恢复

Kyle Bader and I teamed up to deliver a quick (and hopefully painless) review of what types of storage your Big Data strategy needs to succeed alongside the better-understood (and more traditional) existing approaches to structured data.

Data platform engineers need to receive support from both the Compute and the Storage infrastructure teams to deliver. We look at how the public cloud, and Amazon AWS in particular, tackle these challenges and what are the equivalent technology strategies in OpenStack and Ceph.

Tradeoffs between IO latency, availability of storage space, cost and IO performance lead to storage options fragmenting into three broad solution areas: network-backed persistent block, application-focused object storage (also network based), and directly-attached low-latency NVME storage for highest-performance scratch and overflow space.

Ideally, the infrastructure designer would choose to adopt similarly-behaving approaches to the public and private cloud environments, which is what makes OpenStack and Ceph a good fit: scale-out, cloud-native technologies naturally have much more in common with public cloud than legacy vendors. Interested? Listen to our quick survey of the field, the OpenStack Foundation kindly published a recording of our session:

Our slides are available as a PDF download and can be viewed inline below.

Source: Federico Lucifredi (Storage for Data Platforms in 10 minutes)

The post Storage for Data Platforms in 10 minutes appeared first on Ceph.

↧

ceph erasure默认的min_size分析

June 11, 2018, 7:43 pm

≫ Next: 利用s3-test进行ceph的接口兼容性测试

≪ Previous: Storage for Data Platforms in 10 minutes

desk.jpg-47.1kB

引言

最近接触了两个集群都使用到了erasure code,一个集群是hammer版本的，一个环境是luminous版本的，两个环境都出现了incomplete，触发的原因有类似的地方，都是有osd的离线的问题

准备在本地环境进行复验的时候，发现了一个跟之前接触的erasure不同的地方，这里做个记录，以防后面出现同样的问题

分析过程

准备了一个luminous的集群，使用默认的erasure的profile进行了创建存储池的相关工作

[root@lab102 ~]# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van

默认的是2+1的纠删码的配置，创建完了以后存储池的配置是这样的

[root@lab102 ~]# ceph osd dump|grep pool
pool 1 'rbd' erasure size 3 min_size 3 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 41 flags hashpspool stripe_width 8192 application rbdrc

然后停止了一个osd以后，状态变成了这样的

[root@lab102 ~]# ceph -s
  cluster:
    id:     9ec7768a-5e7c-4f8e-8a85-89895e338cca
    health: HEALTH_WARN
            1 osds down
            Reduced data availability: 42 pgs inactive, 131 pgs incomplete
 
  services:
    mon: 1 daemons, quorum lab102
    mgr: lab102(active)
    osd: 6 osds: 5 up, 6 in
 
  data:
    pools:   3 pools, 288 pgs
    objects: 1666k objects, 13331 MB
    usage:   319 GB used, 21659 GB / 21979 GB avail
    pgs:     45.486% pgs not active
             157 active+clean
             131 incomplete

停止一个osd也会出现incomplete的状态，也就是在默认状态下，是一个osd也不允许down掉的，不然pg就进入了无法使用的状态，这个在我这里感觉无法理解的，开始以为这个是L版本的bug，在查了下资料以后，发现并不是的

查询到一个这样的patch：default min_size for erasure pools

这个里面就讨论了min_size的问题，上面的环境我也发现了，默认的配置的2+1,这个在我的理解下，正常应该会配置为min_size 2,在down掉一个的时候还是可写，可读的

实际上在/src/mon/OSDMonitor.cc 这个里面已经把erasure的min_size的控制改为了

*min_size = erasure_code->get_data_chunk_count();
变成
*min_size = erasure_code->get_data_chunk_count() + 1;

最后面作者提出了自己的担心，假如在K+M的配置下，只有K个的osd允许可以读写的时候，环境是K个OSD是好的，M个OSD挂掉了，这个时候启动一个M中的osd的时候，会进行backfilling，这个时候如果K个osd当中的某个osd挂掉的话，这个时候实际上PG里面的数据就是不完整的，如果是K+1的时候，这个时候做恢复的时候再挂掉一个，实际上还是完整的，也就是开发者考虑的是恢复过程的异常状况还留一个冗余，这个实际我们在日常的维护过程当中也经常遇到恢复过程中确实有osd的挂掉的情况,这个在其他文件系统里面的做法是设计成可读不可写状态

也就是现在ceph的erasure的min_size设计成了

min_size=K+1

也就是默认的环境下的是min_size是3

到这里就知道上面为什么会出现上面的状况了，也就是这个编码设置的时候需要自己去控制下，比如4+2的ec，最多能挂掉几个，如果在以前可以很肯定的说是2个，实际在新的情况下是4+1=5也就是只允许挂掉一个是可读可写的

当然真正生产环境出现了4+2挂掉两个变成了incomplete的时候，因为这个时候数据还是完整可拼接的，所以可以强制mark-complete或者自己把代码里面的min_size改掉来触发恢复也是可以的

总结

对于ec这块接触的很早，里面还是有很多有意思的可以研究的东西的，ec最适合的场景就是归档，当然在某些配置下面，性能也是很不错的，也能支持一些低延时的任务，这个最大的特点就是一定需要根据实际环境去跑性能测试，拆成几比几性能有多少，这个一般还是不太好预估的，跟写入的文件模型也有关联

虽然作者的设计初衷是没问题的，但是这个默认配置实际是不符合生产要求的，所以个人觉得这个不是很合理，默认的应该是不需要调整也是可用的，一个osd也不允许down的话，真正也没法用起来，所以不清楚是否有其他可改变的配置来处理这个，自己配置的时候注意下这个min_size，如果未来有控制的参数，会补充进这篇文章

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-06-12

Source: zphj1987@gmail (ceph erasure默认的min_size分析)

The post ceph erasure默认的min_size分析 appeared first on Ceph.

↧

利用s3-test进行ceph的接口兼容性测试

June 27, 2018, 2:16 am

≫ Next: 快速构建ceph可视化监控系统

≪ Previous: ceph erasure默认的min_size分析

前言

ceph的rgw能够提供一个兼容性的s3的接口，既然是兼容性，当然不可能是所有接口都会兼容，那么我们需要有一个工具来进行接口的验证以及测试，这个在其他测试工具里面有类似的posix接口验证工具，这类的工具就是跑测试用例，来输出通过或者不通过的列表

用此类的工具有个好的地方就是，能够对接口进行验证，来避免版本的更新带来的接口破坏

安装

直接对官方的分支进行clone下来，总文件数不多，下载很快

[root@lab101 s3]# git clone https://github.com/ceph/s3-tests.git
[root@lab101 s3]# cd s3-tests/

这个地方注意下有版本之分，测试的时候需要用对应版本，这里我们测试的jewel版本就切换到jewel的分支(关键步骤)

[root@lab101 s3-tests]# git branch -a
[root@lab101 s3-tests]# git checkout -b jewel remotes/origin/ceph-jewel
[root@lab101 s3-tests]# ./bootstrap

进入到目录当中执行 ./bootstrap进行初始化相关的工作，这个是下载一些相关的库和软件包，并且创建了一个python的虚拟环境，如果从其他地方拷贝过来的代码最好是删除掉python虚拟环境，让程序自己去重新创建一套环境

执行完了以后就是创建测试配置文件test.conf

[DEFAULT]
## this section is just used as default for all the "s3 *"
## sections, you can place these variables also directly there

## replace with e.g. "localhost" to run against local software
host = 192.168.19.101

## uncomment the port to use something other than 80
port = 7481

## say "no" to disable TLS
is_secure = no

[fixtures]
## all the buckets created will start with this prefix;
## {random} will be filled with random characters to pad
## the prefix to 30 characters long, and avoid collisions
bucket prefix = cephtest-{random}-

[s3 main]
## the tests assume two accounts are defined, "main" and "alt".

## user_id is a 64-character hexstring
user_id = test1

## display name typically looks more like a unix login, "jdoe" etc
display_name = test1

## replace these with your access keys
access_key = test1
secret_key = test1

## replace with key id obtained when secret is created, or delete if KMS not tested
#kms_keyid = 01234567-89ab-cdef-0123-456789abcdef

[s3 alt]
## another user account, used for ACL-related tests
user_id = test2
display_name = test2
## the "alt" user needs to have email set, too
email = test2@qq.com
access_key = test2
secret_key = test2

上面的用户信息是需要提前创建好的，这个用集群内的机器radosgw-admin命令创建即可

radosgw-admin user create --uid=test01 --display-name=test01 --access-key=test01 --secret-key=test01 --email=test01@qq.com
radosgw-admin user create --uid=test02 --display-name=test02 --access-key=test02 --secret-key=test02 --email=test02@qq.com

创建好了以后就可以开始测试了

[root@lab101 s3-tests]# S3TEST_CONF=test.conf ./virtualenv/bin/nosetests -a '!fails_on_rgw'
..................................................SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................................................................................SSSS.......................................................................................................................................SSSS.......................................................
----------------------------------------------------------------------
Ran 408 tests in 122.087s

OK (SKIP=51)

正常测试完就应该是上面的ok的状态，也有可能某个版本的测试用例是写的支持，但是rgw也不一定就做好了，这个需要自己判断一下

总结

了解软件适配的接口，针对接口进行相关测试即可

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-06-27

Source: zphj1987@gmail (利用s3-test进行ceph的接口兼容性测试)

The post 利用s3-test进行ceph的接口兼容性测试 appeared first on Ceph.

↧

快速构建ceph可视化监控系统

July 17, 2018, 3:50 am

≫ Next: openATTIC 3.7.0 has been released

≪ Previous: 利用s3-test进行ceph的接口兼容性测试

granfa

前言

ceph的可视化方案很多，本篇介绍的是比较简单的一种方式，并且对包都进行了二次封装，所以能够在极短的时间内构建出一个可视化的监控系统

本系统组件如下：

ceph-jewel版本
ceph_exporter的jewel版本
prometheus的2.3.2版本
grafana的grafana-5.2.1版本
Ceph grafana的插件- Clusterby Cristian Calin

适配的系统为centos7

资源如下：

http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
http://static.zybuluo.com/zphj1987/1nu2k4cpcery94q2re3u6s1t/ceph-cluster_rev1.json
http://static.zybuluo.com/zphj1987/7ro7up6r03kx52rkwy1qjuwm/prometheus-2.3.2-1.x86_64.rpm
http://7xweck.com1.z0.glb.clouddn.com/grafana-5.2.1-1.x86_64.rpm

以上资源均可以直接用wget进行下载，然后直接安装

监控的架构介绍

通过ceph_exporter抓取的ceph相关的数据并且在本地监听端口9128端口

prometheus抓取ceph_exporter的9128的端口的数据存储在本地的/var/lib/prometheus/目录下面

grafana抓取prometheus的数据进行渲染成web页面

页面的模板就是使用的grafana的ceph模板插件

那么我们就根据上面的架构去一步步的把系统配置起来

配置监控系统

安装ceph_exporter

[root@lab101 install]# wget http://static.zybuluo.com/zphj1987/jiwx305b8q1hwc5uulo0z7ft/ceph_exporter-2.0.0-1.x86_64.rpm
[root@lab101 install]# rpm -qpl ceph_exporter-2.0.0-1.x86_64.rpm 
/usr/bin/ceph_exporter
/usr/lib/systemd/system/ceph_exporter.service
[root@lab101 install]# rpm -ivh ceph_exporter-2.0.0-1.x86_64.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:ceph_exporter-2:2.0.0-1          ################################# [100%]
[root@lab101 install]# systemctl start ceph_exporter
[root@lab101 install]# systemctl enable ceph_exporter
[root@lab101 install]# netstat -tunlp|grep 9128
tcp6       0      0 :::9128                 :::*                    LISTEN      35853/ceph_exporter

可以看到端口起来了就是安装成功了，这个ceph_exporter建议是安装在管理节点上，也就是能够执行出ceph -s的节点上面的

安装prometheus

[root@lab101 install]#  wget http://static.zybuluo.com/zphj1987/7ro7up6r03kx52rkwy1qjuwm/prometheus-2.3.2-1.x86_64.rpm
[root@lab101 install]# rpm -qpl prometheus-2.3.2-1.x86_64.rpm 
/etc/ceph/prometheus.yml
/usr/bin/prometheus
/usr/lib/systemd/system/prometheus.service
[root@lab101 install]# rpm -ivh prometheus-2.3.2-1.x86_64.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:prometheus-2:2.3.2-1             ################################# [100%]
[root@lab101 install]# systemctl start prometheus
[root@lab101 install]# netstat -tunlp|grep 9090
tcp6       0      0 :::9090                 :::*                    LISTEN      36163/prometheus

这个地方默认是认为prometheus和ceph_exporter在一台机器上面，所以配置文件的/etc/ceph/prometheus.yml里面的targets写的是127.0.0.1，根据需要修改成ceph_exporter的ip地址即可

prometheus的默认监听端口为9090，到这个时候直接去web 上面就可以看到prometheus的抓取的数据了

prometheus

到这里是数据到prometheus的已经完成了，下面就去做跟grafana相关的配置了

安装grafana

[root@lab101 install]# wget http://7xweck.com1.z0.glb.clouddn.com/grafana-5.2.1-1.x86_64.rpm
[root@lab101 install]# yum localinstall grafana-5.2.1-1.x86_64.rpm
[root@lab101 install]# systemctl start grafana-server.service
[root@lab101 install]# netstat -tunlp|grep gra
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp6       0      0 :::3000                 :::*                    LISTEN      36730/grafana-serve

grafana默认监听的3000的端口

grafanalogin
默认登陆的用户名密码为admin admin,登陆成功后会强制修改密码

配置grafana

add sour
首先增加数据源
配置9090
import

image.png-97.2kB

image.png-96.2kB

这里如果能上网就直接输入id 917 ，如果不能上网就把上面的ceph-cluster_rev1.json文件弄到本地去，导入进去即可

granfa

到这里就完成了配置了

总结

以上为了方便都把相关的软件做成了rpm包，从安装方便角度来看，grafana，ceph_exporter，还有prometheus都采用的是单二进制文件的方式，稍微组合一下大大的降低了部署难度，比如那个ceph_exporter需要用go进行编译，封好包以后就不需要这个过程，并且接口因为有版本的限制，所以这样直接对应版本安装也避免了出错

本篇的环境所述均为jewel适配版本

变更记录

Why	Who	When
创建	武汉-运维-磨渣	2018-07-17

Source: zphj1987@gmail (快速构建ceph可视化监控系统)

The post 快速构建ceph可视化监控系统 appeared first on Ceph.

↧

openATTIC 3.7.0 has been released

August 8, 2018, 2:23 am

≫ Next: cephfs根据存储池显示df容量

≪ Previous: 快速构建ceph可视化监控系统

We’re happy to announce version 3.7.0 of openATTIC!

Version 3.7.0 is the first bugfix release of the 3.7 stable branch, containing fixes for multiple issues that were mainly reported by users.

There has been an issue with self-signed certificates in combination with the RGW proxy which is now configurable. We also improved the openATTIC user experience and adapted some of our frontend tests in order to make them more stable.

As mentioned in our last blog post our team was working on a Spanish translation. We are very proud to have the translation included in this release. Thank you Gustavo for your contribution.

Another highlight of the release is then newly added RBD snapshot management. openATTIC is now capable to create, clone, rollback, protect/unprotect and delete RBD snapshots. In addition it is also possible to copy RBD images now.
Furthermore the “pool edit” feature received a slight update: we implemented the option to set the “EC overwrite” flag when editing erasure coded pools.

cephfs根据存储池显示df容量

August 18, 2018, 8:29 pm

≫ Next: mountpoint presentation of Ceph Nano

≪ Previous: openATTIC 3.7.0 has been released

pool.png-115.2kB

前言

如果用cephfs比较多，应该都知道，在cephfs的客户端进行mount以后，看到的容量显示的是集群的总的容量，也就是你的总的磁盘空间是多少这个地方显示的就是多少

这个一直都是这样显示的，我们之前在hammer版本的时候，阿茂和大黄一起在公司内部实现了这个功能，社区会慢慢的集成一些类似的面向面向商业用户的需求

社区已经开发了一个版本，接口都做的差不多了，那么稍微改改，就能实现想要的需求的

本篇内的改动是基于内核客户端代码的改动，改动很小，应该能够看的懂

改动过程

首先找到这个补丁

Improve accuracy of statfs reporting for Ceph filesystems comprising exactly one data pool. In this case, the Ceph monitor can now report the space usage for the single data pool instead of the global data for the entire Ceph cluster. Include support for this message in mon_client and leverage it in ceph/super.

地址：https://www.spinics.net/lists/ceph-devel/msg37937.html

这个说的是改善了statfs的显示，这个statfs就是在linux下面的mount的输出的显示的，说是改善了在单存储池下的显示效果，也就是在单存储池下能够显示存储池的容量空间，而不是全局的空间

这里就有个疑问了，单存储池？那么多存储池呢？我们测试下看看

这里这个补丁已经打到了centos7.5的默认内核里面去了，也就是内核版本

Linux lab103 3.10.0-862.el7.x86_64

对应的rpm包的版本是

[root@lab103 ceph]# rpm -qa|grep  3.10.0-862
kernel-devel-3.10.0-862.el7.x86_64
kernel-3.10.0-862.el7.x86_64

下载的地址为：

http://mirrors.163.com/centos/7.5.1804/os/x86_64/Packages/kernel-3.10.0-862.el7.x86_64.rpm

或者直接安装centos7.5也行，这里只要求是这个内核就可以了

我们看下默认情况下是怎样的

[root@lab102 ~]# ceph -s
  data:
    pools:   3 pools, 72 pgs
    objects: 22 objects, 36179 bytes
    usage:   5209 MB used, 11645 GB / 11650 GB avail
    pgs:     72 active+clean
 
[root@lab102 ~]# ceph fs ls
name: ceph, metadata pool: metadata, data pools: [data ]
[root@lab102 ~]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    11650G     11645G        5209M          0.04 
POOLS:
    NAME         ID     USED      %USED     MAX AVAIL     OBJECTS 
    data         9          0         0         3671G           0 
    metadata     10     36179         0        11014G          22 
    newdata      11         0         0         5507G           0 
[root@lab102 ~]# ceph osd dump|grep pool
pool 9 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 136 flags hashpspool stripe_width 0 application cephfs
pool 10 'metadata' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 112 flags hashpspool stripe_width 0 application cephfs
pool 11 'newdata' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 134 flags hashpspool  stripe_width 0 application cephfs

从上面可以看到我的硬盘裸空间为12T左右，data存储池副本3那么可用空间为4T左右，文件系统里面只有一个data存储池，看下挂载的情况

[root@lab101 ~]# uname -a
Linux lab101 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@lab101 ~]# df -Th|grep mnt
192.168.19.102:/        ceph      3.6T     0  3.6T   0% /mnt

可以看到显示的容量就是存储池的可用容量为总空间的，现在我们加入一个数据池

[root@lab102 ~]# ceph mds add_data_pool newdata
added data pool 11 to fsmap

再次查看df的显示

[root@lab101 ~]# df -Th|grep mnt
192.168.19.102:/        ceph       12T  5.1G   12T   1% /mnt

容量回到了原始的显示的方式，这个跟上面的补丁的预期是一样的，我们看下代码这里怎么控制的

获取当前内核版本的代码

首先要找到当前的内核的src.rpm包，这样可以拿到当前内核版本的源码

wget http://vault.centos.org/7.5.1804/os/Source/SPackages/kernel-3.10.0-862.el7.src.rpm

解压源码包

[root@lab103 origin]# rpm2cpio kernel-3.10.0-862.el7.src.rpm |cpio -div
[root@lab103 origin]# tar -xvf linux-3.10.0-862.el7.tar.xz
[root@lab103 origin]# cd linux-3.10.0-862.el7/fs/ceph/

上面的操作后我们已经进入了我们想要看的源码目录了
我们看下super.c这个文件，这个df的显示的控制是在这个文件里面的

[root@lab103 ceph]# cat super.c |less

看下这段代码

static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf)
{
        struct ceph_fs_client *fsc = ceph_inode_to_client(dentry->d_inode);
        struct ceph_monmap *monmap = fsc->client->monc.monmap;
        struct ceph_statfs st;
        u64 fsid;
        int err;
        u64 data_pool;

        if (fsc->mdsc->mdsmap->m_num_data_pg_pools == 1) {
                data_pool = fsc->mdsc->mdsmap->m_data_pg_pools[0];
        } else {
                data_pool = CEPH_NOPOOL;
        }

        dout("statfsn");
        err = ceph_monc_do_statfs(&fsc->client->monc, data_pool, &st);
        if (err < 0)
                return err;

其中的fsc->mdsc->mdsmap->m_num_data_pg_pools == 1和data_pool = fsc->mdsc->mdsmap->m_data_pg_pools[0];这个地方的意思是如果fs里面包含的存储池的存储池个数为1那么data_pool就取这个存储池的信息，所以上面的我们的实践过程中的就是单个存储池的时候显示存储池的容量，超过一个的时候就显示的全局的容量，这个是跟代码对应的上的

我们基于上面的已经做好的功能改变一下需求

需要可以根据自己的需要指定存储池的容量来显示，通过挂载内核客户端的时候传递一个参数进去来进行显示

代码改动

[root@lab103 ceph]# vim super.h
在super.h内定义一个默认值

#define ZP_POOL_DEFAULT      0  /* pool id */
#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
struct ceph_mount_options {
        int flags;
        int sb_flags;

        int wsize;            /* max write size */
        int rsize;            /* max read size */
        int zp_pool;            /* pool id */
        int rasize;           /* max readahead */

这里增加了两个一个zp_pool和ZP_POOL_DEFAULT
这个文件的改动就只有这么多了

改动super.c的代码
在enum里面加上Opt_zp_pool

enum {
        Opt_wsize,
        Opt_rsize,
        Opt_rasize,
        Opt_caps_wanted_delay_min,
        Opt_zp_pool,

在match_table_t fsopt_tokens里面添加Opt_zp_pool相关的判断，我们自己注意传的是pool在fs里面的id即可

static match_table_t fsopt_tokens = {
        {Opt_wsize, "wsize=%d"},
        {Opt_rsize, "rsize=%d"},
        {Opt_rasize, "rasize=%d"},
        {Opt_caps_wanted_delay_min, "caps_wanted_delay_min=%d"},
        {Opt_zp_pool, "zp_pool=%d"},

在static int parse_fsopt_token中添加

case Opt_caps_wanted_delay_max:
                if (intval < 1)
                        return -EINVAL;
                fsopt->caps_wanted_delay_max = intval;
                break;
        case Opt_zp_pool:
                if (intval < 0)
                        return -EINVAL;
                fsopt->zp_pool = intval;
                break;
        case Opt_readdir_max_entries:
                if (intval < 1)
                        return -EINVAL;
                fsopt->max_readdir = intval;
                break;

判断如果小于0就抛错，这个id从0开始上升的，所以也不允许小于0

在static int parse_mount_options中添加

fsopt->caps_wanted_delay_min = CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT;
fsopt->zp_pool = ZP_POOL_DEFAULT;
fsopt->caps_wanted_delay_max = CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT;

在static int ceph_show_options中添加

if (fsopt->caps_wanted_delay_min != CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT)
        seq_printf(m, ",caps_wanted_delay_min=%d",
                 fsopt->caps_wanted_delay_min);
if (fsopt->zp_pool)
        seq_printf(m, ",zp_pool=%d",
                 fsopt->zp_pool);
if (fsopt->caps_wanted_delay_max != CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT)
        seq_printf(m, ",caps_wanted_delay_max=%d",
                   fsopt->caps_wanted_delay_max);

这个是用来在执行mount命令的时候显示选项的数值的
改动到这里我们检查下我们对super.c做过的的改动

[root@lab103 ceph]# cat super.c |grep zp_pool
	Opt_zp_pool,
	{Opt_zp_pool, "zp_pool=%d"},
    case Opt_zp_pool:
        fsopt->zp_pool = intval;
	fsopt->zp_pool = ZP_POOL_DEFAULT;
        if (fsopt->zp_pool)
                seq_printf(m, ",zp_pool=%d",
                         fsopt->zp_pool);

做了以上的改动后我们就可以把参数给传进来了，现在我们需要把参数传递到需要用的地方
也就是static int ceph_statfs内需要调用这个参数

在static int ceph_statfs中添加上struct ceph_mount_options *fsopt = fsc->mount_options;

static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf)
{
        struct ceph_fs_client *fsc = ceph_inode_to_client(dentry->d_inode);
        struct ceph_monmap *monmap = fsc->client->monc.monmap;
        struct ceph_statfs st;
        struct ceph_mount_options *fsopt = fsc->mount_options;
        u64 fsid;

然后改掉这个fsc->mdsc->mdsmap->m_num_data_pg_pools == 1的判断，我们判断大于0即可

if (fsc->mdsc->mdsmap->m_num_data_pg_pools > 0) {
        data_pool = fsc->mdsc->mdsmap->m_data_pg_pools[fsopt->zp_pool];
} else {
        data_pool = CEPH_NOPOOL;
}

并且把写死的0改成我们的变量fsopt->zp_pool

到这里改动就完成了，这里还没有完，我们需要编译成我们的需要的模块

[root@lab103 ceph]# modinfo ceph
filename:       /lib/modules/3.10.0-862.el7.x86_64/kernel/fs/ceph/ceph.ko.xz

可以看到内核在高版本的时候已经改成了xz压缩的模块了,这里等会需要多处理一步
我们只需要这一个模块就编译这一个ceph.ko模块就好
编译需要装好kernel-devel包kernel-devel-3.10.0-862.el7.x86_64

[root@lab103 ceph]# pwd
/home/origin/linux-3.10.0-862.el7/fs/ceph
[root@lab103 ceph]# make CONFIG_CEPH_FS=m -C /lib/modules/3.10.0-862.el7.x86_64/build/ M=`pwd` modules
make: Entering directory `/usr/src/kernels/3.10.0-862.el7.x86_64'
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/super.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/inode.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/dir.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/file.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/locks.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/addr.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/ioctl.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/export.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/caps.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/snap.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/xattr.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/mds_client.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/mdsmap.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/strings.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/ceph_frag.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/debugfs.o
  CC [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/acl.o
  LD [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/ceph.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/origin/linux-3.10.0-862.el7/fs/ceph/ceph.mod.o
  LD [M]  /home/origin/linux-3.10.0-862.el7/fs/ceph/ceph.ko
make: Leaving directory `/usr/src/kernels/3.10.0-862.el7.x86_64'

正常应该就是上面的没有报错的输出了
压缩ko模块

[root@lab103 ceph]# find * -name '*.ko' | xargs -n 1 xz
[root@lab103 ceph]# rmmod ceph
[root@lab103 ceph]# rm -rf  /lib/modules/3.10.0-862.el7.x86_64/kernel/fs/ceph/ceph.ko.xz
[root@lab103 ceph]# cp -ra ceph.ko.xz /lib/modules/3.10.0-862.el7.x86_64/kernel/fs/ceph/
[root@lab103 ceph]# lsmod |grep ceph
ceph                  345111  0 
libceph               301687  1 ceph
dns_resolver           13140  1 libceph
libcrc32c              12644  2 xfs,libceph

现在已经加载好模块了，我们试验下

[root@lab103 ceph]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    11650G     11645G        5210M          0.04 
POOLS:
    NAME         ID     USED      %USED     MAX AVAIL     OBJECTS 
    data         9          0         0         3671G           0 
    metadata     10     36391         0        11014G          22 
    newdata      11         0         0         5507G           0 

[root@lab103 ceph]# mount -t ceph 192.168.19.102:/ /mnt
[root@lab103 ceph]# df -h|grep mnt
192.168.19.102:/         3.6T     0  3.6T   0% /mnt
[root@lab103 ceph]# ceph fs ls
name: ceph, metadata pool: metadata, data pools: [data newdata ]

我们给了一个默认存储池的值为0的编号的，现在显示的是data的容量，没有问题，我们想显示newdata存储池的

[root@lab103 ceph]# mount -t ceph 192.168.19.102:/ /mnt -o zp_pool=1
[root@lab103 ceph]# df -h|grep mnt
192.168.19.102:/         5.4T     0  5.4T   0% /mnt

这里我们显示的要么0，要么1的存储池的那么我如果想显示全局的怎么处理？那就是给个不存在的编号就行了

[root@lab103 ceph]# mount -t ceph 192.168.19.102:/ /mnt -o zp_pool=1000
[root@lab103 ceph]# mount|grep ceph|grep zp_pool
192.168.19.102:/ on /mnt type ceph (rw,relatime,acl,wsize=16777216,zp_pool=1000)
[root@lab103 ceph]# df -h|grep mnt
192.168.19.102:/          12T  5.1G   12T   1% /mnt