Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdctl endpoint health should check for alarms #12032

Open
jingyih opened this issue Jun 20, 2020 · 8 comments · May be fixed by #12150
Open

etcdctl endpoint health should check for alarms #12032

jingyih opened this issue Jun 20, 2020 · 8 comments · May be fixed by #12150

Comments

@jingyih
Copy link
Member

@jingyih jingyih commented Jun 20, 2020

Today, etcdctl endpoint health only performs a quorum GET. If server is corrupted or exceeded quota (which put the server into read only mode), this command incorrectly returns healthy.

@cfc4n
Copy link
Contributor

@cfc4n cfc4n commented Jun 20, 2020

Do you have any more information?

@jingyih
Copy link
Member Author

@jingyih jingyih commented Jun 20, 2020

Yes. Ref: #11973

@agargi
Copy link
Contributor

@agargi agargi commented Jun 21, 2020

@jingyih Should "endpoint health" not simply call '/health' endpoint on each member instead of performing a quorum read on cluster?

@dvtkrlbs
Copy link

@dvtkrlbs dvtkrlbs commented Jun 21, 2020

i can take look at this issue but the problem is am not that familiar with the project. so can anyone guide me through this

@jingyih
Copy link
Member Author

@jingyih jingyih commented Jun 22, 2020

@jingyih Should "endpoint health" not simply call '/health' endpoint on each member instead of performing a quorum read on cluster?

I think by design etcdctl uses clientv3 library, for example the KV API [1]. The library communicates with etcd server via gRPC, not via HTTP. I think we just need to add additional call to make sure there is no alarm raised at the server by calling the clinetv3 maintenance API [2].

[1]

type KV interface {

[2]

type Maintenance interface {
// AlarmList gets all active alarms.
AlarmList(ctx context.Context) (*AlarmResponse, error)

@jingyih
Copy link
Member Author

@jingyih jingyih commented Jun 22, 2020

@gyuho @spzala Just want to double check with you. Is the current behavior of etcdctl endpoint health by design? I think if the server has alarms, this should return false.

$ etcdctl put foo bar
{"level":"warn","ts":"2020-06-22T02:44:40.368-0700","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-1fb6632a-4ac3-4f99-a1c3-c84860cddfb1/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
Error: etcdserver: mvcc: database space exceeded

$ etcdctl endpoint health -w json
[{"endpoint":"127.0.0.1:2379","health":true,"took":"1.339424ms"}]
@cfc4n
Copy link
Contributor

@cfc4n cfc4n commented Jul 10, 2020

use endpoint status can do it .

$ ./bin/etcdctl endpoint status -w json  
[{"Endpoint":"127.0.0.1:2379","Status":{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":1,"raft_term":4},"version":"3.5.0-pre","dbSize":24576,"leader":10276657743932975437,"raftIndex":11,"raftTerm":4,"raftAppliedIndex":11,"errors":["memberID:10276657743932975437 alarm:NOSPACE "],"dbSizeInUse":24576}}]

and /health is a metric api,not a standard api.

@agargi
Copy link
Contributor

@agargi agargi commented Jul 20, 2020

@jingyih Should "endpoint health" not simply call '/health' endpoint on each member instead of performing a quorum read on cluster?

I think by design etcdctl uses clientv3 library, for example the KV API [1]. The library communicates with etcd server via gRPC, not via HTTP. I think we just need to add additional call to make sure there is no alarm raised at the server by calling the clinetv3 maintenance API [2].

[1]

type KV interface {

[2]

type Maintenance interface {
// AlarmList gets all active alarms.
AlarmList(ctx context.Context) (*AlarmResponse, error)

@jingyih I have made the change and issued a pull request. /cc @gyuho @xiang90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.