vIDM Elasticsearch failing due to idm plugin messing with node count – hard fix

Robert Kloosterhuis, march 29, 2021

 


 

I ran into a strange issue with a vIDM 3.3.2.0 appliance today

 

(vIDM , or ‘VMware Identity Manager’ , is now called ‘VMware Workspace ONE Access’)

 

The issue I had involved a single-node deployment.

 

This is an important detail. vIDM can be clustered, and that means that many of the services it runs inside (RabbitMQ, Elasticsearch) can also be clustered. The ‘fix’ I describe in this post (actually more of a workaround), should only ever be attempted on a single-node deployment. It will break the ability to make your vIDM install clustered.   This is unsupported and totally at your own risk. 

 

The issue was this:

 

I was getting frequent errors in the vIDM user interface, referencing the Analytics service.

 

 

“Call to Analytics failed with status: 500”

 

The Analytics service is, basically, a local installation of Elasticsearch, included on the vIDM appliances. And clustered if you have more than 1 vIDM node.

 

I don’t have any screenshot myself, but this post also nicely demonstrates what you would see in the vIDM Health Dashboard:

 

VMware Identity Manager Cluster 19.03 – Elastic Search Service Issues

Basically the ‘Integrated Components’ check in the Health Dashboard would be red. But in my case, no data at all was being produced by Elasticsearch. All the values where ‘unknown’.

 

To troubleshoot, we need to ssh into the vIDM VM, with the local ‘sshuser’ account. And then sudo to root.

 

When I tried to troubleshoot, it was obvious that I could not even get into the Elasticsearch API at all. It was throwing nothing but error 500’s

 

curl 'http://localhost:9200/_cluster/health'
{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}

curl -XGET 'http://localhost:9200/_cat/indices?v'
{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}

 

The elasticsearch log can be found here:  /opt/vmware/elasticsearch/logs/horizon.log

 

I tailed it, and gave elasticsearch itself a restart

 

service elasticsearch restart

Stopping elasticsearch: process in pidfile `/opt/vmware/elasticsearch/elasticsearch.pid'done.
horizon-workspace service is running
Waiting for IDM: Ok.
Number of nodes in cluster is : 1
Configuring /opt/vmware/elasticsearch/config/elasticsearch.yml file

 

I then tried to ask its health status a few times while it started, to see if it came up at all.

 

Briefly, it did, before it died again!

 


curl 'http://localhost:9200/_cluster/health?pretty'rt
{
"cluster_name" : "horizon",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 60,
"active_shards" : 60,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 60,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

When I examined the log, it saw pretty quickly why I was getting an error 500.

 

It shows Elasticsearch is starting normally. It then discovers it has 244 indices to clean up (more on that later), so its sets health to yellow. But that is fine. At least its not a complete fail.

 

But then something odd happens.

 

Something called ‘com.vmware.idm.elasticsearch.plugin’ makes an appearance and starts, somehow,  messing with the node count that Elasticsearch itself maintains for its cluster.

 

This VMware KB kind of explains what might be going on here https://kb.vmware.com/s/article/74709 , though it references a similar, but different error, actually a timing situation involving a cluster consisting of more than 1 node.

 

The point though, is this:

 

‘com.vmware.idm.elasticsearch.plugin’ is a

 

plugin for elasticsearch that asks IDM for the list of nodes that are expected to be in the cluster. It uses that list to determine how many nodes it should be able to see before a primary can be elected and the cluster formed. 

 

Seems logical, Elasticsearch cant know by itself what kind if cluster topology you build with vIDM, but vIDM knows.

 

Based on the log, what seemed to be happening is that Elastic starts normally, it loads in its config from

 

/opt/vmware/elasticsearch/config/elasticsearch.yml

This config includes how many cluster nodes are expected (in my case, just 1, cause there is no cluster).

 

 

But then, for some reason, the idm plugin tries to update the running cluster count again and here something goes wrong. The elastic cluster service, ends up removing its only node, and then of course, Elastic service dies. The next message is that the cluster service can no longer connect.

 

 

I have no idea why this is happening. And it was pretty consistent. Every time I restarted the Elastic Service or rebooted the VM. The config for the IDM plugin is also contained in /opt/vmware/elasticsearch/config/elasticsearch.yml , but it doesn’t keep its own nodecount value, so I am not sure why it thinks it can safely tell the cluster service to remove the only node for whatever reason.

 

Anyway, the workaround here, is pretty straightforward; simply disable the IDM plugin by setting the ‘discovery.zen.idm.enabled’ value to false. ( in /opt/vmware/elasticsearch/config/elasticsearch.yml )

 

Obviously this is unsupported, so do this at your own risk. If you ever expand the vIDM installation into a cluster, that will obviously break now, so you will have to turn this back on again. At that point, perhaps best to raise a VMware support ticket around this.

 

Bonus: Cleaning up unassigned shards

 

If you health stays yellow due to a number of ‘unassigned shards’ hanging around forever, you can force-delete them with the following one-liner:

 

curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.