Wednesday, December 9, 2015

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read)

In a recent project, seemingly randomly, this exception occurred when doing a CQL 'select' statement from a Spring Boot project to Cassandra:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.255.235.17 (Timeout during read), /10.255.235.16 (Timeout during read))
...


After a lot of research, some people seemed to have reported the same issue. But no clear answer anywhere. Except that some Cassandra driver versions might be the cause of it: they mark (all) the node(s) as down and don't recognize it when it becomes available again.

But, the strange this is we have over 10 (micro) services running, all running at with least 2 instances. But only one of these services had this timeout problem. So it almost couldn't be the driver.... Though it did seem to be related with not using the connection for a while, because often our end-to-end tests just ran fine, time after time. But after a few hours, the tests would just fail. Then we didn't see the pattern yet...

But, as a test, we decided to let nobody use the environment against which the end-to-end tests run for a few hours; especially also because some of the below articles do mention as a solution to set the heartbeat (keep-alive) of the driver.

And indeed, the end-to-end tests started failing again after the grace period. Then we realized it: all our services have a Spring Boot health-check implemented, which is called every X seconds. EXCEPT the service that has the timeouts; it only recently got connected with Cassandra!

After fixing that, the error disappeared! Of course depending on the healthcheck for a connection staying alive is not the ideal solution. A better solution is probably setting the heartbeat interval on the driver on Cluster creation:

var poolingOptions = new PoolingOptions()
  .SetCoreConnectionsPerHost(1)
  .SetHeartBeatInterval(10000);
var cluster = Cluster
  .Builder()
  .AddContactPoints(hosts).
  .WithPoolingOptions(poolingOptions)
  .Build();


In the end it was the firewall which resets all TCP connections every two hours!

References

Tips to analyse the problem:

Similar error reports



No comments: