Search This Blog

Monday, August 1, 2022

I do not recommend Domino 12.0.1 even with FP1 - still many issues

 Hi

Unfortunately FP1 for Domino12.0.1 still contains some important bugs.

I know about two different Domino-environments with 12.0.1 FP1 (I mean, two different companies) which constantly expirience issues after moving to Domino 12.0.1 and then to 12.0.1 FP1.

Before FP1 there were issues with DAOS.

Now with FP1 there are issues with other tasks (mainly Replicator) and all cases were accompanied by an error "The caller's SemWait timeout expired." for a specific database.

Here it is a common case:

Environment: two or more Domino-servers 12.0.1 FP1 in a cluster, Windows Server 2016 or higher

The case: you start to see errors like 

[1358:0002-135C] 16.07.2022 22:07:47 Unable to replicate <SERVER> <DATABASE>: The caller's SemWait timeout expired.

It is unclear why it started happening but some of the cases might be connected with a stuck nCompact.exe on this database and high CPU usage by MTA task. It is not clear though if nCompact hung up and caused the error "The caller's SemWait timeout expired." or if the error was the reason of  nCompact hanging.

There was no way to do anything about that - only Domino restart.

Sometimes Domino even refused to shutdown properly, most of the tasks successfully quit but you might not see the final message "server shutdown complete" so in some cases it was required to "kill server".

After server started again you could see a new error for the database, saying "<DATETIME> Database <DATABASE> time is too far in the future."

Check of database icon showed that its modification time was in fufure: 


Again, it was not clear how to fix that new error either:
- nFixup.exe worked but didn't fix anything
- it refused to be copy-style compacted with the same error that database time was in thefuture. 
- desig refresh didn't change anything

The only way was to delete replica and create it again by replicating from another clustermate. It happened several times for each company already.

One more interesting point about all this: several users (for whose maildatabases this issue happened) complained, that emails they deleted or which they moved to their custom folders in their private mailboxes suddenly returned back to Inbox. My guess is that users worked in another replica, then server restart resolved the error "The caller's SemWait timeout expired", and then cluster replicator replicated changes from the inaccessible earlier replica back to normal replicas. Since the inaccessibe earlier replica was more recent (because of time in the future), some or may be even all changes user made in normal replicas were overwriten with the acctually old data taken from the replica that was in the future.

7 comments:

  1. Hi, did you open a support case at HCL specially for 12.0.1FP1?

    ReplyDelete
    Replies
    1. Keep us posted about the outcome of the support ticket.

      Delete
  2. What you are seeing is a time creep, where the number of time related actions (e.g. creating a unique number, creating a documentuniqueID, creating a new document, creating new databases, etc.) are performed so very often, that Domino's internal time is moving into the future. This is not a product issue, it is working as designed . And it is typically caused by your own / self developed applications, not by the product. The error message you are seeing is a warning to admins, that you better watch out before more damage is caused. Typically the root cause of a time creep is an agent, that is going crazy (e.g. creating millions of documents in a short period of time). So please take a look at what agents are doing...

    ReplyDelete
    Replies
    1. All the issues happened with mail databases without any customizations. There were no any custom agents dealing with the mail databases either.
      We never had such issues before moving to 12.0.1.
      We didn't change server config and didn't develop/introduce anything new.
      Besides that, HCL confirmed that 12.0.1 had several deadlock-errors related to DAOS (see https://ds_infolib.hcltechsw.com/ldd/fixlist.nsf?OpenDatabase&Start=1&Count=30&Expand=2.11), this is just another example of deadlock but in another place.

      Delete
    2. Hmm, i don't see, that the problem you're facing, is related to the DAOS deadlock mentioned in the fixlist. What is the outcome of your support ticket so far?

      Delete
    3. The text of the DAOS-related error was exactly the same - see my ealier post https://ypastov.blogspot.com/2022/01/both-domino-1201-and-if1-contain-bug.html
      There is no outcome yet, HCL is still investigating the probelm....

      Delete