What was the issue?
- All of a sudden on a Sunday morning our content editor team started reporting:
- In the CM servers, the Sitecore items(the beauty products) are not getting published to web database.
- the same products are not getting updated to Solr Master and thus to Solr slave
- They also reported that on the CD servers the searches are not showing up the newly updated products.
What is the overall architecture?
- Our Sitecore implementation receives product notifications though a RabbitMQ notification service where the messages are pooled by the product service(repository of all products)
- In Sitrecore, we have written a processor to listen to these notifications and update the product features and attributes.
- Once the updated product feature or attribute is saved in the master database first, then publishing service is called to update it to web database.
- Finally, re-indexing is executed to update it to Solr Master.
- The Solr master through the replication process updates it to Solr Slave.
- While Solr master is used as an indexing destination, Solr slave is used for searches.
- All this is done through the automated process called as “Notification Service” which is nothing but a listening processor over the pipeline.
What was done to resolve the issue after initial investigation?
- We checked for any publishing error logs, there were none.
- We checked for any notification service logs, there were none.
- We also checked the Solr logs, there were none.
- We then restarted the publishing service, reset the app pools on the CM and CD servers, reset the Solr master and the slave.
Did this resolve the issue?
- Yes and No.
- Yes because when we restart the CM and CD servers, it will use the initialization pipeline processor to update all the products from the Product service into master and web databases and thus also in Solr master and slave after auot-re-indexing. So we thus saw the new products which were not earlier showing up, they were showing up now.
- No because, when we tried to update any new products, it was again not showing it up in web databases, Solr master or Solr slave.
- This was very peculiar scenario that on server reset the products were getting synched the first time but when we tried to update any other product it will not.
What was the root cause then?
- The products from Sitecore master database were not being saved to web database because of “SQL deadlocking”.
- The deadlock was killing the save event and thus the product was not getting saved and thus also not reflecting in Solr
- The tables which were deadlocking in core, master as well as web databases were –
- EventQueue table,
- History table and
- PublishQueue table
- Sitecore uses these tables to keep track of events, history and publishing and thus know which products to publish or not based on these tables
- If these tables consists hundreds of thousands of records then it will fail to retrieve the info in timely fashion and thus slowing the CPU and memory and thus resulting in deadlocks.
So what was the final solution?
- The solution is :
- you have to prune the data in these 3 tables( say more than 12 hours or so).
- You can archive any data older than 12 hours in an archive table so that you can reclaim it in case of Disaster recovery.
- You can also create a job which regularly cleans up data from these 3 tables, archives it in another table and later clean the archive table too say on weekly basis.
- One of the solutions is mentioned as below
- Commands to clean the data older than 12 hours
- Delete data from EventQueue table
deleteFROM[TheDatabase_Master].[dbo].[EventQueue]where[Created] < DATEADD(HOUR, -4, GETDATE())
deleteFROM[TheDatabase_Core].[dbo].[EventQueue]where[Created] < DATEADD(HOUR, -4, GETDATE())
deleteFROM[TheDatabase_Web].[dbo].[EventQueue]where[Created] < DATEADD(HOUR, -4, GETDATE())
Delete data from History table
deleteFROM[TheDatabase_Core].[dbo].[History]whereCreated < DATEADD(HOUR, -12, GETDATE())
deleteFROM[TheDatabase_Master].[dbo].[History]whereCreated < DATEADD(HOUR, -12, GETDATE())
deleteFROM[TheDatabase_Web].[dbo].[History]whereCreated < DATEADD(HOUR, -12, GETDATE())Delete data from Publishqueue table
deleteFROM[TheDatabase_Core].[dbo].[PublishQueue]whereDate< DATEADD(HOUR, -12, GETDATE());
deleteFROM[TheDatabase_Master].[dbo].[PublishQueue]whereDate< DATEADD(HOUR, -12, GETDATE());
deleteFROM[TheDatabase_Web].[dbo].[PublishQueue]whereDate< DATEADD(HOUR, -12, GETDATE());
- Delete data from EventQueue table
- After cleaning the databases, we saw that the dead locks were gone and now the data was saving into the master and web database as well as the Solr master and slave. This resolved our issue.