SharePoint Performance Tuning Emergency
Hello. I’m Dr Kinley. In this chapter of Dr Kinley’s Facebook, we look at the Case of the SharePoint Performance Tuning Emergency. Last week, a new customer got in touch. Their business users were screaming: SharePoint had slowed to a crawl. Their SharePoint server farm, of three servers, used to perform well, but had recently started taking a long time to load pages, frequently returning time-outs and other errors. We managed to get them back on track in a matter of hours.
SharePoint Performance Tuning Triage
The first question we ask under these circumstances is this: what’s changed? As it turned out, there were quite a few recent changes to their environment.
Firstly, they had recently – in the last few months – moved from hosting the farm on VMWare to Hyper-V. That’s the virtual equivalent of changing datacentres.
Secondly, they had very recently removed their Search Service Application in an attempt to resolve other errors.
But how would these changes suddenly require SharePoint performance tuning?
Diagnosis: Internet Connectivity
After asking the old, usual chestnuts, “What customisations have you made? What custom software has been deployed?”, it was time for the dog to see the rabbit. We started a screen sharing session using the customer’s favoured sharing app.
Pages were definitely taking a long time to load. 30-60 seconds, usually accompanied by a time-out and an error.
A very cursory glance at the Windows NLB solution they had in place quickly eliminated that from the pool of likely candidates.
Next, we checked Internet connectivity on the servers. Could we launch a browser from the SharePoint servers and reach Google? No, we couldn’t. Although, anecdotally, the servers previously had this capability when they were hosted on VMWare. This is not a limitation of Hyper-V, of course; this was something omitted during the move from one environment to another.
Without investment in SharePoint performance tuning, you will get poor page load times if your servers have no Internet connectivity.
Certificate Revocation List Hell
The problem with not having Internet connectivity on SharePoint servers is twofold, and related to digital certificates.
Firstly, SharePoint uses .NET assemblies (DLL files) signed by Microsoft for security. When an assembly is loaded – pretty much whenever SharePoint does anything at all; load pages, run timer jobs, anything – .NET needs to check that the certificate used to sign the assembly is still valid. To do this, it reaches out across the Internet to crl.microsoft.com, and requests a list of digital certificates that have been revoked by Microsoft. Assuming success, it then checks this list and if the assembly’s certificate is in this list, it will prevent it from loading and give a security error. But, I hear you cry, what happens if there is no Internet connection? Surely it won’t prevent SharePoint from running? What happens is this: it tries to reach crl.microsoft.com; it fails after a 15-30 second time-out; it then shrugs its shoulders and says “what the hell”, and continues to load the assembly anyway.
This problem becomes obvious if you try and run the old STSADM tool from the command line. If it returns within a second or so, you’re good to go. If it waits 15 or so seconds, then you know you have a CRL problem.
The second case, SharePoint uses SSL certificates to encrypt all traffic between servers in the SharePoint farm, for instance when Server A needs to call a service on Server B – perhaps the Search Service, Managed Metadata Service etc. The certificate used for all the servers in the farm is called the SharePoint Root certificate. A default installation of SharePoint does not trust the issuer of this cert, therefore every time it is used, the server needs to confirm the certificate is still valid by reaching out to crl.microsoft.com again. If it gets no response, after a 15 second time-out, it will just error, and abort whatever it was trying to invoke on the other server.
Both these issues are resolvable. To make the problem go away, the servers can be given outbound Internet connectivity. However, not all security teams will sanction this, so it’s not a one-size-fits-all solution. There are also parts of SharePoint that insist on asking the server for a web/HTTP proxy before continuing. Commands like “netsh winhttp set proxy my.proxy.server:portnumber” will help with this, although your security team may wish to limit the Internet access of the SharePoint farm to crl.microsoft.com and nothing else.
But we still need to tell .NET and SharePoint not to try certificate revocation list checking. After all, if there is no Internet access, you can’t perform the check anyway. We can disable some CRL checks outright, and persuade Windows that the SharePoint root certificate is to be trusted without a CRL check by adding it to the Trusted Root Certificate Authorities store in Windows.
However, it’s never easy, and there’s no single master switch. Have a look at the references below to help you resolve them yourself.
Further SharePoint Performance Tuning
Having applied these various fixes, the server farm was back to its old self. In fact, faster for some pages. However, there were still pages that seemed to take a very long time to return results, every-so-often.
There were two further problems:
Data Volume Tipping Points
Their data seemed to have reached a tipping point. Without any SharePoint performance tuning, you will notice poor page load times.
They make heavy use of roll-up queries, and a mix of out-of-the-box and third party web parts that perform recursive scoped queries throughout their site collection. It wasn’t that their data was big in terms of gigabytes, just big in terms of numbers of rows. They had one site collection for most of their data, and it was less than 100GB. Their use of roll-up web parts, however, yielded queries that trawl through thousands of list items and SQL rows. Under default configuration of SharePoint that would trigger query throttling, and errors would be spotted much sooner. Close inspection through Central Admin showed that someone had previously increased the throttling limit from 5,000 list items up to 50,000. This was sufficiently high to effectively switch off throttling altogether.
The problem with this is down to the fundamental SQL table design used by SharePoint. If multiple site collections are in the same content database, all the list items and library items, in all the sites and subsites, in all the site collections – in effect everything – is stored in one, big, wide table called “AllUserData”. Any SQL query that attempts to lock more than 5,000 SQL rows – even for reading – will trigger an escalation to a whole table lock. When that happens, everybody loses. Every query on anything in that table (list/library items, lists/libraries, sites, everything) now has to wait for the query with the table lock to complete.
Setting the threshold back to 5,000 items, and (temporarily, until an alternative strategy can be used) disabling the third party web parts returns stability to the farm. In the medium term, the customer needs to visit each list used in these queries and turn on suitable column indexing for each one. In the long term, consider moving to a search-based strategy.
Rogue Antivirus Misconfiguration: Check Your Exclusions!
Does antivirus configuration on non-SharePoint servers count as SharePoint performance tuning?
Well, the customer spotted and resolved this problem themselves: they had installed an antivirus program on their database server. Although MDF and LDF (database and log) files had been explicitly excluded from scanning, there is a third kind of database file – NDF (“supplementary data files”) – used in SharePoint databases such as the Web Analytics and Logging. These files are usually smaller that the rest of your SharePoint content, but when on-access scanning kicks in, it effectively locks out everything else (including SQL Server itself) for the duration of the scan… which was around 30 seconds. You can resolve this by adding NDF files to the exclusion list in your antivirus program. They resolved this by uninstalling the antivirus on their database servers.
Conclusion and Wrap Up
Once their servers were performing again, we were asked to perform a full health check on their various SharePoint farms. We usually take a couple of days for small farms, and up to a week for larger environments. We run tooling to capture the configuration of your farms and bring those back to base for further scrutiny. We then write a report to document all the key settings and highlight areas of concern, and suggest appropriate changes to meet best practices.
SharePoint Performance Tuning References
Table lock issues: