Sakai Performance Issues

Tracking known Sakai performance issues that we need need to address

sakai_2-5-x

sakai_2-4-x

Lance's response to "UC Berkeley in crisis"

There has been a lot of excellent input on this issue.  I would add the following comments:

1) Do not discount the possibility of SAK-8932 as Stephen Marquard suggested.  This issue is not limited to Chat but
can be triggered by any JSF application AFAIK.  The stuck threads could lead to the kind of request backlogs you are
reporting.  We have only seen this crop up once or twice, but it does seem to be load related so you may be seeing it
in your environment.

2) Hardware is cheap - Since we have upgraded to eight servers with 10GB heaps (i.e. total 80GB heap), Sakai is
behaving *MUCH* better under load.  Our hardware change included moving from 32bit to 64bit OS, cutting the number of
app servers in half (i.e. 16 --> 8), keeping the total number of CPUs in the cluster at 32, and going from 32GB total
heap to 80GB.

3) Are you seeing any OutOfMemoryExceptions?  Before our 64bit upgrade, we were seeing 10 - 15 of these a day.  Since
the upgrade, I have not seen one OOM error.

4) Turning off quotas does significantly reduce the amount of XML DOM parsing you will do, but it was not a major
contributing factor to our stability.

Let us know what we can do to help... L

Stanford: Critical performance problems in production. Please help!
- http://thread.gmane.org/gmane.comp.cms.sakai.devel/16687

Email from Lance Speelmon

https://oncourse.iu.edu/access/wiki/site/3001b886-1069-4fb7-00d5-8db4b3a85f74/home.html

Adi,

Let me see if I can outline the changes:

1) DBCP settings we have been running for 2+ years:
minSize=10
initialSize=10
maxSize=50

2) When we started seeing DBCP having problems establishing new database connections, we switched to:
minSize=50
initialSize=50
maxSize=50
* These settings served us pretty well until we saw the 2x load increase the first week of classes.

3) Once the load really hit we tried:
minSize=150
initialSize=150
maxSize=150
* We were still seeing errors with creating new database connections and DBCP deadlocks.

4) Our current settings after switching to c3p0:
minSize=150
initialSize=150
maxSize=150
* We still saw connection errors, but c3p0 was able to cope without any deadlocking.

5) Now that we think we have resolved our Oracle connection issues, we are considering moving to the following settings for c3p0:
minSize=10
initialSize=10
maxSize=150
* The change that we think resolved the Oracle connection issues were increasing number of dispatchers, and disabling automatic memory management.

Thanks, L

Do you have minIdle and maxIdle set? and does maxIdle = maxActive? That will
ensure you don't create new db connections and will help you scale much
better.

We have 8 appservers and use:

minIdle@javax.sql.BaseDataSource=1
maxIdle@javax.sql.BaseDataSource=14
initialSize@javax.sql.BaseDataSource=15
maxActive@javax.sql.BaseDataSource=14

with 400 requests per second peak, I'm don't see why you would
need 2400 db pool connections -- maybe 400 * 2 for safety, but you are just
eating PGA unnecessarily with all those connections, and that memory could be
used for SGA instead (we reduced our PGA from 512m to 256m and haven't seen
problems).

Adi

Email exchange with R. P. Aditya : aditya@grot.org

On Fri, Aug 31, 2007 at 11:50:34AM -0700, Thomas Amsler wrote:
> > Are the 15 connections in the DBCP connection pool you max setting? I think 
> > the default is max=50 in OOTB.

on our 8 appservers, we use:

minIdle@javax.sql.BaseDataSource=1
maxIdle@javax.sql.BaseDataSource=14
initialSize@javax.sql.BaseDataSource=15
maxActive@javax.sql.BaseDataSource=14

and in typical use, even at peak, we only see 2-3 active via Oracle

The most important thing for Oracle is that maxIdle = maxActive so that the
pool connections are never dropped or recycled since setting up new
connections is terribly expensive...

Adi

Adi,

Would you mind sharing your Oracle memory settings?  We are currently running with:
db_cache_size = 4096M (from 5120M)
shared_pool_size = 3072M (from 4096M)
java_pool_size = 250M (no change)
large_pool_size= 2048M (from 4096M)
sga_max_size = 20480M (from 24576M)

Thanks, L

Hi Lance,

We are using automatic shared memory management in Oracle. Base on your settings and our ours, I think are the keys are as the follow.
1. Your shared_pool is too large.  SAKAI application codes does not need such a large shared pool. 1 gig can be a good start point (unless you have other applications in the same database).
2. You can set the db_cache_size much greater. We have a total SGA of 6560M and 5872M is used in buffer cache (Oracle assigned it).
3. 255M of PGA is enough based on our settings.
4. If you can set the sga_max_size = 20480M (or even higher as your from), try to use AMM and set the sga_target to at least 18gigs.

The following are our parameter settings:
sga_max_size =6560M
sga_target=6560M
pga_aggregate_target=256M

The following are automatically generated by Oracle based our target:
Shared Pool     624M
Buffer Cache     5872M
Large Pool     16M
Java Pool     32M
Other     16M

Luke has created a site to put the parameter as a reference:

http://confluence.sakaiproject.org/confluence/display/ENC/Oracle+Admini

All the parameters can be seen below.

Thanks,
Drew Zhu
Oracle DBA
ITCS, University of Michigan

What is the "cursor_sharing" parameter? Setting it FORCE or SIMILAR will force the sharing of similar SQLs and may help in reducing the shared_pool_size.  We use force as you can see in the parameter file.  Also, if you are using more tools than we use, it should be larger.

Thanks,
Drew

SAK-9860 : Excessive db queries generated from Site Info / user service

From Ian:

Just commited a fix agaisnt SAK-9860

Its not a total fix, but you should be able to patch 2.4.x (once fully tested) and the profiler is saying the 
number of queries for a since request is now 1 rather than 4 the first time per user and then 0 after that.

Needs testing though, and only eliminates the EID/ID SQL.

SAK-11279 : Spurious presence events

From Stephen:

Hi all,

If you're running 2.4.0 or 2-4-x in production with presence enabled, you will probably want to apply the fix to presence/courier from:

http://jira.sakaiproject.org/jira/browse/SAK-11279

This is a bug that logs 2 presence events every time a presence refresh is made (every 30s per user). Fixing this reduced the volume of presence events in our production system by a factor of 10 or more.

Regards
Stephen