IRC Chat 12162009

Irc 12162009 chat with rSmart

#ucdsakai

09:29	INFO	Channel view for â€œ#ucdsakaiâ€� opened.
09:29	=-=	User mode for smavocet is now +i
09:29	-->\|	YOU (smavocet) have joined #ucdsakai
09:30	<kdalex>	yes, they look okay to me too.
09:31	-->\|	markpankow (n=chatzill@72.44.192.164) has joined #ucdsakai
09:31	<mwenk>	If its the problem we had, and IIRC the problem was the cost based optimizer wasn't using the index
09:31	<mwenk>	because the stats on the table were whacked
09:32	<kdalex>	yes, we should have Mark look at that as well as give us the load average on the db machine
09:32	-->\|	tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai
09:33	\|<--	tamsler has left freenode (Remote closed the connection)
09:33	-->\|	tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai
09:34	<tamsler>	sorry for the dealay
09:34	<tamsler>	I just replied to John's email
09:34	<tamsler>	here is the reply:
09:34	<tamsler>	Assignment definitely makes calls to CM. We added the OSIVF to that
09:34	<tamsler>	tool. I think that's one of the tools that you guys have to retrofit
09:35	<tamsler>	with osiv.properties
09:35	<mwenk>	Isn't assignments under OSIV now?
09:35	<kdalex>	how many other places did we do this?
09:35	<tamsler>	checking ..
09:36	<tamsler>	I am looking at https://svn.ucdavis.edu/svn/ucd-sakai/patches/branches/sakai_2-5-x-prod/patch-list.txt
09:36	<mwenk>	web ?
09:36	<mwenk>	it uses section awareness?
09:36	<mwenk>	wow
09:37	<tamsler>	roster, assignment, msgcntr, sections, chat, mailtool, web, gradebook, sections
09:38	<tamsler>	is somebody analyzing the catalina.out w.r.t. osivf errors
09:39	<tamsler>	I still don't have access to their system
09:39	<tamsler>	because 2.6 may have other tools that use cm
09:39	<tamsler>	like OSP etc
09:40	<mwenk>	I can copy those somewhere you can look at em
09:40	<mwenk>	lemme do that now
09:40	-->\|	Prabhu (n=prabhuem@psl-87.ucdavis.edu) has joined #ucdsakai
09:41	<tamsler>	brb
09:41	<Prabhu>	Mark, When was the last table switch(ta/tb) ran. And was the successful?
09:42	<kdalex>	John Bush...are they any other URL's that are failing?
09:42	<kdalex>	or hanging in the load test?
09:42	<jbush>	that was the worst
09:43	<jbush>	once sec, I'm not sure I saved the report
09:43	<markpankow>	I just sent the group the explain plans for the queries Prabhu sent. Full table scans on coursemanagement_section.
09:44	<smavocet>	Statistics problem like we had, Prabhu?
09:45	<kdalex>	any idea from anyone why we didn't see this on last weeks load test?
09:46	<Prabhu>	i need to know when and how the last table switch script ran
09:46	<Prabhu>	Mark?
09:46	<mwenk>	Thomas: rsmart catalina files copied to sakai@caje:/ucd/opt/sakai/rsmart_logs
09:50	-->\|	kzinti34 (n=chatzill@psl-194.ucdavis.edu) has joined #ucdsakai
09:50	<tamsler>	thanx mike
09:52	<kdalex>	Prabhu, what will the time of the table switch tell you...about stats? Can we see if the stats are in place from here?
09:53	<tamsler>	looks like auto.ddl is enabled not that this is a problem per see but we never did that
09:53	<kdalex>	number of oracle connections is dropping but performance is still slow
09:53	<jbush>	its really bad
09:54	<jbush>	I'm not running any tests over here
09:55	<mwenk>	Probably the queries from the test havent' returned yet
09:55	<mwenk>	We saw that too
09:55	<mwenk>	sessions would hang until we killed the app
09:55	<smavocet>	Is Mark still on IRC? Prabhu is looking for him.
09:55	<mwenk>	sessions as in DB sessions
09:55	<rsmartmike>	thomas, i have copied your pubkey to the server. you will be able to log in now. the other key was messed up and had the incorrect permissions...so that's all fixed.
09:56	<tamsler>	I wonder if the assignment tables are ok in the internal DB
09:56	<tamsler>	lots of errors in catalina.out
09:57	<rsmartmike>	sakai_external is the proper schema name...
09:58	<Prabhu>	OK
09:58	<rsmartmike>	we use 'sakai' and 'sakai_external'
09:58	<Prabhu>	thanks
09:59	<markpankow>	I just replied to your email prabhu. Sorry, I was on the phone with Dave.
09:59	<markpankow>	sakaiext in the pilot. Sakai_external here.
09:59	<tamsler>	I am still not able to ssh tpamsler@smartsite-test.rsmart.com
09:59	<mwenk>	try tamsler@
10:00	<kdalex>	i tried running the query you sent, Prabhu, from TOAD but I get
10:00	<kdalex>	ORA-02404: specified plan table not found
10:00	<jbush>	do we want to recycle the app servers ?
10:00	<rsmartmike>	yes, your userid is 'tamsler'
10:01	<Prabhu>	Kirk, we will not be able to run these quries from our end...we need dba priv on the server itself
10:02	<rsmartmike>	oracle server load is still high--do we know why if there are no load tests running?
10:03	<Prabhu>	Mark, I have sent drop stats statements. Please run them and repeat the autotrace
10:03	<kdalex>	I show valid stats on course_management_ta with last date = 12/16/2009 5:02:54 AM
10:04	<mwenk>	jbush, your assignment check relied on an preexisting assignment being there?
10:04	<mwenk>	or did you create an assignment for the test?
10:04	<kdalex>	db sessions have just dropped to 3 from 400
10:05	<smavocet>	Prabhu sent him delete stats from various tables.
10:05	<mwenk>	did you guys cycle the app servers?
10:05	<smavocet>	sent Mark
10:05	<rsmartmike>	oracle server load plummeted
10:05	<rsmartmike>	have not cycled tomcat...i can do that now, just give the word.
10:06	<kdalex>	dont think they recycled the servers, I didn't loose my session
10:06	<mwenk>	cool
10:06	<mwenk>	did mark kill the stats then?
10:06	<smavocet>	So once Mark runs the delete stats from several tables, we'll start the load test again?
10:07	<jbush>	much better now
10:07	<rsmartmike>	so, cycle tomcat or no/
10:07	<Prabhu>	:) Lets hear from Mark first
10:07	<jbush>	I don't think its necessary
10:07	<mwenk>	I think the assignment NPEs are because whatever assignment it was looking for doesn't exist
10:07	<mwenk>	probably because it was created on the old data but not the new data
10:08	<Prabhu>	we need to look at the trace and explain plans, to make sure there are no full scans
10:08	<jbush>	right
10:08	<jbush>	we should have corey re-record the test
10:08	<jbush>	now that the server are responsive again he can do that
10:08	<mwenk>	I can't believe grinder cant do thinks like clicking etc
10:08	<jbush>	I was going to have him do test8, once with chat and once without chat
10:08	<mwenk>	reminds me of jmeter
10:09	<jbush>	well thats why we use selenium
10:09	<mwenk>	silk is like selenium
10:09	<mwenk>	in that way anyways
10:09	<mwenk>	well it can do both
10:09	<kdalex>	so did mark do anything to the db to get it to clear? any word on dropping the stats
10:10	<tamsler>	same with tamsler it just prompts for pw instead pass phrase
10:11	<smavocet>	Mark just sent something in email.
10:11	<Prabhu>	i got explain plans from mark...they look great
10:11	<smavocet>	Explain plan
10:12	<mwenk>	owa is being evil to me
10:12	<smavocet>	so all stats have been removed, Prabhu?
10:12	<Prabhu>	yes
10:12	<smavocet>	no more full table scans?
10:12	<Prabhu>	CM quries should go fine now
10:12	<smavocet>	so are we ready to rerun the load test again now?
10:12	<mwenk>	So then do we need to have him cron run the delete stats?
10:12	<Prabhu>	yea...i will send him details
10:13	<smavocet>	Load test?
10:13	<Prabhu>	they may have to adjust auto stats windows too
10:13	<jbush>	corey is re-recording tests, then I have to adjust manually adjust to get the users in there, will probably take 15-20 minutes or so
10:13	<smavocet>	So stats run only at a particular time?
10:13	<smavocet>	Prabhu?
10:14	<smavocet>	ok, wrt jbush
10:14	<smavocet>	Didn't Mark say he had stats running hourly?
10:14	<mwenk>	Once you do that jbush, please commit em to svn
10:16	<kdalex>	I didn't get Mark's latest email. Can someone forward to the list, please?
10:16	<Prabhu>	yea...we can define the windows based on our work load.
10:17	<smavocet>	Done wrt Kirk.
10:18	<kdalex>	Doe we need a phone call between Prabhu and Mark and others to make sure we get this db configuration and stats issue correct? This is absolutely critical!
10:19	<rsmartmike>	thomas, the authorized_keys file was owned by root...i changed that, but i don't think that will make a difference.
10:19	<rsmartmike>	not sure why your key's not workin.
10:19	<tamsler>	did you get it from my last email?
10:19	<rsmartmike>	yes
10:20	<tamsler>	hmm I am using that same key just fine on my systems
10:20	<tamsler>	have you added all the IPs that I have specified in that email?
10:20	<smavocet>	wrt kirk, Mark and Prabhu can call each other anytime. They have each other's cell numbers.
10:20	<tamsler>	or is ssh wide open?
10:21	<kdalex>	no ssh needs to be from the ip you gave them originally
10:21	<rsmartmike>	not sure about the ip's...i will double check
10:21	<tamsler>	I know, I specified my home IP and it is blocked
10:21	<rsmartmike>	dave may not have acted on that one, i will ping him right now
10:21	<Prabhu>	Kirk, I will explain about stats on an e-mail to mark and cc to you all
10:22	<mwenk>	Alright, since it will take a cpl for the load test, gonna use this time to get into PSL
10:22	<kdalex>	Thanks, Prabhu
10:22	<mwenk>	I'll be able to start the load test once I'm in
10:22	<mwenk>	that ok with kirk/sandra?
10:23	<kdalex>	Let's get you in here first...
10:23	<--\|	mwenk has left #ucdsakai
10:24	-->\|	littlelucca (n=joncarlo@bly.ucdavis.edu) has joined #ucdsakai
10:25	<tamsler>	here are the IPs just in case 169.237.11.246
10:26	<tamsler>	169.237.11.247
10:26	<tamsler>	69.12.224.247
10:32	<tamsler>	Ok I was able to login now
10:32	<tamsler>	but only from the 169.237.11.246 ip and not the other two
10:38	<kdalex>	Prabhu, when can we expect your email?
10:40	<jbush>	ok tests committed, I haven't started test yet to make sure they work
10:41	<kdalex>	ok. guess you can run a short one to test and let us know
10:41	<jbush>	there are 2 news tests, test_8_with_chat.py and test_8_without_chat.py
10:41	<smavocet>	I'll let QA know.
10:42	<smavocet>	Starting when?
10:42	<kdalex>	lets start w/o chat. can we have mark watch...start when we know he is ready
10:42	<jbush>	running short test of first one now, just to make sure it works
10:42	<jbush>	ok
10:42	<smavocet>	Mark's not on irc
10:42	<smavocet>	Not sure he's watching. Is he?
10:43	<kdalex>	yes he is...
10:43	<smavocet>	He's there but not responding here.
10:43	<smavocet>	Responding only in email that I can tell.
10:44	<markpankow>	I am here.
10:44	<markpankow>	I am ready...
10:45	<smavocet>	ok, great. Hadn't seen input from you in some time.
10:45	<smavocet>	I'll let QA know. We're starting now, right?
10:46	<kdalex>	23 sessions now in the db again...showing cm queries. I cannot tell if there are full table scans
10:46	<jbush>	50 user test in process
10:46	<markpankow>	Sorry. I just had 4 other windows open looking at other issues.
10:46	<kdalex>	got this
10:46	<kdalex>	Table Scan: SAKAI_EXTERNAL.COURSEMANAGEMENT_SECTION_TA: 45792 out of 45792 Blocks done
10:47	<smavocet>	What does that mean?
10:49	<smavocet>	so far so good wrt QA and response time as you might expect with such a low load.
10:52	<jbush>	well, its almost done
10:52	<jbush>	looks like assignment is still a problem
10:52	<jbush>	https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11
10:52	<jbush>	that url is taking on average 179 seconds
10:53	<kdalex>	W
10:53	<jbush>	I'll killing the test
10:53	<jbush>	what do what want to do at this point, seems like there is still a problem
10:53	<kdalex>	there are 2 consecutive slashes in that URL...is that ok?
10:55	<jbush>	we need some ideas
10:55	<jbush>	seems like to big things have changed here, new code, and new database
10:55	<jbush>	we could roll back to old tag and test that way
10:56	<kdalex>	We need input from Mark first...
10:56	<jbush>	ok
10:57	<kdalex>	To the UCD programmers....any more on the interrelationship between the URL above and CM?
10:57	<kdalex>	I just got a timeout so I cannot make that URL work for me
10:58	<tamsler>	checking ...
10:58	<markpankow>	837
10:58	<markpankow>	sorry... wrong window
10:58	<kdalex>	Also, what is the meaning of the 8081 port? I just tried it without the port and it came back
10:58	<tamsler>	I can access it just fine ..??
10:59	<jbush>	oh mike exposed 8081-8086 so you can hit tomcat directly
10:59	<jbush>	makes it easier when looking at logs
10:59	<kdalex>	What is the 837?
10:59	<kdalex>	Oh, great! That's good to know
10:59	<markpankow>	nothing... I typed that in the wrong window
10:59	<kdalex>	Well the URL works okay w/o the port...and I can get into Grade mode on the tool which accesses cm
11:00	<tamsler>	what's strange is that every other time I try to access https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11 it doesn't work
11:00	<kdalex>	yeah, it's hanging for me too....do we have to login first to that port to get a valid session on that server?
11:00	<jbush>	yes
11:01	<rsmartmike>	ports 8081-8086 should be accessible via those 30 or so IPs you sent
11:01	<rsmartmike>	if you want, i can paste them in here
11:02	<jbush>
11:03	<kdalex>	oh, that's my problem...I am trying from home.
11:04	<kdalex>	Mark - any word on stats or current table scans? Looks like they are still happening though w/o DBA permissions I can't be sure
11:05	<kdalex>	DB load dropped to 0 again
11:05	<markpankow>	I am sending a report from the 11:25 to 11:45 timeframe. Things that looked out of whack before are looking better, but it takes me a bit to read through it.
11:05	<markpankow>	I am sending the email now.
11:07	<kdalex>	URL works from my fixed ip
11:07	<jbush>	yeah seems ok now for me too
11:09	<kdalex>	should we start the load again then and have Mark start both a report and do some realtime checking?
11:09	<kdalex>	Also, could he re-run Prabhu's query now while its quiet to see if we still show full table scan in the explain plan?
11:09	<smavocet>	That url resolved for me once but the second time, it couldn't access it.
11:10	<smavocet>	I'm on my fixed IP address also.
11:10	<jbush>	how many users do we want to do ?
11:11	<kdalex>	worked for me.
11:12	<jbush>	how about 1000 users ?
11:12	<smavocet>	Try it again, using a different browser, Kirk.
11:12	<kdalex>	If you switch browsers you have to relogin to the specific port (e.g. 8081)
11:12	<kdalex>	1000 seems ok. should we limit to 10 mins or so?
11:13	<jbush>	ok, 1000 users for 10 mins, here goes
11:13	<kdalex>	Mark, can you confirm you are monitoring?
11:15	<markpankow>	Okay, I am watching. Give me 10 seconds
11:15	<markpankow>	Olau gp
11:15	<markpankow>	okay go
11:15	<jbush>	about 250 users in so far
11:16	<markpankow>	I have it set to get a before snapshot and another in 10 minutes.
11:16	<jbush>	the assignment url looks fine so far
11:17	<jbush>	about 500 users now
11:17	<kdalex>	about 200 db connections
11:18	<kdalex>	so far so good
11:18	<jbush>	750 users
11:19	<Prabhu>	Whats the DB server load?
11:20	<markpankow>	24
11:20	<kdalex>	trying to load nut 011 001-010 FQ 2009 - still waiting
11:20	<smavocet>	24 is a bit high, not so?
11:21	<Prabhu>	That is too high......is the load increasing with increase in users?
11:21	<jbush>	all 1000 are in now
11:21	<markpankow>	load is still at ~24
11:22	<markpankow>	it does appear to increase with added users.
11:22	<kdalex>	Can anybody else get to a Site Info in a course site?
11:22	<kdalex>	576 database connections
11:22	<jbush>	try pulling up assignment tool in SPA 001 001-0011 site, crazy slow again
11:23	<kdalex>	yes, unfortunately. If Mark's report doesnt' show full table scans then we must have ruined it with the realm refreshes????
11:24	<jbush>	this is concerning
11:24	<jbush>	2009-12-16 11:23:01,076 INFO ajp-10.10.12.20-8009-357 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955
11:24	<jbush>	2009-12-16 11:23:02,193 INFO ajp-10.10.12.20-8009-360 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955
11:24	<jbush>	2009-12-16 11:23:02,554 INFO ajp-10.10.12.20-8009-359 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/319258e7-c2ee-4eab-80c8-65365458c345
11:24	<jbush>	2009-12-16 11:23:03,161 INFO ajp-10.10.12.20-8009-361 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955
11:24	<jbush>	2009-12-16 11:23:05,609 INFO ajp-10.10.12.20-8009-362 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955
11:24	<jbush>	2009-12-16 11:23:06,381 INFO ajp-10.10.12.20-8009-363 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955
11:25	<smavocet>	Kirk, I can get to site info but it takes a long time.
11:26	<smavocet>	So jbush, the authz refresh is too frequent? Thomas?
11:26	<kdalex>	not me. did su to liz applegate and cannot visit her course sites at all
11:28	<smavocet>	spoke too soon. That was section info. Site info hasn't returned yet.
11:28	<jbush>	ah, they logs are cluttered with refreshes
11:29	<jbush>	its a refresh attack
11:29	<kdalex>	Thomas can you look at these logs too, please?
11:29	<jbush>	something is not right with this code, it shouldn't be refreshing like this, its multiple times a second, different threads
11:30	<jbush>	I'm looking at ironmaiden, ucdavis1
11:30	<kdalex>	okay, good find. Thomas and Mike can you look at the code again. Do the logs show where in the code?
11:31	<smavocet>	Mike hasn't made it in yet. On his way.
11:31	<smavocet>	I'll share this script with him though when he gets in.
11:34	<rsmartmike>	well, at least we'll get up to date info in the site roster ;)
11:35	<kdalex>	Yeah, to the nanosecond!
11:35	<kdalex>	Glad you can keep a sense of humor while we're all pestering you in realtime!
11:39	<smavocet>	Are you stopping the load test now or keep it going for continued diagnostics?
11:39	<jbush>	I'm stopping it
11:40	<smavocet>	ok
11:40	<kdalex>	I just got Internal Server Error
11:40	<kdalex>	The server encountered an internal error or misconfiguration and was unable to complete your request.
11:40	<kdalex>	Please contact the server administrator, support@rsmart.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
11:40	<kdalex>	while trying to get to site info on a big site
11:40	<jbush>	this is what I think we should do
11:41	<jbush>	modify the refresh log to output the stack, so we can see where its coming from, make this change on one server, stop all the server and run with only the one server, try a load of 250
11:41	<kdalex>	okay with me. Thomas, Mike: any comments?
11:42	<smavocet>	Mike's on his way into Chiles Road.
11:42	<smavocet>	Sounds good to me also but I'm no substitute for them.
11:43	<tamsler>	the logs show that refreshAuthzGroupIfNecessary is called way to many times
11:44	<kdalex>	can the code be made to check to not do it more than once in a session for a site or something to that effect?
11:44	<jbush>	yes
11:44	<jbush>	use a thread local or something
11:45	<tamsler>	or use a cache indicating when the last refresh was done
11:45	<tamsler>	similar we did before but without the timer
11:45	<jbush>	well that is what its supposed to be doing right now
11:45	<tamsler>	and then use the caches ttl to expire
11:47	-->\|	wenk (n=mwenk@psl-242.ucdavis.edu) has joined #ucdsakai
11:47	<smavocet>	Hi Mike. I saved the irc script to update you.
11:48	<wenk>	Ok
	IRCchat12162009 with rSmart	(missing from 11:48 to 13:05)

#ucdsakai

13:05	<wenk>	but then I'd wonder why it works on our 2.5 impl
13:06	<jbush>	synchronized (azGroup) Unknown macro: { \|\| 13}
13:06	<jbush>	I'm not sure all the azGroup objects are the same to all threads
13:08	<jbush>	I'm not sure this is even the problem
13:08	<jbush>	that 7 refresh calls in a 1/10 of sec
13:09	<wenk>	dunno either
13:09	<wenk>	but considering that down in that refresh code it is calling out CM and is wanting sections
13:09	<wenk>	its definitely not helping
13:09	<kdalex>	211 db connections but no long ops
13:10	<kdalex>	now only 5 db connections...clears fast
13:10	<wenk>	Look, all the reading I've done in the past(there was a discussion on it on sakai-dev a while back) about CHM vs the collections api wrapper says we should be using CHM
13:10	<wenk>	not the wrapper
13:11	<jbush>	ok well thats easy to change
13:12	<wenk>	Not sure if it will fix it
13:12	<jbush>	so like this? realmsRefreshed = new ConcurrentHashMap(size);
13:14	<wenk>	yu
13:14	<wenk>	p
13:14	<wenk>	CHM is supposed to scale better
13:14	<wenk>	so lets see if that's true
13:14	<wenk>	anyways, gonna grab some food, bb in 15
13:15	<jbush>	ok we have server back up with the wrapper, lets test that first
13:15	<smavocet>	Response time seem good from ux perspective.
13:16	<jbush>	starting 1000 user test now
13:17	<jbush>	brb in 5
13:19	-->\|	corey1 (n=corey@internal.rsmart.com) has joined #ucdsakai
13:19	<tamsler>	why don't we init the cache in the init() method rather then in the refresh method ?
13:20	<jbush>	700 users on...
13:21	<jbush>	1000 users loaded up
13:22	<smavocet>	still looks pretty good for me.
13:25	<smavocet>	Can we let it run a little longer. QA was caught off guard. Thanks.
13:26	<jbush>	yep, response is really good from grinder
13:26	<jbush>	I don't see how this synchronized thing would have fixed it, did mark do something else ?
13:30	<markpankow>	I didn't do anything. I wish I could take credit.
13:30	<tamsler>	The logs still show serveral refresh entries per second
13:31	\|<--	kdalex has left freenode (Read error: 110 (Connection timed out))
13:34	-->\|	kdalex (n=chatzill@moobilenet-100-51.ucdavis.edu) has joined #ucdsakai
13:34	<wenk>	back
13:37	<kdalex>	me too
13:39	<tamsler>	so does the code use the concurrentHashMap now?
13:40	<jbush>	no, its using the wrapper
13:44	<jbush>	we can try this, change to use ConcurrentHashMap and the synchronize on the authgroup id, like this
13:44	<jbush>	public void refreshAuthzGroupIfNecessary(AuthzGroup azGroup){
13:44	<jbush>	if(azGroup != null && azGroup.getId().startsWith("/site/") && !azGroup.getId().contains("/group/")) {
13:44	<jbush>	if(null==realmsRefreshed){
13:44	\|<--	jbush has left freenode (Excess Flood)
13:44	-->\|	jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai
13:45	<wenk>	So what exactly changed?
13:46	<jbush>	add this synchronized (azGroup.getId()) {
13:46	<jbush>	and switched to CHM
13:46	<wenk>	is it running with that?
13:46	<wenk>	want to use pastebin ?
13:46	<jbush>	no, its only running with the wrapper
13:46	<jbush>	but there doesn't seem to be a performance problem right now
13:46	<wenk>	I know
13:46	<jbush>	should we try to ramp up to 3000 users first
13:47	<wenk>	which makes me wonder if its not this but something else
13:47	<wenk>	Sure, can do that
13:47	<jbush>	do we want to split it between us ?
13:47	<wenk>	I'll fire off 1800 from here
13:48	<wenk>	Also, lets turn debug off for this run
13:48	<tamsler>	I think we still have an issue
13:48	<tamsler>	looking at the logs, I see several calls to refresh
13:48	<jbush>	if we are going to bounce should we then make the code mods
13:49	<tamsler>	but only the first log message shows up containing refreshing but not the one containing group
13:49	<tamsler>	this is due to the exception being thrown
13:49	<tamsler>	if we take debug out we will tank again
13:49	<jbush>	what exception ?
13:49	<jbush>	I put a stackTrace in on purpose, its not an exception if that is what you are talking about
13:50	<wenk>	yes but there's a ton of overhead
13:50	<jbush>	ah
13:50	<wenk>	If its a timing issue between the threads that will smooth it out a bit
13:51	<kdalex>	are we taking time out for the conference call in 10 min?
13:51	<jbush>	oh I see
13:51	<tamsler>	I think it's still not working as we expect it to
13:52	<tamsler>	if you do: grep refreshAuthzGroupIfNecessary catalina.out.server6 \| grep -v site/~ \| grep -v 1128
13:52	<tamsler>	you will see that we are getting several refreshed per second
13:53	<jbush>	then I think we should add in the azgroup blocking
13:53	<jbush>	seems like that would address it if it a timing issue
13:55	<wenk>	sync around refreshAuthzGroupIfNecessary
13:55	<jbush>	really, that seems agressive
13:55	<wenk>	wait
13:57	<wenk>	sorry looking at the timer
13:57	<wenk>	something seems odd between the old and new
13:58	<wenk>	yeah, we had that timer thread thing
13:58	<jbush>	we are joining the call now
13:59	<wenk>	will you stay on irc as well?
13:59	<kdalex>	coming
14:00	<kdalex>	room confloct coming
14:03	<kdalex>	thomas and mike...come join the call
14:04	<Prabhu>	I will stay on IRC
14:04	<kdalex>	ok
14:08	\|<--	jbush has left freenode (Read error: 60 (Operation timed out))
14:11	-->\|	jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai
14:35	<wenk>	Back
14:35	<wenk>	Lemme know when to kick off the ucd side of the load test
14:41	\|<--	jbush has left freenode ("Leaving.")
14:43	-->\|	jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai
14:46	<rsmartmike>	ok, app servers coming down for new tag deployment. probably be about 15-20 minutes before they can be back up.
14:48	<wenk>	okay
14:48	<wenk>	lemme know when I should start the load test from here
14:49	<rsmartmike>	roger
15:02	<smavocet>	Oh, I wasn't watching
15:02	<smavocet>	We're starting the load test now?
15:04	<--\|	corey1 has left #ucdsakai
15:13	<wenk>	mike, how we doin?
15:15	<wenk>	rsmartmike: How we doin ?
15:16	\|<--	kzinti34 has left freenode ("ChatZilla 0.9.86 Firefox 3.5.6/20091201220228")
15:17	<rsmartmike>	compiling...prob less than 5 minutes
15:18	<smavocet>	ok
15:19	<wenk>	ok
15:22	<rsmartmike>	starting up...
15:23	<--\|	littlelucca has left #ucdsakai
15:27	<wenk>	so kick off the load test?
15:27	<jbush>	yes, started 1200 users over here
15:28	<jbush>	still ramping mine up
15:28	<wenk>	which script should I call? w/o chat?
15:29	<jbush>	that is the one I'm using
15:29	<wenk>	ok
15:32	<wenk>	starting
15:34	<jbush>	my 1200 are going, response time avg is 72 millis
15:36	<wenk>	Oops, didn't put the script in for my last node
15:36	<wenk>	only running 1200
15:36	<wenk>	last 6 will be up in a sec
15:37	<rsmartmike>	db load is < 1
15:37	<jbush>	grinder is getting timeouts now, but response time still good
15:37	<jbush>	dave is checking firewall/apache, could be there
15:37	<rsmartmike>	db is looking great so far, at least...
15:38	<jbush>	nice!
15:38	<rsmartmike>	we are seeing tons of cache hits for authz stuff
15:38	<wenk>	All 1800 up
15:38	<wenk>	Oops, somehow the props I had to ramp up were lost
15:38	<wenk>	so all 1200 and then the 600 connected at once
15:38	<wenk>	Getting errors
15:39	<rsmartmike>	authz cache size is 350-500 on the 3 servers i looked at
15:39	<wenk>	again I wish there was a ghost busting button in the Online tool
15:40	<wenk>	another min b4 we get the new catalina.out
15:40	<rsmartmike>	could we just clean up the session table? woudl that effect the same result?
15:40	<wenk>	cluster table I think
15:40	<wenk>	I'd not worry about it, its just annoyin
15:40	<jbush>	as soon as you go live we'll start working on that :)
15:41	<rsmartmike>	oracle load 1.5
15:41	<wenk>	I'm kicking off timings
15:41	<tamsler>	refreshes look really good
15:41	<tamsler>	lot's of cache hits
15:42	<tamsler>	and no multiple refreshed per realm
15:42	<wenk>	Yeah
15:43	<rsmartmike>	oracle load 3.31, not bad.
15:44	<rsmartmike>	better than 47!
15:44	<kdalex>	Now that's more like it.
15:45	<wenk>	slow performance now
15:45	<wenk>	tho honestly I dunno how much its the site and how much its my client at this point
15:46	<rsmartmike>	oracle load .23
15:46	<rsmartmike>	it is not responsive all of a sudden for me too
15:46	<tamsler>	I am not getting in anymore
15:47	<jbush>	yeah my browser got a timeout
15:47	<tamsler>	what is the server load on the two systems that run apache compared to the other nodes
15:47	<rsmartmike>	.13 & .15
15:47	<rsmartmike>	on 2 apache servers
15:48	<rsmartmike>	.35 now on one of them
15:48	<tamsler>	I wonder if we are running out of tomcat threads
15:48	<rsmartmike>	and it's back
15:48	<rsmartmike>	oracle load to 7.49
15:49	<jbush>	weird its happy again
15:49	<tamsler>	how many threads are allocated per tomcat in server.xml
15:49	<tamsler>	is it the 150 default?
15:50	<rsmartmike>	kswapd1 is active on teh oracle box
15:50	<rsmartmike>	that doesn't seem like a good ting
15:51	<rsmartmike>	apache load: 2.26/0.65
15:51	<tamsler>	I am seeing lots of the following errors:
15:51	<tamsler>	2009-12-16 15:28:22,247 ERROR ajp-10.10.12.45-8009-54 org.sakaiproject.tool.su.SuTool - SuTool Fatal Error: You must be an administrator to become another user. null
15:52	<wenk>	maybe 3000 is too many users
15:52	<wenk>	brb
15:54	<rsmartmike>	1.28gb swap in use on oracle box
15:54	<tamsler>	we are running 250 threads per tomcat instance
15:58	<tamsler>	Total sessions: 12428
16:02	<tamsler>	so how are we doing ?
16:03	<rsmartmike>	not responsive for me at the moment
16:03	<wenk>	same here
16:03	<rsmartmike>	oracle load is very low, .1
16:03	<wenk>	So we killed the app this time
16:03	<wenk>	but not the db
16:03	<rsmartmike>	seeing errors in apache
16:03	<rsmartmike>	could not connect to tomact
16:03	<wenk>	I guess that's a positive
16:03	<wenk>	just came back
16:03	<jbush>	yeah I think thats positive
16:03	<rsmartmike>	so we exceeded tomcat's 500 threads per server seems like
16:04	<wenk>	you'd see that message in the logs
16:04	<rsmartmike>	yeah, weird, i don't think i did
16:05	<wenk>	when i first started using silk performer I did that
16:05	<wenk>	it gave a nice message
16:05	<wenk>	cant remember what it was exactly
16:05	<jbush>	yeah, damn open source apps
16:05	<tamsler>	any out of memory errors
16:06	<wenk>	Sorry this was a bit over 2 years ago ;)
16:06	<wenk>	no, don't see any out of memory
16:06	<wenk>	you guys use jconsole ?
16:07	<rsmartmike>	i've use jvmstat, jconsole is newer
16:07	<rsmartmike>	?
16:07	<wenk>	sigh, I'd love to look at the memory use graph
16:07	<wenk>	plus tomcat gives a bunch of mbeans with fun stuff
16:08	<wenk>	stats on stuff and what not
16:08	\|<--	jbush has left freenode ("Leaving.")
16:09	<wenk>	so any of those firewall blocks ?
16:09	<rsmartmike>	are you guys dialing in to conf. line b?
16:09	<rsmartmike>	we are on.
16:09	<rsmartmike>	dave says no.
16:10	<rsmartmike>	fw looks good
16:10	<wenk>	are we supposed to call in a conf ?
16:10	<wenk>	I didn't know I needed to, but sure can
16:11	<kdalex>	okay, on our conf line or yours...sorry we missed the time due to a local crisis
16:11	<rsmartmike>	ours, one sec
16:11	<markpankow>	should I me on this as well?
16:12	<kdalex>	same number as the last call?
16:12	<rsmartmike>	916-233-4242 / 571605
16:12	<rsmartmike>	mark: yes
16:12	<kdalex>	we are coming
16:12	<rsmartmike>	there was some oracle memeory swapping
16:12	<rsmartmike>	but the load was ok
16:24	<Prabhu>	Is the DB server 64 bit? whats the RAM? what % of RAM was used for swap?
16:25	<rsmartmike>	16gb ram, about 1.3 or 1.4 gb showed as swap used
16:25	<rsmartmike>	64-bit centOS 5.4
16:29	<Prabhu>	We got to give 0.75 times the size of RAM for swap on 64bit
16:31	<Prabhu>	Well...that is oracle recommended!
16:34	<kdalex>	so what is the swap size on the database server?
16:36	<Prabhu>	If it is in the right size (12 GB), using 1.3 or 1.4 is absolutely not an issue
16:37	<wenk>	mike, what's the output of free -m ?
16:38	<wenk>	Well 1800 users is still running on the server
16:38	<wenk>	and its quite fast
16:39	<wenk>	is your deployment done ?
16:40	<wenk>	Looks like it to me, see the startup time in server logs
16:46	<wenk>	rsmartmike: You still there?
16:50	-->\|	jbush (n=Adium@ip68-3-76-94.ph.ph.cox.net) has joined #ucdsakai
16:50	<wenk>	Hey John
16:53	<wenk>	our 1800 are running
17:01	<jbush>	I've got 250 running and a bunch of timeouts
17:01	<jbush>	oh wait 661
17:01	<jbush>	you seeing any timeouts from grinder ?
17:02	<jbush>	now I'm up to 1000, still ramping up
17:02	<wenk>	errors
17:03	<wenk>	I wonder what we'll see when we just run 1800
17:03	\|<--	tamsler has left freenode (Remote closed the connection)
17:03	<wenk>	how's the db server?
17:04	<wenk>	allright folks
17:04	<wenk>	I'm gonna drop, I'll monitor thru night
17:04	<wenk>	I'll leave this up for the moment
17:16	\|<--	kdalex has left freenode (Read error: 110 (Connection timed out))
17:46	\|<--	Prabhu has left freenode ()
18:21	-->\|	markpankow_ (n=chatzill@68.3.111.203) has joined #ucdsakai
18:30	\|<--	markpankow has left freenode (Read error: 110 (Connection timed out))
20:38	\|<--	jbush has left freenode ("Leaving.")
22:07	-->\|	jbush (n=Adium@68.3.76.94) has joined #ucdsakai
22:13	\|<--	jbush has left freenode ("Leaving.")
23:30	-->\|	jbush (n=Adium@68.3.76.94) has joined #ucdsakai
03:33	\|<--	markpankow_ has left freenode ("ChatZilla 0.9.86 Firefox 3.5.5/20091102134505")
06:43	christel	Global Notice Hi all, we are (surprise, surprise) still experiencing DDoS, after a quiet period it just started up again. Apologies for the inconvenience.
07:03	\|<--	jbush has left freenode ("Leaving.")	Â Â