Tuesday, January 22, 2013

Sudden Massive Sim Loss


Sudden Massive Sim Loss

Beach (Atrebor Zenovka) sent us this analysis ...

=====================


My unfinished report 

I have been trying to make sense of the Sudden Massive Sim Lag events
as they affect Dance Performances

Regarding Sudden Massive Sim Lag and then my description of what happend during the last A&M Mocap Dancers Home Show

http://community.secondlife.com/t5/Second-Life-Server/Increase-in-Instant-SIM-LAG-amp-Crashes-During-Larger-Events/td-p/1683765/page/5

http://blog.nalates.net/2012/12/18/sl-news-2-week-51/#more-9260

http://208.74.205.111/t5/Second-Life-Server/Increase-in-Instant-SIM-LAG-amp-Crashes-During-Larger-Events/td-p/1683765

During each of the most recent A& M Home Show there was a Sudden Massive Sim Lag event (SMSLE) that interfered with the performance. Dazzler Sim was struck during the most recent Cutie Nominations Performance on Sunday. During the earliest Home show I was attentive to the show and not watching the stats. During the most recent Home show and now again at Dazzler I kept my stats window open the whole time.  

OBSERVATION:

The sim fluctuations appear [to me] to begin at the point Dancers jumped off their balls at the end of a performance. 

DISCUSSION:

An avatar posed on a dance ball becomes just another stationary object to the sim server
- easy for the Sim Server to track that object's rendering, physics and position factors, forces and vectors as opposed to the SimServer tracking a loose avatar standing or dancing around on the sim-very lag-inducing, constant checking to see if the avatar has collided with anything, checking, recording and sending on positions(vectors) and movement information to all the present observing viewers-   

OBSERVATION:

Once the sim fluctuations began, first noticeable in huge sim ping and bandwidth use (locally) escalating, subsequent increase in packet loss also noticeable under Time Dilation in Physics FPS fluctuation, then in Sim FPS consequently noticeable in the megajump in Total Frame TIme (should be about 22, at Dazzler sim it was going up to 47000 consequently noticeable in the rubberbanding in Time Dilation from .99 to .3 to 80.0 to .3 to 0.0

DISCUSSION:

The Dazzler sim did not crash, it just rubberbanded(oscillated) to ZERO FUNCTIONING before it recovered (during which time numerous avatars were logged out= those avatars did not crash, they simply got logged out of SL due to timing issues= The Dazzler sim which had been keeping their info and reporting them as present in SL had stopped such reporting until it oscillated back into full function at which time the Dazzler sim still showed those avatars as present[ghosted] even though they had got logged out by SL for failure to pony up when called on to report in thru Dazzler )

FURTHER DISCUSSION, HYPOTHESIS and PROPOSAL

IMPACT of PATHFINDING

Since the pathfinding capability was added to the Sim Server functions SUDDEN MASSIVE SIM LAG has become a problem in a way not seen before and not yet fully understood by customer-Users or by the provider-LindenLab. Things the sim server used to take in its stride easily now require it to perform significant positioning recalculation. Some people have the notion that turning off Pathfinding for the region will solve this BUT IT DOES  NOT SOLVE THE PROBLEM. Turning off Pathfinding prevents Users from accessing the SimServer's pathfinding capabilities, it does not relieve the SimServer of its duty to perform (LAG & LAG & LAG) 
and communicate (PING SIM ESCALATION & ESCALATION & ESCALATION) all those recalculations to ALL Client Viewers Currently Connected and so as to be able to report the forces, vectors, collisions, and positions if someone COULD ask it for the report.
IMPACT OF ASSET SERVER (INVENTORY) 

Since the changes in Asset Server the access to items our inventories should be faster 
-overall loading of my inventory at logon is actually faster but my experience indicates 
finding and retrieving any individual item is really slower, somehow in the mechanics of sorting and reporting and then grabbing and dragging out(rezzing) in my viewer.

OLD LAG REMEDIES

    REDUCE AVATAR RENDERING COST = take off all your clothes and become invisble =RESULT: THE TRUTH= the sim still has to place your avatar and keep track of its whereabouts and collision data so the benefit is minimal and adding an ARC report to your viewer is itself lag inducing
    
    SCRIPT COUNTERS AND SCRIPT REPORT BOARDS = on a timer basis , sense , read and report on either script count or render data for all avatars in range so as to make them feel guilty enough to reduce their huds, scripts, attached prims or textures= RESULT: THE TRUTH= SENSORS AND TIMERS CAUSE LAG BECausE THE SIM SERVER HAS TO monitor THEM added burden on the server CPU

    REDUCE SCRIPTS Even inactive or not running scripts do require the dutiful SimServer log those scripts in when the scripted objects enter the region and to keep checking them every frame to see if their status has changed to active and if they require some SimServer response= THE TRUTH YEP this does make sometimes an observable difference in overall sim lag

PROPOSALS FOR DANCERS

    1. Know the exact location and rotation coordinates your stage will need, rezz the copiable-disposable Stage from your Huddles or Barre Hud and use a macro Dialogue to script command it to proper location and orientation = ALL JUST BUTTON PUSHING IN YOUR HUD, no need to Open inventory, find, rezz, no need to Open Edit or Build Flyout and drag the stage into place 
In Huddles, the HUDMASTER)

    2.LOAD your Stage Rezzing Macro Notecard [Your stage was previously loaded into the hud
]
    3. Hit a HudButton- rezz your primslinked stage BACKSTAGE SOMEWHERE at Embarkation Point

    4. Hit a HudButton- rezz your dancer position linked pose balls BACKSTAGE SOMEWHERE at Embarkation Point 

    5. DANCERS in readiness each jump on their assigned ball

    6. Hit a HudButton- open a dialogue menu

        6a. >>hit a location button and the stage jumps to where it scripted belongs and the dancer loaded pose ball set jumps to where it scripted belongs
        6b. >>hit a curtain open button and the curtain opens

    7. LOAD your already warmed up DANCE SEQUENCE;

    8. DANCE DANCE DANCE BOW AND 
    soak up the adulation applause and yeehaws     whilst pocketing all thrown vegetables to make soup later
    DANCERS STAY ON YOUR POSE BALL
    HUDMaster 

    8. LOAD the Stage Rezzing Macro Notecard and Hit a Button-Open the Dialogue Menu
        8a. >>Hit a Location Button and the Dancers posed on their balls immediately go off stage to the prearranged Disembark
        8b. >>Hit a Button and the Stage immediately DIES
        8c.Dancers now vacate their dance pose balls ONE AT A TIME IN ORDER CALLED but they do not stand, walk or dance in place, they select their prearranged pose ball from which they will observe the next dance routine
        8d.>>Hit a Button the dancer position linked pose balls DIE

PROPOSALS FOR SIM (THEATER) OWNERS

        1. Enable Pathfinding on your SIm because Pathfinding Enabled Sims function more efficiently, and disabling only stops users, it doesn't save the Sim Server any effort

        2. Encourage your audience to either SIT at provided pose chairs or STAND on prearranged dance pose balls if they want to dance during the performance using either the Venue Dance ball or their own Hud animations

        3. ANy Script counting or avatar render cost checking and reporting should be done         on a separate TICKET BOOTH (like an airport scanner booth) Entry SIM (not parcel) because this activity is a high lag-inducing SimServer cost activity

        4. even Hosts, Hostesses and Coordinators or Stage Managers should be at all times mounted on a pose ball 

        5. Waiting Dancers ready for the next Routine should be mounted on pose balls,         not loose hud dancing or walking around backstage

        
I have utilized all the actions proposed here and they do reduce the lagtax on SimServer CPU time and make stage and dancer placement more rapid and expedient and i find them much more efficient than Inventory dragging-rezzing +Edit Move or even Place at last position rezzing from Inventory (available in Firestorm viewer); some of the HudButton pushing can actually and more expediently be accomplished with FunctionKey Gesture calling.

    As always, if you know your system has been tested and DOES WORK, then REFRAIN FROM panic-induced repetitive and redundant button pushing which only issues more commands into the queue for the Sim Server to act on The use of pose balls for audience, staff, and waiting performers radically decreases the collision and physics engine calculation burden on the sim server and may well help to stave off Sudden Massive Sim Lag.

When 6 or 8 dancers vacate their dance pose balls at the same time 
it is to the Sim Server like 6 or 8 avatars all teleported in at the same time 
(posed on their balls the dancers were just 'linked objects' to the server
and jumping off they become 'loose cannonballs' that have to be tracked actively for movement and collisions)

I hope some of this makes sense.

I can provide for demo the stage rezz and placement and die using Huddles
Barre Hud is easy sub 
Dancemaster Pro handles much of this for those fsmiliar with it
Other Dance and animation huds with which I am not familiar may well do it 
Custom Gestures called by Shift Fkey can duplicate or sub in for many of the dialogue menubutton commands 
    
atrebor Zenovka (Beach)    
2013 January 21
    

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. ...lol what?.. lag is cause and always has been cause when the sim has to use swap filing when all of the sims ram(memory) is used..used by the things you mentioned and some that you haven't..also..

    There are up to 8 (not sure on the exact number) other sims running off of one simulator..that number is not fixed and it changes all the time. So..one week your sim might be sharing with one and sometimes alone and others it might be sharing with 5 horse breading porn sims.. sl lag is a luck(or bad luck) of the draw.. it is not predictable and only preventive measures can be implemented on your sim,but those don't eliminate the horse breading porn ranches.

    Also.. day and time.. if there's a surge of people on a sunday after church..or school(just using as a example) will cause these"surges".

    SL's infrastructure is a fragile one.. if we compare it to online games like World of Warcraft(like i did in my early days)we ask our self's...wth is this? SL was never intended in its design for avs to be leaping about and syncing to music..it was a deigned to be a hybrid chat room and for people to craft thing..its come along ways but it has along ways more to go to keep up with our demands we ask from it and want ..

    In a nut shell there are way to many variables to pin down a source of lag,and the few ways to prevent it can be destroyed by many things.

    ReplyDelete
  3. LL has admitted that the statistics meter is inaccurate. They have made some efforts to correct it, review server updates for more information. Gathering statistics based on this method is somewhat flawed do to it being design to calculate on a different system core. It is our only point of reference so we got to use what we got.
    Further testing has shown that MONO updates & path finding have affected regions extensively; MONO changes seem to be affection things as much as pathfinding.
    I did a little research into the server updates and “12.10.26.266333” is where we see “Updated the Havok physics engine to version 2012.1” and “This is the server modernization project” appears to be associated with much of the slowdown and negative population effect. If I remember correctly there were stress tests that took place somewhere around or just previous to this time. This also indicates there was a question as to ability to sustain population, thus the need for a stress test. Please note that previous updates had also decayed quality and is not exclusive to the previously referenced update.
    A normal region where events were held before does not deal as well with higher populations as it has in the past. This potentially forcing region limits from what was “OK” at 73 avatars down to 59 avatars. The reason I give the number of 73 is because this was the threshold of avatar population on a region before or at the time the region becomes too laggy to function on. These are based on many events were I have charted population to usability. Sure, your results may vary but this was the general rule of thumb for most cases. Now we are seeing that 59 +- avatars is the usability threshold.
    Your comments about pose balls are very relevant to the core issue, that being the inability of the server to sustain a higher population. For us old timers, we knew of this ball hopping / physics issue 6 years ago. It is nothing new and a variant technique was used to collapse regions “for the fun of it”. As a point of definition it is a usability boundary point. Here we can establish the TRUE region limitations. With this understood, region administrators simply need to change the population limit on the region OR ask LL to fix the decay in quality of service. Either way you still pay the same amount for your region but you get less population usability.
    To summarize, Linden Labs need to fix what they broke. http://community.secondlife.com/t5/Tools-and-Technology/JIRA-Update-Changes-to-The-Bug-Reporting-Process/ba-p/1660981

    ReplyDelete
  4. yes both Di and Anon
    the situation you describe identifies certain elements which are endemic to the platform we stand on or require remedial action that will have to come from Linden Labs.
    In the meantime we can be devising strategies and methods which effectively reduce the impact of a diversity of lag-inducing factors on our performances.

    The Sudden Massive Sim Lag is a fairly new kind of lag event. I am accustomed to using the stats meters and while they may be in some ways inaccurate for fine monitoring, they do break down the total sim functioning into differentiable phenomenon. I am not giving up using my eyeballs to execute vision events because they do not reliably measure an inch. I will use an added tool to measure the inch.

    We do have some more recent clues from Linden Lab:

    "Object Rez
    Baker Linden came by and explained,
    “I wanted to pop in and mention that the new object rez code
    I’ve been working on is going to QA for testing in the next few weeks.

    We’re looking at some sim stats, and it seems like the threading is doing a good job of keeping the simulator running at full frame rate. That’s good because that means more things can get processed every frame.

    Once the code is live somewhere, I’ll announce the location here and
    in Thursday’s Beta User Group, so y’all can start testing it to find
    any additional bugs we didn’t catch in QA.”

    This is the code that takes the rez process into separate threads
    so that the SERVER PHYSICS is not STALLED
    when objects like LARGE LINK SETS are REZZED."
    -- http://blog.nalates.net/2013/01/22/second-life-news-2013-4-2/
    (Caps mine)
    I wonder if this is one of the sources of the Sudden Massive Sim Lag
    that has been afflicting so many dance performances ( it makes a simple explanation regarding the timing of this problem arriving in SL and
    of the statistics tables I have been observing and which Anon's comments seem to validate) and will it fix that problem?
    Yet to be seen.

    ReplyDelete
    Replies
    1. Well for whats worth...2 cents... maybe 3 in a good economy...I have furniture (I won't say what kind, but you know the kind) whose "rez option" started giving stack-heap errors all of a sudden out of the blue recently and then would freeze. When I approached the maker they asked me a million questions but in the end they said, ask for a sim restart. That cleared it up...for now but I have another one of "those type of objects" made by a different creator that also got stack-heap errors also recently out of the blue and then would lock up.

      Well...I just happened to dabble a lil bit in the code world and found an object I wrote that always worked, suddenly started throwing stack-heap errors. Luckily I was able to pin-point the general area and execution point when it happend and actually inspect it just prior to it locking up. All of a sudden I was running out of memory.

      It appears that (and I am speculating) a new LL server release algorithm causes probably more efficiency time-wise in cycle usage at the expense of more cost in memory resources...just my gut. Bottom line, rI an out of memory resources sooner than I would have prior to the server release. So I split my script into 2 scripts, which it needed to be anyways, and "voila" no more stack-heaps.

      Not a scientific study but with little insight, everyone taking stabs in the dark, take it for what it can be worth to you.

      An thats another Yummy's(this time bitter) taste of things.

      Delete
    2. Interesting Yummy! What you are pointing at can very well be associated with a MONO issue. I have also noticed issues of this sort too. I found that I needed to add a cap check with a sleep to stop the Stack-Heap violence. My thought was that the MONO garbage collector was not running in the same cycle or relevant to the needed cycles to clear the memory. By forcing a sleep moment it gives MONO time to do its thingy. Strange thing is it can happen with a idle script! It is odd to watch 5 boxes with “Hello World” Stack-Heap crash when you have not done a thing.

      Di points out file swapping on a server. This is a very clear point of lag and LL had pressed concerns when then they introduced MONO into the mix. Noting that the 16K to 64k per script would have a massive effect on things. This issue won't go away until we see massive RAM added to the servers.

      “The Sudden Massive Sim Lag is a fairly new kind of lag event.” Not really. This has always been a spotty issue. What is new is this new round of SMSL. In the past LL have found and fixed the issue causing in it. KUDOS to LL for that. I am not trying to split hairs here just inform. Anonymous points out “Updated the Havok physics engine to version 2012.1” I think we can see the relationship with ATREBOR’s comment made by Baker Linden.

      The stats meter info is not correct. Some information is reported OS side while other information is reported region side. This was written for a different method as Anonymous points out. The conflict most fail to consider is that LL run MORE regions on a server then they did in the past clouding it further. It is the only tool we really have, so we are stuck with it.

      It does appears that we are dealing with a vast array of server side issues. The bottom line is that we should not need to change how many avatars can be on a sim because of it. Going into martial law on your audience in a laggy region is just wrong in so many ways.

      I think the article points out a healthy attempt at a solution. That is to have as many avatars as possible sit sit sit! Just hope you don't bore the audience and have 5 jump at that same time, crashing your performance.

      Given time I think LL will sort it all out. Offering patients is all we can do.

      Delete
  5. I think the article points out a healthy attempt at a solution. That is to have as many avatars as possible sit sit sit! Just hope you don't bore the audience and have 5 jump at that same time, crashing your performance. LOL

    ReplyDelete