Автор: Milberg K.  

Теги: software   computer science  

ISBN: 978-158347-098-5

Год: 2009

Текст
                    
Driving the Power of AIX

Driving the Power of AIX Performance Tuning on IBM Power Systems Ken Milberg MC Press Online, LP Lewisville, TX 75077
® ™ Driving the Power of AIX : Performance Tuning on IBM Power Systems Ken Milberg Photography by Michele Huttler Silver, Michele Silver Photography First Printing—October 2009 © 2009 Ken Milberg. All rights reserved. Portions © MC Press Online, LP Every attempt has been made to provide correct information. However, the publisher and the author do not guarantee the accuracy of the book and do not assume responsibility for information included in or omitted from it. IBM is a registered trademark of International Business Machines Corporation in the United States, other countries, or both. AIX, POWER and POWER6 are registered trademarks of International Business Machines Corporation in the United States, other countries, or both. All other product names are trademarked or copyrighted by their respective manufacturers. Printed in Canada. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. MC Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include custom covers and content particular to your business, training goals, marketing focus, and branding interest. For information regarding permissions or special orders, please contact: MC Press Corporate Offices 125 N. Woodland Trail Lewisville, TX 75077 USA For information regarding sales and/or customer service, please contact: MC Press P.O. Box 4300 Big Sandy, TX 75755-4300 USA ISBN: 978-158347-098-5
Acknowledgements First and foremost, this book is dedicated to my children—Hadara, Ori, Rani and Elana, whom I love and adore with all my heart and who have been a constant source of joy to me throughout their lives. Thank you Vera, for providing me with these incredible children. Thank you Mom and Dad, for all the love you have given me through the years. This book is dedicated to my parent’s family, all of whom perished during the Holocaust, except for my dear Aunt Molly, who passed away several years ago and whom I still miss dearly. The publication of this book could not have been possible without the support and encouragement of many individuals throughout my career. I want to thank David Brodt for giving me my first job in systems and keeping me around even after I mistakenly destroyed his entire B90 Burroughs system (even though it was a Burroughs VMS bug) along with all his backups during a failed operations activity. I stayed on and led their project, my first, to convert their legacy system to Unix over 20 years ago—SCO Unix 3.2.2. I want to thank Terry Every for giving me my first opportunity in NYC in the early 1990s as a Unix Systems Manager, working on HP9000s and HP-UX. I learned so much from him, less about systems (though he is technical), and more about people and class. I want to thank Mark Mulconry for giving me my first opportunity to manage a large production IBM AIX environment and my homeboys at Empire BC/BS (Greg Pastuzyn, Steven Goldman, Steven Gerasimovich, Amit Goel, Arkady
Getselis) as well as my homegal, Marilyn Walter. To Winston, an AIX system administrator who worked for me at the World Trade Center. We’ll always remember you. You will never be forgotten! I want to thank the folks at IBM, who at the turn of the century thought enough of me to put me on their AIX performance team in Washington DC, working for the US Census Bureau (which is perhaps where this whole train started). I want to thank Nicolete McFadden and Bharvi Parikh for their work helping me through many IBM initiatives, including founding and leading the NY Metro PowerAIX/Linux Users Group. And thanks go to Randy Default, the former President of COMMON, who made me a permanent Guest on their Board of Directors representing AIX interests. I want to thank Bess Protacio and her AIX team of Bradd Baldwin, Abid Khwaja, and Jonathan Mencher for the times we had at Adecco migrating to AIX from that nameless Sun Unix operating system. I want to thank Dan Raju and Wahid Ullah for the great AIX fun we had in Ann Arbor and Ed Braunstein for providing my first exposure to AIX in 1996, when I was a CIO (before my career starting going downhill) and for the great times we had at LAS. I want to thank Brian Shorter, Mitch Diodato, Bruce Slaven, Jennifer Weems and Tim Paramore at Arrow for giving me the confidence and tools to start my own company, PowerTCO an IBM Business Partner, and for Raffi Princian for believing in me and leading our first assessment. Thanks also to the fine folks at Future Tech (Bob Venero, Phil Preston, Karen Sinda, Mike Rosatto, Steven Vames, Bill Daub, and Lynn Keegan) who showed me the ropes of working for a BP. It must be said that I would not even have considered writing if not for the folks at TechTarget who took a chance years ago on a neophyte writer. Thank you TechTarget (in the early days it was Amy Kucharik and Jan Stafford) for sticking by me and helping me launch my Ask The Expert Linux site as well as my writing career. I still do quite a bit of work for searchdatacenter.techtarget.com and searchenterpriselinux.com and love the assignments (thank you Matt Stansberry and Leah Rosin). You can see my blog also at itknowledgeexchange, another TechTarget offering. I want to thank James Proescholdt, formerly of IBM Systems Magazine for giving me the opportunity to write for them and Rob McNelly, who runs their AIXchange blog, who provided me with contact information that enabled me to further my writing career with IBM. Thank you to Natalie Boike, my present editor at IBM Systems Magazine for all the fun work. I am also very thankful to Troy Mott at Backstop Media for being my editor/publisher on content through
IBM developerWorks and for helping advise me during the early conceptual stages of my book. I want to thank Susan Schreitmueller, IBM’s most renowned and well-known performance expert, who reviewed my book and from whom I learned so much. And Jaqui Lynch, among other performance gurus, from whom I also learned so much through the years. Finally the publication of this book could not have been possible but for the ungrudging efforts put in by the writer of the foreword of my book, IBM Distinguished Engineer Joefon Jann, and for Chris Gibson, IBM AIX guru and writer who took the time out of his busy schedule to proofread the myriad mistakes in my first drafts. I want to thank Michele Huttler Silver, with Michele Silver Photography (msilverphotograpy.com) for the incredible job she did with the breathtaking photographs you will see interspersed throughout the book. And thanks again to my publisher Merrikay Lee—for giving me the opportunity to write this book, for believing in me, for sponsoring our book signing, book fair, and presentation seminar during the summer of 2009 in NYC and for taking a chance on an IBM Power AIX book. Thanks also go to my copy editor, Katie, for the stellar job. You are amazing! I’ll add a special mention to my dear friends, Steven and Shelly, Mitch and Candy, David and Laurie, who’ve always been there for me and my children, through thick and thin. Last, but definitely not least, thank you M—the love of my life, the one who makes my heart sing and race, and the one person in my life who has never wavered in her belief in me. You’re my muse and inspiration to keep going (with this book and through all life’s trials and tribulations), and one of the few folks who think that I am more than an idiot savant. You are the one who has helped keep things together for me, through good times and bad. —Ken Milberg September 2009

Contents Foreword Preface xi xiii SECTION I: INTRODUCTION Chapter 1: Performance Tuning Methodology Step 1. Establishing a Baseline Step 2. Stress Testing and Monitoring Step 3. Identifying the Bottleneck Step 4. Tuning Step 5. Repeat 3 3 4 5 5 6 Chapter 2: Introduction to AIX Unix AIX AIX Market Share 7 7 8 9 Chapter 3: Introduction to POWER Architecture POWER5 POWER6 11 13 14 Section I: Summary, Tips, and Quiz Summary Tips 17 17 18 QUIZ Multiple Choice True or False Fill In the Blank(s) 19 19 20 20
vi Contents SECTION II: CPU Chapter 4: CPU: Introduction 23 Chapter 5: CPU: Monitoring vmstat (Unix-generic) sar (Unix-generic) iostat (Unix-generic) w (Unix-generic) lparstat (AIX-specific) mpstat (AIX-specific) topas (AIX-specific) nmon Using nmon for Historical Analysis ps (Unix-generic) Tracing Tools tprof Timing Tools time timex 25 25 28 31 31 32 33 35 36 37 38 39 39 41 41 42 Chapter 6: CPU: Tuning Process and Thread Management nice renice ps schedo sched_R and sched_D fixed_pri_global timeslice bindprocessor smtctl gprof 45 45 46 47 48 48 50 51 51 52 53 54 Section II: Summary, Tips, and Quiz Summary Tips 55 55 55 QUIZ Multiple Choice True or False Fill in the Blank(s) 57 57 59 59
Contents vii SECTION III: MEMORY Chapter 7: Memory: Introduction Virtual Memory Manager Computational Memory File Memory Paging and Swapping VMM Tuning Evolution 63 63 65 65 65 66 Chapter 8: Memory: Monitoring vmstat (Unix-generic) Virtual Memory Summary sar (Unix-generic) lsps (AIX-specific) ps (Unix-generic) svmon (AIX-specific) Memory Leak 67 68 71 71 73 73 74 77 Chapter 9: Memory: Tuning vmo minperm, maxperm, maxclient, and lru_file_repage minfree and maxfree Page Space Allocation How Much Paging Space? Paging Space Tuning Thrashing and Load Control Memory Scanning and lrubucket rmss 81 81 82 84 85 86 87 87 88 89 Section III: Summary, Tips, and Quiz Summary Tips 91 91 92 QUIZ Multiple Choice True or False Fill in the Blank(s) 94 94 96 96 SECTION IV: DISK I/O Chapter 10: Disk I/O: Introduction Direct I/O Concurrent I/O 99 101 101
viii Contents Asynchronous I/O Logical Volumes and Disk Placement: Intra- and Inter-Policy Inter-Disk Policy File Systems 102 102 105 105 Chapter 11: Disk I/O: Monitoring sar topas Logical Volume Monitoring AIX LVM Commands filemon and fileplace filemon fileplace 107 107 108 111 112 116 116 117 Chapter 12: Disk I/O: Tuning lvmo ioo JFS2 Tuning Options 119 119 120 122 Section IV: Summary, Tips, and Quiz Summary Tips 125 125 126 QUIZ Multiple Choice True or False Fill in the Blank 128 128 129 130 SECTION VNETWORK I/O 131 Chapter 13: Network I/O: Introduction Network I/O Overview NFS Media Speed Network Subsystem Memory Management Virtual and Shared Ethernet 133 134 136 139 141 141 Chapter 14: Network I/O: Monitoring netpmon Monitoring NFS nfsstat nfs4cl 143 145 148 149 151
Contents ix netpmon and NFS Monitoring Network Packets iptrace, ipreport, and ipfilter tcpdump 152 154 154 156 Chapter 15: Network I/O: Tuning Name Resolution Maximum Transfer Unit Tuning: Client Tuning: Server 157 161 162 162 164 Section V: Summary, Tips, and Quiz Summary Tips 167 167 168 QUIZ Multiple Choice True or False Fill in the Blank 170 170 171 172 SECTION VI: BONUS TOPICS Chapter 16: AIX 6.1 Introduction Memory CPU Disk I/O JFS2 iSCSI I/O Pacing Asynchronous I/O Network NFS 175 175 176 179 179 179 179 180 180 182 183 Section VI: Chapter 16 Quiz Multiple Choice True or False Fill in the Blank 185 185 187 187 Chapter 17: Tuning AIX for Oracle Memory CPU 189 189 192
x Contents Asynchronous I/O Servers Concurrent I/O Oracle Tools Statspack Oracle Enterprise Manager 192 193 194 194 195 Section VI: Chapter 17 Quiz Multiple Choice True or False Fill in the Blank 197 197 198 198 Chapter 18: Linux on Power Monitoring Handy Linux Commands Virtualization Tuning 199 199 200 201 202 Section VI: Chapter 18 Quiz Multiple Choice True or False Fill in the Blank(s) 205 205 205 206 Quiz Answers Section I: Introduction Section II: CPU Section III: Memory Section IV: Disk I/O Section V: Network I/O Section VI / Chapter 16: AIX 6.1 Section VI / Chapter 17: Tuning AIX for Oracle Section VI / Chapter 18: Linux on Power 207 207 207 207 207 208 208 208 208
Foreword As computers have become increasingly sophisticated, the task of tuning the operating system to yield high performance for its applications while providing optimal total cost of ownership (TCO) for the IT owners has become increasingly complex. In the early days of computers, the OS typically ran only one application at a time, and most performance tuning was targeted at minimizing the number of instructions required to run the application within the limited resources (CPU, memory, disk/tape, networking) of a uniprocessor system. With advances in virtual memory, multitasking, multicore, caches, faster networks, huge storage devices and databases, and, in the past decade, the flourishing of virtualization technologies (e.g., LPARs, DLPARs, simultaneous multithreading, WPARs, virtual Ethernet, virtual SCSI), the task of performance optimization has become far more complex and has shifted to tuning the OS and balancing the hardware resources across LPARs within a hardware box. Nonetheless, the tuning goals remain the same: to yield high performance for applications while providing optimal TCO for IT owners. Ken Milberg, with his rich background in managing, operating, and writing about Unix and Linux systems, has abstracted the essence of the complex tuning process, which he clearly describes in Chapter 1. In fact, the tuning methodology described therein is applicable to most OS types: establish a baseline, stress test and monitor, identify the bottleneck, tune, and repeat. The rest of the book highlights the important monitoring and tuning tools for each major subcomponent of the AIX/POWER system. The progression of the topics is great, from the core to progressively further-away
xii Foreword components — from CPU to memory to disk to network, paralleling the AIX tools schedo, vmo, ioo, no, and nfso. The tips and quiz at the end of each section are a treat. Not only do they give a summary review of the key items covered, but they also provide a lot of fun and satisfaction, especially when you can verify whether you’ve understood everything correctly by checking against the provided answers. To sum up, this is a book that every AIX system administrator and systems manager should read. —Joefon Jann Distinguished Engineer, Research Lead in AIX and POWER Systems Software IBM Thomas J. Watson Research Center, Yorktown Heights, New York
Preface Why this book? Although a Google search may show a fair number of books about AIX, including a couple about performance tuning, just about all of them are at least a decade old. IBM provides a tremendous amount of information through its portals and Redbooks, but it is not unusual for administrators seeking to tune their boxes to examine dozens of Web sites and Redbooks before finding the information they need. This book brings it all together for you, and more. Further, I review best practices and provide tips and tricks that are not usually covered in the IBM literature. Last, the book provides an impartial view (I don’t work for IBM) of systems performance tuning based on the real-world experiences of a battle-scarred systems administration veteran. This book is intended for systems professionals who need to understand, monitor, and control the factors that affect AIX performance on their IBM POWER servers. It also includes bonus chapters on the recent innovations of AIX 6.1, Linux on Power (LoP) performance, and running Oracle on AIX. This is an intermediate book about AIX performance analysis and systems tuning. The material comes both from IBM sources and from real life, based on my experiences as a Unix professional supporting production systems for more than 20 years (almost half of them on AIX), in many capacities and for a broad range of industries. Because this book is not an introduction to Unix, prior knowledge of Unix (and AIX in particular) is recommended, although I would not say it is a prerequisite. The book covers tuning methodology, systems monitoring,
xiv Preface and performance tuning on all subsystems, including CPU, RAM, and I/O (network and disk). As an introduction, I review time-tested tuning and analysis methodology, steps that will assist you throughout the tuning lifecycle. The monitoring sections describe tools that will let you immediately gain a foothold (taking quick-and-dirty snapshots on the health of the system) on your system. They also discuss tools that will help you collect historic data for the purpose of analyzing trends and results. All the tools used in this book either are part of the standard IBM AIX systems build or are opensource products written by folks who work for IBM (e.g., nmon) and used widely in the field of battle. —Ken Milberg August 2009
Section I Introduction This section introduces the concept of performance tuning methodology and discusses the AIX operating system and how it has evolved through the years. We also explore the development of IBM’s POWER architecture and how it has changed from its early stages to the POWER6.

C h a p t e r 1 Performance Tuning Methodology Performance tuning is a never-ending process, and an important concept to understand is that it is not unusual to fix one bottleneck only to create another. That’s part of what makes our lives as AIX administrators so indispensable! The following time-tested tuning and analysis methodology will aid you throughout the tuning lifecycle: 1. Establish a baseline 2. Stress test and monitor 3. Identify bottleneck 4. Tune 5. Repeat (starting with step 2) Step 1. Establishing a Baseline Well before you ever tune a system, it is imperative to establish a baseline. The baseline is a snapshot of what the system looks like when you first put it into production, while it is performing at acceptable enough levels to the business for it to be deployed. The baseline should not only capture performance statistics but also document the actual configuration of the system (amount of memory, CPU, and disk). It’s important to document the system configuration because otherwise you won’t be comparing apples with apples when the time comes to examine the baseline to your current
4 Chapter 1: Performance Tuning Methodology configuration. This step is particularly relevant in our new partitioned world, when you can dynamically add or subtract CPU resources at a moment’s notice. To come up with a proper baseline, you must first identify the appropriate tools to use for monitoring. Some tools are more suited to immediate gratification, while others are geared more toward historical trending and analysis. Tools such as nmon and topas, which we’ll discuss in detail in Chapter 5, can serve both purposes. Once you’ve identified your monitoring tools, you need to gather your statistics and performance measurements. This information helps you to define what an acceptable level of performance is for a given system. You need to know what a well-performing system looks like before you start receiving calls complaining about performance. You should also work with the appropriate application and functional teams to define exactly what a well-behaved system is. At that time, you would translate that definition into an acceptable service level agreement (SLA), on which the customer would sign off. Step 2. Stress Testing and Monitoring This step is where you monitor the system at peak workloads and during problem periods. Stressing your system, preferably in a controlled environment, can help you make the right diagnosis — an essential part of performance tuning. Is your bottleneck really a CPU bottleneck, or is it related more to memory or I/O? It’s also important not to fall too much in love with any one utility. I like to use several monitoring tools here to help validate my findings. For example, I might use an interactive tool (e.g., vmstat) and then a data capturing tool (nmon) to help me track data historically. The monitoring step is critical because you cannot effectively tune anything without having an accurate historical record of what has been going on in your system, particularly during periods of stress. Larger organizations that recognize the importance of this process even have their own stress-testing teams, which work together with application and infrastructure teams to test new deployments before putting them into production.
Step 4. Tuning 5 It’s also essential here to establish performance policies for the system. You can determine the measures that are relevant during monitoring, analyze them historically, and then examine them further during stress testing. Step 3. Identifying the Bottleneck The objective of stressing and monitoring the system is to determine the bottleneck. Ask any doctor: you cannot provide the correct medicine (the tuning) without the proper diagnosis. If the system is in fact CPU-bound, you can run additional tools, such as curt, ps, splat, tprof, and trace (we’ll discuss these utilities later), to further identify the actual processes that are causing the bottleneck. It’s possible that your system might in fact be memory- or I/O-bound and not CPU-bound. Fixing one bottleneck, such as a memory problem, can actually cause another, such as a CPU bottleneck, because in this case your system is now letting the CPU perform to its optimum capacity. At one point in time, it might not have had the capacity to handle the increased amount of resources given to it. I’ve seen this situation quite often, and it isn’t necessarily a bad thing. Quite the opposite: it ultimately helps you isolate all your bottlenecks and tune the system to its max. You’ll find that monitoring and tuning systems is quite a dynamic process and not always predictable. That’s what makes performance tuning as challenging as it is. Step 4. Tuning Once you’ve identified the bottleneck, it’s time to tune it. For a CPU bottleneck, that usually means one of four solutions: ● ● Balancing system workload — This solution involves running processes at different intervals to more efficiently use the 24-hour day. More often that not, this is what we usually do to resolve CPU bottlenecks. Tuning the scheduler — Tuning the scheduler using nice or renice helps you assign different priorities to running processes to prevent CPU hogs.
6 Chapter 1: Performance Tuning Methodology ● ● Tuning scheduler parameters — Adjust scheduler parameters to finetune priority formulas. For example, you can use the schedo command to change the amount of time the operating system lets a given process run before calling the dispatcher to choose another. Increasing resources — Add CPUs or, in a virtualized environment, reconfigure logical partitions (LPARs) to boost available resources. This solution might include uncapping partitions or adding more virtual processors to existing partitions. Virtualizing the partitioned environment appropriately can help increase physical resource utilization, decrease CPU bottlenecks on specific LPARs, and reduce the expense of idle capacity in LPARs that are not “breathing heavy.” Step 5. Repeat After tuning, you need to go through the process again, starting with step 2, stress testing and monitoring. Only by repeating your tests and consistently monitoring your systems can you determine whether your tuning has made an impact. I know some administrators who simply tune certain parameters based on best practices for a specific application and then move on. That is the worst thing you can do. For one thing, what works in some environments might not work in yours. More important, how do you really know whether what you’ve tuned has helped the bottleneck unless you look at the data? To reiterate, AIX performance tuning is a dynamic and reiterative process, and to achieve real success, you need to consistently monitor your systems, which can only happen once you’ve established a baseline and SLA. The bottom line is, if you can’t define the behavior of a system that runs well, how will you define the behavior of a system that doesn’t?
C h a p t e r 2 Introduction to AIX AIX — which stands for Advanced Interactive eXecutive — is a POSIXcompliant and X/Open-certified Unix operating system introduced by IBM in 1986. While AIX is based on UNIX System V, it has roots in the Berkeley Software Distribution (BSD) version of Unix as well. Today, AIX has an abundance of both flavors (you can go with chocolate one day and vanilla the next), providing another reason for its popularity. Unix From its introduction in 1969 and development in the mid-1970s, Unix has evolved into one of the most successful operating systems to date. The roots of this operating system go as far back as the mid-1960s, when AT&T’s Bell Labs partnered with General Electric and the Massachusetts Institute of Technology (MIT) to develop a multi-user operating system called Multics (which stood for Multiplexed Information and Computer Service). Dennis Ritchie and Ken Thompson worked on this project until AT&T withdrew from it. The two eventually created another operating system in an effort to port a computer game that simulated space travel. They did so on a Digital Equipment Corporation (DEC) PDP-7 computer, and they named the new operating system Unics (for Uniplexed Information and Computing Service). Somewhere along the way, “Unics” evolved into “Unix.”
8 Chapter 2: Introduction to AIX AIX AIX was the first operating system to introduce the idea of a journaling file system, an advance that enabled fast boot times by avoiding the need to perform file system checking (fsck) for disks on reboot. AIX also has a strong, built-in Logical Volume Manager (LVM), introduced as early as 1990, which helps to partition and administer groups of disks. Another important innovation was the introduction of shared libraries, which avoided the need for an application to statically link to the libraries it used. The resulting smaller binaries used less of the hardware RAM to run and required less disk space for installation. IBM ported AIX to its RS/6000 platform of products in 1989. The release of AIX Version 3 coincided with the announcement of the first RS/6000 models. At the time, these systems were considered unique in that they not only outperformed all other machines in integer compute performance but also beat the competition by a factor of 10 in floating-point performance. Version 4, introduced in 1994, added support for symmetric multiprocessing (SMP) with the first RS/6000 SMP servers. The operating system evolved until 1999, when AIX 4.3.3 introduced workload management (WLM). In May 2001, IBM unveiled AIX 5L (the L stands for “Linux affinity”), coinciding with the release of its POWER4 servers, which provided for the logical partitioning of servers. In October of the following year, IBM announced dynamic logical partitioning (DLPAR) with AIX 5.2. The latest update to AIX 5L, AIX 5.3 (introduced in August 2004), provided innovative new features for virtualization, security, reliability, systems management, and administration. Most important, AIX 5.3 fully supported the Advanced Power Virtualization (APV) capabilities of the POWER5 architecture, including micropartioning, virtual I/O servers, and symmetric multithreading (SMT). Arguably, this was the most important release of AIX in more than a decade, and it remains the most popular (as of this writing). That is why we’ll primarily focus on AIX 5.3 for the purposes of this book. IBM announced AIX 6-Beta in May 2007 and formally introduced AIX 6.1 in November 2007. Major innovations of AIX 6.1 include workload
AIX Market Share 9 partitions (WPARs), which are similar to Solaris containers, and Live Application Mobility (not available with Solaris), which lets you move the partitions without application down time. Chapter 16 discusses performance monitoring and tuning on AIX 6.1. AIX Market Share AIX celebrated its 20th anniversary in January 2006, and it appears to have an extremely bright future in the Unix space. IBM’s AIX has been the only Unix that increased its market share through the years, and IBM continues to own the market space for Unix servers. Most of the Unix growth at this time stems from IBM. AIX has benefited from the many hardware innovations that the POWER platform has introduced through the years, and it continues to do so. IBM has also made good decisions around its Linux strategy. Linux, supported natively on the POWER5, more or less complements, rather than competes with, AIX on the POWER architecture.

C h a p t e r 3 Introduction to POWER Architecture The “POWER” in POWER architecture stands for Power Optimization with Enhanced RISC, and it is the processor used by IBM’s midrange Unix offering, AIX. POWER is a descendant of IBM’s 801 CPU and is a second-generation Reduced Instruction Set Computer (RISC) processor. It was introduced in 1990 to support Unix RS/6000 systems. The POWER architecture incorporated many characteristics that were already common in most RISC architectures. The instructions were fixed in length (four bytes) and had consistent formats. What made the architecture unique among existing RISC architectures was that it was functionally partitioned, separating the functions of program flow control, fixed-point computation, and floating-point computation. The objective of most RISC architectures was to be extremely simple so that implementations would have an extremely short cycle type. This approach would result in processors that could execute instructions at the fastest possible clock rate. The designers of the POWER architecture chose to minimize the total time spent to complete a task. This time was a byproduct of three different components: path length, the number of cycles needed to complete an instruction, and cycle time. During the early 1990s, five different RISC architectures actively competed with one another. IBM partnered with Apple and Motorola to come up with a common architecture that would meet the standards of an alliance they would form. The first design was very simple, and all its instructions
12 Chapter 3: Introduction to POWER Architecture were completed in one cycle. It lacked floating-point and parallel processing capability. The POWER architecture was a real attempt to correct this flaw. It consisted of more than 100 instructions and was known as a complex RISC system. The POWER1 chip consisted of 800,000 transistors per chip and was functionally partitioned. It had separate floating-point registers and could scale from low-end to the highest-end workstations. The first chip actually consisted of several chips on a single motherboard but was refined to one RISC chip with more than a million transistors. Some of you may be surprised to learn that this chip was actually used as the CPU for the Mars Pathfinder mission. The POWER2 chip was released in 1993 and was the standard-bearer for nearly five years. It contained 15 million transistors per chip. It also added a second floating-point unit (FPU) and extra cache. This chip was known for powering the IBM Deep Blue supercomputer that would beat Garry Kasparov at chess in 1997. (Joefon Jann, whose team developed this system, wrote the Foreword to this book.) The POWER3 architecture was the first 64-bit symmetric multiprocessor. Designed to work on both scientific and technical computer applications, it included a data prefetch engine, dual floating-point execution units, and a nonblocked interleaved data cache. It used copper interconnect, which delivered double the performance for the same price. The POWER4 (code-named Regatta) architecture, released in 2001, featured 174 million transistors per processor. It incorporated micron copper and silicon-based technology. Each processor had 64-bit, 1 GHz PowerPC cores and could execute as many as 200 instructions simultaneously. POWER4 became the driving force behind the IBM Regatta Servers, which supported logical partitioning. The POWER4 processor supported logical partitioning with a new privileged processor state called the POWER Hypervisor mode.
POWER5 13 As wonderful as the Regattas were, if you purchased one shortly before the POWER5 systems were released, you were not a happy camper. POWER5 IBM’s POWER5 architecture, introduced in 2003, contained 276 million transistors per processor. It was based on the 130 nm copper/silicon-oninsulator (SOI) process and featured chip multiprocessing, a larger cache, a memory controller on the chip, simultaneous multithreading (SMT), advanced power management, and improved Hypervisor technology. The POWER5 was built to allow up to 256 logical partitions and was available on IBM’s System i and System p servers. Each POWER5 core is designed to support SMT and single-threaded modes. The software (the Hypervisor) switches the processor from SMT to single-threaded mode. Some of the objectives of the POWER5 were ● To maintain binary capability with older POWER4 systems ● To enhance and extend symmetric multiprocessing (SMP) scalability ● To improve performance and reliability ● To provide additional server flexibility ● To improve power efficiency ● To provide virtualization capabilities As a result of its dual-core design and support for SMT, one POWER5 chip appears as a four-way microprocessor to the operating system. Processors using SMT can issue multiple instructions from different code paths during a single cycle. Multiple instructions from both hardware threads can be issued from one cycle.
14 Chapter 3: Introduction to POWER Architecture Figure 3.1 depicts the Hypervisor, without which there is no virtualization. Programs AIX 5L Programs Linux Programs IBM i Open Firmware RTAS Open Firmware RTAS TIMI SLIC POWER Hypervisor POWER 64-bit Processor Figure 3.1: Hypervisor architecture As you examine this architecture, you can see that the layers above the POWER Hypervisor are similar, but the contents are characterized by the operating system. The layers of code supporting AIX and Linux consist of system firmware and Run-Time Abstraction Services (RTAS). Open Firmware and RTAS are both platform-specific firmware, and both are tailored by the platform developer to manipulate the specific platform hardware. In the POWER5 processor, IBM introduced further design enhancements that enabled the sharing of processors by multiple partitions. The POWER Hypervisor Decrementer (HDEC) is a new hardware facility in the POWER5 design that is programmed to provide the POWER Hypervisor with a timed interrupt independent of partition activity. It was the POWER5 architecture, along with the extraordinary virtualization capabilities of Advanced Power Virtualization (APV) that really paved the way for server consolidation around IBM POWER systems. (IBM has since rebranded the term Advanced Power Virtualization to PowerVM.) POWER6 The POWER6, with approximately 790 million transistors, debuted in June 2007. Its dual-core design enabled it to reach 4.7 GHz. Innovations
POWER6 15 in energy and cooling let it retain the same power consumption as the POWER5 while almost doubling performance. The POWER6 has hardware support for decimal arithmetic. It also has the first decimal floating-point unit integrated in silicon. Several important APV enhancements were also released with the POWER6, including Live Partition Mobility, Decimal Floating Point, and Dynamic Energy Management. It was around this time that IBM rebranded APV to PowerVM.

Section I Summary, Tips, and Quiz Summary ● The five-step performance tuning methodology is: 1. Establish a baseline 2. Stress test and monitor 3. Identify bottleneck 4. Tune 5. Repeat (starting with step 2) ● ● ● ● ● Unix was “invented” in 1969, the result of an effort by Dennis Ritchie and Ken Thompson to port a computer game to a DEC PDP-7 following their work with AT&T’s Bell Labs. AIX, which stands for Advanced Interactive eXecutive, was introduced by IBM in 1986. It is the first version of Unix to provide a journaling file system and to incorporate a Logical Volume Manager (LVM) in the base operating system. IBM’s Power Optimization with Enhanced RISC (POWER) architecture was introduced in 1990 to support RS/6000 systems. AIX 5L, introduced in May 2001, provided for the logical partitioning of servers with the POWER4 architecture. AIX 5.3, released in 2004, would become the most important release of AIX in more than a decade. It boasted support for Advanced Power Virtualization (APV) and the new POWER5 architecture. IBM has since rebranded the term Advanced Power Virtualization to PowerVM.
18 Section I: Summary, Tips, and Quiz ● AIX 6 and the POWER6 architecture were released in 2007 (the former in the spring and the latter in the fall). AIX 6 enhancements include workload partitioning and Live Application Mobility. POWER6 innovations include Live Partition Mobility, Decimal Floating Point, and Dynamic Energy Management. Tips ● ● ● ● ● ● ● ● Do not, under any circumstances, introduce an application into production without first implementing a proactive performance monitoring strategy. Otherwise, you will never really know what your subsystems (CPU, I/O, memory) should look like when the system is performing well and its performance has been deemed acceptable to the business and/or application folks. The time to start monitoring your system is before you’ve been told that the system is slow, not after. Use more than one monitoring tool so that you can use each to validate the findings of the others. Create multiple environments for your application architecture, including development, test, and/or quality assurance. Establish a deployment and stress-testing strategy for how applications are tested and deployed into production. These measures will help you ensure the reliability and performance of your applications. Spend time analyzing your performance data. Remember, you can’t prescribe the right medicine (tune) without a proper diagnosis (analysis of historic data). Introduce one change at a time when tuning your systems. Otherwise, how will you really know what the true effect of each change is? Use the virtualization capabilities of AIX 5.3 and APV (now PowerVM). These innovations can help you save big money on total cost of ownership and help drive a large return on investment for server and data center consolidation projects. Don’t upgrade to AIX 6.1 simply because you’ve fallen in love with the new technology. Remember that your production application might not share that love. Create a 6.1 partition on your POWER server so
Multiple Choice 19 that you can start playing nicely in the sandbox. Note that POWER6 innovations such as Live Partition Mobility are fully supported on AIX 5.3 (Technology Level 7, or TL_7). Quiz Multiple Choice 1. AIX stands for a. Advanced Interactive Unix b. Advanced Interactive eXecutive c. Advanced Unix d. It’s just an acronym. 2. AIX was introduced in a. 1969 b. 1986 c. 1990 d. 1994 3. Which is the first Unix that introduced journaling file systems? a. Solaris b. HP-UX c. AIX d. Linux
20 Section I: Summary, Tips, and Quiz 4. Advanced Power Virtualization was introduced with which combination? a. AIX 5.3 and POWER5 b. AIX 5.2 and POWER5 c. AIX5L and POWER4 d. AIX 6.1 and POWER5 5. DLPAR stands for a. Logical partitioning b. Advanced power virtualization c. Dynamic logical partitioning d. Nothing True or False 6. Linux cannot run natively on the POWER architecture. 7. Performance monitoring and tuning is a never-ending process. 8. Fixing a bottleneck should not cause another bottleneck to occur. 9. Never make more than one tuning change at the same time. Fill In the Blank(s) 10. Fill in the missing steps of the five-step tuning methodology described in this book: 1. __________________ 2. Stress test and monitor 3. __________________ 4. __________________ 5. __________________
Section II CPU This section provides an overview of CPU monitoring and tuning and discusses best practices for CPU performance tuning, given the various considerations that can impact performance.

C h a p t e r 4 CPU: Introduction Unlike other subsystems (e.g., memory, I/O), when it comes to CPU, there is less to actually tune and more you can do on the back end (e.g., balancing systems workload) to ensure your systems are running smoothly. As a Unix administrator, you need to understand which tools are best used for which purpose. As far as monitoring is concerned, some tools are better suited to quick-and-dirty system snapshots, while others are clearly more effective for long-term trending and analysis. Choose the tool that best fits the situation you’re faced with. For example, if you’re experiencing a serious production problem, you don’t have five days to perform long-term analysis — you may not even have more than five minutes to come up with something. Nevertheless, you still need to arrive at the right diagnosis to help determine the bottleneck. Often, you’ll find that the bottleneck isn’t actually CPU but relates to memory or I/O. Most users assume CPU is the problem and figure the box needs more horsepower. However, CPU usually isn’t the culprit, and throwing more iron at a problem is neither the quickest nor the most costeffective way to solve the issue. Furthermore, trying to tune the CPU subsystem when virtual memory is the problem could be a real disaster. Before you look for a way to tune, take the time to analyze the system properly. I don’t mean to be condescending here. It’s just that sometimes we don’t take the time to monitor and analyze. We rush to judgment because of the pressure we’re under to solve problems and move on to the next issue or
24 Chapter 4: CPU: Introduction production concern. This is one reason that, when first investigating any performance bottleneck, I prefer to use tools that focus less on a specific area but provide a better understanding of the big picture. The bottom line is that you really want to make sure you have a CPU problem if that’s what you’re trying to tune. More on this point later. As an AIX administrator, you should already know some of the basic tools of performance monitoring — commands such as vmstat and topas — and you should be familiar with ways to identify processes that are CPU hogs. What some people have a hard time understanding is that CPU performance tuning isn’t about running some tuning commands but about proactively monitoring systems, particularly when you’re not experiencing performance problems. Without historical data to analyze, there can be no effective performance tuning. Performance in a virtualized environment provides challenges to even the most senior of administrators, so I’ll also go over specific concepts for a virtualized environment, including simultaneous multithreading (SMT), virtual processors, and the POWER Hypervisor. As far as the methodology, when investigating a perceived performance problem, start by monitoring the statistics of CPU utilization. It’s important to continuously observe system performance because you need to compare the loaded system data with normal usage data, which is the baseline. Because the CPU is one of the fastest components of the system, if CPU utilization keeps the CPU 100 percent busy (which happens to every system at some time), you’ll need to investigate the process that causes this situation. AIX provides many trace and profiling tools to follow the most complex of processes. Don’t be afraid to also use any application or database tools at your disposal to help you further. In a CPU-bound system, all the processors are 100 percent busy, and some jobs may be waiting for CPU time in the run queue. Generally speaking, a system has an excellent chance of becoming CPU-bound if the CPU is 100 percent busy, has a large run queue compared with the number of CPUs, and requires more context switches than usual. That’s the quick and dirty. We’ll get into much more detail in the next couple of chapters.
C h a p t e r 5 CPU: Monitoring AIX systems administrators have much more at their disposal than the average Unix administrator. Not only can you use the standard Unix generic monitoring tools that have been around nearly as long as Unix itself, but a potpourri of AIX-specific commands is also available. Some of these commands come standard with an AIX build, while others are tools that, although not officially supported by IBM, are widely distributed and are used by most administrators. We’ll discuss all these types of monitoring tools in this chapter, including those we don’t use very often. As we go through the tools, note that four commands — mpstat, sar, topas, and vmstat — have been enhanced in AIX 5.3 to enable the tools to report back accurate statistics about shared partitions using Advanced Power Virtualization (PowerVM). The trace-based tools curt, filemon, netpmon, pprof, and splat have also been updated. One command not covered here, lparmon, is the most comprehensive tool you can use in a partitioned environment. vmstat (Unix-generic) vmstat [-fsviItlw] [[-p|-P] pagesize|ALL] [Drives] [Interval [Count]] While the vmstat command is more commonly associated with viewing information about virtual memory (hence the “vm”), it is the first tool most administrators invoke when trying to get a quick assessment of their systems. That’s because vmstat reports back all kinds of pertinent
26 Chapter 5: CPU: Monitoring performance-related information, including data about memory, paging, blocked I/O, and overall CPU activity. Because it reports virtually all subsystem information line by line in a quick and painless way, running vmstat is probably the simplest and most efficient way to gauge exactly what is going on in your system. A common way to run vmstat is for five iterations every two seconds: vmstat 2 5 Running the command in this way produces the following results: # vmstat 2 5 System configuration: lcpu=4 mem=3072MB ent=0.40 kthr memory page faults ----- ------------- ---------------------avm fre cpu ---------- ---------------------- r b re pi po fr sr cy pc ec 1 0 128826 641397 0 0 0 0 0 0 448 87 138 in sy cs us sy id wa 0 1 98 0 0.01 2.8 1 0 128826 641397 0 0 0 0 0 0 385 10 136 0 1 99 0 0.01 2.2 1 0 128826 641397 0 0 0 0 0 0 381 13 138 0 1 99 0 0.01 2.2 1 0 128826 641397 0 0 0 0 0 0 364 40 138 0 1 99 0 0.01 2.4 1 0 128826 641397 0 0 0 0 0 0 610 13 138 0 2 98 0 0.01 3.3 In addition to specific monitoring information, vmstat provides a very high-level snapshot of the system, which can be useful. Just by running vmstat in the preceding snapshot, we know that we have a system with four logical CPUs and 3 GB of RAM and are using shared processors. (In actuality, this partition is using two physical CPUs; symmetric multithreading is enabled, yielding the four logical CPUs. More about SMT later.) Some of the more important fields in the vmstat output include the following: ● r — The average number of runnable kernel threads over the sampling interval you have chosen.
vmstat (Unix-generic) ● ● 27 b — The average number of kernel threads in the virtual memory waiting queue over the sampling interval. The r value should always be higher than b; if it is not, you probably have a CPU bottleneck. fre — The size of the memory free list. Don’t worry too much if this number is really small. More important, determine whether any paging is going on if this size is small. ● pi — Pages paged in from paging space. ● po — Pages paged out to paging space. Our focus in this chapter is on the last section of output, CPU: ● us — User time ● sy — System time ● id — Idle time ● wa — Time spent waiting on I/O ● ● pc — Number of physical processors consumed (displayed only if the partition is configured with shared processors) ec — Percentage of entitled capacity (displayed only if the partition is configured with shared processors) Clearly, the system in our example has no bottleneck to speak of. How can we tell this? Let’s look at us and sy. If these entries combined consistently averaged more than 80 percent, you more than likely would have a CPU bottleneck. If you are in a state where the CPU is running at 100 percent (which happens on occasion to everyone), your system is really breathing hot and heavy. If the numbers are small but the wait time (wa) is on the high side (usually greater than 30), this usually signals that there may be I/O problems, which in turn can cause the CPU not to work as hard as it can. Alternatively, if more time is spent in sy time than us time, your system is probably spending less time crunching numbers and more time processing kernel data. When this happens, it is usually a sign either of badly written code or that something has run amok.
28 Chapter 5: CPU: Monitoring Let’s look at another system: # vmstat 2 5 System configuration: lcpu=4 mem=3072MB ent=0.40 kthr ----r b 2 1 3 2 4 1 2 1 6 2 memory page faults cpu ------------- ---------------------- ------------- ----------------------avm fre re pi po fr sr cy in sy cs us sy id wa pc ec 169829 600290 0 0 0 0 0 0 553 36538 175 64 32 4 0 0.79 84.9 169829 600290 0 0 0 0 0 0 778 33033 175 60 29 11 0 0.84 73.2 169828 600291 0 0 0 0 0 0 403 11904 179 76 10 4 10 0.69 87.8 169828 600291 0 0 0 0 0 0 368 30745 175 82 14 2 2 0.91 85.5 169830 600289 0 0 0 0 0 0 395 27898 173 57 34 4 5 0.89 91.5 What kind of determination can we make here? When we add us and sy, our numbers come out much differently this time — fairly close to 100 percent. This system is clearly CPU-bound. If paging were going on, we would see numbers in the paging (page) columns. In this case, no paging is occurring, nor are there any I/O problems to speak of. Because vmstat is an all-purpose utility, it can help you perform this quick-and-dirty analysis on the fly. If the blocked processes represented three times the number of runnable processes and everything else stayed the same, I/O would likely be causing the CPU bottleneck. In that case, you should be prepared to have even more of a CPU bottleneck once you fix the I/O problem. As I explained previously, this is all part of systems tuning; fixing one bottleneck often causes another. sar (Unix-generic) sar {-A [-M]|[-a][-b][-c][-d][-k][-m][-q][-r][-u][-v][-w][-y][-M]} [-s hh[:mm[:ss]]] [-e hh[:mm[:ss]]] [-P processor_id[,...] | ALL] [-f file] [-i seconds] [-o file] [interval [number]] [-X file] [-i seconds] [-o file] [interval [number]] The sar command is the Unix System Activity Reporting tool (part of the bos.acct fileset). It is most commonly used to analyze CPU activity. The command writes to standard output the contents of the cumulative activity, similar to vmstat. The default version of sar produces a CPU utilization report:
sar (Unix-generic) 29 # sar 2 5 AIX lpar30p682e_pub 3 5 00CED82E4C00 12/24/07 System configuration: lcpu=4 ent=0.40 mode=Uncapped 10:13:40 10:13:42 10:13:44 10:13:46 10:13:48 10:13:50 %sys 31 30 35 11 24 %wio 0 0 0 0 0 %idle 57 58 51 83 67 physc 0.18 0.17 0.20 0.07 0.14 %entc 44.5 43.5 50.8 18.0 34.5 11 26 0 63 0.15 38.3 Average %usr 13 12 14 6 9 Used this way, the sar command provides the same type of high-level information that vmstat does, although it also lets you know the mode in which the system is running, which is helpful. In the example, we can see that our partition is an uncapped partition, which, when configured as such, lets the partition use more resources than its entitled capacity. In this default view, the fields themselves are the same as the vmstat fields, but us becomes usr, sy becomes sys, id becomes idle, io becomes wio, pc becomes physc, and ec becomes entc. A more effective way to run sar is to view all processors by using the ALL flag: # sar -u -P ALL 2 5 AIX lpar30p682e_pub 3 5 00CED82E4C00 12/24/07 System configuration: lcpu=4 ent=0.40 mode=Uncapped 10:24:18 cpu 10:24:20 0 1 2 3 U 10:24:22 0 1 2 %usr 27 0 0 0 10 32 0 0 %sys 71 35 36 29 27 66 37 35 %wio 0 0 0 0 0 0 0 0 0 %idle 2 65 64 71 62 63 2 63 65 physc 0.15 0.00 0.00 0.00 0.25 0.15 0.15 0.00 0.00 %entc 37.5 0.5 0.0 0.0 61.8 38.2 37.2 0.6 0.0
30 Chapter 5: CPU: Monitoring 10:24:24 3 1 2 3 U 0 0 0 0 0 12 29 30 37 35 30 25 69 0 0 0 0 0 0 0 70 63 65 70 62 63 2 0.00 0.00 0.00 0.00 0.25 0.15 0.15 0.0 0.6 0.0 0.0 62.1 37.9 37.7 I prefer using vmstat to sar because vmstat provides a quick snapshot of all subsystems, not just CPU. Although you can use other flags to obtain additional subsystem information using sar, it just is not as efficient or simple. One advantage sar provides that vmstat does not is the ability to capture information and analyze data. This is done through the System Activity Data Collector (sadc), which is essentially a back end to sar. When enabled through cron (it is commented out on a typical default AIX partition), sadc collects data periodically in binary format. In the following example, we run it from the command line: # /usr/lib/sa/sadc 2 5 /tmp/sarinfo To view the results (remember it’s in binary format), we need to use the –f flag: # sar -f /tmp/sarinfo AIX lpar30p682e_pub 3 5 00CED82E4C00 12/24/07 System configuration: lcpu=4 ent=0.40 mode=Uncapped 10:41:42 10:41:44 10:41:46 10:41:48 10:41:50 Average %usr 0 0 0 0 0 %sys 1 1 1 1 1 %wio 0 0 0 0 0 %idle 99 98 99 99 99 physc 0.01 0.01 0.01 0.01 0.01 %entc 2.4 2.6 2.1 1.9 2.3
w (Unix-generic) 31 iostat (Unix-generic) iostat [-a][-l][-s][-t][-T][-z] [{-A [-P] [-q|Q]} | {-d|-D [-R]} ] [-m] [Drives] [Interval [Count]] The iostat command is another good first-impression type of tool, which is more commonly used for I/O information. When run with the –t flag, it provides only tty/cpu information. I also like to use the –T flag to obtain the timestamp: # iostat -tT 1 System configuration: lcpu=4 ent=0.40 tty: tin 0.0 0.0 0.0 0.0 0.0 tout 41.0 182.0 92.0 92.0 92.0 avg-cpu: % user % sys % idle % iowait physc % entc time 0.0 1.1 98.8 0.0 0.0 2.2 10:51:13 0.0 0.9 99.0 0.0 0.0 1.8 10:51:14 0.0 0.9 99.1 0.0 0.0 1.7 10:51:15 0.1 1.1 98.8 0.0 0.0 2.1 10:51:16 0.0 1.4 98.6 0.0 0.0 2.7 10:51:17 w (Unix-generic) /usr/bin/w64 [ -hlsuwX ] [ user ] The w command prints a summary of all current activity on the system. I like this command — always have and always will. Sometimes I run it even before vmstat. I appreciate the clear, concise way in which w provides important information, such as load average. You can tell a lot about your system from the load average. If my load average commonly varies between 2 and 5 but is 37 when I run this command, I’m about ready to say, “Houston we have a problem.” In the following case, we’re okay. # w 08:29AM up 1 day, User tty u0004773 pts/0 u0004773 pts/1 23:44, login@ 06:40AM 08:28AM 2 users, idle 0 0 load average: 1.00, 1.00, 1.01 JCPU 0 0 PCPU what 0 -ks 0 –ksh
32 Chapter 5: CPU: Monitoring lparstat (AIX-specific) lparstat { -i | [-H|-h] [Interval [Count]] } The purpose of the lparstat command is to report logical partition (LPAR) information statistics. This command also displays hypervisor statistical data about many POWER Hypervisor calls. Introduced in AIX 5.2, lparstat is commonly used to assist in shared-processor partitioned environments. In the following command output, you should recognize the entries up until entitled capacity (entc). # lparstat 2 5 System configuration: type=Shared mode=Uncapped smt=On lcpu=4 mem=3072 psize=16 ent=0.40 %user ----0.1 0.0 0.0 0.0 0.1 %sys ---1.4 1.4 1.3 1.5 1.1 %wait ----0.0 0.0 0.0 0.0 0.0 %idle physc %entc lbusy ----- ----- ----- -----98.5 0.01 2.6 0.0 98.6 0.01 2.6 0.0 98.7 0.01 2.4 0.0 98.5 0.01 2.8 1.2 98.8 0.01 2.1 0.0 vcsw phint ---- ----582 0 635 0 593 0 685 0 458 1 On shared partitions, lparstat provides the following information: ● ● ● lbusy — The percentage of logical processor utilization (executing at the user and system level) vcsw — The number of virtual context switches that are virtual processor hardware preemptions phint — The number of phantom interrupts (redirected to other partitions in the shared pool) An important flag worth a mention is the –h flag, which shows the POWER Hypervisor statistics:
mpstat (AIX-specific) 33 # lparstat -H 2 5 System configuration: type=Shared mode=Uncapped smt=On lcpu=4 mem=3072 psize=16 ent=0.40 Detailed information on Hypervisor Call Hypervisor Call remove read nclear_mod page_init clear_ref protect put_tce xirr Number of Calls 0 0 0 265 0 0 0 565 %Total Time Spent 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 %Hypervisor Time Spent 0.0 0.0 0.0 0.9 0.0 0.0 0.0 2.4 Avg Call Time(ns) Max Call Time(ns) 1 1 1 604 1 1 1 758 656 0 0 6593 0 0 0 1406 Hypervisor information includes: ● Number of calls — The number of Hypervisor calls ● %Total Time Spent — Percentage of total time spent on call ● ● ● %Hypervisor Time Spent — Percentage of Hypervisor time spent on call Avg Call Time — Average call time for this type of call; the percentage of logical processor utilization executing at the user and system level (in nanoseconds) Max Call Time — Maximum call time for this type of call (in nanoseconds) For partitions running AIX 5.2 or AIX 5.3, either in a dedicated environment or in shared and capped mode, the overall CPU utilization is based on the user, sys, wait, and idle values. In AIX 5.3 partitions running in uncapped mode, the utilization is based on the entitled capacity percentage. mpstat (AIX-specific) mpstat [ { -a | -d | -i | -s | -h } ] [ -w ] [ interval [ count ] ]
34 Chapter 5: CPU: Monitoring The mpstat command (part of the bos.acct fileset) was introduced in AIX 5.3. This tool displays overall performance numbers for all logical CPUs on your partitioned system. When you run the command, two sections of statistics are displayed. The first section shows system configuration information, which is displayed when the command starts and whenever a change in the system configuration occurs; the second section, which is displayed at user-specified intervals, shows utilization statistics: # mpstat 1 2 System configuration: lcpu=4 ent=0.4 mode=Uncapped cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc %ec lcs 0 18 0 0 524 125 56 1 0 100 100 8 58 0 34 0.01 2.1 465 1 0 0 0 108 0 0 0 0 0 0 36 0 64 0.00 0.5 108 2 0 0 0 10 0 0 0 0 0 0 32 0 68 0.00 0.0 10 3 0 0 0 10 0 0 0 0 0 0 29 0 71 0.00 0.0 10 U - - - 0 97 0.39 97.3 ALL 18 0 0 652 125 56 1 0 100 100 0 1 0 98 0.01 2.7 593 ------------------------------------------------------------------------------0 1 2 3 U ALL 3 0 0 0 3 0 0 0 0 0 0 0 0 0 0 392 70 10 10 482 127 0 0 0 127 58 0 0 0 58 1 0 0 0 1 0 100 0 0 0 0 100 67 0 0 0 67 5 0 0 0 0 56 34 32 29 1 0 0 0 0 0 0 38 66 68 71 98 99 Information given includes: ● cpu — Logical CPU processor ID ● min — Minor page faults ● ma — Major page faults ● mpc — Total number of interprocessor calls ● int — Total number of interrupts ● cs — Total number of voluntary context switches ● ics — Total number of involuntary context switches 0.01 1.4 0.00 0.4 0.00 0.0 0.00 0.0 0.39 98.2 0.01 1.8 331 70 10 10 421
topas (AIX-specific) ● rq — Total run queues ● mig — Total number of thread migrations ● lpa — Logical processor affinity ● sysc — Total number of system calls ● us — CPU time spent on user activity ● sy — CPU time spent on system activity ● wa — CPU time spent waiting on I/O ● id — CPU time idle ● pc — Fraction of processor consumed ● %ec — Percentage of entitled capacity consumed ● lcs — Total number of logical context switches 35 The mpstat command is a very useful command because it reports collection information for each logical CPU on your partition in a format that is clearly illustrated. You can even view SMT utilization by specifying the –s flag: # mpstat -s 1 System configuration: lcpu=4 ent=0.4 mode=Uncapped Proc0 Proc1 1.01% 0.02% cpu0 cpu1 cpu2 cpu3 0.85% 0.16% 0.01% 0.01% -----------------------------------------------------------------Proc0 Proc1 0.74% 0.02% cpu0 cpu1 cpu2 cpu3 0.56% 0.18% 0.01% 0.01% topas (AIX-specific) IBM has improved the topas command (part of the bos.perf.tools fileset) substantially in AIX 5.3. Before these changes, topas did not have the
36 Chapter 5: CPU: Monitoring ability to capture historical data, nor was it enhanced for use in shared partitioned environments. (The command’s –L flag now reports partitioned information.) By incorporating these changes to let you collect performance data from multiple partitions, IBM has really simplified the capability of topas as a performance management and capacity planning tool. The command’s look and feel is quite similar to top and monitor (used in other Unix variants). The topas utility displays all kinds of information on your screen in a textbased, graphical type of format. In its default mode, it provides a myriad of CPU, memory, and I/O information. Some recent changes: ● ● As of TL_4 of AIX 5.3, topas uses a daemon named xmwlm, which is automatically started from the inittab. As of TL_5 of AIX 5.3, the system keeps seven days of data as a default and records almost all the topas data that is displayed interactively, except for process and Workload Manager (WLM) information. You can use the topasout command to generate text-based reports. By specifying the –C flag, you can actually view monitoring information across all partitions in an IBM POWER system. nmon My favorite of all performance monitoring tools is nmon, which until recently was not an “officially” supported IBM tool; if you were going to send data to IBM for analysis, this was not the tool you would use. nmon is almost the perfect AIX analysis tool (it’s also available now for Linux on POWER). The data it collects is available either from your screen or through reports, which you can run from cron. In the words of nmon’s creator, Nigel Griffiths, “Why use five or six tools when one free tool can give you everything you need?” What attracts most people to nmon is that not only does it have a very efficient front-end monitor, but it also provides the ability (unlike topas) to capture data to a text file for graphing reports because the output is in a .csv (spreadsheet) format. In fact, moments after running an nmon session, you can actually view the nicely rendered charts in a Microsoft Excel spreadsheet, which you can hand off to senior management or other techni-
Using nmon for Historical Analysis 37 cal teams for further analysis. Further, in contrast to topas, I’ve never seen any performance-type overhead with this utility. Using nmon for Historical Analysis First, we’ll tell nmon to create a file, name the run, and do data collection every 30 seconds for one hour (120 intervals): # ./nmon -f -t -r test3 -s 30 -c 120 AIX version 5.3.0.0 and starting up nmon nmon_aix5 When monitoring is completed, we’ll sort the file: # sort -A p682e_pub_071224_1411.nmon > lpar30p682e_pub_071224_411.csv Now, we can FTP the spreadsheet to a PC and open it up. Start the nmon analyzer, and click on Analyze nmon data. Enter the location of the file, wait about 20 seconds, and you’ll see your nmon data in all its glory! Figure 5.1 shows some sample output from the nmon analyzer. Figure 5.1: Sample nmon analyzer output The nmon analyzer is an awesome tool, written by Stephen Atkins, that graphically presents data (CPU, memory, network, or I/O) from an Excel
38 Chapter 5: CPU: Monitoring spreadsheet. Perhaps the only drawback that prevents it from being perceived as an enterprise type of tool is that it lacks the ability to gather statistics about large numbers of LPARs at once (although it now has a partition-viewing capability similar to that of topas). The analyzer is not a database, nor was it meant to be. That is where a tool such as Ganglia helps; this utility has actually received the blessing of Nigel Griffiths as the tool that can integrate nmon analysis. You can download the nmon analyzer for free from http://www.ibm. com/developerworks/aix/library/au-nmon_analyser. For more information about Ganglia, see http://ganglia.info. ps (Unix-generic) ps [-ANPaedfklmMZ] [-n namelist] [-F Format] [-o specifier[=header],...] [-p proclist][-G|-g grouplist] [-t termlist] [-U|-u userlist] [-c classlist] [ -T pid] [ -L pidlist] ps [aceglnsuvwxU] [t tty] [processnumber] The ps command shows the current status of processes. Upon viewing the syntaxes shown above, the first question you may have is, why the two sets of usage parameters? To make a long story short, the answer has to do with the basic history of Unix — the old Berkeley versus System V (now referred to as X/Open Standards) wars. As we discussed in Chapter 2, AIX is a hybrid of sorts, and it contains both flavors of Unix. Most of you are probably more familiar with the X/Open Standards usage of ps (e.g., ps –ef), which is the first usage shown above. How can you best use ps in CPU systems monitoring? In other words, how can you identify processes that are taking an inordinate amount of CPU time? If you can find these processes, you can take action on them. I like using the Berkeley syntax better here; the information it provides is in a nicer, more presentable format. Let’s look at ps ux, which displays the CPU execution time of processes: # ps ux | more USER root PID %CPU %MEM 8196 0.1 0.0 SZ 384 RSS 384 TTY STAT A STIME 08:45:25 TIME COMMAND 1:02 wait
tprof root root root root root root root root root root root 53274 86118 299158 69666 0 57372 61470 286880 258190 151642 233606 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 384 384 0.0 504 512 0.0 472 500 0.0 960 960 0.0 384 384 0.0 384 384 0.0 384 384 0.0 900 928 0.0 1216 1216 0.0 512 512 0.0 840 956 - A A A A A A A A A A A 08:45:25 08:45:27 08:45:44 08:45:25 08:45:25 08:45:25 08:45:25 08:45:44 08:45:35 08:45:27 08:45:44 0:30 0:08 0:06 0:04 0:04 0:02 0:02 0:01 0:01 0:01 0:00 39 wait /usr/sbin/syncd /usr/sbin/getty gi swappe wait wait /usr/bin/xmwlmrpc.lock rtcmd /usr/sbin/sshd This ps command uses two key parameters: ● ● — Displays user-oriented output about each process: the USER (user), PID (process ID), %CPU (CPU time used), %MEM (memory used), SZ (size of process core image), RSS (resident set size), TTY (controlling terminal name), STAT (process state), STIME (start time), TIME (total run time), and COMMAND (executed command) fields. u x — Displays processes without a controlling terminal in addition to processes with a controlling terminal. To see processes that don’t include daemons, substitute a for x. For our purposes, the most important field of the ps output is %CPU. This field reports the percentage of CPU time that the process has used since it started. Tracing Tools Tracing tools come in handy when you want to drill down further to analyze processes that are causing bottlenecks. Among these tools are curt, splat, tprof, trace, and trcrpt. We’ll use the tprof and trace tools here. tprof tprof [ -c ] [ -C { all | cpuidslist } ] [ -d ] [ -D ] [ -e ] { [ -E { ALIGNMENT | EMULATION | ISLBMISS | DSLBMISS | PM_<event> } ] [ -f interval ] } [ -F ] [ -j ] [ -J profilehook ] [ -k ] [ -l ]
40 Chapter 5: CPU: Monitoring [ -L objectslist ] [ -m objectslist ] [ -M sourcepathlist ] [ -p processlist ] [ -P { all | pidslist } ] [ -s ] [ -S searchpathlist ] [ -t ] [ -T buffersize ] [ -u ] [ -v ] [ -V verbosefilename ] [ -I ] [ -N ] { [-z] [-Z] | -R } { { -r rootstring } [ -X { xmloptions } ] | { { [ -A { all | cpuidslist } ] [-n] } [ -r rootstring ] -x command } } The tprof command reports CPU usage for both individual programs and the system as a whole. The output provides an estimate of the amount of CPU time spent for each process that was executing while tprof was running. It also contains an estimate of the amount of CPU time spent in each of the kernel address spaces: the kernel address space, the user address space, and shared library address spaces. You can use tprof to view a basic global program and thread-level summary by running the command in the following fashion: # tprof -x sleep 20 Mon Dec 24 18:55:54 2 System: AIX 5.3 Node: lpar30p682e_pub Machine: 00CED82E4C0 Starting Command sleep 2 stopping trace collection. Generating sleep.prof root@lpar30p682e_pub[/] Let’s view the file (sleep.prof) that we just created: # more sleep.prof Configuration information ========================= System: AIX 5.3 Node: lpar30p682e_pub Machine: 00CED82E4C00 Next, let’s use the trace command to run a manual trace:
time 41 /usr/bin/trace -ad -M -L 109113753 -T 500000 -j 000,00A,001,002,003,38F,005,006,134,139,5A2,5A5,465,234, -o Total Samples = 1088 Traced Time = 20.02s (out of a total execution time of 20.02s) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Process Freq Total Kernel User Shared Other ======= ==== ===== ====== ==== ====== ===== wait 4 99.82 99.82 0.00 0.00 0.00 swapper 1 0.09 0.09 0.00 0.00 0.00 /usr/bin/tprof 1 0.09 0.00 0.00 0.09 0.00 Total 6 100.00 99.91 0.00 0.09 0.00 Process PID TID Total Kernel User Shared Other ======= === === ===== ====== ==== ====== ===== wait 8196 8197 44.58 44.58 0.00 0.00 0.00 swapper 0 3 0.09 0.09 0.00 0.00 0.00 /usr/bin/tprof 418000 688307 0.09 0.00 0.00 0.09 0.00 ======= Total === === ===== ====== 100.00 99.91 ==== ====== 0.00 0.09 ===== 0.00 The tprof command is an excellent tool for identifying runaway processes because these processes appear at the top of the output list. Timing Tools Two tools, time and timex, provide access to information about command execution time. time time [ -p ] Command [ Argument ... ] The time command returns the total execution time of your program, including real time, user time, and system time. This information can be useful when you’re trying to figure out the amount of time it takes for commands to execute. time works by counting the CPU ticks from the time the command was first started until the time it ends: # time find ./ -depth 1>/dev/null real user sys 0m23.30s 0m0.22s 0m2.10s
42 Chapter 5: CPU: Monitoring timex timex [ -s ][ -o ][ -p [ -fhkmrt ] ] cmd Without any flags, the timex command provides the same type of information as time, but with a prettier view. Used with the –s flag, it summarizes all system activity while the command is being executed. This spares you the task of starting up a sar or vmstat process while running a timing. For this reason alone, I like to use timex, and I’ve found it a very useful tool through the years. # timex -s find ./ -depth 1>/dev/null real 21.69 user 0.20 sys 2 AIX lpar30p682e_pub 3 5 00CED82E4C00 12/26/07 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 %usr %sys %wio %idle physc %entc 08:40:30 5 33 0 62 0.17 43.2 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 08:40:30 0 0 0 0 0 0 0 0 System configuration: lcpu=4 mem=3072MB ent=0.40 mode=Uncapped 08:40:08 slots cycle/s fault/s odio/s 08:40:30 392358 0.00 18.11 0.00 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s 08:40:30 0 0 0 0 0 0 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 08:40:30 19659 8 5522 0.14 0.18 12407 308149 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 cswch/s 08:40:30 5617 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 iget/s lookuppn/s dirblk/s 08:40:30 0 8513 0
timex System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 runq-sz %runocc swpq-sz %swpocc 08:40:30 1.3 95 System configuration: 08:40:08 proc-sz 08:40:30 68/262144 mode=Uncapped inod-sz file-sz 0/170 387/1124 thrd-sz 219/524288 System configuration: lcpu=4 ent=0.40 mode=Uncapped 08:40:08 msg/s sema/s 08:40:30 0.00 0.00 43

C h a p t e r 6 CPU: Tuning This chapter identifies the AIX tools you’ll use to help resolve CPU system bottlenecks and improve performance. Notice that I didn’t use the word “tune” in the preceding sentence. You’ll find that improving CPU performance is less about tuning and more about improving workload utilization and managing processes and threads more efficiently. Process and Thread Management A junior administrator might consider process management as little more than monitoring active processes and killing zombie and/or runaway processes. In reality, there is a lot more to it than that. Let’s start by addressing a fundamental question: how do processes relate to threads? The answer is simple. While the process is the entity that AIX uses to control the use of system resources, the threads control the actual time consumption because each kernel thread is a single sequential flow of control. Each process is made up of one or more threads. Controlling thread use is where you can make a difference. To do this, you need to understand the tools that let you work with threads to improve CPU performance. Although AIX Version 4 introduced the use of threads to control processor time consumption, it was in AIX 5L that system management tools really evolved to help monitor and analyze thread usage.
46 Chapter 6: CPU: Tuning nice nice [-n Increment] Command [Argument...] nice [-Increment] Command [Argument...] The nice command lets you adjust the priority of a given process. The default value for processes is 20, except for korn shell (ksh) background processes, which are set to 24. With nice, the larger the increment number you specify, the lower the priority. You can use the ps command with the –l (lowercase L) flag to view your information. The NI column shows the nice value for each process: nice # ps -l | more F 240001 200005 200001 S UID PID PPID A 20004773 90156 164038 A 0 376960 90156 A 0 409730 376960 C PRI 0 60 2 20 0 60 NI ADDR SZ WCHAN TTY TIME CMD 20 30448400 724 pts/0 0:00 ksh 60 45c400 736 pts/0 0:00 ksh 20 446400 724 pts/0 0:00 ps Let’s start a new ksh with nice: # nice -10 ksh # ps -l | more F 240001 200005 200001 S UID PID PPID A 20004773 90156 164038 A 0 311534 376960 A 0 376960 90156 C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 60 20 30448400 724 pts/0 0:00 ksh 1 80 30 48376400 688 pts/0 0:00 ksh 0 60 20 6045c400 736 pts/0 0:00 ksh The preceding output shows that the priority of the new process (PID 311534) has been added and changed from its default. The child process that was forked from the process is also shown. Watch the nice syntax — it can be a little confusing. The minus sign (–) identifies the increment value, which is assumed to be positive. To specify a negative increment, you must use two minus signs, with no spaces in between. When you use the renice command (covered next), the parameter following the command name is the value, whether it is positive or negative.
renice 47 renice renice [ -n Increment ] [ -g|-p|-u ] ID . . . The renice command dynamically reassigns a priority to a running process. Using renice can cause the system to assign either a higher or a lower priority to a given process. When you use renice, you actually change the value of the priority of a thread (default value of 40) by changing the nice value of its process. Assume that the following processes are currently running: # ps -l F 240001 200005 200001 200001 S UID PID PPID A 20004773 90156 164038 A 0 311534 376960 A 0 376960 90156 A 0 417842 311534 C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 60 20 30448400 724 pts/0 0:00 ksh 0 80 30 48376400 688 pts/0 0:00 ksh 0 60 20 6045c400 736 pts/0 0:00 ksh 3 81 30 30468400 732 pts/0 0:00 ps Let’s increase the priority of a thread by changing the nice value for the process that contains it: # renice -10 376960 # ps -l F 240001 200005 200001 200001 S UID PID PPID A 20004773 90156 164038 A 0 311534 376960 A 0 376960 90156 A 0 417842 311534 C PRI NI ADDR 0 60 20 30448400 0 80 30 48376400 0 50 10 6045c400 3 81 30 30468400 SZ WCHAN TTY TIME CMD 724 pts/0 0:00 ksh 688 pts/0 0:00 ksh 736 pts/0 0:00 ksh 732 pts/0 0:00 ps It’s important to note that when not run as root, renice has some limitations. Without the protection of root, only processes by the current user ID can be changed. In addition, you cannot increase nice values after making a prior one less favorable.
48 Chapter 6: CPU: Tuning ps In the preceding chapter, we looked at the ps command and how you can use it to monitor CPUs. You’ll find that ps is one of the most versatile commands in Unix. Specifying it with the –mo flag gives you a granular look at your threads: # ps -mo THREAD USER u0004773 PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND - A 0 60 1 - 240001 pts/0 - 933995 S 0 60 1 - 10400 - root 311534 376960 - A 0 80 1 - 200005 pts/0 - 90156 164038 - - 823311 S 0 80 1 - 10400 - root 376960 90156 - A 0 50 1 - 200001 pts/0 - 835591 S 0 50 1 - 10400 - root 409778 311534 - A 3 81 1 - 200001 pts/0 880775 R 3 81 1 - 400010 - - - - - -ksh - - ksh - - -ksh - - ps -mo THREAD - - The TID column lists the thread ID, while the BND column shows the processes and threads bound to a processor. Why do you need to know this information? Because you can actually change the priority of threads, globally. To do so, you modify the CPU scheduling parameters (using the schedo command) that calculate the priority for each thread. schedo schedo -h [tunable] | {-L [tunable]} | {-x [tunable]} schedo [-p|-r] (-a | {-o tunable} schedo [-p|-r] (-D | ({-d tunable} {-o tunable=value}) The schedo command manages the CPU scheduler tunable parameters; it can be used only with root. Similar to other tunable commands (e.g., vmo), schedo can make immediate changes or can defer the changes until the next reboot, depending on the flags you use. Use of the –p flag causes the changes to take effect at the next reboot. First, let’s display the existing scheduling parameters by using schedo with the –L flag:
schedo 49 # schedo -L NAME CUR DEF BOOT MIN MAX %usDelta 100 100 100 0 100 UNIT TYPE DEPENDENCIES affinity_lim 7 7 7 0 100 dispatches D allowMCMmigrate 0 0 0 0 1 boolean D big_tick_size 1 1 1 1 100 10 ms D ded_cpu_donate_thresh 80 80 80 0 100 % busy fixed_pri_global 1 0 0 0 1 boolean force_grq 0 0 0 0 1 boolean hotlocks_enable 0 0 0 0 1 boolean idle_migration_barrier 4 4 4 0 100 sixteenth D krlock_confer2self 1 1 1 0 1 boolean D krlock_conferb4alloc 0 0 0 0 1 boolean D krlock_enable 1 1 1 0 1 boolean krlock_spinb4alloc 1 1 1 1 2G-1 krlock_spinb4confer 1K 1K 1K 0 2G-1 maxspin 16K 16K 16K 1 4G-1 n_idle_loop_vlopri 100 100 100 0 976K D pacefork 10 10 10 10 2G-1 clock ticks D sched_D 16 16 16 0 32 sched_R 16 16 16 0 32 search_globalrq_mload 256 256 256 0 4095M search_smtrunq_mload 256 256 256 0 4095M setnewrq_sidle_mload 384 384 384 0 4095M shed_primrunq_mload 64 64 64 0 4095M sidle_S1runq_mload 64 64 64 0 4095M 134 134 134 0 4095M 134 134 0 4095M 4095M 4095M 4095M 0 4095M D D D spins D D D sidle_S2runq_mlo sidle_S2runq_mload D sidle_S1runq_mload sidle_S3runq_mloa sidle_S3runq_mload 134 sidle_S2runq_mloa sidle_S4runq_mload sidle_S4runq_mload D sidle_S3runq_mload slock_spinb4confer 1K 1K 1K 0 2G-1 smt_snooze_delay 0 0 0 -1 97656K microsecs D smtrunq_load_diff 2 2 2 1 4095M D tb_balance_S0 0 0 0 0 2 ticks D
50 Chapter 6: CPU: Tuning tb_balance_S1 2 2 2 0 2 ticks tb_threshold 100 100 100 10 1000 ticks D timeslice 1 1 1 0 2G-1 clock ticks D unboost_inflih 1 1 1 0 1 boolean D v_exempt_secs 2 2 2 0 2G-1 seconds D v_min_process 2 2 2 0 2G-1 processes D v_repage_hi 0 0 0 0 2G-1 v_repage_proc 4 4 4 0 2G-1 v_sec_wait 1 1 1 0 2G-1 seconds vpm_xvcpus 0 0 0 -1 2G-1 processors D ----------------------------------------------------------------------------n//a means parameter not supported by the current platform or kernel Parameter types: S = Static: cannot be changed D = Dynamic: can be freely change B = Bosboot: can only be changed using bosboot and reboot R = Reboot: can only be changed during reboot C = Connect: changes are only effective for future socket connection M = Mount: changes are only effective for future mountings I = Incremental: can only be incremented d = deprecated: deprecated and cannot be changed Value conventions: K = Kilo: 2^10 G = Giga: 2^30 P = Peta: 2^5 M = Mega: 2^20 T = Tera: 2^40 E = Exa: 2^60 You can also display these parameters using the –a flag, although the information given is far less meaningful. sched_R and sched_D The sched_R and sched_D scheduling parameters relate to process priority calculations. The scheduler’s priority calculations are based on sched_R and sched_D, values that are expressed in thirty-seconds (1/32). I won’t bore you here with the complex algorithms associated with these parameters. The net of it is that lowering sched_R has the effect of helping the scheduler distinguish between background processes and processes running as interactive foreground processes, thereby enabling it to assign a greater priority to foreground processes. The following example lowers sched_R from its default value of 16 to 5: # schedo -o sched_R=5 Setting sched_R to 5
timeslice 51 fixed_pri_global When a CPU is ready to dispatch a thread, the system checks the global run queue before any of the others. When the thread completes its running slice on the CPU, it gets put back on the queue, which helps maintain something called processor affinity. Processor affinity is defined as the probability of dispatching a thread to a processor that previously executed it. To improve overall thread performance, you can enable an environment variable called RT_GRQ, which is set to off by default. Turning on RT_GRQ automatically places the thread on the global run queue. All fixed priority threads will be placed on the run queue if you change the default from 0 to 1. Let’s use schedo to change the default value of fixed_pri_global: # schedo -o fixed_pri_global=1 # schedo -a | grep fixed_pri_global fixed_pri_global = 0 # schedo -o fixed_pri_global=1 Setting fixed_pri_global to 1 # schedo -a | grep fixed_pri_global fixed_pri_global = 1 The actual priority of the user processes varies over time, depending on the amount of overall CPU time that a process has used most recently. Please note that in some instances, this variable should be turned of because it can impact SMT performance. Make sure that you test this in your environment to determine what works best for your application. timeslice Perhaps the most important schedo parameter is timeslice. This setting represents the largest number of clock ticks that a thread can be in control of before facing the possibility of being replaced by another thread. In some cases, increasing the timeslice can improve system throughput by reducing context switching. Before changing the timeslice setting, make sure you run vmstat (or sar) enough to determine whether there really is a considerable amount of
52 Chapter 6: CPU: Tuning context switching going on. If there is, the overhead of dispatching threads is more costly than letting them run for longer slices. The following example increases the timeslice from 1 to 2: # schedo -p -o timeslice=2 Setting timeslice to 2 in nextboot file Setting timeslice to 2 In this case, we’ve also used the –p flag, which saves the parameter on a reboot. bindprocessor bindprocessor { -q|-u ProcessID|-s SmtSetID|-b BindId|ProcessID [ProcessorNum] } CPU binding lets processes run on a specific processor, a capability that relates to the processor affinity concept I defined earlier. Binding threads to specific processors has many purposes; for example, you might bind threads to a given processor to find the root cause of a hanging program. More commonly, the technique is used when you’re trying to spread around the wealth of a system — in a symmetric multiprocessing (SMP) box, for example. To display the available (logical) processors on your box, you would use the –q flag: # bindprocessor -q The available processors are: # CPU binding 0 1 2 Assuming that symmetric multithreading (SMT) is enabled (it is by default), each and every hardware thread of the physical processor is listed as a separate processor when you run the bindprocessor command. On POWER5 chips, two hardware threads exist on each processor. With shared processor logical partitions (LPARs), using this command binds to virtual CPUs, so you must be careful because problems can result for
smtctl 53 applications that are predisposed to run on a specific CPU. If you want to bind a process to a particular CPU, it’s as simple as running this command: # bindprocessor 12769 3 This example assigns process ID (PID) 12769 to logical CPU 3. smtctl smtctl [ -m off|on [ -w boot|now ] ] The smtctl command (introduced in AIX 5.3) displays SMT information. To determine whether SMT is enabled, you simply run the command without any flags: # smtctl This system is SMT capable SMT is currently enabled SMT boot mode is not set SMT threads are bound to the same virtual processor. proc0 has 2 SMT thread Bind processor 0 is bound with proc Bind processor 1 is bound with proc proc2 has 2 SMT threads. Bind processor 2 is bound with proc2 Bind processor 3 is bound with proc2 System performance usually increases about 30 percent when SMT is enabled, so you almost always want to activate this functionality. Processor affinity also occurs naturally. When a thread is running on a CPU and is interrupted, it usually is placed back on the same CPU because the processor’s cache might still have lines belonging to the thread. If the thread were to be dispatched to a different CPU, it might have to obtain information from RAM, which would slow processing time dramatically. You can also bind threads using subroutines, although I advise caution if you attempt to do so. This technique binds all kernel threads in a process to a processor, which has the effect of forcing these threads to be run on that specific processor until they are unbound.
54 Chapter 6: CPU: Tuning gprof /usr/ccs/bin/gprof [-b] [ -c [filename] ] [-e Name] [-E Name] [-f Name] [-g filename] [-i filename] [-p filename] [-F Name] [-PathName] [-s] [-x [filename]] [-z] [a.out [gmon.out ...]] The gprof command, used during programming, produces an execution profile of your compiled programs in C, Fortran, Pascal, or even Cobol. The command reports on flow control through all the subroutines of your program and tells you the amount of CPU time each subroutine consumed. This information is useful when you’re troubleshooting how processes use CPU resources. You can use gprof to profile your program and determine which functions are using the CPU. The profile data is taken from the call graph profile file (gmon.out by default). AIX 5.3 lets you assign a user-specified name to the profiling output files by setting special environment variables. Version 5.3 also provides additional profiling support for threads and new options that affect the type of profiling data collected.
Section II Summary, Tips, and Quiz Summary ● ● ● ● CPU monitoring tools you can use include iostat, lparstat, mpstat, nmon, sar, topas, vmstat, and w. Tracing tools include curt, splat, tprof, trace, and trcrp. The nice and renice commands are important utilities that can help you prioritize your processes and treads. is the command used to manage the CPU scheduler’s tunable parameters. schedo ● The smctl command is used for symmetric multithreading (SMT). ● ps ● is an extremely versatile command that can help you identify process hogs, thread utilization, and nice priorities. and mpstat are important performance tools you should use in a partitioned environment. lparstat Tips ● ● Identifying workload is of paramount importance to improving CPU utilization. Running jobs and processes during off-peak hours, using cron and/or other third-party types of scheduling tools (e.g., IBM’s Workload Manager, CA’s AutoSys) can make a big difference in performance. Users usually will assume that your systems bottleneck is with the CPU, but more often than not the problem is either memory- or I/O-related. Tune a subsystem only when you’re certain of the diagnosis.
56 Section II: Summary, Tips, and Quiz ● ● ● ● ● ● Before making any changes to production systems, make the changes to either your test or development environment first so you can analyze their effect. This advice is particularly important when using the schedo command for AIX CPU tuning. Once you’ve determined that you’re experiencing a CPU bottleneck, adding CPUs is always an option. With dynamic logical partitioning (DLPAR), this solution is much easier to accomplish than it used to be because you can just add or subtract CPUs dynamically. Tools such as the DLPAR toolset and Partition Load Manager (PLM) can automate the process, letting you add or subtract CPUs to or from your partition based on variables you’ve already identified. Uncapping partitions in a virtual environment can also alleviate CPU bottlenecks. Using nmon, ps, tprof, or any number of other tools, you might have identified processes that are hogging CPU time. If you question whether these processes are necessary, try contacting the process owners (if possible). You may find out that you can kill the processes. If you’re told you can do so, be sure to kill them using kill -1 and not kill -9. Also, be careful about zombie processes that can be created when you kill parent processes and leave their children alone. It may not sound proper, but make sure the children are dead, too; otherwise, you’re at risk for runaway and/or zombie processes. Starting with the POWER5, SMT is built into the POWER architecture. This capability provides two independent threads of instruction execution for each processor. Enabling SMT makes one processor appear as two processors on the partition. Always make sure SMT is enabled (by running the smtctl command), except where an ISV explicitly states that it is not recommended. SMT’s performance gain depends on many variables, each of which you should analyze carefully. SMT is bestsuited for multithreaded, I/O-intensive applications. It is not a good fit for numerically intensive workloads. Use tools such as nmon (and the nmon analyzer) or topas to store historical performance data for trending and analysis. Don’t wait to use these tools until you have a problem. You should be using them when you are first in production with your system. Other IBM utilities are available that don’t come standard with AIX and have a cost associated with them, including:
Multiple Choice ❍ 57 Performance Toolbox (PTX), which as of AIX 5.3 includes the procmon utility (used for process management) ❍ IBM Tivoli Monitoring System Edition for System p for AIX 5L V5 ❍ PM for System p, an IBM Global Technology Services offering Quiz Multiple Choice 1. Which iostat flag reports AIO information? a. –A b. –a c. –v d. –e 2. Which topas flag reports all partitioned information in your managed system? a. –c b. –L c. –j d. –p 3. Which tool is best used to monitor performance numbers for all logical CPUs on a partitioned system? a. iostat b. lparstatc c. mpstat d. lvms
58 Section II: Summary, Tips, and Quiz 4. Which ps flag reports thread information? a. –a b. –u c. –ef d. –mo 5. Which of the following is not an example of a trace tool? a. lprof b. tprof c. curt d. splat 6. timex –s reports the total execution time of the program as well as the a. Number of referenced inodes b. Percentage of blocked processes c. Number of threads d. Summary of systems activity 7. Given the following results of the vmstat command, are you experiencing a CPU bottleneck? # vmstat 2 System configuration: lcpu=4 mem=3072MB ent=0. kthr memory ----- -----------avm fre page ---------------------- faults cpu ------------ ---------------------- r b re pi po fr sr cy in pc ec 1 4 128826 641397 0 0 0 0 0 0 448 sy 87 138 cs us sy id wa 24 1 40 35 0.01 2.8 1 7 128826 641397 0 0 0 0 0 0 385 10 136 35 14 20 31 0.01 2.2 2 7 128826 641397 0 0 0 0 0 0 381 13 138 35 4 20 41 0.01 2.2 3 4 128826 641397 0 0 0 0 0 0 364 40 138 40 17 16 27 0.01 2.
Fill in the Blank(s) a. Yes b. No c. Maybe d. Not enough information True or False 8. With nice, the larger the number, the lower the priority. 9. Lowering the schedo parameter sched_R has the effect of giving a higher preference to foreground processes than to background processes. Fill in the Blank(s) 10. Define processor affinity: __________________________________________ 59

Section III Memory This section provides an overview of the AIX Virtual Memory Manager and other important memory-related concepts, including how to monitor and tune your virtual memory. We also discuss best practices for virtual memory monitoring, analysis, and tuning, given the various considerations that can impact performance.

C h a p t e r 7 Memory: Introduction What, exactly, is involved in memory performance tuning? As a systems administrator, you’re probably already familiar with the basics of memory, such as the differences between physical and virtual memory. What we’ll be discussing here is how the Virtual Memory Manager (VMM) works in AIX and how it relates to overall systems performance. We’ll also review some of the more important recent enhancements. Let me reiterate that regardless of which subsystem you want to tune, you should always think of the process as an ongoing one. Start monitoring your system as soon as you put it into production and have it running well, rather than when users are screaming about slow performance. Review Chapter 1 on tuning methodology. I’m not saying that you must follow that specific methodology, but without a plan, you won’t succeed in optimizing the performance of your environments. Further, be sure to make only one change at a time unless otherwise noted (as when changing related parameters, such as minperm% and maxperm%). In addition, capture and analyze data as quickly as possible after making a change to determine what difference, if any, the change has really made. Virtual Memory Manager AIX newbies are sometimes surprised to hear that the Virtual Memory Manager (VMM) services all memory requests from the system, not just virtual memory. When the system accesses random access memory (RAM), the VMM needs to allocate space, even when plenty of physical
64 Chapter 7: Memory: Introduction memory is left on the system. It implements a process of early allocation of paging space. Using this method, the VMM plays a vital role in helping manage real memory, not just virtual memory. In AIX, all virtual memory segments are partitioned into pages, with a default page size of 4K. Because virtual memory consists of real memory and paging space, allocated virtual memory segments can be either RAM or paging space (virtual memory stored on disk). This is an important concept to understand, so read that last paragraph at least twice. VMM also maintains what is referred to as a free list, which is defined as unallocated page frames. These are used to satisfy page faults. There are usually a very few unallocated pages (which you configure) that the VMM uses to free up space and reassign the page frames to. The VMM then selects the virtual memory pages (whose page frames are to be reassigned) using its page replacement algorithm. The paging algorithm determines which virtual memory pages currently in RAM ultimately have their page frames brought back to the free list. AIX uses all available memory, except that which is configured to be unallocated — the free list. To reiterate, the purpose of VMM is to manage the allocation of both RAM and virtual pages. VMM’s objectives are to help minimize both the response time of page faults and the use of virtual memory where it can. Given the choice between RAM and paging space, the preference is to use physical memory — if the RAM is available. VMM also classifies virtual memory segments into two distinct categories, which are critical for you to understand. This concept is the most important to grasp, and I’ll admit that when I first started working with AIX, it took me a while to fully understand the concept and the tuning recommendations (which we’ll discuss later) behind it. The two categories are working segments using computational memory and persistent segments using file memory. Simply put, without fully grasping these concepts, you won’t be able to tune your systems to their optimum capabilities.
Paging and Swapping 65 Computational Memory Computational memory is used while your processes are actually working on computing information. These working segments are temporary (transitory) and exist only up until the time a process terminates or the page is stolen. They have no real permanent disk storage location. When a process terminates, both the physical and paging spaces are released in many cases. When a large spike occurs in available pages, you can actually see this happening while monitoring your system. In the world of virtual memory, when free physical memory starts getting low, programs that have not been used recently are moved from RAM to paging space to help release physical memory for more real work. Remember, virtual memory consists of real and paging space; it is not just paging space. The most important point to remember about computational memory is that when the system pages, you do not want it to page out computational memory; your preference is file memory. File Memory File memory (unlike computational memory) uses persistent segments (not working segments), and it has a permanent storage location on the disk. Data files or executable programs are mapped to persistent segments rather than to working segments. The data files can relate to file systems, such as the Journaled File System (JFS), Enhanced Journaled File System (JFS2), or Network File System (NFS). These files remain in memory until the time when a file is unmounted, a page is stolen, or a file is unlinked. After a data file is copied into RAM, VMM controls when these pages are overwritten or used to store other data. Given the alternative, you would much rather have file memory paged to disk than computational memory. Paging and Swapping When a process references a page on disk, the page must be paged in, which could cause other pages to page out again. VMM is constantly working in the background, stealing frames that have not been recently referenced using the page replacement algorithm. It also helps detect thrashing, which can occur when memory is extremely low and pages are constantly being paged in and out to support processing. VMM actually
66 Chapter 7: Memory: Introduction has a memory load control algorithm, which can detect whether the system is thrashing and actually tries to remedy the situation. Unabashed thrashing can literally cause a system to come to a standstill, as the kernel becomes so concerned with making room for pages that it just can’t do anything productive. What about swapping? Although the terms are often used interchangeably, there is a subtle difference between paging and swapping. As we’ve discussed, with paging, only parts of the process are moved back and forth between disk and RAM. When swapping occurs, you are moving entire processes back and forth. For this to happen, AIX would need to suspend the entire process before moving it to paging space. It could then only continue to process when the process was swapped back into RAM at a later time. The difference that is not subtle is this: while paging is often okay, swapping is a very bad thing. VMM Tuning Evolution Before AIX 5L, you would have used the vmtune command to tune your VMM system. Although the vmo command came around in AIX 5.2, vmtune actually hung around until AIX 5.3. With AIX 5.3, vmtune is no more. Although most of the actual parameters are the same (and remain the same in AIX 6.1), there are some fundamental changes in the recommended tuning parameters. (AIX 5.3 also does away with the schedtune command, whose function is now performed by schedo.) One important change in AIX 5L relates to page frames. Starting with the POWER4 processor, AIX supported up to 16MB page sizes. The POWER5 chip supports four virtual memory page sizes: 4K, 64K, 16 MB, and 16 GB. With a simple vmo change here that reflects these sizes, you can actually tune the system to provide for large page usage, which can improve system performance substantially in very memory-intensive application. The recommendations for the minperm and maxperm settings have also changed substantially. Furthermore, starting with AIX 5.2, we no longer save our tunables in rc.tune but in /etc/tunables.
C h a p t e r 8 Memory: Monitoring As with CPU monitoring, the AIX systems administrator has a myriad tools at his or her disposal when tuning the Virtual Memory Manager (VMM). Some of the tools are Unix-generic, while others are AIX-specific. We’ll discuss these tools in the context of real performance issues and what you can do to address them. IBM enhanced the following tools in AIX 5.3 to allow more accurate statistics on shared partitions using APV: sar, topas, and vmstat. Suppose that while you’re surfing the Internet and enjoying your coffee, one of the DBAs knocks on the side of your cubicle (why Unix administrators never get an office, I’ll never know) and informs you that “We have a real memory problem.” Although your first reaction might be to dismiss the suggestion entirely (do you tell the DBA that the indexes need rebuilding?), the first thing I would do is ask why the person came to this conclusion. The more information you have at your disposal, the more effective you’ll be in your efforts to resolve the alleged bottleneck. More than likely, the DBA used a graphical tool such as nmon or topas that indicated that real memory was low. This is a common event. However, one of the biggest incorrect assumptions is concluding that you have a memory problem because real memory is low. On the contrary, we want real memory to be low — because that means we’ve sized the system properly. So, where do we begin to troubleshoot this issue? If you’ve read the CPU monitoring chapter, you’ll know that I like to start with vmstat.
68 Chapter 8: Memory: Monitoring vmstat (Unix-generic) vmstat [-fsviItlw] [[-p|-P] pagesize|ALL] [Drives] [Interval [Count]] In Chapter 6, we used vmstat to monitor CPU. In this chapter, we’ll look at how to use this command for virtual memory analysis, which was actually the intended purpose of the tool (remember the “vm” in vmstat). Here’s a summary of the relevant output fields: ● ● ● ● ● ● ● r — Average number of runnable kernel threads over a sampling interval, which you specify when running the command. “Runnable” includes threads that are ready but are waiting to run as well as those that are already running. I start to become concerned when this number is three or four times greater than the number of processors on the system. b — Average number of kernel threads placed in the VMM wait queue that are waiting on I/O. This is an extremely important field; if these numbers are higher than r (runnable processes), that is usually symptomatic of I/O problems. Watch this field very carefully! avm — Contrary to what most people think, this field does not report the average memory. It shows the number of active virtual pages — the sum of virtual and real memory pages (remember this concept). Each page is 4,096 bytes. fre — Size of the free list. It’s important to note that you shouldn’t concern yourself too much if these numbers look really low, because a large part of RAM is used as a cache for file system data. Applications people will always point out this field to you and say, “There is no more memory left on the system.” If no bottlenecks are occurring, it is just not a problem. re — Pager input/output list pi — Number of pages paged in from paging space. This field becomes populated when there are lots of processes starting up, which can occur during a CPU or memory bottleneck. po — Number of pages paged out to paging space. If the numbers are high here, paging is occurring, which can certainly signify a memory bottleneck.
69 vmstat (Unix-generic) ● fr — Number of pages freed (page replacement) ● sr —Number of pages scanned by the page replacement algorithm ● cy — Number of clock cycles executed by the page replacement algorithm ● in — Device interrupt ● sy — System calls ● cs — Kernel thread context switches The following output is a snapshot of a very well-behaved system. This system is easily handling the number of runnable processes; there are no blocked processes, no paging going on, nor any waiting on I/O. I love this system. # vmstat 2 5 System configuration: lcpu=4 mem=3072MB ent=0.40 kthr memory ----- page faults ------------- ---------------------avm fre cpu ---------in sy cs ---------------------- r b re pi po fr sr cy pc ec 3 0 173838 576044 0 0 0 0 0 0 365 87 144 us sy id wa 0 1 99 0 0.01 2.4 3 0 173837 576045 0 0 0 0 0 0 297 13 149 0 1 99 0 0.01 1.9 3 0 173838 576044 0 0 0 0 0 0 329 37 143 0 1 99 0 0.01 2.2 3 0 173838 576044 0 0 0 0 0 0 337 10 143 0 1 99 0 0.01 2.0 3 0 173838 576044 0 0 0 0 0 0 364 13 143 0 1 99 0 0.01 2.1 Here is a snapshot of the system that the DBA was looking at: # vmstat 2 5 System configuration: lcpu=4 mem=3072MB ent=0.40 kthr ----r b memory ---------avm fre page faults --------------------re pi po fr sr cy cpu -------------in sy cs -------------------us sy id wa pc ec 4 19 173838 123 0 9 92 104 208 417 365 12001 6004 12 4 30 54 .4 2.4 9 36 173837 567 0 7 45 53 109 229 297 17002 8124 21 9 12 58 .8 1.9 2 19 173838 191 0 22 127 140 287 567 329 18229 9374 41 23 .5 2.1 2 34
70 Chapter 8: Memory: Monitoring At first glance, it appears that there are memory problems. In this case, we’re looking at the free (fre) list because there is paging going on. If there were not, I wouldn’t even give the low numbers a second glance. Oftentimes, one bottleneck will be the cause of another. In this case, it appears that significant I/O problems are causing other bottlenecks to occur. There are many blocked processes (b); and the wait time (wa) in the CPU section is also extremely high. Preliminary analysis shows us that the system just cannot keep up with the workload. The CPU can’t work hard because of the I/O problems. The paging is occurring because of the excessive I/O, which appears to have also caused a memory bottleneck. Let’s change just a few numbers around. What do we see now? kthr memory ----- ---------avm fre page faults --------------------re pi po fr sr cy cpu -------------in sy cs -------------------- r b us sy id wa pc ec 4 2 173838 123 0 9 92 104 208 417 365 12001 6004 81 19 0 0 .4 2.4 9 1 173837 567 0 7 45 53 109 229 297 17002 8124 74 25 1 0 .8 1.9 9 1 173838 191 0 22 127 140 287 567 329 18229 9374 41 23 2 0 .5 2.1 Clearly, there are no I/O problems to speak of; the wait times and blocked processes are not there. The CPU obviously is running hot and heavy. Do we have a CPU bottleneck? Sure we do, because the CPU is running at almost 100 percent busy. But what is causing that bottleneck? Because excessive paging is going on and the numbers in fre list are low, I would guess that in this case a memory bottleneck is causing the CPU bottleneck, not the reverse. In fact, this snapshot could have been taken after we fixed the I/O bottleneck in the previous snapshot. Remember, fixing one bottleneck often causes others, but that’s okay; it’s just part of the circle of tuning. In any case, the system is at least processing data here, where in the prior example, it was just stuck in the mud. If we can tune the memory accordingly here, the CPU bottleneck may break through, or it may continue. In the latter event, we might have to throw more iron at the box or manage our workload more efficiently.
sar (Unix-generic) 71 Virtual Memory Summary You’ll be interested to know that AIX 5.3 introduced a new vmstat flag, –v, that summarizes overall virtual memory statistics: # vmstat -v 786432 748478 574790 5 110858 80.0 20.0 80.0 4.4 33524 0.0 0 4.4 80.0 33524 0 memory pages lruable page free page memory pool pinned pages maxpin percentag minperm percentage maxperm percentage numperm percentage file pages compressed percentage compressed percentage numclient percentage maxclient percentag client page remote pageouts schedule 0 pending disk I/Os blocked with no pbuf 0 paging space I/Os blocked with no psbuf 2484 filesystem I/Os blocked with no fsbbuf 0 client filesystem I/Os blocked with no fsbuf 0 external pager filesystem I/Os blocked with no fsbuf 0 Virtualized Partition Memory Page Faults 0.00 Time resolving virtualized partition memory page faults sar (Unix-generic) sar { -A [-M] | [-a][-b][-c][-d][-k][-m][-q][-r][-u][-v][-w][-y][-M] [-s hh[:mm[:ss]]] [-e hh[:mm[:ss][-P processor_id[,...] | ALL] -f file] [-i seconds] [o file] [interval [number]] [-X file] [-i seconds] [o file] [interval [number]] Let’s turn our attention now to the sar command and try using it to examine data that can impact virtual memory performance. In the following
72 Chapter 8: Memory: Monitoring view, we’ll use the –rm flag, which enables us to view paging statistic (-r) and semaphore information (–m). The output reports the following fields: ● cycle/s — Number of page replacement cycles per second ● fault/s — Number of page faults per second ● slots — Number of free pages on the paging spaces ● odio/s — Number of non-paging disk I/Os per second ● ● msg/s — Number of Interprocess Communication (IPC) message primitives sema/s — Number of IPC semaphore primitives # sar -rm 1 5 AIX lpar30p682e_pub 3 5 00CED82E4C00 12/30/07 System configuration: lcpu=4 mem=3072MB ent=0.40 mode=Uncapped 15:21:14 slots cycle/s fault/s msg/s sema/s odio/s 15:21:15 392354 0.00 0.00 0.00 44.00 0.00 15:21:16 392354 0.00 0.00 0.00 3.00 0.00 15:21:17 392354 0.00 0.00 0.00 0.00 0.00 15:21:18 392354 0.00 0.00 0.00 0.00 0.00 15:21:19 392354 0.00 0.00 0.00 0.00 0.00 Average Average 392354 0.00 0 0.00 9 0
73 ps (Unix-generic) The preceding example shows a lot of page faults per second, but not much else. We can also see that there are 392,354 4K pages available on the paging space, which comes out to about 1.5 GB of available paging space. We can validate this number by running the lsps command, which reports the same result: # lsps -a Page Space hd6 Physical Volume hdisk0 Volume Group rootvg Size 1536MB %Used Active 1 yes Auto yes Type lv lsps (AIX-specific) lsps {-s | [-c | -l] {-a | Psname | -t {lv|nfs} } } The lsps command provides the paging space statistics. This very important command should definitely be part of your repertoire. One additional view, besides the –a view illustrated above, is used: the –s flag. It’s important to note that the –a flag reports only paging space that is being used, while –s provides a summary of all paging space allocated, including early page space allocation. We’ll discuss the various page space allocation polices in Chapter 11, when we dig into page space tuning. We’ve looked at various snapshots and seen some semblance of memory problems. Where do we go from here? Let’s try to identify some memory hog processes. If you recall, we previously looked at using ps commands to identify CPU hogs. It just takes another flag to identify the memory hog. (As I noted earlier, ps is one versatile command.) ps (Unix-generic) ps [-ANPaedfklmMZ] [-n namelist] [-F Format] [-o specifier] [=header],... [-p proclist][-G|-g grouplist] [-t termlist] [-U|-u userlist] [-c classlist] [ -T pid] [-L pidlist] ps [aceglnsuvwxU] [t tty] [processnumber] For our purposes here, we’ll use ps gv, the second usage shown above, which is based on the Berkeley method:
74 Chapter 8: Memory: Monitoring # ps gv | head -n 1; ps gv | egrep -v “RSS” | sort +6b -7 -n -r PID TIME PGIN SIZE 319648 TTY STAT - A 0:00 119 6576 401612 - A 0:06 86 336046 - A 0:03 106552 - A 286880 188568 RSS LIM TSIZ TRS %CPU %MEM COMMAND 6784 xx 201 208 0.0 1.0 /usr/sbin/rsct/bin/IBM.ERr 2284 2664 xx 828 380 0.0 0.0 /usr/sbin/IBM.CSMAgentRM 374 1672 2160 xx 522 488 0.0 0.0 /usr/sbin/rmcd -a 0:00 0 2048 2048 xx 0.0 0.0 j2p - A 0:41 22 1956 32768 xx 60 0.0 0.0 /usr/bin - A 0:00 25 1772 xx 116 148 0.0 0.0 /usr/sb IBM.LPCommands -r 1920 0 33 Let me briefly identify what some of this information means: ● ● ● ● SIZE — The amount of paging space allocated for the process (text and data). RSS — The amount of RAM used for the text and data segments of the process (in kilobytes). Note that PID 286880 is using 32,768K. TRS — The amount of RAM used for the text segment of the process (in kilobytes). %MEM — The actual amount of memory used per total RAM. Watch for processes whose %MEM value is 40 to 70 percent. The ps command provides a lot of useful information, but I don’t usually start with it unless one of my trusted administrators has already diagnosed that a memory issue of some kind exists on the system. Although ps has helped us identify some of the processes, it’s really time to call in our cleanup hitter, svmon. svmon (AIX-specific) svmon [-G [-i Intvl [NumIntvl] ][-z] ] svmon [-P [pid1...pidn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z] [-m] ] svmon [-S [sid1...sidn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z] [-m] ] svmon [-D sid1...sidn [-b] [-q [s|m|L|S]] [-i Intvl [NumIntvl] ][-z]]
svmon (AIX-specific) 75 svmon [-F [fr1...frn] [-q [s|m|L|S]] [-i Intvl [NumIntvl] ][-z] ] svmon [-C cmd1...cmdn [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-d] [-l] [-j] [-z] [-m] ]svmon [-U [lognm1...lognmn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-t Count] [ -i Intvl [NumIntvl] ] [-d] [-l] [-j] [-z] [-m] ] svmon [-W [class1...classn] [-e] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z] [-m] svmon [-T [tier1...tiern] [-a superclass] [-x] [-e] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l] [-z] [-m] From the usage alone, it’s clear how much you can do with the svmon utility. You use svmon specifically for VMM. It provides a potpourri of information about the current state of memory and really helps you drill down and determine which processes, users, programs, and segments consume the most virtual (real and paging) memory. The statistics themselves are based on 4K pages, including real, virtual, and paging space memory used. The –G flag gives you a global view of memory utilization on your host: # svmon -G memory pg space size 786432 393216 inuse 211735 863 free 574697 pin in use work 110862 174420 pers 0 0 clnt 0 3731 PoolSize - inuse 183863 1742 pgsp 863 0 PageSize s 4 KB m 64 KB pin 110862 virtual 17442 pin 97342 845 virtual 146548 174
76 Chapter 8: Memory: Monitoring Let’s look at the first part of the data: ● ● ● ● ● size — The size of real memory frames, or simply real memory (including any frames that may have been reduced by using the rmss command, which we discuss in the next chapter) inuse — The number of frames containing actual pages; pages in RAM in use by processes plus persistent pages that belonged to a terminated process and remain resident in RAM free — The number of pages on the free list pin — The number of pages pinned in physical memory (RAM), which cannot be paged out virtual — The number of pages allocated in the virtual space The next section of the output provides statistics about pin and inuse memory. The pin entry here specifies statistics about the subset of real memory containing pinned pages, while inuse provides statistics about the subset of all real memory in use. The information includes: ● work — Number of frames containing working segment pages ● pers — Number of frames containing persistent segment pages ● clnt — Number of frames containing client segment pages The third and final section provides individual statistics per page size (where alternative page sizes are available): ● PageSize — Page size ● PoolSize — Number of pages in pool ● inuse — Number of pages of this size that are used ● pgsp — Number of pages allocated to paging space ● pin — Number of pinned pages of this size ● virtual — Number of pages of this size that are allocated in the system virtual space
Memory Leak 77 To gain a better understanding of what is going on, you can correlate some of the svmon fields to vmstat. In this case, the svmon –free field matches up with the vmstat –fre, and the svmon –virtual matches the vmstat –avm. The net is that while svmon provides more overall information about memory, vmstat gives you more overall systems information. Let’s look at both: # svmon memory size inuse free pin virtual 786432 211735 574697 110862 174420 clnt pg space work pers pin 110862 0 0 in use 174420 0 3731 PageSize PoolSize inuse pgsp pin virtual s 4 KB - 183863 863 97342 14654 m 64 KB - 1742 0 845 1742 # vmstat System configuration: lcpu=4 mem=3072MB ent=0.4 kthr ----- memory page ------------- ----------------------avm fre faults cpu ----------- ---------------------- r b re pi po fr sr cy pc ec 1 0 174418 574699 0 0 0 0 0 0 439 333 159 1 2 97 0 0.02 4.0 1 0 0 0 0 0 0 0 452 0 2 98 0 0.01 2.7 174418 574699 in sy cs 20 146 us sy id wa In addition to the global view, you can create eight other types of reports using svmon: user, command, class, tier, process, segment, detailed segment, and frame. I won’t review each one here, but I strongly recommend that you check out all these views to see how each one can assist you. Memory Leak Let’s look at one more way to use svmon. Memory leaks can be a big problem on a system. A memory leak is any program or process that keeps on allocating more memory and does not release it. This situation can cause real memory to be used up extremely quickly and, in a worst-case
78 Chapter 8: Memory: Monitoring scenario, can even precipitate a system crash by causing the system to run out of paging space. I’m not ashamed to admit that this has happened to me. In fact, before I knew about svmon, I saw it happening before my eyes and couldn’t stop it because I wasn’t certain what was causing it! To identify the cause of memory leaks, you first need to identify the processes that are using up the most memory. Here is one way to do this: # svmon -uP -t 5 | grep -p Pi ----------------------------------------------------------------------------Inuse Pin Pgsp 286880 xmwlm Pid Command 21074 7802 0 Virtual 64-bit Mthrd 20859 N N 319648 IBM.ERrmd 20666 7815 0 20532 N Y 336046 rmcd 19919 7805 0 19276 N Y 413902 IBM.ServiceRM 19680 7818 0 19242 N Y 401612 IBM.CSMAgentR 19623 7816 0 19462 N Y 16MB N For this purpose, we’ve used the following flags with svmon: — Specifies that the displayed information be sorted in decreasing order, thereby displaying the top offender first ● –u ● –P — Displays process information ● –t — Indicates the number of processes to display After identifying the process you’re most concerned about (let’s assume it’s the top offending process), you can track it further to make sure that neither the working nor the kernel segments are increasing rapidly. You can use svmon similarly to vmstat, with a counter. To illustrate, we’ll set the counter to run for two intervals, with five-second iterations. The resulting output confirms that we are not having problems:
79 Memory Leak # svmon – P 286880 –i 5 2 -------------------------------------------------------------------------------Pid Command 286880 xmwlm Inuse Pin Pgsp 20769 7802 0 Virtual 64-bit Mthrd 20559 PageSize Inuse Pin Pgsp Virtual s 4 KB 13809 7802 0 1359 m 64 KB 435 0 0 4 Vsid Esid Type Description 0 PSize N Inuse 16MB N Pin Pgsp Virtual 0 work kernel segment s 11584 7799 0 11584 330ad d work shared library text m 435 0 0 435 6425d c work shared memory segment s 1480 0 0 14 541f1 2 work process private s 444 3 0 44 50250 - clnt /dev/hd4:921 s 194 0 - 6825e f work shared library data s 91 0 0 2426d 1 clnt code,/dev/hd2:152455 s 15 0 - 6025c - clnt /dev/hd2:41407 s 1 0 - 91 -------------------------------------------------------------------------------Pid Command 286880 xmwlm Inuse Pin Pgsp 20769 7802 0 Virtual 64-bit Mthrd 20559 PageSize Inuse Pin Pgsp Virtual s 4 KB 13809 7802 0 1359 m 64 KB 435 0 0 43 Vsid 16MB N PSize Inuse 0 work kernel segment s 11584 7799 0 330ad d work shared library text m 435 0 0 43 6425d c work shared memory segment s 1480 0 0 148 541f1 2 work process private s 444 3 0 444 50250 - clnt /dev/hd4:921 s 194 0 - - 6825e f work shared library data s 91 0 0 2426d 1 clnt code,/dev/hd2:152455 s 15 0 - 6025c - clnt /dev/hd2:41407 s 1 0 - 0 Esid Type Description N Pin Pgsp Virtual 11584 -

C h a p t e r 9 Memory: Tuning In this chapter, I identify and show you how to tune your virtual memory subsystem. In contrast to other subsystems, there is a lot you can do to improve performance from a virtual memory perspective. Before we get started, let me again state that, unless instructed otherwise, you should change only one parameter at a time. If you make multiple changes, you won’t know precisely what caused the impact on performance. This point is particularly relevant to virtual memory. vmo vmo -h [tunable] | {-L [tunable]} | {-x [tunable]} vmo [-p|-r] (-a | {-o tunable}) vmo [-p|-r] (-D | ({-d tunable} {-o tunable=value})) Let us assume that we’re running an Oracle online transaction processing (OLTP) application and we’ve determined from some vmstat output that the system is paging. We’ve also looked at nmon data, which helped us reach the same conclusion. What can we do to improve the situation? This is where the vmo command comes into play. You will probably use vmo more than any other tunable command because it is with virtual memory that you have the greatest ability to positively affect performance by changing parameters. The vmo command provides
82 Chapter 9: Memory: Tuning a staggering 61 tunables in AIX 5.3. (The situation changes a bit in AIX 6.1 with the introduction of restricted parameters, which permit changes but make it a little more difficult to get into trouble.) I won’t describe each vmo parameter here, but I will go through the key ones as we try to tune our memory subsystem. minperm, maxperm, maxclient, and lru_file_repage Perhaps the most important concepts that relate to tuning revolve around our prior discussions about working and persistent storage. We definitely want the Virtual Memory Manager (VMM) to favor working storage, meaning that we don’t want AIX to page working storage. What we really want is for the system to favor the caching that the database (Oracle in this case) uses. The way to do this is to set the vmo command’s maxperm parameter to a high enough value while also making certain that the lru_file_repage parameter is set correctly. Here’s a description of the involved parameters: ● ● — The point below which the page stealer algorithm will steal file or computational pages, regardless of repaging rates minperm% maxperm% — The point above which the page stealer will steal only file pages ● ● — The minimum percentage of RAM that can be used to cache client pages maxclient% — Setting this value to 0 (off) allows AIX to free only file cache memory (provided numperm is greater than minperm and VMM can steal enough memory to satisfy demand), virtually guaranteeing that working storage remains in memory lru_file_repage Background Arguably, the most important vmo settings are minperm% and maxperm%. Setting these parameters appropriately will ensure that your system is tuned to favor either computational memory or file memory. In most cases, you don’t want to page working segments, because doing so will cause your system to page unnecessarily and will decrease performance.
minperm, maxperm, maxclient, and lru_file_repage 83 First, some background and history. The way things used to work was actually much simpler. If the number of file pages specified in vmo parameter numperm% was greater than the actual number of pages (maxperm), the page replacement would steal only file pages. When the number of file pages fell below minperm, both file and computational pages could be stolen. If the number fell between the minimum and maximum values, the page replacement would steal only file pages — unless the number of file repages was greater than the number of computational pages. In other words, if your numperm was greater than maxperm, you would start to steal from persistent storage. Based on this methodology, the old approach to tuning minperm and maxperm was to set maxperm to a low number — much lower than the default value (20) — and set minperm to less than or equal to 10. This is how we normally would have tuned our database server. Don’t do this anymore! Starting with AIX 5.2 Maintenance Level 5 (ML5) and AIX 5.3 ML2, the rules have changed. A New Approach The new approach is to set maxperm to a very high value — higher than its default (80) — and to make sure lru_file_repage is set to 0. IBM introduced the lru_file_repage parameter in AIX 5.2 with ML4 and in AIX 5.3 with ML1. The lru_file_repage value indicates whether the VMM repage counts should be considered and what type of memory should be stolen. The default setting is 1 (it becomes 0 in AIX 6.1), so we need to change it to 0 to have the VMM steal file pages rather than computational pages. This technique solves the old problem of having to limit JFS2 file cache to guarantee memory for applications such as Oracle. Let’s not lose sight of the fact that the primary reason you need to tune lru_file_repage is because you want to protect the computational memory — that is, process memory, kernel memory, and shared memory, which includes Oracle’s System Global Area (SGA). Because Oracle uses its own cache, using AIX file caching for this purpose only causes confusion, so we want to stop it. In this scenario, if you were to reduce maxperm, you’d be making the mistake of stopping the application caching programs that are running. You’d also be permitting lrud, the kernel process responsible for stealing memory when required, to do more work than necessary.
84 Chapter 9: Memory: Tuning You should always be tracking your numperm, something you can do using nmon or topas or from the command line using vmstat (with the –v flag). If you leave the lru_file_repage default of 1, VMM will continue to use the computational and noncomputational repage counts (defined at the top) in determining whether to steal computational or file memory. Here are the recommendations for configuring the other parameters we’ve discussed: vmo -p -o minperm%=5 vmo –p –o maxperm%=90 vmo –p –o maxclient%=90 In AIX 6.1, IBM has changed the default parameter values to reflect these common defaults, so you’ll have less to do in that release. You should also leave strict_maxperm and strict_maxclient at their default numbers. We used to change these settings, but we don’t need to anymore. Changing them to 1 (the old approach) places a hard limit on the amount of memory that can be used for persistent file cache. This is done by making the maxperm value the upper limit for the cache. These days, this step is unnecessary because changing lru_file_repage is a far more effective way of tuning because we prefer that AIX file caching not be used at all. minfree and maxfree Two other important vmo parameters worth noting here are minfree and maxfree. These values set the lower and upper limits of the free list, which keeps track of the real memory frames released: — Specifies the minimum number of frames on the free list, at which point the VMM will start to steal pages to replenish ● minfree ● maxfree — Specifies the number of frames on the free list at which page stealing is to stop If the number of pages on your free list falls below the minfree value, the VMM starts to steal pages (just to add to the free list), which is not good. It will continue to do this until the free list contains at least the number of pages specified in the maxfree parameter.
Page Space Allocation 85 While you want to keep your free list higher (because you don’t want your processes to be killed if the minfree value is reached, you want the VMM to always get the page frames it needs from the free list). I remember when the defaults used to be 120 and if I hadn’t raised the values, users would nag me, saying no memory was left on the system. You also don’t want the system to experience excessive I/O because it’s always stealing paging to expand the free list. The default values now depend on the physical memory of the system. maxfree equals the lesser of the number of memory pages divided by 128, or 128. These values are the sum of all memory pools. The maxfree value should also be greater than or equal to maxpgahead. Page Space Allocation AIX provides three different modes of paging space allocation: deferred page space allocation (DPSA), late page space allocation (LPSA), and early page space allocation (EPSA). The default policy is deferred page space allocation. DPSA works by delaying the allocation of paging space until the time when it is necessary to page out the page. This approach ensures that there is no wasted paging space, an important component of demand paging. In fact, when you have a large amount of RAM, you may actually never even use any of your paging space. Here is an example: # lsps -a Page Space Physical Volume Volume Group hd6 hdisk0 rootvg Size %Used Active 1536MB 1 Auto Type yes Only 1 percent of paging space is used here. Let’s view how AIX is presently handling paging space allocation: # vmo -a | grep defps defps = 1 The preceding output shows that the default method, DPSA, is being used. To disable this policy, you would set the defps parameter to 0. This value would cause the LPSA policy to be used. LPSA causes paging disk blocks
86 Chapter 9: Memory: Tuning not to be allocated until the corresponding pages in RAM are touched. This method is usually intended for environments where optimum performance is more important than reliability, because in this scenario it’s possible for a program to fail due to lack of memory. The EPSA policy is usually used when you want to make sure that processes won’t be killed because of low paging conditions. EPSA ensures this by preallocating paging space. This is the opposite end of the spectrum from LPSA. EPSA is used in environments where reliability rules. To turn on EPSA, you set the PSALLOC environment variable to early (PSALLOC=early). You should also be aware of the garbage collection feature introduced in AIX 5.3. Garbage collection frees up paging-space disk blocks, letting you configure less paging space than you ordinarily would need to. This feature is available only for the default DPSA policy. How Much Paging Space? How much paging space do you need on your system? What is the rule of thumb? To determine the answer, start with the folks who own your applications. For example, your DB2 or Oracle teams should be able to tell you how much paging space needs to be allocated on the system from a database perspective. If yours is a small shop, you’ll have to do the research on your own. Be careful, though. Database administrators usually like to request the highest number of everything and might instruct you to double the amount of paging space versus your RAM (an older rule of thumb). Generally speaking, if a system has less than 4 GB of RAM, I usually like to create a one-to-one ratio of paging space versus RAM. If it has 8 GB or higher, I set my paging space to as little as half the size of RAM. Monitor the system frequently after going live. If you see that you’re never really approaching 50 percent of paging space utilization, don’t add the space. A quick look at the recent Oracle for AIX documentation confirms this principle; it recommends that the initial setting for paging space be half the size of RAM plus 4 GB, with an upper limit of 32 GB. The documentation further suggests monitoring space with the lsps –a
Thrashing and Load Control 87 command and not worrying unless the utilization is more than 25 percent on the system. Adding space that you won’t use gives you absolutely nothing extra. I’m often asked how one can tell whether a process is using paging space Let’s go back to the svmon command for a moment. Here is how you do it. First, use the ps command to identify a process you want to view. Then, use svmon as follows: # svmon -P | grep -p 286880 --------------------------------------------------------------------------Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB 286880 xmwlm 21009 7802 0 20925 N N N Paging Space Tuning When your free list is really low and you’re paging incessantly, your system will start to release processes to avoid thrashing. It will even kill processes if sufficient paging space is not available. To prevent this from happening, you can tune these three vmo values: ● ● ● — This parameter specifies the number of free paging space pages at which AIX starts killing (SIGKILL) processes. npskill — This parameter specifies the number of free paging space pages at which AIX starts sending warnings (SIGDANGER) to processes. npswarn — Setting this parameter to 1 prevents processes owned by root from being killed when parameter npskill has started to take effect. nokillroot Thrashing and Load Control Thrashing is what occurs when memory resources are so overloaded that the system is in a state of utter exhaustion. To be more specific, the system is constantly paging in and out whole processes in a futile attempt to
88 Chapter 9: Memory: Tuning process data, which it can’t properly do because of the excessive paging operations. Using the CPU tuning command schedo, you can affect the criteria used to determine thrashing by tuning the VMM load control facility, which further helps protect an overloaded system from thrashing. More than anything, load control is meant to help straighten out infrequent spikes in load. Let’s look at some schedo parameters you can adjust to specify the thresholds for the algorithm that controls memory load control: ● ● ● ● ● — Defines the period of time (in elapsed seconds) that a reactivated suspended process is exempt from suspension. v_exempt_secs v_min_process — Defines the number of active processes that can be run and waiting for page I/O. — Controls the threshold for memory over commitment. If this threshold is exceeded, load control will try to suspend processes. v_repage_hi v_repage_proc — Determines whether the process is eligible for suspension. This value is further used to set a threshold for the number of repages and the number of page faults that the process has accumulated in the past second. — Defines the number of intervals for which the po/fr fraction — the number of pages written to paging space in the last second (po) divided by the number of page steals occurring during that time (fr) — can remain below v_repage_hi before suspended processes are reactivated. v_sec_wait After tuning these values and playing around with some of these settings, you can always reset them to their defaults using the schedo –D command. Memory Scanning and lrubucket The vmo command’s lrubucket parameter indicates the number of memory frames per bucket. On systems with multiple memory pools, the parameter’s setting is per memory pool. Tuning this value can help you reduce scanning overhead on systems that have a large amount of memory.
rmss 89 This point has to do with how the page replacement algorithm works. The algorithm’s role is to scan and look for free frames — to be used for new pages or for page replacement. With larger systems, because there are so many frames to scan, memory is divvied into buckets of frames. The larger the bucket, the fewer the frames that must be scanned. The following example increases the bucket to 2 GB (you specify the value in 4K pages): # vmo -o lrubucket=524288 Setting lrubucket to 524288 rmss rmss [-s startmemsize] [-f finalmemsize] [-d deltamemsiz] [-n numiterations] [-o outputfile] command rmss -c memsize rmss -r rmss -p Before the advent of the POWER4’s Hypervisor gave folks access to dynamic logical partitioning (DLPAR) memory, the rmss command was the only tool you could use for capacity planning as it related to memory. It is still the only tool that lets you reduce available memory without either physically removing RAM from your box or performing a DLPAR operation to reduce RAM. Although rmss isn’t a performance-tuning tool in the strict sense of the word, it is an invaluable aid that you should use when sizing systems. Most administrators are just throwing RAM in the garbage because they choose not to care whether their systems require it — often for fear they’ll be blamed by application folks for not providing the excessive amount of memory requested. Using rmss, you can quickly subtract memory (and just as quickly add it) to determine how your application reacts.
90 Chapter 9: Memory: Tuning First, let’s see how much memory we have on the box: # lsattr -El mem0 goodsize 3072 Amount of usable physical memory in Mbytes False size 3072 Total amount of physical memory in Mbytes False Now, let’s use rmss to view the current memory size: # rmss –p simulated memory size is 3072 Mb. Let’s change it: # rmss -c 2048 Simulated memory size changed to 2048 Mb. The system still sees 3 GB of physical memory. When you’re ready, you can restore the real memory size: # rmss –r
Section III Summary, Tips, and Quiz Summary ● ● ● ● ● ● ● ● The Virtual Memory Manager (VMM) services all memory requests on the system, not just virtual memory. Working segments use computational memory, and persistent segments use file memory. When paging, you prefer that the system does not page out working/computational memory because this is the working storage for processes that are currently executing. File memory uses persistent storage and has a permanent location on the disk. Paging is a normal condition of AIX, due to its tight integration with the VMM and AIX’s implementation of demand paging. Data is constantly shuffled back and forth between paging space and RAM because the kernel loads only a few pages at a time into memory. The vmtune and schedtune commands are no more, replaced in AIX 5L by vmo and schedo and eliminated completely in AIX 5.3. Starting with AIX 5.2, tunables are saved in /etc/tunables. Before this release, they were saved in rc.tune. vmstat –v (–v is a new flag) provides a summary of all virtual memory statistics. ● Thrashing is a condition that occurs when virtual memory resources are overloaded and the free list is abnormally low. This condition can cause entire processes to be swapped out to disk and can even cause a system to crash, if the paging space fills up.
92 Section III: Summary, Tips, and Quiz ● ● Memory leaks occur when a process keeps on allocating more memory without releasing it. The svmon command can help find these leaks. Memory monitoring tools you should use include lsps, nmon, ps, sar, and vmstat. svmon, topas, ● ● is the primary tuning tool used to manage the virtual memory tunable parameters. You use schedo to tune the VMM load control facility, which helps protect an overloaded system from thrashing. vmo AIX provides three different modes of paging space allocation: deferred page space allocation (DPSA), late page space allocation (LPSA), and early page space allocation (EPSA). The default policy is DPSA. Tips ● ● ● ● With systems having so much more memory than back in the day, the ratios for paging space recommendations are much lower than ever. Just because your DBA tells you he or she needs a 1:1 (or greater) ratio of physical to paging space doesn’t mean you have to provide it. Even Oracle now recommends that the initial paging space setting be half the size of RAM plus 4 GB with an upper limit of 32 GB. It is much easier to add paging space than delete it, and it’s easy enough to determine whether your system uses a lot of paging space. If you do your job properly, you won’t have to over-architect your paging space. Having said that, you should always check with your ISV for recommendations before deploying your paging strategy. Remember the new minperm% and maxperm% tuning recommendations (starting with AIX 5.3 ML2) to favor computational memory over working persistent storage. IBM kernel engineers came out with these new recommendations for a reason. Don’t forget to also set lru_file_repage to 0; otherwise, you’ll defeat the purpose of the new recommendations, and your system will be slower, not faster! If you want to save your tuning changes on a reboot, make sure you save them to /etc/tunables (there is no more rc.tune). The –p flag on the vmo command will take care of this.
Tips ● ● ● ● ● ● ● 93 You can tune an extraordinary number of parameters with vmo, more than for any other subsystem. However, “don’t touch that dial” unless you fully understand what the parameters mean. And when you do tune parameters, test your changes in a staging or development environment before rolling them out in production, and remember to implement only one change at a time. Learn the svmon command. Most system administrators are stubborn mules and will use only the same tools they’ve been using for decades. svmon is easily the best memory analysis tool out there today; take the time to learn how to use it. Don’t wait for your system to start thrashing before you look at the free list using vmstat or other utilities. A thrashing system can lead to a crashing system and is one of the worst things that can happen to you as a systems administrator. Tuning the lrubucket parameter can help you reduce scanning overhead on systems that have a large amount of memory. In most cases, you’ll do fine if you at least double the default. The rmss command can help with memory capacity planning. It lets you temporarily reduce the amount of RAM without having to either physically reduce memory or run a DLPAR operation. Just as identifying workload is of paramount importance to improving CPU utilization, it can also be important when managing batch jobs that may accumulate a lot of virtual memory. Don’t be afraid to use your 24hour day, particularly if you see excessive paging. Similar to CPUs, adding RAM is always an option if you’ve determined that you’re experiencing a memory bottleneck. All it takes is a simple DLPAR operation. The task is much easier than it used to be; you can just add or subtract RAM dynamically. Tools such as the DLPAR toolset and the Partition Load Manager (PLM) can automate this process. One caveat here: As of AIX 6.1, PLM is no more. If you’re looking at using uncapped partitions to do this, sorry — that solution works only with CPUs, not RAM.
94 Section III: Summary, Tips, and Quiz Quiz Multiple Choice 1. Which vmstat flag reports summary information? a. –c b. –t c. –a d. –v 2. Which sar flag reports paging statistics? a. –a b. –g c. –c d. –d 3. What command summarizes the amount of paging space on your system? a. lsps -a b. svmon c. vmstat d. stat 4. Which ps flag reports memory information? a. gv b. ux c. –e d. –m
Multiple Choice 95 Use the following output to answer Questions 5 through 7: # vmstat 2 5 System configuration: lcpu=4 mem=3072MB ent=0.40 kthr ----r b 4 3 9 6 9 3 memory page faults ---------- ------------------------ ------------avm fre re pi po fr sr cy in sy cs 173838 123 0 9 92 104 208 417 365 12001 6004 173837 567 0 7 45 59 109 229 297 17002 8124 173838 191 0 22 127 187 287 567 329 18229 9374 cpu -----------------us sy id wa pc ec 69 4 20 7 .8 2.4 51 9 12 28 1 2.9 71 23 2 4 .5 2.1 5. Given the preceding information, are you experiencing a RAM bottleneck? a. Yes b. No c. Maybe d. Not enough information 6. Which of the following would be an acceptable next action after you come up with your analysis? a. Notify the DBA team. b. Do a vmstat –v to look at a summary of your memory and paging statistics. c. Tune the vmo command’s minperm and maxperm setting. d. Run a trace. 7. It is not unusual to see multiple bottlenecks on a system. Does it appear that you are having either a CPU or an I/O problem? a. Yes b. No c. Maybe d. Not enough information
96 Section III: Summary, Tips, and Quiz True or False 8. Computational memory is made up of working segments and is transitory. 9. The Virtual Memory Manager (VMM) manages all memory requests, including physical RAM, not just virtual memory. Fill in the Blank(s) 10. Define a memory leak: _________________________________________________
Section IV Disk I/O This section gives you an overview of disk management on AIX, including how to monitor and tune your disk I/O subsystem. We also discuss best practices for disk placement, file system management, optimum hardware configuration, and concepts such as direct and concurrent I/O, and asynchronous I/O (AIO).

C h a p t e r 10 Disk I/O: Introduction What, exactly, is involved in tuning your disk subsystem? Tuning disk is a little trickier that tuning your CPU or virtual memory subsystem. One important reason is because you can do more to optimize throughput during the initial configuration of your I/O devices than you can ever do with tuning. It’s simply much easier to move things around during the initial build-out of your environment than to re-architect production. Furthermore, understand that the slowest operation for running programs is the time spent on actually retrieving your data from disk. This activity involves the physical disk as well as its logical components, such as the Logical Volume Manager (LVM). All the tuning in the world will do little if you have a poorly architected subsystem. Let’s look at the I/O stack, which is depicted in Figure 10.1.
100 Chapter 10: Disk I/O: Introduction Application open( ) close( ) read( ) write( ) Async, sync, and other options for both open and R/W I/O to a file mount unmount I/O to blocks in a filesystem Filesystem Mount options affect the IO dio cio rbr rbrw rbw VMM LVM crfs chfs mkfs logform mount -a cio dio mknfsexp chnfs exp showmoung cfsadmin ioo vmo vmtune File I/O to filesystem cache mkvg extendvg mklv inportvg chvg importvg exportvg cplv mklvcopy mirrorvg migratepv varyonvg Device drivers sattr chdev mkdev rmdev Disk subsytem Software varies Block I/O to logical disk Block I/O to a physical disk Disk Disk I/O flows from top to bottom Physical Disk/Layout flows from bottom to top Figure 10.1: I/O stack The figure clearly shows the tight integration between physical components as they relate to both the logical disk and its application I/O. When you configure your disk, you should work from the ground up. Start with the physical system and then move to the device layers, logical volumes, file systems, files, and applications. The physical component is crucial. Configuring this component involves determining the amount of disk, type (speed), size, and throughput. One important challenge to note with storage technology is that although the storage capabilities of disk are increasing dramatically, disk rotational speed increases more slowly. Disk I/O is clearly the weakest link on a system: while RAM access takes about 540 CPU cycles, disk access can take 20 million CPU cycles. To reiterate, poor layout of your data affects I/O performance much more than any tunable I/O parameter. Returning to the I/O stack, you can clearly see the truth in this statement just by looking at where the tunables are on
Concurrent I/O 101 the stack. They are much closer to the top than disk placement and logical volumes. Direct I/O First introduced in AIX 4.3, direct I/O bypasses the Virtual Memory Manager (VMM), enabling the transfer of data directly to disk from the user’s buffer. Direct I/O is not for everyone, because although it is possible to improve performance using this technique, it is also possible to degrade performance if you turn on direct I/O where you shouldn’t. Implementing direct I/O can provide near raw logical volume performance while at the same time maintaining the flexibility and manageability of file systems. What are a good candidates for direct I/O? Applications that have files with poor cache utilization are one example. Another is applications that use synchronous writes, because these writes must go to disk. Direct I/O goes directly to disk, so CPU usage drops because the dual data copy (bypassing the cache) is dropped. What are not good candidates for direct I/O? Applications that have smaller requests with persistent segments (which translate into permanent locations). Concurrent I/O Introduced in AIX 5.2, concurrent I/O (CIO) is nearly identical to direct I/O, but one better. With direct I/O, inodes (data structures that are associated with files) are locked to prevent a condition in which multiple threads might try to change the contents of a file at the same time. CIO actually bypasses this inode lock, letting multiple threads read and write data concurrently to the same file. This capability is enabled due to the way in which JFS2 is implemented with a write-exclusive inode lock, which lets multiples users read the same file simultaneously. This design has the effect of increasing performance dramatically when multiple users read from the same data file. Direct I/O can cause major problems with databases that continuously read from the same file. Concurrent I/O solves this problem, making it the preferred method of running databases. You turn on CIO either by mounting
102 Chapter 10: Disk I/O: Introduction the file system or through open systems calls. It’s as simple as running the mount command with the cio option: # mount -o cio /u01 When you mount the file system using this method, all files in the file system will use CIO. Unlike direct I/O, you can use CIO only with JFS2. As with direct I/O, some environments won’t benefit from turning on CIO. For example, applications that could benefit from a file system read-ahead or high buffer cache might actually experience decreased performance. Test, test, test, and then test some more! Asynchronous I/O Asynchronous I/O (AIO) conceptually relates to whether applications are waiting for I/O to complete before processing additional data. In other words, AIO lets applications continue to process while I/O runs in the background. This approach improves performance because processing can occur simultaneously. An AIX 6.1 note: virtually everything AIO-related has changed with the implementation of AIX 6.1. For information about these changes, see Chapter 16. Logical Volumes and Disk Placement: Intra- and Inter-Policy Figure 10.2 depicts the relationship between the logical volumes and the physical disk.
Logical Volumes and Disk Placement: Intra- and Inter-Policy Application Layer Logical Layer Raw Logical Volume JFS/JFS2 Volume Group Logical Volume Manager 103 Logical Volume Logical Volume Logical Volume Device Driver Physical Volume Physical Volume Device Driver Physical Volume Device Driver Physical Layer Physical Disk Physical Disk Physical Array Figure 10.2: System layers The logical volume layer sits between the application and physical layers. In other words, the application layer correlates to the file system or raw logical volume. The physical layer consists of the actual disk. Logical Volume Manager is the AIX disk management system that maps data between logical and physical storage. LVM also lets data reside on multiple physical platters and be managed and analyzed using specialized LVM commands. LVM controls all the physical disk resources on your system while providing a logical view of the storage subsystem. Knowing that the logical layer sits directly between the application layer and the physical layer should help you understand why the logical layer is probably the most important of all the layers. Even your physical volumes themselves are part of the logical layer because the physical layer encompasses only the actual physical components.
104 Chapter 10: Disk I/O: Introduction What about the other elements that make up the preceding illustration? From the bottom up, each of the drives is named as a physical volume. Multiple physical volumes make up the volume group. The logical volumes are defined within the volume group, and LVM enables the data to be on multiple physical drives, although they might be configured to be on a single volume group. The logical volumes can be either one or multiple logical partitions. Each logical partition has a physical partition that correlates to it. This is where you actually mirror your system, by having multiple copies of the physical partitions. How does logical volume creation correlate with physical volumes? Figure 10.3 illustrates the storage position on the physical disk platter. Center Inner Middle Inner Edge Middle Edge Figure 10.3: Physical disk platter layout As a general rule, data written toward the center of the platter has faster seek times than data written on the outer edge. This has to do with the concept of data density. Because data is more dense as it moves toward the center, there will be less movement of the head. Because the inner edge will usually have the slowest seek times, more intensive I/O applications should be brought closer to the center of the physical volumes. Is this always the case? There are exceptions. For example, disks hold more data per track on the edges of the disk than on the center. For this reason, logical volumes being accessed sequentially should actually be placed on the edge for better performance. The same holds true for logical volumes that have Mirror Write Consistency Check (MWCC) turned on. This is because the MWCC sector is on the edge of the disk (not at the center), which relates to the intra-disk policy of logical volumes.
File Systems 105 Inter-Disk Policy The inter-disk policy defines the number of actual disks on which the physical partitions of a logical volume reside. The general rule is that the minimum policy provides the greatest reliability and availability, while the maximum policy improves performance. Simply put, the more drives your data is spread on, the better the performance. Some other best practices include the following: ● Allocating intensive logical volumes to separate physical volumes ● Defining the logical volumes to the maximum size you need ● Placing frequently used logical volumes close together These are all reasons to understand your data before configuring your systems so that you can create policies that make sense from the start. You can define your policies when creating the logical volumes themselves using the System Management Interface Tool (SMIT) fastpath command: # smitty mklv File Systems Two types of kernels exist in AIX: a 32-bit kernel and a 64-bit kernel. (AIX 6.1 has only a 64-bit kernel.) Although both types of kernels share some common libraries and most commands and utilities, you should understand their differences and how the kernel relates to overall performance tuning. JFS2 is optimized for the 64-bit kernel, while JFS is optimized for the 32-bit kernel. Always use JFS2 if you can. Both JFS and JFS2 are journaling file systems, which have been associated with performance overheads. In fact, with JFS, where availability was not an issue and peak performance was necessary, you could disable metadata logging in an effort to increase performance. With JFS2 (AIX 5.3 only), that technique is no longer possible (or necessary) because the file system is tuned to handle metadata-intensive types of applications more efficiently. With AIX 6.1 you can now mount file systems without logging. The most important advantage of JFS2 lies in its ability to scale. With JFS2, you can have files up to 16 TB; JFS imposes a file size limit of 64 GB. JFS2 also includes changes in the directory organization. It uses a binary tree representation while performing inode searches, rather than the linear method used by JFS.

C h a p t e r 11 Disk I/O: Monitoring This chapter provides an overview of the AIX-specific tools (sar, nmon, and topas) available to monitor disk I/O activity. These commands let you quickly troubleshoot a performance problem and capture data for historical trending and analysis. Don’t expect to see iostat here. That Unix utility lets you quickly determine whether there is an imbalanced I/O load between your physical disks and adapters. But unless you decide to write your own scripting tools using iostat, it won’t help you with long-term trending and capturing data. sar The sar command, whose syntax is given in Chapter 8, is one of those older, generic Unix tools that have been improved over the years. Although I generally prefer to use more specific AIX tools, such as nmon and topas, sar provides strong information with respect to disk I/O. Let’s run a typical sar command to examine I/O activity: # sar -d 1 2 AIX newdev 3 5 06/04/ System Configuration: lcpu=4 disk=5 07:11:16 device %busy avque 07:11:17 hdisk1 0 0.0 hdisk0 29 0.0 hdisk3 0 0.0 r+w/s 0 129 0 Here’s a breakdown of the column headings: blks/s 0 85 0 avwait 0.0 0.0 0.0 avser 0.0 0.0 0.0
108 Chapter 11: Disk I/O: Monitoring ● ● ● %busy — Portion of time the device was busy servicing transfer requests avque —Number of requests waiting to be sent to disk (as of AIX 5.3) r+w/s — Number of read or write transfers to or from a device (in 512-byte units) ● avwait — Average wait time per request (in milliseconds) ● avserv — Average service time per request (in milliseconds) You want to be wary of any disk that approaches 100 percent utilization or shows a large number of queue requests waiting for disk. Although the sample output shows some activity, we have no real I/O problems because no waiting for I/O is occurring. We should continue to monitor this system to ensure other disks in addition to hdisk0 are being used. Where sar differs from iostat is in its ability to capture data for long-term analysis and trending using its system activity data collector (sadc) utility. Usually turned off in cron, the sadc utility lets you capture data for historical trending and analysis. Here’s how this works. As delivered by default on AIX systems, two shell scripts, /usr/lib/sa/sa1 and /usr/lib/sa/sa2, which are normally commented out provide daily reports on the activity of the system. The sar command actually calls the sadc routine to access system data. The following example shows how the shell scripts are usually kicked off from cron: # crontab -l | grep sa1 0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 & 0 * * * 0,6 /usr/lib/sa/sa1 & 0 18-7 * * 1-5 /usr/lib/sa/sa1 & topas What about something a little more user-friendly? Did you say topas? The topas command is a nice performance-monitoring tool that you can use for a number of purposes, including monitoring your disk subsystem.
109 topas Let’s take a look at the topas output from a disk perspective: Topas output for host – Testhost Mon May 7 07:33:38 2007 Interval: 2 Events/Queues FILE/TTY Cswitch 500 Readch 487 Syscall 1298 Writech 943 Kernel 0.5 |# } Reads 2 Rawin User 0.5 |# | Writes 1 Ttyout Wait 0.0 | | Forks 0 Igets 0 Idle 99.0 |###########################| Execs 0 Namei 25 Dirblk 0 Network KBPS I-Pack O-Pack KB-In en1 0.6 1.0 1.0 0.1 0.5 lo0 0.1 1.0 1.0 0.0 0.0 KB-Out Runqueue 0.0 Waitqueue 0.0 PAGING 0 459 MEMORY Faults 1 Real,MB 4095 TPS KB-Read KB-Writ Steals 0 % Comp 13.8 Busy% KBPS hdisk0 0.0 0.0 0.0 0.0 0.0 PgspIn 0 % Noncomp 87.1 hdisk1 0.0 0.0 0.0 0.0 0.0 PgspOut 0 % Client 0.5 hdisk3 0.0 0.0 0.0 0.0 0.0 PageIn 0 cd0 0.0 0.0 0.0 0.0 0.0 PageOut 0 PAGING SPACE hdisk2 0.0 0.0 0.0 0.0 0.0 Sios 0 Size,MB Disk Name PID CPU% PgSp Owner NFS (calls/sec) 4096 % Used 0.5 % Free 99.4 X 15256 0.8 2.5 root ServerV2 0 topas 22320 0.2 1.5 root ClientV2 0 Press: syncd 15016 0.0 0.6 root ServerV3 0 “h” for help lrud 9030 0.0 0.0 root ClientV3 0 “q” to quit gil 10320 0.0 0.1 root i4llmd 12434 0.0 1.1 root prngd 19154 0.0 0.2 root rpc.lock 26878 0.0 0.0 root nfsd 28238 0.0 0.0 root tcl 17906 0.0 0.8 root i4lmd 25352 0.0 1.3 root dtwm 22752 0.0 1.9 rds xmgc 9804 0.0 0.0 root 20700 0.0 1.8 rds 1 0.0 0.7 root vmstat 37288 0.0 0.2 root dtfile 20444 0.0 1.7 rds cron 27720 0.0 0.4 root rshell 33334 0.0 0.8 user netm 10062 0.0 0.0 root dtsessio init No I/O activity at all is going on here. Besides the physical disk, pay close attention to the “Wait” information (in the CPU section up top), which
110 Chapter 11: Disk I/O: Monitoring also helps you determine whether the system is I/O-bound. If you see high numbers here, you can then use other tools, such as filemon, fileplace, lslv, or lsof, to help you figure out which processes, adapters, or file systems are causing your bottlenecks. The topas command is useful for quickly troubleshooting an issue when you want a little more than iostat can provide. In a sense, topas is a graphical mix of iostat and vmstat, although recent improvements now provide the ability to capture data for historical analysis. These improvements, introduced in AIX 5.3, no doubt were made because of the popularity of nmon. While nmon provides a front end similar to topas, it is much more useful in terms of long-term trending and analysis. Further, as you learned in Chapter 5, nmon gives system administrators the ability to output data to an Excel spreadsheet for presentation in graphical charts (tailor-made for senior management and functional teams) that clearly illustrate bottlenecks. The nmon analyzer tool provides the hooks into nmon. (Figure 5.1 in Chapter 5 shows some sample output from the nmon analyzer.) With respect to disk I/O, nmon reports the following data: disk I/O rates, data transfers, read/write ratios, and disk adapter statistics. Here is one small example of where nmon really shines. Let’s say you want to know which processes are hogging most of the disk I/O, and you want to be able to correlate that activity with the actual disk to clearly illustrate I/O per process. nmon usage helps you here more than any other tool. To perform this task with nmon, use the –t option; set your timing and then sort by I/O channel. How do you use nmon to capture data and import it into the analyzer? Use the open-source sudo command and run nmon for three hours, taking a snapshot every 30 seconds: # sudo nmon -f -t -r test1 -s 30 -c 180 Next, sort the created output file: # sort -A testsystem_yymmdd.nmon > testsystem_yymmdd.csv
111 Logical Volume Monitoring Then FTP the .csv file to your PC, start the nmon analyzer spreadsheet (enabling macros), and click on Analyze nmon data. The nmon command also helps track the configuration of asynchronous I/O servers. Logical Volume Monitoring Say that a ticket has just been opened up with the service desk that relates to slow performance on some database server. You suspect there might be an I/O issue, so you start with iostat. iostat, the equivalent of using vmstat for virtual memory, is arguably the most effective way to get a first glance at what is happening with your I/O subsystem. Let’s run iostat, in this case once a second: # iostat 1 System configuration: lcpu=4 disk=4 tty: tin tout 0.0 392.0 avg-cpu: % user % sys % idle % iowait 5.2 5.5 88.3 1.1 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk1 0.5 19.5 1.4 53437739 21482563 hdisk0 0.7 29.7 3.0 93086751 21482563 hdisk4 1.7 278.2 6.2 238584732 832883320 hdisk3 2.1 294.3 8.0 300653060 832883320 The command reports the following information: ● ● ● ● ● % tm_act — Percentage of time that the physical disk was active, or the total time of disk request Kbps — Amount of data (in kilobytes per second) transferred to the drive tps — Number of transfers per second issued to the physical disk Kb_read — Total data (in kilobytes) from the measured interval that is read from the physical volumes Kb_wrtn — Amount of data (kilobytes) from the measured interval that is written to the physical volumes
112 Chapter 11: Disk I/O: Monitoring You need to watch % tm_act very carefully because if this utilization exceeds roughly 60 to 70 percent, that usually indicates that processes are starting to wait for I/O. This might be your first clue of impending I/O problems. Moving data to less busy drives can obviously help ease this burden. Generally speaking, the more drives your data hits, the better. Just like anything else, too much of a good thing can also be bad, and you also have to make sure you don’t have too many drives hitting any one adapter. One way to determine whether an adapter is saturated is to sum the Kbps amounts for all disks attached to one adapter. The total should be below the disk adapter’s throughput rating, usually less than 70 percent. Using the –a flag with iostat helps you drill down further to examine adapter utilization. In the following output, there clearly are no bottlenecks: # iostat -a Adapter: scsi0 Paths/Disk: hdisk1_Path0 hdisk0_Path0 hdisk4_Path0 hdisk3_Path0 Kbps 0.0 % tm_act 37.0 67.0 0.0 0.0 Adapter: ide0 Paths/Disk: cd0 tps 0.0 Kbps 89.0 47.0 0.0 0.0 Kbps 0.0 % tm_act 0.0 Kb_read 0 tps 0.0 0.0 0.0 0.0 tps 0.0 Kbps 0.0 Kb_read 0 0 0 0 Kb_wrtn 0 Kb_read 0 Kb_wrtn 0 0 0 0 tps 0.0 Kb_read 0 Kb_wrtn 0 Kb_wrtn 0 AIX LVM Commands We examined disk placement earlier, and I stressed the importance of architecting your systems correctly from the beginning. Unfortunately, you don’t always have that option. As system administrators, we sometimes inherit systems that must be fixed. Let’s look at a sample layout of the logical volumes on disks to determine whether we need to change definitions or rearrange data. We’ll examine a volume group and find the logical volumes that are a part of it.
AIX LVM Commands 113 The lsvg command provides volume group information: # lsvg -l data2 Data2vg LV NAME data2lv loglv00 appdatalv TYPE jfs jfslog jfs LPs 128 1 128 PPs 256 2 256 PVs 2 2 2 LV STATE open/syncd open/syncd open/syncd MOUNT POINT /data N/ /appdata Now, let’s use lslv, which provides information about logical volumes: # lslv data2lv LOGICAL VOLUME: data2lv VOLUME GROUP: data2vg LV IDENTIFIER: 0003a0ec00004c00000000fb076f3f41.1 PERMISSION: read/write VG STATE: active/complete LV STATE: opened/syncd TYPE: jfs WRITE VERIFY: off MAX LPs: 512 PP SIZE: COPIES: 2 SCHED POLICY: parallel LPs: 128 PPs: 256 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: yes INTRA-POLICY: center UPPER BOUND: 32 MOUNT POINT: /data LABEL: /data 64 megabyte(s) This view provides a detailed description of the logical volume attributes. What do we have here? The intra-policy is at the center, which normally is the best policy for I/O-intensive logical volumes. As you recall from an earlier discussion, there are exceptions to this rule. Unfortunately, you’ve just hit one of them. Because Mirror Write Consistency Check (MWCC) is on, the volume would have been better served if it were placed on the edge. Let’s look at its inter-policy. The inter-policy is minimum, which is usually the best policy if availability matters more than performance. Further, there are twice as many physical partitions as logical partitions, which signifies that you are mirroring your systems. In this case, let’s assume you were told that raw performance was the most important objective, so the logical volume wasn’t configured to reflect the reality of how the volume is being
114 Chapter 11: Disk I/O: Monitoring used. Further, if you are mirroring the system and using an external storage array, the situation would even be worse, because you’re already providing mirroring at the hardware layer, which is actually more effective than using AIX mirroring. The lslv command’s –l (lowercase L) flag lists all the physical volumes associated with the logical volumes and shows the distribution for each logical volume: # lslv -l data2lv data2lv:/data2 PV hdisk2 hdisk3 COPIES 128:000:000 128:000:000 IN BAND 100% 100% DISTRIBUTION 000:108:020:000:000 000:108:020:000:000 With this detail, you can determine that 100 percent of the physical partitions on the disk are allocated to this logical volume. The distribution section of the output shows the actual number of physical partitions within each physical volume. From here, you can detail the volume’s intra-disk policy. Let’s drill down even further, using the -p flag: # lspv -p hdisk2 hdisk2: PP RANGE 1-108 109-109 110-217 218-237 238-325 326-365 366-433 434-542 STATE free used used used used used free free REGION outer edge outer edge outer middle center center inner middle inner middle inner edge LV ID TYPE MOUNT POINT loglv00 data2lv appdatalv testdatalv stagingdatalv jfslog jfs jfs jfs jfs N/A /data2 /appdata /testdata /staging The preceding view shows you what is free on the physical volume, what has been used, and which partitions are used where. The order of the fields
AIX LVM Commands 115 is as follows: edge, middle, center, inner-middle, inner-edge. The sample report shows that most of the data is in the middle and some is at the center. This is a nice view. You can do a lot with lsvg and lslv; run a man on these commands to find out more about them. One of the best tools for looking at LVM use is lvmstat. Because the lvmstat view is not enabled by default, you need to enable it before running the tool: # lvmstat -v data2vg -e The following command takes a snapshot of Logical Volume Manager information every second for 10 intervals: # lvmstat -v data2vg 1 10 The resulting output shows the most utilized logical volumes on your system since you started the data collection tool: # lvmstat -v data2vg Logical Volume appdatalv loglv00 data2lv % iocnt 306653 34 453 Kb_read 47493022 0 234543 Kb_wrtn 383822 3340 234343 Kbps 103.2 2.8 89.3 This detail is very helpful when drilling down to the logical volume layer in tuning your systems: ● ● ● ● % iocnt — Number of read and write requests Kb_read — Total data (in kilobytes) from your measured interval that is read Kb_wrtn — Total data (in kilobytes) from your measured interval that is written Kbps — Amount of data transferred (in kilobytes per second)
116 Chapter 11: Disk I/O: Monitoring Be sure to review the documentation for all the commands discussed here before adding them to your repertoire. filemon and fileplace This section introduces two important I/O tools, filemon and fileplace, and discusses how you can use them during systems administration each day. filemon filemon [-d] [-i Trace_File -n Gennames_File] [-o File] [-O Levels] [-P] [-T n] [-u] [-v] The filemon command uses a trace facility to report on the I/O activity of physical and logical storage, including your actual files. The I/O activity monitored is based on the time interval specified when running the trace. The command reports on all layers of file system utilization, including the LVM, virtual memory, and physical disk layers. Run without any flags, filemon executes in the background while application programs or system commands are being run and monitored. The trace starts automatically and runs until it is stopped. At that time, the command generates an I/O activity report and exits. It can also process a trace file that has been recorded by the trace facility. You can then generate reports from this file. Because reports generated to standard output usually scroll past your screen, I advise using the –o option to write the output to a file: # f ilemon -o dbmon.out -O all Run trcstop command to signal end of trace. Sun Aug 19 17:47:34 200 System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00 # trcstop [f ilemon: Reporting started] # [f ilemon: Reporting completed] [f ilemon: 73.906 secs in measured interval]
fileplace 117 When we check out the file, here is what we see: Sun Aug 19 17:50:45 2007 System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00 Cpu utilization: 68.2% Cpu allocation: 77.1% 130582780 events were lost. Reported data may have inconsistencies or errors. Most Active Files -----------------------------------------------------------------------#MBs #opns #rds #wrs file volume:inode . . . Look for long seek times because they can result in decreased application performance. By examining the read and write sequence counts in detail, you can further determine whether the access is sequential or random. This information helps you when it is time to do I/O tuning. The sample output clearly illustrates that there is no I/O bottleneck to speak of in this case. The filemon command provides a tremendous amount of detail; to be honest, I’ve found it gives too much information at times. Further, using filemon can impose a large performance hit. I don’t typically like to recommend performance tools that impose such a substantial overhead, so I’ll reiterate that although filemon certainly has a purpose, you need to be very careful when using it. fileplace f ileplace [ {-l|-p} [-i] [-v] ] File | [-m LogicalVolumeName] The fileplace command reports the placement of a file’s blocks within a file system. The command is commonly used to examine and assess the efficiency of a file’s placement on disk. For what purposes do you use it? One reason would be to help determine whether some of your heavily used files are substantially fragmented. The fileplace command can also help you identify the physical volume with the highest utilization and determine whether the drive or I/O adapter is causing the bottleneck. Let’s look at an example of a frequently accessed file:
118 Chapter 11: Disk I/O: Monitoring # fileplace -pv dbfile File: dbfile Size: 5374622 bytes Blk Size: 4096 Inode: 21 Frag Size: 4096 Mode: -rw-r--r-- Vol: /dev/hd4 Nfrags: 1313 Owner: root Group: system Physical Addresses (mirror copy 1) Logical Extent ---------------------------------- ----------------- 02134816-02134943 hdisk0 128 frags 524288 Bytes, 9.7% 00004352-00004479 02135680-02136864 hdisk0 1185 frags 4853760 Bytes, 90.3% 00005216-00006400 1313 frags over space of 2049 frags: 2 extents out of 1313 possible: space efficiency = 64.1% sequentiality = 99.9% You should be interested in space efficiency and sequentiality here. Higher space efficiency means files are less fragmented and provide better sequential file access. A higher sequentiality tells you that the files are more contiguously allocated, which is also better for sequential file access. In the example, space efficiency could be better, while sequentiality is quite high. If space and sequentiality are too low, you might want to consider file system reorganization. You would do this with the reorgvg command, which can improve logical volume utilization and efficiency.
C h a p t e r 12 Disk I/O: Tuning The best way to tune your I/O is to configure it properly before deploying your systems. In this way, I/O tuning is different from memory or CPU subsystem tuning. Of course, nine times out of ten, you will have inherited an existing system, so you need to be aware of all the areas where you can tune your disk I/O subsystems. lvmo lvmo -v Name -o Tunable [=NewValue] lvmo -a [-v vgname] You use the lvmo command to set and display pinned memory buffer, or pbuf, tuning parameters. The Logical Volume Manager uses pbufs to control pending disk I/O operations. The lvmo command is also used to display blocked I/O statistics. lvmo is one of those new commands introduced in AIX 5.3. It’s important to note that its usage permits changes only for LVM pbuf tunables that are dedicated to specific volume groups. The ioo utility (described next) remains the only way to manage pbufs on a systemwide basis. That’s because before Version 5.3, the pbuf pool parameter was a systemwide resource. With the introduction of AIX Version 5.3, LVM manages one pbuf pool for each volume group.
120 Chapter 12: Disk I/O: Tuning Let’s display the lvmo tunables for the data2vg volume group: # lvmo -v data2vg -a vgname = data2vg pv_pbuf_count = 1024 total_vg_pbubs = 1024 max_vg_pbuf_count = 8192 perv_blocked_io_count = 7455 global_pbuf_count = 1024 global_blocked_io_count = 7455 The following parameters are available for tuning: ● ● ● — Number of pbufs that can be added when a physical volume is added to the volume group pv_pbuf_count — Maximum number of pbufs that can be allocated for a volume group max_vg_pbuf_count — Number of pbufs that can be added when a physical volume is added to any volume group global_pbuf_count Let’s increase the pbuf count for this volume group: # lvmo -v redvg -o pv_pbuf_count=2048 It’s important to note that if you increase the pbuf value too much, performance may actually degrade. Truthfully, I usually stay away from lvmo and use ioo instead. I’m more used to tuning the global parameters, and it’s also safer this way. ioo ioo [-p|-r] { -o Tunable [=NewValue] } ioo [-p|-r] { -d Tunable } ioo [-p|-r] -D ioo [-p|-r] -a ioo -?
ioo 121 ioo -h [Tunable] ioo -L [Tunable] ioo -x [Tunable] The ioo command is used for virtually all I/O-related tuning parameters. As with vmo, you need to be extremely careful when changing this command’s parameters because doing so on the fly can severely degrade performance. Table 12.1 details specific tuning parameters used often for JFS file systems. As you can see, most of the tuning commands for I/O use the ioo utility. Table 12.1: JFS tuning parameters Function JFS tuning parameter JFS2 tuning parameter Set the maximum amount of memory for caching files vmo -o maxperm=value vmo -o maxclient=value (less than or equal to maxperm) Set the minimum amount of vmo -o minperm=value memory for caching n/a Set a (hard) limit on memory for caching vmo -o strict_maxperm vmo -o maxclient (hard limit) Set the maximum number of pages used for sequential read-ahead ioo -o maxpgahead=value ioo -o j2_maxPageReadAhead=value Set the minimum number of pages used for sequential read-ahead ioo -o minpgahead=value ioo -o j2_minPageReadAhead=value Set the maximum number of pending write I/Os to a file chdev -l sys0 -a maxpout maxpout chdev -1 sys0 -a maxpout maxpout Set the minimum number of pending write I/Os to a file at which programs blocked by maxpout might proceed chdev -l sys0 -a minpout minpout chdev -1 sys0 -a minpout minpout Set the size of modified data cache for a file with random writes ioo -o maxrandwrt=value ioo -o j2_maxRandomWrite ioo -o j2_nRandomCluster Control the gathering of I/Os for sequential writebehind ioo -o numclust=value ioo -o j2_PagesPerWriteBehindCluste r=value Set the number of file system bufstructs ioo -o numfsbufs=value ioo -o j2_nBufferPerPagerDevice=value
122 Chapter 12: Disk I/O: Tuning There are several ways to determine the existing ioo values on your system. The long display listing for ioo gives you the most information. It lists the values for current, reboot value, range, unit, type, and dependencies of all tunable parameters managed by ioo. Here is a sample of some of the parameters: # ioo -L NAME CUR DEF BOOT MIN MAX UNIT TYPE j2_atimeUpdateSymlink 0 0 0 0 1 boolean D j2_dynamicBufferPreallo 16 16 16 0 256 16K slabs D j2_inodeCacheSize 400 400 400 1 1000 j2_maxPageReadAhead 128 128 128 0 64K 4KB pages D j2_maxRandomWrite 0 0 0 0 64K 4KB pages D DEPENDENCIES D Let’s change a tunable: # ioo -o maxpgahead=32 Setting maxpgahead to 32 JFS2 Tuning Options Some important JFS2-specific file system performance enhancements include sequential page read-ahead and sequential and random write-behind. The AIX Virtual Memory Manager anticipates page requirements for observing the patterns of files that are accessed. When a program accesses two pages of a file, the VMM assumes that the program will keep trying to access the file in a sequential method. You can set VMM thresholds to configure the number of pages to be read ahead. With JFS2, make note of two important parameters: ● ● J2_minPageReadAhead — Determines the number of pages to read ahead when VMM initially detects a sequential pattern J2_maxPageReadAhead — Determines the maximum number of pages VMM can read in a sequential file
JFS2 Tuning Options 123 Sequential and random write-behind relates to writing modified pages in memory to disk after a certain threshold is reached. In this way, it does not wait for the syncd daemon to flush out pages to disk. The purpose of this functionality is to limit the amount of dirty pages in memory, thereby further reducing I/O overhead and disk fragmentation. With sequential writebehind, pages do not stay in memory until the syncd daemon runs, which can cause real bottlenecks. With random write-behind, when the number of pages in memory exceeds a specified amount, all subsequent pages are written to disk. Another important area worth mentioning is large sequential I/O processing. When too much simultaneous I/O is occurring to your file systems, the I/O can bottleneck at the file system level. In this case, you should increase the ioo command’s j2_nBufferPerPagerDevice parameter (numfsbus with JFS). If you use raw I/O as opposed to file systems, the same type of bottleneck can occur through LVM. In this case, you might want to tune the lvm_bufcnt parameter.

Section IV Summary, Tips, and Quiz Summary ● ● ● ● ● ● Direct I/O, introduced in AIX 4.3, bypasses the Virtual Memory Manager and transfers data directly to the disk from the user’s buffer. Turning on this feature may increase your performance, depending on your application. Direct I/O benefits applications that use synchronous writes, because the writes have to go to disk. Concurrent I/O (CIO) has all the performance benefits of direct I/O while also bypassing inode lock. This action lets multiple threads read and write data concurrently to the same file. Concurrent I/O benefits from the implementation of JFS2 with a write-exclusive inode lock, which lets multiple users read the same file simultaneously. Appropriate use of asynchronous I/O (AIO) can significantly improve the performance of writes on the I/O subsystem. AIO lets an application continue processing while its I/O completes in the background; I/O and application processing can thus run concurrently. The logical volume sits between the application and physical layers. The Logical Volume Manager (LVM) disk management system maps the data between logical and physical storage. This architecture lets data reside on multiple physical platters and be managed using LVM commands. As a general rule, data written toward the center of the physical disk platter has faster seek times than data written on the outer edge. This advantage has to do with the density of the data Inter-policy defines the number of disks on which the physical partitions of a logical volume reside.
126 Section IV: Summary, Tips, and Quiz ● ● ● ● ● ● ● Intra-policy defines the place on the disk where the logical volume actually resides. You use the lslv, lvm, lvmstat, and lvpv commands to monitor logical volumes. Commands ioo and lvmo work to tune disk I/O. Most tuning commands for I/O use the ioo utility. The filemon command uses a trace facility to report on the I/O activity of physical and logical storage, including your actual files. The I/O activity monitored is based on the time interval you specify when running the trace. The utility reports on all layers of file system utilization, including the LVM, virtual memory, and physical disk layers. The fileplace command reports the placement of a file’s blocks within a file system. It commonly is used to examine and assess the efficiency of a file’s placement on disk. Journaling file systems, although much more secure than nonjournaling systems, have historically been associated with performance overheads. In a “Performance Rules!” shop (at the expense of availability), you would disable metadata logging in an effort to increase performance with the JFS file system. With JFS2, that option is no longer possible, or even necessary, because JFS2 is tuned to handle metadata-intensive types of applications much more efficiently. JFS imposes a limit of 64 GB for a file; with JFS2, you can have a file supporting 16 TB. Tips ● ● Make sure your data is spread evenly across all spindles. If you have a storage area network (SAN) or an external storage array, verify that your storage administrator understands how he or she needs to configure this system — which includes trying to create arrays of equal size and type if possible. Try to create one logical unit (LUN) for each array and then spread the logical volumes across all physical volumes in the volume group. Make certain your mirrors are on separate disks and adapters.
Tips ● ● ● ● ● ● ● ● 127 If you’re running a relational database management system (RDBMS), make sure your indexes, temporary tablespaces, and redo logs reside on separate physical disks or LUNs. Regarding adapters, spread them across multiple buses, and don’t attach too many physical disks or LUNs to any one adapter. Remember, the more adapters you have, the better your performance will be. Be sure your device drivers support multipath I/O or your storage equivalent of that (e.g., PowerPath for EMC) to allow for further load balancing of the I/O subsystem. Be careful when using the filemon command, because you incur a performance overhead when using this tracing tool. The rule of thumb when configuring AIO servers in AIX 5.3 is to set the maximum number of servers (MaxServers) equal to 10 times the amount of disk or 10 times the number of processors. You would set MinServers at one half of this amount. Other than having some more kernel processes hanging out that don’t get used (consuming a small amount of kernel memory), there really is little risk in oversizing the number of MaxServers, so don’t be afraid to bump it up. Note that in AIX 6.1, this issue is no longer a concern. Consider employing concurrent I/O when using databases such as Oracle. CIO permits multiple threads to read and write data concurrently to the same file. This advantage accrues from the way in which JFS2 is implemented with write-exclusive inode locks, which let multiple users read the same file simultaneously. Performance increases dramatically when multiple users read from the same data file. Never lose sight of the fact that while RAM access takes about 540 CPU cycles, disk access can take 20 million CPU cycles. Clearly, the weakest link on a system is the disk I/O storage system. It’s your job as the system administrator to make sure it doesn’t become even more of a bottleneck. In terms of intra-disk policy, as a best practice, the more intensive I/O applications should be brought closer to the center of the physical volumes. Note, though, that this rule has exceptions. Disks hold more data per track on the edges, not on the center. That being said, logical volumes being accessed sequentially should actually be placed on the edge
128 Section IV: Summary, Tips, and Quiz for better performance. The same advice holds true for logical volumes that have Mirror Write Consistency Check (MWCC) turned on, because the MWCC sector is on the edge of the disk and not at the center of it, which relates to the intra-disk policy of logical volumes. ● Examine parameters J2_minPageReadAhead and J2_maxPageReadAhead in an effort to increase performance when sequential I/O is encountered. Quiz Multiple Choice 1. What is the weakest link on your system? a. RAM b. CPU c. Disk d. CPU cache 2. What sits between the application layer and the physical layer of the system? a. Physical volumes b. Logical volumes c. File systems d. Inodes 3. With JFS2, you can have a file that supports a. 32 TB b. 16 GB c. 16 TB d. 72 TB
True or False 129 4. For better performance, where on the disk platter should you place logical volumes that are being accessed sequentially? a. On the edge b. In the middle c. Inside d. Outside 5. Which command do you use to set and display your pbuf tuning parameter? a. no b. nfsm c. lvmo d. lsattr 6. Which command is used most often to tune disk I/O? a. vmo b. ioo c. iostat d. lsmo 7. What defines the place on the disk where the logical volume actually reside? a. Intra-policy b. Inter-policy c. Inode policy d. LVM policy True or False 8. The rule of thumb when configuring AIO servers in AIX 5.3 is to set the maximum number of servers equal to 10 times the amount of disk or 10 times the number of processors.
130 Section IV: Summary, Tips, and Quiz 9. filemon reports the placement of a file’s blocks within a file system. Fill in the Blank 10. Which parameter determines the number of pages to read ahead when VMM initially detects a sequential pattern? __________________________________________
Section V Network I/O This section provides an overview of network management on AIX, including how to monitor and tune the network subsystem. It also discusses tools you can use to monitor your hardware and the Network File System (NFS). Unlike other subsystems, the network subsystem has many things to monitor, so we’ll spend quite a bit of time on this topic. You’ll learn how to monitor network packets using the netstat command. We’ll also review best practices for tuning your network and discuss various networking concepts as they relate to systems performance.

C h a p t e r 13 Network I/O: Introduction The first thing that usually comes to mind when a system administrator hears that there might be some network contention issues is to run netstat. The netstat command — the “net” equivalent of using vmstat or iostat — provides a quick-and-dirty way to get an overview of how your network is configured. Unlike vmstat or iostat, however, the command defaults usually don’t give you as much information as you’d probably like. You need to understand the correct usage of netstat and how best to use it when monitoring your system. The netstat facility isn’t really a monitoring tool in the sense that vmstat and iostat are. Other, more suitable tools (which we’ll get to later) are available to help you monitor your network subsystem. At the same time, you can’t really start to monitor until you have a thorough understanding of the various components related to network performance. These components include your network adapters, your switches and routers, and how you are using virtualization on your host logical partitions. If you determine that you indeed are experiencing a network bottleneck, the solution to the problem might actually lie outside your immediate host machine. If the network switch is improperly configured on the other end, there is little you can do. Of course, you might be able to point the network team in the right direction. You should also spend time gathering overall information about your network.
134 Chapter 13: Network I/O: Introduction How are you going to be able to understand how to troubleshoot your network devices unless you really understand the network? In the next few chapters, we’ll look at specific AIX network tracing tools, such as netpmon, to see how they can help you isolate your bottlenecks. No matter which subsystem you want to tune, remember that systems tuning is an ongoing process. As I’ve stated before, the best time to start monitoring your systems is at the beginning, before you have any problems and when users aren’t screaming. You need a baseline of network performance so that you know what the system looks like when it’s behaving normally. And remember: be careful to make changes one at a time so you can assess the actual impact of each change. Network I/O Overview Understanding the network subsystem as it relates to AIX is not an easy undertaking. From a hardware and software aspect, there are far fewer areas you need to investigate when you examine CPU and memory bottlenecks. Tuning disk I/O is more complex than other tuning activities because many more issues affect performance, particularly during the architecting and build-out of systems. In this respect, tuning the network is probably most similar to tuning disk I/O — a fact that’s actually not too surprising, given that both relate to I/O. Let’s start by examining the AIX Transmission Control Protocol/Internet Protocol (TCP/IP) layers, which are depicted in Figure 13.1.
Network I/O Overview 135 Figure 13.1: AIX TCP/IP layers From this illustration, you can clearly see that there is more to network monitoring than simply running netstat and looking for collisions. From the application layer through the media layer, areas need to be configured, monitored, and tuned. At this point, you should notice some similarities between this illustration and the Open Systems Interconnection (OSI) model, which divides network architecture into seven layers (from top to bottom): ● Application ● Presentation ● Session ● Transport ● Network ● Data link ● Physical Perhaps the most important concept to understand is that each layer on the host machine communicates with the corresponding layer on the remote machine. The actual application programs transmit data using either the User Datagram Protocol (UDP) or the TCP transport layer protocols. They
136 Chapter 13: Network I/O: Introduction receive the data from whatever application you are using and divide that data into packets. The packets themselves differ depending on whether a packet is a UDP packet or a TCP packet. In general, UDP is faster, while TCP is more secure. There are many tunable parameters to look at, and we’ll get to these later. To begin, you might want to start to familiarize yourself with the no command, which is the utility designed to make most network changes. From a hardware perspective, it is critical for you to understand the components that must be configured appropriately to optimize performance. Although you might work together with the network teams that manage your switches and routers, you probably won’t be configuring those devices unless you’re a small shop or a one-person IT department. The most important component you’ll work with is the network adapter. Most of your adapters will probably be some version that supports Gigabit Ethernet, such as a 10/100/1000 Mbps Ethernet card. Let’s review the important concepts you’ll need to work with here. NFS Introduced by Sun Microsystems in 1984, the Network File System (NFS) lets clients access files over a network as if the files were locally attached as disks. Version 2 of NFS, introduced in 1989, operated exclusively on UDP. Version 3, which debuted in 1995, added TCP support, which helped NFS thrive over a wide area network (WAN). Version 4, introduced in 2000, was the first version developed by the Internet Engineering Task Force (after Sun relinquished control of NFS development).NFS V4 was also the first version to provide stateful support, whereby both the client and the server maintain current information about both open files and file locks. NFS was further enhanced in 2003 under RFC3530, and it is this standard that AIX supports. AIX 5.3 supports three versions of NFS: Versions 2, 3, and 4. The default version is Version 3. (For Red Hat Linux, the default NFS version is Version 4.) You can choose the NFS version type during the actual mounting of the file system, and you can run different NFS versions on the same server.
NFS 137 NFS now supports both TCP and UDP. Because UDP is faster (it does less), some environments that favor optimum performance (on a LAN) over reliability might perform better with UDP. TCP is more reliable (because it establishes connections) and provides better performance over a WAN (because its flow control helps minimize network latency). A benefit of NFS is that it acts independently of machine types and operating systems. It achieves this independence through the use of remote procedure calls (RPCs), as depicted in Figure 13.2. Server Z Client A Thread m biod i nfsd a biod j nfsd b biod k nfsd c LAN Client B Thread n biod a nfsd x biod b nfsd y biod c nfsd z Figure 13.2: Interaction between client and server The figure illustrates how NFS clients A and B access the data on NFS server Z. The client computers first request access to the exported data by mounting the file system. Then, when a client thread tries to process data within the NFS mounted file system, the data is redirected to the biod daemon, which takes the data through the LAN to the NFS server and its nfsd daemon. The server uses nfsd to export the directories that are available to its clients. As you can see, you’ll need to tune the network and I/O parameters. If Server Z is performing poorly, that obviously affects all of its NFS clients. If possible, tune the server specifically to function as an NFS server (more about this later).
138 Chapter 13: Network I/O: Introduction What about the biod daemon? This daemon is required to perform both read-ahead and write-behind requests. biod improves overall NFS performance as it either empties or fills up the buffer cache, acting as a liaison to the client applications. As shown in the figure, the biod daemon sends the requests to the server. On the other side, nfsd is the liaison that provides NFS services to clients. When the server receives biod communications from the client, it uses the nfsd daemon until the request is completed. How is it that NFS was not stateful until Version 4, even though it could use TCP as early as Version 2? Figure 13.3 illustrates where NFS resides in relation to the TCP/IP stack and the OSI model. Figure 13.3: NFS relationship to OSI and TCP/IP Because NFS uses remote procedure calls, it does not reside on the transport stack. RPCs are a library of procedures that enable the client and server processes to execute system calls as if they were executed in their own address spaces. In a typical UDP NFS Version 2 or 3 implementation, the NFS server sends its client a type of cookie after the clients are authorized to share the volume. This approach helps minimize network traffic. The problem is that if the server goes down, clients will continue to inundate the network with requests. That is why there is a preference for
Media Speed 139 using TCP. Only Version 4 can use stateful connections, and only Version 4 uses TCP as its transport protocol. NFS 4 has no interaction with portmap or other daemons such as lockd and statd, because these functions are rolled into the kernel. In versions other than Version 4, the portmapper is used to register RPC services and to provide the port numbers for the communications between clients and servers. External Data Representation (XDR) provides the mechanism that RPC and NFS use to ensure reliable data exchange between client and server. This interaction takes place in a way that is platform-independent for the exchange of binary data, thus addressing the possibility of systems representing data in different ways. Using XDR, data can be interpreted correctly, even on platforms that are not alike. Media Speed Network adapters communicate with other devices based on how the media speed is configured. Although other choices are available, you should configure your card for either 100 Mbps full duplex or auto-negotiation. With auto-negotiation, both adapters try to communicate using the highest possible speed. The documentation might tell you that you need to configure the card this way (IBM even defaults to auto-negotiation on the system), but most senior AIX administrators I know prefer to set it to full duplex to ensure they receive the fastest possible adapter speed. If this setting doesn’t function properly, you should work with the appropriate network teams to resolve the problem before deployment. I prefer to take more time initially rather than set the adapter to an option that might cause slower speeds as a result of poorly configured switches. The lsattr command gives you the information you need. Used with the en prefix, it displays your driver parameters; the ent prefix displays your hardware parameters. In the following case, the interface is set to autonegotiate. # lsattr -El ent0 alt_addr busintr busmem 0x000000000000 166 0xc8030000 Alternate Ethernet Address Bus interrupt level Bus memory address True False False
140 Chapter 13: Network I/O: Introduction chksum_offload intr_priority ipsec_offload large_send media_speed poll_link poll_link_timer rom_mem rx_hog rxbuf_pool_sz rxdesc_que_sz slih_hog tx_preload tx_que_sz txdesc_que_sz use_alt_addr yes 3 no no Auto_Negotiation no 500 0xc8000000 1000 1024 1024 10 1520 8192 512 no Enable RX Checksum Offload Interrupt priority IPsec Offload Enable TCP Large Send Offload Media Speed Enable Link Polling Time interval for Link Polling ROM memory address RX Descriptors per RX Interrupt Receive Buffer Pool Size RX Descriptor Queue Size Interrupt Events per Interrupt TX Preload Value Software TX Queue Size TX Descriptor Queue Size Enable Alternate Ethernet Address True False True True True True True False True True True True True True True True You should also check your adapter firmware levels to make sure they’re up-to-date. I’ve seen many network problems fixed by updating to the latest levels of firmware. The lscfg command reports firmware information: # lscfg -vp | grep -p ROM 10/100 Mbps Ethernet PCI Adapter II: Part Number.................09P5023 FRU Number..................09P5023 EC Level....................H10971A Manufacture ID..............YL1021 Network Address.............0002556FC98B ROM Level.(alterable).......SCU015 Product Specific.(Z0).......A5204207 Device Specific.(YL)........U0.1-P1-I1/E1 10/100/1000 Base-TX PCI-X Adapter: Part Number.................00P3056 FRU Number..................00P3056 EC Level....................H11635A Manufacture ID..............YL1021 Network Address.............00096B2E31BD ROM Level.(alterable).......GOL002 Device Specific.(YL)........U0.1-P1/E2
Virtual and Shared Ethernet 141 Network Subsystem Memory Management You should also start to familiarize yourself with the memory management facility of network subsystems. This facility makes use of data structures called mbufs that are used to store kernel data for incoming and outbound traffic. The buffer sizes themselves can range from 32 bytes to 16,384 bytes. The buffer pools are created by making allocation requests to the Virtual Memory Manager. In a symmetric multiprocessing box, each memory pool is split evenly for every processor. An important point to note is that a processor cannot borrow from the memory pool outside of its own processor. Virtual and Shared Ethernet Two other concepts to be familiar with are virtual Ethernet and shared Ethernet. First supported on AIX 5.3 on POWER5, virtual Ethernet allows for interpartition- and IP-based communications between logical partitions on the same frame. This functionality is achieved through the use of a virtual I/O switch. The Ethernet adapters themselves are created and configured using the Hardware Management Console (HMC). Shared Ethernet is one of the features of Advanced Power Virtualization (APV) or PowerVM. It enables the use of virtual I/O servers (VIOs), whereby several host machines can actually share one physical network adapter. Shared Ethernet is typically used in environments that don’t require substantial network bandwidth. Although an in-depth discussion of virtualization is beyond the scope of this book, you should understand that if you are using virtualization, there might be other reasons for your bottleneck outside of what you’re doing on the host machine. Virtualization is a wonderful thing, but you need to be careful not to share too many adapters from your VIO server, or you might pay a large network I/O penalty. Use of the appropriate monitoring tools should inform you whether you have a problem. Further, you might want to familiarize yourself with concepts such as Address Resolution Protocol (ARP) and Domain Name Server (DNS), which can also affect network performance and reliability in different ways.

14 C h a p t e r Network I/O: Monitoring Let’s begin our discussion of network I/O monitoring by revisiting our old standby, netstat, which displays overall network statistics. Probably one of the most common commands you will type is netstat –in: # netstat -in Name en1 en1 en0 en0 lo0 lo0 lo0 Mtu 1500 1500 1500 1500 16896 16896 16896 Network link#2 10.153 link#3 172.29.128 link#1 127 ::1 Address 2a.21.70.0.90.6 10.153.3.7 2a.21.70.0.90.5 172.29.137.205 127.0.0.1 Ipkts 21005666 21005666 328241182 328241182 62223 62223 62223 Ierrs 0 0 0 0 0 0 0 Opkts Oerrs Coll 175389 0 0 175389 0 0 1189 0 0 1189 0 0 62234 0 0 62234 0 0 62234 0 0 Here is a key to the output fields: ● Name — Interface name ● Mtu — Interface Maximum Transfer Unit (MTU) size ● Network — The actual network address to which the interface connects ● Address — Media Access Control (MAC) or IP address ● Ipkts — Total number of packets received by the interface
144 Chapter 14: Network I/O: Monitoring ● Ierrs — Number of errors reported back from the interface ● Opkts — Number of packets transmitted from the interface ● Oerrs — Number of error packets transmitted from the interface ● Coll — Number of collisions on the adapter (if you’re using Ethernet, you won’t see anything here) Another handy netstat flag is –m. This option lets you view the kernel memory allocation statistics, including mbuf memory requests (and buffer size), amount of memory in use, and failures by CPU: # netstat -m Kernel malloc statistics: ******* CPU 0 ******* By size inuse calls failed 32 194 5203 0 64 484 3926 0 128 309 14913 0 256 392 214494 0 512 2060 26183179 0 1024 31 2714 0 2048 587 1237 0 4096 9 8367 0 8192 2 12 0 16384 224 354 0 32768 48 183 0 65536 84 142 0 131072 3 4 0 ******* CPU 1 ******* By size inuse calls failed 32 17 96 0 64 295 1214 0 128 151 93806 0 256 83 273 0 512 1577 86936634 0 1024 4 18 0 2048 515 516 0 4096 1 707 0 8192 1 1 0 16384 32 32 0 32768 52 193 0 65536 34 34 0 131072 0 0 0 delayed 2 7 8 22 261 8 292 2 2 29 13 42 0 free 62 28 875 136 60 25 5 2 1 2 3 0 51 hiwat 2620 2620 1310 2620 3275 1310 1965 655 327 163 81 81 102 freed 0 0 0 0 0 0 0 0 0 0 0 0 0 delayed 0 5 5 5 199 2 257 0 1 4 15 17 0 free 111 25 713 29 23 4 1 1 4 0 5 0 44 hiwat 2620 2620 1310 2620 3275 1310 1965 655 327 163 90 81 88 freed 0 0 0 0 0 0 0 0 0 0 0 0 0
netpmon 145 If you’re using Ethernet, you can also use the entstat command to display device driver statistics: # entstat -d en1 ------------------------------------------------------------ETHERNET STATISTICS (en1) : Device Type: 10/100 Mbps Ethernet PCI Adapter II (1410ff01) Hardware Address: 00:02:55:6f:c9:9b Elapsed Time: 5 days 12 hours 14 minutes 46 seconds Transmit Statistics: -------------------Packets: 803536 Bytes: 511099654 Interrupts: 520 Transmit Errors: 0 Packets Dropped: 0 Receive Statistics: ------------------Packets: 2095253 Bytes: 1099945394 Interrupts: 2074913 Receive Errors: 0 Packets Dropped: 0 Bad Packets: 0 Max Packets on S/W Transmit Queue: 38 S/W Transmit Queue Overflow: 0 Current S/W+H/W Transmit Queue Length: 1 Broadcast Packets: 535 Broadcast Packets: 997476 The entstat output provides a potpourri of information. You won’t see many collisions because you’ll probably be working in a switched environment. Look for transmit errors, and make sure they’re not increasing too fast. You need to learn to troubleshoot collision and error problems before you even begin to think about tuning. As an alternative, you can use netstat –v, which provides similar information. netpmon netpmon [-o File] [-d] [-T n] [-P] [-t] [-v] [-O ReportType ...] [-i Trace_File -n Gennames_File] The netpmon command reports information about CPU usage as it relates to the network. It also provides data about network device driver I/O, Internet socket calls, and various other statistics.
146 Chapter 14: Network I/O: Monitoring Similar to its other trace brethren, tprof and filemon, netpmon starts a trace and runs in the background until you stop it with the trcstop command. I like netpmon because it really gives you a detailed overview of network activity and also captures data for trending and analysis (although it’s not as useful as nmon for the latter purpose). In the following example, we’ll use a trace buffer size of 2 million bytes: # netpmon -T 2000000 -o /tmp/net.out Wed Sep 5 05:30:27 2007 System: AIX 5.3 Node: lpar7ml162f_pub Machine: 00C22F2F4C00 Run trcstop to signal the end of the trace: # trcstop # [netpmon: Reporting started] [netpmon: Reporting completed] [ 4 traced cpus [ 245.464 secs total preempt time ] ] [netpmon: 164.813 secs in measured interval] Let’s look at the data. Here is just a small sampling of the output: # more net.out Process CPU Usage Statistics: ----------------------------Network Process (top 20) PID CPU Time CPU % CPU % ---------------------------------------------------------UNKNOWN 15920 151.2735 36.558 0.000 UNKNOWN 7794 104.8801 25.346 0.000 UNKNOWN 6876 73.8785 17.854 0.000 UNKNOWN 5402 50.6225 12.234 0.000 xmwlm 13934 15.0469 3.636 0.000 -ksh 5040 0.0371 0.009 0.000 getty 18688 0.0280 0.007 0.000 sshd: 28514 0.0224 0.005 0.000 syncd 10068 0.0212 0.005 0.000
netpmon gil swapper spray send-mail rmcd ping ksh trcstop 3870 0 5400 18654 15026 5036 26642 5404 0.0163 0.0135 0.0085 0.0084 0.0081 0.0068 0.0062 0.0057 0.004 0.003 0.002 0.002 0.002 0.002 0.002 0.001 147 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 As you can see, little overall network I/O activity was going on during this time. The top section of the output is most important. It helps you gain an understanding of which processes are eating up network I/O time. The lsattr command, which we used in Chapter 13 to view hardware parameters, is another tool you’ll use frequently to display statistics about your interfaces. The attributes reported by this command are configured using either the chdev or the no command. Let’s display the driver parameters using lsattr: # lsattr -El en0 alias4 alias6 arp authority broadcast mtu netaddr netaddr6 netmask prefixlen remmtu rfc1323 security state tcp_mssdflt tcp_nodelay tcp_recvspace tcp_sendspace IPv4 Alias including Subnet Mask IPv6 Alias including Prefix Length on Address Resolution Protocol (ARP) Authorized Users Broadcast Address 1500 Maximum IP Packet Size for This Device Internet Address IPv6 Internet Address Subnet Mask Prefix Length for IPv6 Internet Address 576 Maximum IP Packet Size for REMOTE Networks Enable/Disable TCP RFC 1323 Window Scaling none Security Level detach Current Interface Status Set TCP Maximum Segment Size Enable/Disable TCP_NODELAY Option Set Socket Buffer Space for Receiving Set Socket Buffer Space for Sending True True True True True True True True True True True True True True True True True True
148 Chapter 14: Network I/O: Monitoring Sometimes, I also like to use the spray command to troubleshoot possible problems (although oftentimes this command is blocked because it’s not very secure). The spray command sends a one-way stream of packets from your host to the remote host machines and reports the number of packets dropped as well as the number of packets transferred: # /usr/etc/spray lpar8test -c 2000 -l 1400 -d 1 sending 2000 packets of length 1402 to lpar8test ... 34 packets (1.700%) dropped by lpar8test 23667 packets/second, 33181234 bytes/second In the preceding example, 2,000 packets were sent to the lpar8test host, with a delay of one microsecond. Each packet consisted of 1,400 bytes. Before using spray, make sure the sprayd daemon isn’t commented out of the inetd daemon (the default configuration in AIX), and don’t forget to refresh inetd. If you’re seeing a substantial number of dropped packets, that obviously is not good. Monitoring NFS This section covers the use of the nmon, topas, nfsstat, nfs, nfs4cl, and netpmon commands to monitor the Network File System (NFS). For NFS tuning, you could use a tool such as topas or nmon initially because these commands provide a nice dashboard view of what is happening in your system. Remember that NFS performance problems might not be related to your NFS subsystem at all; your bottleneck could be on the network or, from a server perspective, related to CPU or disk I/O. Running a tool such as topas or nmon can quickly help you get a sense of what the real issues are. Consider a system that has two CPUs and is running AIX 5.3 TL_6. The report in Figure 14.1 shows nmon output from an NFS perspective.
nfsstat 149 Figure 14.1: NFS nmon output Look at all the information that is available to you from an NFS (client and server) perspective using nmon! There are no current bottlenecks at all on this system. Although topas has improved recently with its ability to capture data, nmon might still be a better first choice. While topas provides a front end similar to nmon, nmon is more useful in terms of long-term trending and analysis. nfsstat The nfsstat tool is arguably the most important tool you’ll work with as you monitor your network. This command displays all types of information about NFS and remote procedure calls (RPCs). You can use nfsstat as
150 Chapter 14: Network I/O: Monitoring a monitoring tool to troubleshoot problems and also employ it for performance tuning. Depending on the flags you use, you can have nfsstat display NFS client or server information. The command can also show the actual usage count of file system operations. This detail helps you understand exactly how each file system is utilized, so that you can know how to best tune your system. Look at the client flag (c) first. The r flag generates the RPC information: # nfsstat -cr Client rpc: Connection oriented calls badcalls 14348 1 nomem cantconn 0 0 Connectionless calls badcalls 23 0 timers 0 nomem 0 badxids 0 interrupt 0 timeouts 0 newcreds 0 badverfs 0 timers 0 retrans 0 badxids 0 timeouts 0 newcreds 0 badverfs 0 cantsend 0 Here’s a rundown of the connection-oriented parameters: ● calls — Number of RPC calls received ● badcalls — Number of calls rejected by the RPC layers ● ● ● ● badxids — Number of times a server reply was received that did not correspond to any outstanding call timeouts — Number of times calls timed out while waiting for replies from the server newcreds — Number of times authentication information was refreshed badverfs — Number of times a call failed due to a bad verifier in the response
nfs4cl 151 If you notice a large number of timeouts or badxids, you could benefit by increasing the timeo parameter with the mount command (details to come). Next, look at the NFS information by using the n flag: # nfsstat -cn Client nfs: calls badcalls clgets 14348 1 0 Version 2: (0 calls) null getattr setattr 0 0% 0 0% 0 0% wrcache write create 0 0% 0 0% 0 0% mkdir rmdir readdir 0 0% 0 0% 0 0% Version 3: (14348 calls) null getattr setattr 0 0% 3480 24% 5 0% write create mkdir 44 0% 3 0% 0 0% rename link readdir 0 0% 2 0% 3 0% cltoomany 0 root 0 0% remove 0 0% statfs 0 0 lookup 0 0% rename 0 0% readlink 0 0% link 0 0% read 0 0 symlink 0 0% lookup 1790 12% symlink 0 0% readdir+ 3195 22% access 5742 40% mknod 0 0% fsstat 5 0% readlink 0 0% remove 3 0% fsinfo 2 0% read 30 0% rmdir 0 0% pathconf 0 0% In NFS Version 3, the output fields include: ● calls — Number of received NFS calls ● badcalls — Number of calls rejected by the NFS layer ● clgets — Number of times a client handle was received ● cltoomany — Number of times the client handle had no unused entries nfs4cl If you’re running NFS Version 4, you might be using the nfs4cl command more often. This command displays NFS 4 statistics and properties:
152 Chapter 14: Network I/O: Monitoring # nfs4cl showfs Server -------- Remote Path --------------- fsid --------------- Local Path --------------- If after running this command, you see that there is no output, run the mount command to obtain more detail: # mount node mounted ------------ --------------/dev/hd4 192.168.1.12 /stage/middleware mounted over vfs --------------- ---- ------------ --------------- date options / jfs Sep 25 13:18 rw,log=/dev/hd8 /stage/middleware nfs3 Sep 25 13:22 ro,bg,soft,intr, sec=sys 192.168.1.12 /userdata/20004773 /home/u0004773 nfs3 Sep 25 13:29 bg,hard,int As you can tell, in this example no file systems are mounted using NFS Version 4, only NFS Version 3. Unlike the vast majority of performance tuning commands, nfs4cl can also be used to tune your system. You do this by using the setfsoptions subcommand to tune NFS Version 4. Another parameter you can tune is the previously mentioned timeo, which specifies the timeout value for the RPC calls to the server. netpmon and NFS The netpmon command can also help you troubleshoot NFS bottlenecks. In addition to monitoring many other types of network statistics, netpmon monitors for clients — both read and write subroutines and NFS RPC requests. For servers, netpmon monitors read and write requests. The command starts a trace and runs in the background until you stop it. First, let’s kick off the trace: # netpmon -T 3000000 -o /tmp/nfrss.out You run the trcstop command to signal the end of the trace, as the following message informs you:
netpmon and NFS # Sun Oct 153 7 07:06:14 2007 System: AIX 5.3 Node: lpar24ml162f_pub Machine: 00C22F2F4C00 Run trcstop command to signal end of trace. Let’s stop our trace: # trcstop # [netpmon: Reporting started] [netpmon: Reporting completed [ 2 traced cpus ] [ 245.464 secs total preempt time ] [netpmon: 164.813 secs in measured interval Now, we can check out the NFS-specific information provided in the output file: NFSv3 Client RPC Statistics (by Server): ---------------------------------------Server Calls/s ---------------------------------p650 126.68 -----------------------------------------------------------------------Total (all servers) 126.68 Detailed NFSv3 Client RPC Statistics (by Server): ------------------------------------------------SERVER: p650 calls: 5602 call times (msec): COMBINED (All Servers) calls: call times (msec): avg 1.408 min 0.274 max 979.611 sdev 21.310 5602 avg 1.408 min 0.274 max 979.611 sdev 21.310 In this case, you can see the NFS Version 3 client statistics by server. Although netpmon is a useful trace utility, its performance overhead can sometimes outweigh its benefits, particularly when you have other ways to obtain similar information. So be aware of this consideration when using this utility.
154 Chapter 14: Network I/O: Monitoring Monitoring Network Packets Earlier, I addressed some of the very basic flags, such as –in, that you typically use with the netstat command. Using netstat, you can also monitor more detailed information about the packets themselves. For example, the –D option reports the overall number of packets received, transmitted, and dropped in your communications subsystem. The command output sorts the results by device, driver, and protocol: # netstat -D Source Ipkts Opkts Idrops Odrops ------------------------------------------------------------------------------ent_dev0 238122150 1805 0 0 ent_dev1 17583646 301547 0 0 --------------------------------------------------------------Devices Total 255705796 303352 0 0 . . . There are actually so many different ways to use netstat that the best place to start is to look at the man page for netstat and go from there. Don’t be afraid to run these commands, because they won’t eat up disk space or affect performance. iptrace, ipreport, and ipfilter The tracing tools provided within AIX are used to record detailed information about packets. Use these commands with more caution. The tools are extremely helpful when you’re trying to determine the root cause of network performance problems. Check out iptrace and ipreport first. The iptrace command records all packets received from the network interfaces. The ipreport command formats the data generated from iptrace into a readable trace report. You can also use the ipfilter command to sort the output file created from ipreport. Let’s try starting the trace and running it for one minute:
iptrace, ipreport, and ipfilter 155 # /usr/sbin/iptrace -a -i en0 iptrace [1] 7375 # [774252 [1] + Done /usr/sbin/iptrace -a -i en0 iptrace.out Here, you can see the trace running: # ps -ef | grep iptrace root 205030 749602 0 10:57:32 pts/0 0:00 grep iptrace root 774252 2 10:57:25 1 - 0:00 /usr/sbin/iptrace -a -i en0 iptrace.out When we’re done with the trace, we need to kill the process: # kill -1 77425 # iptrace: unload success! Next, let’s sort the file: # ipreport -r -s iptrace.out >/ipreport.network Now, we can examine the output, which shows the captured information about each packet, including packet size and IP address information: # more ipreport.network IPTRACE version: 2.0 ETH: ====( 114 bytes transmitted on interface en0 )==== 10:57:25.698790226 ETH: [ da:bb:b8:b5:26:14 -> 6e:87:76:59:6e:cd ] type 800 (IP) IP: < SRC = 172.29.135.44 > (lpar37p682e) IP: < DST = 172.29.131.16 > IP: ip_v=4, ip_hl=20, ip_tos=16, ip_len=100, ip_id=18349, ip_off=0 DF IP: ip_ttl=60, ip_sum=945f, ip_p = 6 (TCP) TCP: <source port=22(ssh), destination port=53643 > TCP: th_seq=337783617, th_ack=1783353394 TCP: th_off=8, flags<PUSH | ACK> TCP: th_win=65522, th_sum=0, th_urp=0 TCP: nop TCP: nop TCP: timestamps TSVal: 0x47414604 TSEcho: 0x47826117 TCP: 00000000 520bea13 dfaefa7b e1c517d6 ce86f960 |R......{.......’| TCP: 00000010 fdb24d69 947c8d48 fa7b6379 235d1a63 |..Mi.|.H.{cy#].c| TCP: 00000020 840adfc2 e1b4b916 e1002983 f96fc1fb |..........)..o..|
156 Chapter 14: Network I/O: Monitoring As you can imagine, the trace file can become very large fairly quickly. The file for this example grew to 40 MB in less than a minute! Be very careful when running these traces because you’ll run out of disk space really fast if you don’t have the disk bandwidth for these files. You can also start the trace using the System Resource Controller (SRC). tcpdump What about tcpdump? This command prints the headers of the packets that are captured for each network interface card (NIC). One important difference with tcpdump is that, unlike iptrace, it can look at only one network interface at a time. And because iptrace examines the entire packet from the kernel space, its results can include lots of dropped packets. With tcpdump, you can limit the amount of data to be traced. Also, you don’t need to use an ipreport type of command to format the binary data because tcpdump performs both the trace and the output. Let’s run tcpdump: # tcpdump -w tcp.out tcpdump: listening on en0, link-type 1, capture size 96 bytes The utility continues to capture packets until you press Ctrl+C. If any packets were dropped due to a lack of buffer space, tcpdump reports that, too: 14755 packets received by filter 0 packets dropped by kernel 13:40:28.001711 IP lpar37p682e.ssh > 172.29.131.16.53736: P 374368029:374368077(48) The preceding output shows that the kernel dropped no packets, which is a good thing.
C h a p t e r 15 Network I/O: Tuning The most important command for tuning AIX network parameters is the no command. First, take a look at the first few parameters, using the –a flag: root@lpar37p682e[/] > no -a arpqsize = 12 arpt_killc = 20 arptab_bsiz = 7 arptab_nb = 149 bcastping = 0 clean_partial_conns = 0 delayack = 0 delayackports = {} As an alternative, you can use the –L flag, which provides much more detailed information. The no command provides more than 100 parameters you can tune. In older versions of AIX, thewall was an important tunable whose defaults you needed to change; this parameter defined the upper limit for network kernel buffers. Today, this size is defined at installation time depending on the amount of RAM and the kernel type. For example, if you are running AIX 5.3 on a 64-bit kernel, the parameter is set at half the size of real
158 Chapter 15: Network I/O: Tuning memory. (I actually used to enjoy playing around with thewall, so I’m not sure I like the new approach.) You can use netstat –m to detect shortages or failures of network memory requests. In the following example, there are no shortages (failures): root@lpar37p682e[/etc/tunables] > netstat -m Kernel malloc statistics ******* CPU 0 ******* By size inuse calls failed delayed free hiwat free 32 64 117 109 217 6523 0 0 0 1 11 83 5240 5240 0 0 128 256 975 520 15951 67637 0 0 29 30 785 1016 2620 5240 0 0 Streams mblk statistic failures 0 high priority mblk failures 0 medium priority mblk failures 0 low priority mblk failures Although you can change many parameters using the no utility, most of them are better left alone. The most important parameters are those that relate to TCP streaming workload tuning: ● ● ● — This parameter controls how much buffer space in the kernel is used to buffer application data. You really want to bump this value up from the default because if its limit is reached, the sending application suspends data transfer until TCP sends the data to the buffer. tcp_sendspace tcp_recvspace — In addition to controlling the amount buffer space to be consumed by receive buffers, this value helps AIX determine the size to make its transmit window. — When using UDP, you can set this value no higher than 65536 because IP has an upper limit of 65,536 bytes per packet. udp_sendspace
tcpdump ● 159 udp_recvspace — This value should be greater than udp_sendspace because it needs to handle as many simultaneous UDP packets per socket as it can. You can easily set this parameter to 10 times the value of udp_sendspace. Let’s use no make a few changes. First, increase the size of udp_sendspace: root@lpar37p682e[/] > no -p -o udp_sendspace=65536 Setting udp_sendspace to 65536 Setting udp_sendspace to 65536 in nextboot file Next, change udp_recvspace to the recommended configuration of 10 times udp_sendspace: root@lpar37p682e[/] > no -p -o udp_recvspace=655360 Setting udp_recvspace to 655360 Setting udp_recvspace to 655360 in nextboot file Change to tunable udp_recvspace, will only be effective for future connections Note that the –p flag retains the entries, even after a reboot. It appends the updated values in the etc/tunables/nextboot stanza file. Regarding the TCP parameters for higher-speed adapters, there is no problem setting tcp_sendspace to twice the value of tcp_recvspace. These are good settings. Two other important workload parameters of the no command are rfc1323 and sb_max. The rfc1323 tunable enables the TCP window scaling option, which lets TCP use a larger window size. Turning on this parameter enables the best TCP performance. The sb_max tunable sets an upper limit on the number of socket buffers queued to an individual socket, controlling the amount of buffer space consumed by buffers (queued to either a sender or receiver socket). This number should usually be less than thewall and approximately four times the size of the largest value of the TCP or UDP
160 Chapter 15: Network I/O: Tuning send and receive settings. For example, if your udp_recvspace value is 655360, you can’t go wrong by doubling this to 1310720. Another useful no tunable, tcp_nodelayack, prompts TCP to send an immediate rather than a delayed acknowledgment. Although sending an immediate acknowledgment can add more overhead in some environments, it can greatly improve network performance in others. If changing this parameter does not improve performance in your environment, you can quickly change it back. Let’s also review ipqmalen. This tunable controls the length of the IP input queue. If you see an overflow counter (using netstat –s), setting a maximum length for this queue can help fix the overflow. What about Address Resolution Protocol (ARP)? When many clients are connected to the system, you might want to tune the ARP cache. You can examine the relevant statistics using netstat: root@lpar37p682e[/etc/tunables] > netstat -p arp arp: 10 packets sent 0 packets purged If you see a high purge count, increase the size of the ARP table. In the preceding example, no increase is needed. Here are the no parameters that relate to arp: root@lpar37p682e[/etc/tunables] > no -a | grep arp arpqsize = 12 arpt_killc = 20 arptab_bsiz = 7 arptab_nb = 149
Name Resolution 161 You can tune these buffers either systemwide or according to specific interfaces. To tune by interface, set the no command’s use_isno option to 1 (this option is enabled by default in AIX 5.3): root@lpar37p682e[/etc/tunables] > no -a | grep use use_isno = 1 Disabling the use_isno parameter (by setting it to 0) can serve as a diagnostic tool of sorts by setting the buffer values across the board to help isolate performance problems. When these values are set for the specific interfaces, they actually override the default value in the no view, which can sometimes confuse system administrators. You can view specific interface settings using either ifconfig or lsattr: # ifconfig en0 en0: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST, GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 172.29.135.44 netmask 0xffffc000 broadcast 172.29.191.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1 In this example, look at the settings using ifconfig (see the last line, which references a couple of the tunables mentioned earlier).You can change these options (by interface) using SMIT or the chdev or ifconfig command. Note that ifconfig will not update the Object Data Manager (ODM), so on reboot, the settings will revert to their previous values. For this reason, you should use SMIT. Use the smit tcpip fastpath, and go to Further configuration > Network interfaces > Change/Show characteristics of an interface. Name Resolution Name resolution is another area that can impact performance. If you know how you want to resolve names (using either DNS or the hosts file), make sure name resolution is set up correctly in the /etc/netsvc.conf file. If you’re using DNS, take out the local if you are not using a hosts file at all, or leave it in if you are using it as a backup to DNS (but make it the second entry). If you’re not using DNS, remove the bind because it will slow
162 Chapter 15: Network I/O: Tuning performance by first trying (if it is the first entry in the record) to resolve using a name server that doesn’t exist. Maximum Transfer Unit The maximum transfer unit (MTU) is defined as the largest packet that can be sent over a network. The size depends on the type of network. For example, 16-bit token-ring has a default MTU size of 17,914, while Fiber Distributed Data Interface (FDDI) has a default size of 4,352. Ethernet’s default size is 1,500 (or 9,000 with jumbo frames enabled). Larger packets mean fewer packet transfers, which results in higher bandwidth utilization on your system. An exception to this rule is if your application prefers smaller packets. If you’re using a Gigabit Ethernet, you can use a jumbo frames option. To support the use of jumbo frames, your switch must be configured accordingly. To change to jumbo frames, use the smit device fastpath and go to Communication > Ethernet > Adapter > Change > Show characteristics of an Ethernet adapter. You can make the change from there. Tuning: Client The biod daemon plays an important role in connectivity. While biod self-tunes the number of threads (the daemon process creates and kills threads as needed), you can adjust the maximum number of biod threads, depending on the overall load. An important concept to understand here is that increasing the number of threads alone will not alleviate performance problems caused by CPU, I/O, or memory bottlenecks. For example, if your CPU is near 100 percent utilization, increasing the number of threads won’t help you at all. Increasing the number of threads can help when multiple application threads access the same files and you don’t find any other types of bottlenecks. Using the lsof command can help you further determine which threads are accessing which files. From earlier tuning sections, you might remember the Virtual Memory Manager parameters minperm and maxperm. Unlike when you tune database servers, with NFS you want to let the VMM use as much RAM as possible for NFS data caching. Most NFS clients have little need for working segment pages. To ensure that all
Tuning: Client 163 memory is used for file caching, set both maxperm and maxclient to 100 percent: root@lpar24ml162f_pub[/tmp] > vmo -o maxperm%=100 Setting maxperm% to 100 root@lpar24ml162f_pub[/tmp] > vmo -o maxclient%=100 Setting maxclient% to 100 Note that in the event that your application uses databases and could benefit from performing its own file data caching, you should not set maxperm and maxclient to 100 percent. In this situation, set these numbers low and mount your file systems using concurrent I/O over NFS. NFS maintains caches on each client system that contain attributes of the most recently accessed files and directories. The mount command controls the length of time that these entries are kept in cache. The mount parameters you can change include the following: acdirmin, acdirmax, acregmin, acregmax, and actime. For example, the acregmin parameter specifies the minimum length of time after an actual update that file entries will be retained. When a file is updated, its removal from cache depends on this parameter’s value. Using the mount command, you can also specify whether you want a hard or soft mount. With a soft mount, if an error occurs, it is reported immediately to the requested program; with a hard mount, NFS keeps retrying. These retries themselves could lead to performance problems. From a reliability standpoint, hard mounting read and write directories is recommended to prevent possible data corruption. Mount parameters rsize and wsize define the maximum sizes of RPC packets for read and write directories, respectively. The default value is 32,768 bytes. With NFS 3 and 4, if your NFS volumes are mounted on high-speed networks, you should increase this setting to 65,536. On the other hand, if your network is extremely slow, you might think about decreasing the default to reduce the amount of packet fragmentation by sending shorter packets. However, if you do decrease the default, more packets will need to be sent, which could increase overall network utilization.
164 Chapter 15: Network I/O: Tuning Understand your network, and tune it accordingly! Tuning: Server Before examining specific NFS parameters, always try to decrease the load on the network while also looking at your CPU and I/O subsystems. CPU bottlenecks often contribute to what appears to be an NFS-specific problem. For example, NFS can use either TCP or UDP, depending on the version and your preference. Make sure your tcp_sendspace and tcp_recvspace tunables are set to values higher than the defaults because this can have an impact on your server by increasing network performance. You tune these values with the no command: root@lpar24ml162f_pub[/tmp] > no -a | grep send ipsendredirects = 1 ipsrcroutesend = 1 send_file_duration = 300 tcp_sendspace = 1638 udp_sendspace = 9216 root@lpar24ml162f_pub[/] > no -o tcp_sendspace=524288 Setting tcp_sendspace to 524288 Change to tunable tcp_sendspace, will only be effective for future connection If you are running Version 4 of NFS, make sure you turn on nfs_rfc1323. Doing so allows for TCP window sizes greater than 64K. Set this value on the client as well. root@lpar24ml162f_pub[/] > no -o rfc1323 Setting rfc1323 to 1 As an alternative, you can set the rfc1323 tunable using the nfso command, which manages the NFS tuning parameters: root@lpar24ml162f_pub[/] > nfso -o nfs_rfc1323=1 Setting nfs_rfc1323 to 1
165 Tuning: Server Setting rfc1323 with nfso configures the TCP window to affect only NFS (as opposed to no, which applies this setting across the board). If you have already set this option with no, you don’t need to change it, although you might want to in case some other Unix administrator decides to play around with the no command. Similar to the client, if the server is a dedicated NFS server, make sure you tune your VMM parameters accordingly. Modify maxperm and maxclient to 100 percent to make sure the VMM controls the caching of the page files, using as much memory as possible in the process. On the server, tune nfsd, which is multithreaded, the same way you tuned biod. (Other daemons you can tune include rpc.mountd and rpc.lockd.) Like biod, nfsd self-tunes, depending on the load. Increase the number of threads using the nfso command. One parameter to check is nfs_max_ read_size, which sets the maximum size of RPCs for read replies. Look at what nfs_max_read_size is set to below: root@lpar24ml162f_pub[/tmp] > nfso -L nfs_max_read_size NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES --------------------------------------------------------------------------nfs_max_read_size 32K 32K 32K 512 64K Bytes D Let’s increase it to 64K (using bytes): root@lpar24ml162f_pub[/tmp] > nfso -o nfs_max_read_size=65536 root@lpar24ml162f_pub[/tmp] > nfso -L nfs_max_read_size NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES --------------------------------------------------------------------------nfs_max_read_size 64K 32K 32K 512 64K Bytes D We just changed nfs_max_read_size to the maximum value allowed. If you want to keep the new values, add your changes to the /etc/tunables/ nextboot file so that the settings will remain changed after a reboot. The nfso offers additional parameters you can modify. To list them all, use the –a or –L flag.

Section V Summary, Tips, and Quiz Summary ● ● ● ● ● ● ● The OSI model consists of the following layers: physical, data link, network, transport, session, presentation, and application. The AIX TCP/IP layers correlate to the layers in the OSI. The maximum transfer unit (MTU) is the largest packet that can be sent over a network. Ethernet has a default size of 1,500 (or 9,000 with jumbo frames enabled). The lscfg command should be used to obtain information about firmware. The netstat command is one of the most common commands you will use to monitor your system. The entstat command is very similar; you use it to display device driver statistics. The netpmon command provides information about CPU usage as it relates to the network. The command starts a trace and runs in the background, providing an overview of network activity and capturing data for trending and analysis. It can also provide information to monitor read and write subroutines for Network File System (NFS) clients and servers. is the monitoring tool you will use to display information about NFS and remote procedure calls (RPCs). Other commands for NFS include netpmon, nfs, nfs4cl, nfsstat, nmon, and topas. nfstat The nfs4cl command provides NFS 4 statistics and properties. You can also tune using this command. One option is to set the timeout value for RPC calls to the server.
168 Section V: Summary, Tips, and Quiz ● ● ● ● ● ● ● You use the mount and nfsmo commands to tune NFS parameters. Use mount to tune server-based resources. The netstat command lets you monitor and troubleshoot network packet issues. You use the no command to tune the network subsystem. tcp_sendspace and udp_sendspace are important no parameters you should examine. Setting up DNS improperly can cause performance problems because you may not be resolving names correctly. Virtual Ethernet, supported on AIX 5.3 on POWER5, supports interpartition- and IP-based communications between logical partitions on the same frame. This functionality is accomplished through the use of a virtual I/O switch. Shared Ethernet, one of the features of Advanced Power Virtualization (APV) or PowerVM, enables the use of virtual I/O Servers (VIOs), letting several host machines share a single physical network adapter. The iptrace command records all the packets received from the network interfaces. The ipreport command formats the data generated from iptrace into a readable trace report. You can also use ipfilter to sort the output file created from ipreport. The tcpdump command prints the headers of packets captured for each network interface card (NIC). One important difference with tcpdump is that, unlike iptrace, it can look at only one network interface at a time. And because iptrace examines the entire packet from the kernel space, its results can offer lots of dropped packets. With tcpdump, you can also limit the amount of data to be traced. Tips ● ● Although netstat is very useful, it is really not a monitoring tool in the sense that vmstat and iostat are. You can use other, more suitable tools to help monitor your network subsystem. Ethernet has a default size of 1,500 (9,000 with jumbo frames enabled). Larger packets require fewer packet transfers, resulting in higher bandwidth utilization on your system. An exception to this rule is if an
Tips 169 application prefers smaller packets. If you’re using a Gigabit Ethernet, you can use the jumbo frames option. To support the use of jumbo frames, your switch must be configured accordingly. ● ● ● ● ● ● Virtualization is a wonderful thing, but be careful not to share too many adapters from your VIO server, or you might pay a large network I/O penalty. Use the appropriate monitoring tools so you’ll know whether you have a problem. From a client perspective, NFS file systems use disks that are remotely attached. Anything that affects the performance of the mounted disk will affect the performance of the NFS clients. The maximum number of biod threads that can be tuned depends on the overall load. Increasing the number of threads alone won’t alleviate performance problems caused by CPU, I/O, or memory bottlenecks. For example, if your CPU is near 100 percent utilization, increasing the number of threads won’t help at all. Increasing the threads can help when multiple application threads access the same files and you don’t find any other types of bottlenecks. The lsof command can help you determine which threads are accessing which file. With NFS, you want to let the Virtual Memory Manager use as much RAM as possible for NFS data caching. Most NFS clients have little need for working segment pages. To ensure that all memory is used for file caching, set both maxperm and maxclient to 100 percent. The rfc1323 tunable enables the TCP window scaling option, which lets TCP use a larger window size. Turn on this option to enable the best TCP performance. If you’re using DNS, take out the local if you are not using a hosts file at all, unless you are using it as a backup to DNS (in that case, make it the second entry). If you’re not using DNA at all, take out the bind because it will only slow your performance by trying (if it is the first entry in the record) to resolve using a name server that doesn’t exist.
170 Section V: Summary, Tips, and Quiz Quiz Multiple Choice 1. entstat is used to a. Tune the Ethernet controller b. Display device driver statistics c. Provide ARP information d. There is no such command. 2. What does the rfc1323 tunable do? a. It provides information about mbufs. b. It’s a generic text file. c. It enables the TCP window scaling option, letting TCP use a larger window. d. It relates to UDP and increasing the packet size dynamically. 3. What is the packet size with jumbo frames enabled? a. 1,500 b. 9,000 c. 90,000 d. 450 4. netpmon is used for what purpose? a. To provide information about CPU usage as it relates to the network b. To provide information about RAM usage as it relates to the network c. To trace zombie processes d. To waste time
True or False 171 5. If you want to keep network values after a reboot, which file must you update? a. /etc/config b. /bin/tunablesc c. /etc/tunables/nextboot d. /tmp/configuration/tunable 6. Which of the following tunables controls how much buffer space in the kernel is used to buffer application data? a. tcp_send b. tcp_sendtrace c. tcpsendtcpip d. tcp_sendspace 7. Which command lets you tune udp_sendspace? a. ioo b. no c. nfso d. netstat True or False 8. Shared Ethernet, supported on AIX 5.3 on POWER5, allows for interpartition- and IP-based communications between logical partitions on the same frame. This is done by the use of a virtual I/O switch. 9. NFS can use either TCP or UDP, depending on the version and your preference.
172 Section V: Summary, Tips, and Quiz Fill in the Blank 10. Which command is not used for NFS: nmon, nfsstat, nfs, nfs4cl, netpmon, or nfsfr? ___________________________________________
Section VI Bonus Topics Just when you thought you understood performance tuning on AIX, here comes AIX 6.1 to throw you a curveball! In this section, we discuss upto-date information about the recent changes to performance monitoring and tuning in AIX 6.1, including CPU, virtual memory, and I/O (disk and network). We also review AIX performance tuning as it relates to Oracle. The last chapter provides an overview of systems performance when running Linux on Power (LoP).

C h a p t e r 16 AIX 6.1 Many of the changes in AIX 6.1 are really less about kernel innovations and more about ancillary features, such as improving default parameters to more accurately reflect real-world data processing. Other enhancements include restricted tunables, unique tunable documentation (a useful feature that provides help messages via the new –h option for the tunable commands, including ioo, nfso, no, raso, schedo, and vmo), and various other improvements to certain subsystems. Introduction AIX 6.1 provides many important innovations and improvements, including enhancements in the following categories: ● ● ● ● Virtualization — Features such as workload partitioning and Live Application Mobility Security — Features such as encrypted file systems, trusted AIX, and role-based access control (RBAC) Availability — Features such as AIX concurrent updates and dynamic tracing Manageability — Features such as the new Systems Director Console for AIX and the Workload Partition Manager
176 Chapter 16: AIX 6.1 The 6.1 release also provides support for POWER6 performance innovations, such as advanced simultaneous multithreading, shared dedicated processors, and variable page size. It’s important to fully understand which innovations and enhancements are a reflection of POWER6, AIX 6.1, or a combination of both. For example, from a purely operating system perspective, AIX improves on the older tunable defaults for the aio, ioo, nfso, no, schedo, and vmo commands. Although AIX 6.1 includes some real performance enhancements, such as improvements in I/O pacing and AIX’s implementation of asynchronous I/O (AIO) servers, I must say that there is nothing breathtakingly different. In fact, IBM made more performance changes from AIX 5.1 to AIX 5.2 and from 5.2 to 5.3 (including new monitoring and tuning tools, new tunables that changed how you set Virtual Memory Manager settings, and concurrent I/O improvements) than you will see in moving from AIX 5.3 to AIX 6.1. In AIX 6.1, all the tuning commands remain the same (except for those that have been taken away, such as aioo), and there are no new monitoring tools. Other changes reflect updates made to the utilities to reflect support for other workload partitioning innovations; the updated utilities include curt, filemon, iostat, netpmon, pprof, procmon, proctree, svmon, topas, tprof, and vmstat. Workload partitions (WPARs) enable the use of separate virtual partitions within one AIX image. This feature is more of a complement to logical partitions (LPARs) than a replacement for them. WPARs actually run inside LPARs and are similar in concept to Solaris containers. I’ve built WPARs in less than 15 minutes. In fact, we’ll do some of our analysis inside WPARs so that you can actually view some of the updated tools that now support WPARs. Note that WPARs are possible only in AIX 6.1, and a POWER6 is not necessary. Some commands also run differently or don’t run at all within WPARs; we’ll discuss a few of these where applicable. Memory Through the years, many have complained about the default VMM parameters of AIX. The complaints have been that the default parameters just haven’t reflected the reality of what most users run on top of AIX — for
Memory 177 example, mission-critical database applications such as Oracle. Because of this, systems administrators have had to change the default settings on many subsystems — most notably those related to virtual memory (i.e., minperm and maxperm). IBM engineering has listened and in AIX 6.1 has changed the parameters to reflect that reality. Note that you shouldn’t rely exclusively on these settings. Further, always check with your ISV to verify its recommended settings for AIX 6.1; then make changes accordingly. The most important changes to default settings were made to address paging issues, where database servers frequently page out computational pages even though the system has enough free memory. In the AIX 5.3 memory tuning discussion, I recommended changing the relevant parameters to defaults fairly close to what was indicated on the table. The changes are indicated in the AIX 5.3 tuning recommendation column. In AIX 6.1, IBM now classifies many tunables as “restricted” in an attempt to discourage junior administrators from changing certain parameters deemed critical enough to be classified as restricted. The net is that you can change only 29 vmo tunables without receiving a firm warning; 30 others are now deemed restricted tunables, which IBM officially states should not be modified unless instructed by “IBM support professionals.” A new vmo flag, –F, lets you view all the parameters, including the restricted ones. The following snippet of content includes an example from the restricted section. # vmo -F -a force_relalias_lite = 0 vmm_default_pspa = -1 ##Restricted tunables maxperm% = 90 Even restricted tunables can be changed. If you make such a change, you just receive a stern warning: # vmo -o maxperm%=99 Setting maxperm% to 99 Warning: a restricted tunable has been modified
178 Chapter 16: AIX 6.1 When a restricted parameter is changed after a reboot, you’ll receive a further rebuke and be asked to confirm whether you really want to make the change. You’ll have to physically type in “yes” to reply. The most important out-of-the-box performance changes related to memory include new values for minperm, maxperm, maxclient, and strict_maxclient. This update is a continuation of changes that first appeared in AIX 5.3, when you no longer had to turn off strict_maxclient, increase minfree and maxfree, or reduce minperm, maxperm, and maxclient. The new recommendation (now incorporated as the default value in AIX 6.1) is to turn off the repage ratio check (lru_file_repage) to ensure that working storage is not paged and to consider only file paging. In AIX 6.1, the VMM replacement default is changed to use up to 90 percent of its real memory for file caching, favoring computational pages over file pages. Unless the amount of active virtual memory exceeds 97 percent of the size of real memory, minperm is reduced to 3 percent to ensure that computational pages will not be stolen. Let’s try changing it: # vmo -o minperm%=97 Value of the tunable minperm% cannot be changed in a WPAR As you can see from the error message, some changes will not work in WPARs. WPARs are a subset of an LPAR, but they are still part of the single operating system image. Another important change includes VMM dynamic variable page size support (VPSS). Pages are defined as fixed-length data blocks held in virtual memory. In AIX 6.1 (on POWER6 processors only), VMM can now dynamically use the larger page size based on application memory usage, which should substantially improve performance. This feature is completely transparent to applications. AIX uses the larger page size only if doing so does not result in increasing process memory usage. The use of larger pages improves performance because fewer hardware address translations need to be made. This feature is supported only for working storage memory, not persistent storage. The new parameter is vmm_default_pspa (it works in conjunction with the existing vmm_mpsize_support tunable).
iSCSI 179 Let’s view the tunable setting for VPSS: # vmo -a | grep pspa vmm_default_pspa = -1 CPU In AIX 6.1, only 27 of the schedo command’s 42 CPU-related tunables are restricted, leaving 15 parameters that you can modify without explicit warnings. Although some defaults have changed, no substantial changes have been made with respect to CPU monitoring and tuning in AIX 6.1. Disk I/O Of the 48 tunables you can control with the ioo I/O tuning command, 27 are now restricted, leaving 21 that you can modify without explicit warnings. The most important changes affect I/O pacing and AIO dynamic tunables. JFS2 AIX 6.1 brings changes to the Enhanced Journaled File System (JFS2) that let you mount a JFS2 file system without logging. Although this capability can improve performance substantially, I don’t recommend implementing it. If you do so and then at some point need to recover your data, you’ll have to use the dreaded fsck command, which has been pretty much banished from memory since the advent of journaling file systems. Circumstances in which the capability might come in handy include restoring data from backups and saving time during an activity where you might have a very small window and availability is not a concern. iSCSI The target software driver can now be used over a Gigabit Ethernet adapter, which should improve performance in this type of environment. The target driver exports local disks or logical volumes to Internet Small Computer System Interface (iSCSI) initiators that connect to AIX using the iSCSI protocol. The proliferation of iSCSI represents a viable alternative to fiber-based storage, making this an important enhancement.
180 Chapter 16: AIX 6.1 I/O Pacing Disk I/O pacing is a mechanism that lets you limit the number of pending I/O requests to a file, thereby preventing disk I/O-intensive processes (usually in the form of large sequential writes) from exhausting the CPU. AIX 6.1 enables I/O pacing by default. In AIX 5.3, you must explicitly enable this feature. The new defaults set the sys0 settings of the minpout and maxput parameters to 4096 and 8193, respectively. Asynchronous I/O AIO is an AIX software subsystem that permits processes to issue I/O operations without waiting for I/O to finish. Because I/O operations and application processing operate concurrently, they essentially run in the background and improve performance. This advantage is particularly important in a database environment. There are two types of AIX subsystems: Legacy AIO and POSIX AIO. The differences between them involve different parameter passing at the application layer. In other words, the developers pick the implementation that the application uses. Regardless of which subsystem is chosen, both run concurrently on AIX. In AIX 5L, if applications use AIO, the subsystem must be explicitly activated in the autoconfig parameter. The system also requires a reboot because the kernel extensions must be loaded. In fact, any release before AIX 5.3 TL_5 requires reboots if any changes are made to the following tunables: maxreqs, maxservers, and minservers. AIX 5.3 provided the aioo command, which lets you make these changes dynamically without a reboot (decreasing required reboots). This command does not change the Object Data Manager (ODM) attributes, meaning that changes will not persist across a reboot. In AIX 6.1, the tunables fastpath and fsfastpath are now restricted and are set to 1 by default. The new setting has the following effect on the tunables: ● ● — AIO requests that raw logical volumes be passed directly to the disk layer. fastpath — AIO requests that files opened with concurrent I/O on JFS2 be passed directly to the Logical Volume Manager or to disk. fsfastpath
Asynchronous I/O 181 ##Restricted tunables aio_fastpath = 1 aio_fsfastpath = 1 Further, AIO subsystems are now loaded by default and not activated. They are started automatically when the application initiates the AIO I/O requests. AIX 6.1 no longer provides the aioo command (what a short life span), and these tunables are now used only with ioo. The old method (AIX 5.3): # # aioo -a minservers = 1 maxservers = 1 maxreqs= 4096 fsfastpath = 0 The new method with AIX 6.1: # ioo -a | grep active aio_active = 0 posix_aio_active = It’s worth noting that there are no more AIO devices in the ODM. Two new parameters have also been added to ioo: aio_active and posix_aix_active. These settings can be changed only by AIX, and they are set to 1 only when AIO kernel extensors are used and pinned. If you do a grep, you won’t find any more AIO servers. You’ll now see aioLpools and aioPpools, the kernel processes that manage the AIO subsystems for Legacy and POSIX. As a result of this change, there is less pinned memory and fewer processes running on the system; both have positive effects on overall systems performance. Here’s a look at the new AIO kernel processes: # pstat -a | grep aio 39 a 2704e 1 2704e 40 a 28050 1 28050 0 0 0 0 1 1 aioLpool aioPpool
182 Chapter 16: AIX 6.1 The minserver and maxserver parameters, as they relate to AIO servers, are now tuned per each CPU tunable. Changing these values will not result in changes to the number of available servers on the system; that number depends on the number of concurrent I/O requests. The following shows the new default values for minservers and maxservers: # ioo -a | grep minservers aio_minservers = 3 posix_aio_minservers = 3 # ioo -a | grep maxservers aio_maxservers = 30 posix_aio_maxservers = 30 Network Of the 133 no command tunables, IBM has classified only these five as restricted: #no -F -a ##Restricted tunables extendednetstats inet_stack_size net_malloc_police pseintrstack use_isno = = = = = 0 16 16384 24576 1 A new network caching daemon has also been introduced to improve performance when resolving names using Domain Name Server (DNS). You can start this daemon from the System Resource Controller (SRC). Its main configuration file is /etc/netcd.conf, and you can copy the one in /usr/samples/tcpip to /etc and use that as a template. The command used to manage the daemon is netcdctrl. With this command, you can dump the cache contents to a file, display cache usage statistics, flush the cache table, and change the logging level of the daemon.
NFS 183 Regarding the /etc/netsvc.conf file, nothing has changed; this file is still necessary in determining the order of resolving. NFS Of the 24 Network File System (NFS) tunables, IBM has classified 21 as restricted. The only noteworthy change here is that RFC1323 (on the TCP/ IP stack) is now enabled by default, letting TCP connections use the TCP scaling window for any NFS connections. The default number of biod daemons has also increased to 32 for each NFS Version 3 mount point.

Section VI Chapter 16 Quiz Multiple Choice 1. Which version of AIX started restricting tunables? a. 4 b. 5 c. 5 d. 6.1 2. What has replaced AIO servers? a. I/O pacing b. Lpools and Ppools c. aioLpools and aioPpools d. mbuf 3. Which command has been taken away in AIX 6? a. vmtune b. ioo c. aioo d. aix
186 Section VI: Chapter 16 Quiz 4. Why did IBM institute restricted tunables? a. To decrease cache that was taking up space. b. No reason. c. To make things harder for people. d. To discourage junior administrators from changing parameters. 5. In AIX 6.1 (on POWER6 processors only), VMM can now dynamically use the larger page size based on application memory usage. Why is this enhancement important? a. It will increase availability. b. It should improve performance. c. It will decrease performance. d. It will help in DLPAR operations. 6. AIX 6.1 introduces changes to the Enhanced Journaling File System (JFS2) that let you mount a JFS2 file system without logging. What is the effect of doing so? a. It increases performance while possibly decreasing availability. b. It decreases performance while increasing availability. c. It increases performance and reliability. d. No change. 7. Which is the vmo flag that provides all parameters, including restricted ones? a. –l b. –v c. –o d. –F
Fill in the Blank 187 True or False 8. AIX 6.1 enables I/O pacing by default. In AIX 5.3, you need to explicitly enable this feature. 9. Disk I/O pacing is a mechanism that lets you limit the number of pending I/O requests to a file. Fill in the Blank 10. Explain the purpose of netcdctrl: _______________________________________

C h a p t e r 17 Tuning AIX for Oracle This chapter provides an overview of running Oracle on AIX. We’ll drill down into the many aspects of tuning AIX to run Oracle, examining memory, CPU, and I/O (both disk and network). We’ll discuss in detail the Virtual Memory Manager and the tuning commands used to tune memory for Oracle. I’ll go over some of the tools you can use to analyze bottlenecks and make changes to the system. Last, we’ll look at a couple of Oracle tools that can help you with performance tuning. Because many of the AIX tuning commands and parameters have changed in recent years, Oracle has changed, too. Changes have also been made to tools such as the Oracle Enterprise Manager (OEM). As you’ll see, this important utility is one you should definitely add to your repertoire and take the time to learn. Memory As we discussed in earlier chapters, the AIX Virtual Memory Manager services all memory requests from the system, not just virtual memory. When RAM is accessed, the VMM must allocate space even when plenty of physical memory remains on the box. This point confuses both DBAs and systems administrators at times. The VMM works using a process called early allocation of paging space by partitioning segments into pages. These pages can be either RAM or paging space (virtual memory stored on disk). At the same time, it
190 Chapter 17: Tuning AIX for Oracle maintains a free list of unallocated page frames, which are used to satisfy page faults. The VMM’s page-replacement algorithm assigns page frames and determines exactly which virtual memory pages currently in RAM will have their page frames brought back to the free list. The AIX operating system will use all available memory, other than memory that is configured to be unallocated — in other words, the free list. Obviously, administrators prefer to use physical memory rather than paging space when the physical memory is available. VMM classifies memory segments into two categories: persistent segments and working segments. Persistent segments use file memory, and working segments use computational memory. What does this mean to us? It’s the computational memory that is used when your SQL queries access the Oracle database. These are working segments. They have no real permanent location and will terminate when the process is completed. On the other hand, file memory uses persistent segments that do have permanent locations on the disks. Persistent segments remain in memory, usually until the pages are stolen or the database is recycled. Again, you want the file memory paged to disk but not the computational memory. How do you tune the system? One critical parameter is the Translation Lookaside Buffer (TLB). Applications such as Oracle exploit a tremendous amount of virtual memory, so by using large pages you can increase performance substantially. Increasing the size of the TLB lets the system map more virtual memory, resulting in a lower miss rate for applications, such as Oracle, that use a lot of virtual memory. This category includes both online transaction processing and data warehouse applications. Oracle employs large pages for its System Global Area (SGA) because it is the SGA that really dominates virtual memory. To reiterate, in AIX 5.3 and later releases, you use vmo to tune; earlier releases used vmtune. The following vmo command uses the lgpg_size and lgpg_regions parameters to allocate 16,777,216 bytes to provide large pages, with 256 actual large pages: # vmo -r -o lgpg_size=16777216 lgpg_regions=256
Memory 191 At the same time, with Oracle Database 10g, make sure the LOCK_SGA Oracle initialization parameter is set to TRUE so that Oracle requests large pages when allocating shared memory. By far, the two most important vmo settings are minperm and maxperm. These parameters determine whether the system favors computational memory or file memory. The first thing to do here is make sure the lru_file_ repage parameter is set to 0. This parameter, which was introduced in ML1 of AIX 5.3, determines whether the page-stealing algorithm should consider VMM repage counts and dictates the type of memory it should steal. The default value for lru_file_repage is 1, so we need to change this setting using vmo: # vmo -o lru_file_repage=0 Setting lru_file_repage to 0 Setting lru_file_repage to 0 tells the VMM that you want to steal only file pages and not computational pages. Because this behavior will change if numperm is less than minperm or greater than maxperm, we should also set maxperm high and minperm very low. (Years ago, before the introduction of the lru_file_repage parameter, we used to make maxperm low. If you did this now, you would stop the application caching programs that are currently running.) Let’s change the relevant parameters: # vmo -p -o minperm%=5 # vmo -p -o maxperm%=90 # vmo -p -o maxclient%=90 You also want to take a look at minfree and maxfree. When the pages on the free list fall below minfree, the VMM will start to steal pages, something you don’t want to have happen until you’ve beefed up the free list by upping the number in maxfree. Use these values: vmo -p -o minfree=960 vmo -p -o maxfree=1088
192 Chapter 17: Tuning AIX for Oracle CPU Let’s start our discussion of CPU performance and Oracle with symmetric multithreading (SMT). This important POWER5 innovation provides the ability for a single physical processor to concurrently dispatch instructions from several hardware threads. In AIX 5L Version 5.3, a dedicated partition created with one physical processor is configured as a logical two-way by turning on SMT. With Oracle, you should always have SMT on: # smtctl This system is SMT capable. SMT is currently enabled. SMT boot mode is not set. SMT threads are bound to the same virtual processor. proc0 has 2 SMT threads. Bind processor 0 is bound with proc0 Bind processor 1 is bound with proc0 A couple other important concepts to keep in mind: ● ● Processor affinity lets processes run on specific processors. You can actually correlate specific processes with running processes. The nice and renice commands change the priority of running processes. It is not recommended to renice Oracle processes. Asynchronous I/O Servers Asynchronous I/O (AIO) determines whether Oracle waits for I/O to complete before starting new processing. What AIO does is let the system continue processing while I/O completes in the background. Performance improves significantly because processes can run at the same time that I/O is going on. However, if tuned improperly, AIO can significantly degrade the overall performance of writes on the I/O subsystem. You can use the iostat or nmon command to monitor the AIO subsystem. Let’s fire up iostat:
Concurrent I/O 193 # iostat -A 1 5 System configuration: lcpu=2 drives=2 ent=0.25 paths=2 vdisks=2 aio: avgc avfc maxgc maxfc maxreqs avg-cpu: %user %sys %idle %iowait %physc %entc 0 0 312 0 4096 3.1 7.1 89.8 Disks: %tm_act Kbps tps Kb_read Kb_wrtn hdisk1 0.0 0.0 0.0 0 0 hdisk0 0.0 0.0 0.0 0 0 0.0 0.0 16.7 The following parameters are used to monitor the AIO subsystem for the specified interval: ● avgc — Average global AIO request count per second ● avfc — Average fastpath request count per second ● ● ● maxgc — Maximum global AIO request count since the last time this value was fetched maxfc — Maximum fastpath request count since the last time this value was fetched maxreqs — Maximum number of AIO requests allowed In the preceding example, AIO servers are not a system bottleneck. Concurrent I/O Concurrent I/O (CIO), introduced in AIX 5.2, is an important system capability that you should use in your Oracle environment. Similar to its predecessor, direct I/O, CIO lets file system I/O bypass the VMM and transfer data directly to disk from the user’s buffer. CIO permits multiple threads to read and write data concurrently to the same file, letting users read and write simultaneously. To turn on CIO, mount your file systems using the cio flag: # mount -o cio /orafilesystem Elements to consider when using CIO include:
194 Chapter 17: Tuning AIX for Oracle ● ● ● Raw devices — Although some Oracle DBAs like to create raw logical volumes for their data (and there is little argument about the performance benefit of doing so), in most cases this functionality is too difficult to administer, and I’ve found that the Unix administrators can talk the Oracle DBAs out of this one. With the advent of CIO, I would not use raw logical volumes unless performance is the driving factor behind everything you’re doing and you have a staff that can manage the complexities inherent in this type of environment. Spreading the wealth — The more spindles you have, the more you should spread your wealth around. The more adapters you have, the more your performance will also increase. In addition, try to keep indexes and redo logs off the same volumes as your data. Storage area network (SAN) — Make sure you spend time looking at your SAN. Optimizing the hardware will help you more than anything you can do at the operating system level. Oracle Tools Let’s look now at two Oracle-specific tools that can help you with your AIX administration. Statspack Statspack is an Oracle performance diagnosis tool that I highly recommend Unix administrators learn to use. Once you have it set up and configured, which you do using SQL after Oracle is installed, it’s not that complicated to use. Statspack provides two basic collection options: level and threshold. The level parameter controls the type of data collected from Oracle. The threshold parameter acts as a filter for the collection of SQL statements into the status summary tables. To install Statspack, simply log on to the system as Oracle, start up sqlplus, and then follow the steps as instructed:
Oracle Enterprise Manager SQL*Plus: Release 10.1.0.2.0 - Production on Sun May 18 Copyright (c) 1982, 2004, Oracle. 19:21:21 195 2008 All rights reserved. Enter user-name: system as sysdba Enter password: Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - 64bit Production With the Partitioning, OLAP and Data Mining options SQL> execute SQL> @?/rdbms/admin/spcreate Choose the PERFSTAT user’s password ----------------------------------Not specifying a password will result in the installation FAILING Oracle Enterprise Manager choose the Temporary tablespace for the PERFSTAT user ----------------------------------------------------Below is the list of online tablespaces in this database which can store temporary data (e.g., for sort work areas). Specifying the SYSTEM tablespace for the user’s temporary tablespace will result in the installation FAILING, as using SYSTEM for workareas is not supported. Choose the PERFSTAT user’s temporary tablespace. Oracle Enterprise Manager The Oracle Enterprise Manager (OEM) is a very useful and productive tool that I’ve used for years. To use this Web-based utility, you need to make sure you let it run when installing Oracle or creating a database using the Oracle dbca utility. After the database is created, turn on OEM with this command: $ emctl start dbconsole Then enter the following in your browser to access the tool: http://lpar21ml16ed_pub:5505/em
196 Chapter 17: Tuning AIX for Oracle There is so much you can monitor and tune within OEM that whole books exist on this utility. If you are working in an Oracle environment, this is a must-use system tool. Figure 17.1 shows the graphical OEM display. Figure 17.1: Oracle Enterprise Manager
Section VI Chapter 17 Quiz Multiple Choice 1. What does the following command do: emctl start dbconsole? a. Starts VMM b. Starts OEM c. Shuts off the dbservice d. Brings up kernel tuning mode 2. Which of the following was introduced in ML1 of AIX 5.3 and determines whether the VMM repage counts are considered? a. lru_file_repage b. vmm c. LOCK_SGA d. Translation Lookaside Buffer 3. You can monitor the AIO subsystem by using either iostat or which of the following? a. vmstat b. svm c. sar d. nmon
198 Section VI: Chapter 17 Quiz 4. Processor affinity enables processes to run a. On specific processors. b. Within the SGA. c. In the hypervisor. d. There is no such term. 5. Increasing the size of which of the following buffers lets the system map more virtual memory, resulting in a lower miss rate for applications, such as Oracle, that use a lot of virtual memory? a. Inode b. Memory Buffer c. SGA Buffer d. Translation Lookaside Buffer True or False 6. Statspack is an Oracle performance diagnosis tool. 7. AIO servers let the system continue processing while I/O completes in the background. 8. The LOCK_SGA Oracle initialization parameter should be set to TRUE so that Oracle requests large pages when allocating shared memory. 9. Direct I/O is more recent than concurrent I/O. Fill in the Blank 10. What is the command to turn on CIO? ___________________________________________
C h a p t e r 18 Linux on Power This chapter provides an overview of systems performance when running Linux on Power (LoP). Monitoring AIX administrators will be happy to know that the nmon command works great with Linux. SystemTap, which conducts performance analysis by analyzing a running kernel, also runs on the platform. Two other popular tools, iostat and sar, are also available on Linux systems. While tools aren’t the focus here, it’s nice to know these options are available. For our first monitoring example, let’s inspect some basic Linux configuration files. The /proc file system is one you should use frequently, much more so than on AIX boxes because the information is simply more valuable in Linux. A lot of kernel and process information resides here, in the form of configuration files. One such file is cpuinfo: [root@172_29_140_173 proc]# more cpuinfo processor : 0cpu : POWER5 (gr)clock : 1654.344000MHzrevision : 2.1 (pvr 003a 02001) It’s easy to see from this file that you’re using a POWER5 system. Next, we can look at the release text file to determine the operating system level:
200 Chapter 18: Linux on Power [root@172_29_140_173 etc]# more redhat-release Red Hat Enterprise Linux Server release 5.2 (Tikanga) This box is running Red Hat Enterprise Linux 5 (RHEL5) on a POWER5 partition. Handy Linux Commands One of my favorite Linux commands is top. This command provides realtime information quickly, in a character-based display, including the processes that are consuming the most CPU time. Another useful command is free. It reports the total amount of free and used physical and swap memory: [root@172_29_140_173 etc]# free total used Mem: 2073856 2057536 -/+ buffers/cache: 226944 Swap: 0 0 free shared 16320 0 1846912 0 buffers cached 440832 1389760 Still, my favorite of all Unix/Linux performance commands is vmstat. I love this old standby because, unlike other tools, vmstat provides quickand-dirty information about all subsystems. Nothing fancy here: [root@172_29_140_173 etc]# vmstat 1 procs -----------memory---------- --swap-- ---io--- --system-- -----cpu-----r b buff cache si so bi bo in cs 0 0 16256 440832 1389760 swpd free 0 0 1 1 23 11 0 us sy id wa st 1 98 0 0 0 0 0 16384 440832 1389760 0 0 0 0 524 30 0 1 99 0 0 0 0 0 16384 440832 1389760 0 0 0 0 536 16 0 1 99 0 0 0 0 0 16320 440832 1389760 0 0 0 0 588 25 0 1 99 0 0 0 0 0 16320 440832 1389760 0 0 0 0 628 12 0 1 99 0 0 0 0 0 16320 440832 1389760 0 0 0 24 633 29 0 1 99 0 0 0 0 0 16320 440832 1389760 0 0 0 0 578 18 0 1 99 0 0 0 You will find that vmstat output on Linux differs a bit from what you see on AIX systems. Here’s a quick description of what each field means:
Virtualization ● swpd — Amount of virtual memory being used ● free — Amount of idle memory ● buff — Amount of memory used as buffers ● cache — Amount of memory used as cache 201 Virtualization Some administrators don’t take full advantage of the PowerVM capabilities of Linux. If you administer LoP the same way you do on x86 boxes, you’re doing yourself and your organization a major disservice. Some capabilities available on POWER systems include the following. ● Symmetric multithreading (SMT) ● Shared processor pool and uncapped partitions From a CPU perspective, SMT is an important feature. It lets you maximize the use of instruction sets and in some cases increase CPU performance by 30 percent. SMT enables these improvements by supporting multithreading, a capability that’s part of PowerVM and the POWER architecture. Multithreading enables two separate instruction streams to run concurrently on the same physical processor, with each thread appearing to run on its own independent logical processor. This feature is enabled by default. Through the POWER architecture, you can create Linux and AIX partitions. When creating Linux partitions, you can “uncap” them, which means that the partitions will receive unused CPU cycles from the shared processor pool over and above their entitled capacity. Other than the number of cycles left in that shared processor pool, the only limitation is the number of virtual processors configured for the profile. I recommend uncapping partitions whenever possible to maximize all available CPU resources and increase performance. From a CPU perspective, I’d also take advantage of the capability to add CPU horsepower on the fly through a dynamic LPAR (DLPAR) operation.
202 Chapter 18: Linux on Power Tuning In Linux, the sysctl command changes kernel parameters. Be advised that the method you use to change parameters may depend on your distribution; for example, you can use the Powertweak tool with Novell SUSE Linux, but it isn’t available with Red Hat. Because we’re using Red Hat here, sysctl is the choice. Let’s change some parameters. One parameter that’s changed frequently is SHMMAX, which is used to define the maximum size (in bytes) for a shared memory segment. In Oracle, you should set this value large enough for the largest System Global Area (SGA) size. Let’s examine the default parameter: # sysctl kernel.shmmax kernel.shmmax = 268435456 In this case, the limit is set to 256 MB. Let’s change this to 1 GB. To do so, use the vi command to display the /etc/sysctl.conf file. This is where you edit the value: # Controls the maximum shared segment size, in bytes kernel.shmmax = 107374182 When you view the file again using sysctl, you can see the change: # sysctl kernel.shmmax kernel.shmmax = 107374182 To make the parameter work without a reboot, issue the sysctl command with the –p parameter. On the memory side, parameters worth examining include SEMMSL, which controls the maximum number of semaphores per semaphore set; SEMMNI, which controls the maximum number of semaphore sets on the entire Linux system; and SEMMNS, which controls the maximum number of semaphores (no semaphore sets) on the entire Linux system. Another important parameter is vm.nr.hugepages. The background here is that the POWER architecture supports page sizes of 4 KB and 16 MB. The
Tuning 203 default vm.nr.hugepages setting for LoP is 4 KB, which is too small for larger databases. To enable large pages, you need to change this parameter. Let’s first view the hugepage parameter — in this case, by looking at the proc/meminfo file. # grep -i hugepages /proc/meminfo HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 16384 kB Now, let’s allocate 130 large pages to support an SGA of approximately 2 GB: # sysctl -w vm.nr_hugepages=130 Then view the meminfo file again: [root@172_29_140_173 ~]# grep -i hugepages /proc/meminfo HugePages_Total: 41 HugePages_Free: 41 HugePages_Rsvd: 0 Hugepagesize: 16384 kB You can see that the system is already starting to use the hugepage parameter. Entire books are dedicated to Linux performance tuning. Check your application recommendations to learn how kernel parameters should be configured in your environment. Linux starts many daemons that usually aren’t needed, including autofs, cups, nfslock, sendmail, and xfs. You should turn off anything that isn’t explicitly required. You can accomplish this in several ways, but the chkconfig command is probably the best method. As an example, let’s shut down cups:
204 Chapter 18: Linux on Power <[root@172_29_140_173 ~]# chkconfig --del cups Now, let’s make sure it’s not around anymore: # chkconfig --list service cups supports chkconfig, but is not referenced in any runlevel (run ‘chkconfig --add cups’)
Section VI Chapter 18 Quiz Multiple Choice 1. What is SHMMAX used for? a. Memory b. I/O c. CPU d. Networking 2. What is the default setting for vm.nr.hugepages? a. 4 KB b. 16 KB c. 32 TB d. 1 MB True or False 3. The nmon command is available on Linux. 4. sysctl is available only on Red Hat. 5. SMT is not available for Linux. 6. The /proc file system is more useful on AIX than Linux.
206 Section VI: Chapter 18 Quiz 7. SEMMSL controls the maximum number of semaphores per semaphore set. 8. You use chkconfig to shut off services. Fill in the Blank(s) 9. SEMMNS controls the maximum number of ____________ on the system. 10. In vmstat output, _____ references the amount of virtual memory being used.
Quiz Answers Section I: Introduction Answers: 1 – b, 2 – b, 3 – c, 4 – a, 5 – c, 6 – False, 7 – True, 8 – False, 9 – True, 10 – 1. Establish a baseline, 3. Identify bottleneck, 4. Tune, 5. Repeat (starting with step 2). Section II: CPU Answers: 1 – a, 2 – b, 3 – c, 4 – d, 5 – a, 6 – d, 7 – b, 8 – True, 9 – True, 10 – Processor affinity is the probability of dispatching a thread to a processor that previously executed it. Section III: Memory Answers: 1 – d, 2 – b, 3 – a, 4 – a, 5 – a, 6 – b, 7 – a, 8 – True, 9 – True, 10 – A memory leak occurs when a process keeps on allocating more memory without releasing it. Section IV: Disk I/O Answers: 1 – c, 2 – b, 3 – c, 4 – a, 5 – c, 6 – b, 7 – a, 8 – True, 9 – False, 10 – J2_minPageReadAhead
208 Quiz Answers Section V: Network I/O Answers: 1 – b, 2 – c, 3 – b, 4 – a, 5 – c, 6 – d, 7 – b, 8 – False, 9 – True, 10 – nfsfr Section VI / Chapter 16: AIX 6.1 Answers: 1 – d, 2 – c, 3 – c, 4 – d, 5 – b, 6 – a, 7 – d, 8 – True, 9 – True, 10 – The netcdctrl command is used to manage the new network caching daemon, letting you dump the cache contents to a file, display cache usage statistics, flush the cache table, and change the logging level of the daemon. Section VI / Chapter 17: Tuning AIX for Oracle Answers: 1 – b, 2 – a, 3 – d, 4 – a, 5 – d, 6 – True, 7 – True, 8 – True, 9 – False, 10 – # mount -o cio /orafilesystem Section VI / Chapter 18: Linux on Power Answers: 1 – a, 2 – a, 3 – True, 4 – False, 5 – False, 6 – False, 7 – True, 8 – True, 9 – semaphores, 10 – swpd
Index A access time/speed disk vs. CPU, 100, 127 network I/O and, 139–140 acdirmin, acdirmax, acregmin, acregmax, and actime parameters to tune network I/O, 163 Address Resolution Protocol (ARP), 141, 160 Advanced Interactive eXecutive. See AIX, 7 Advanced Power Virtualization (APV), 8, 14, 17, 25, 141, 168 aio, AIX 6.1 and, 176 aio_active parameter, AIX 6.1 and, 181 aioLpools and aioPpools, 181 aioo, AIX 6.1 and, 176, 181 AIX, 7–9, 11, 14, 17, 18, 173 AIX 6.1, 175–187 aio_active parameter in, 181 aioLpools and aioPpools in, 181 asynchronous I/O (AIO) and, 180–182 autoconfig and, 180 availability in, 175 CPU tuning in, 179 disk I/O tuning in, 179, 180 Enhanced Journaled File System (JFS2) in, 179 fastpath parameter tuning in, 180, 180, 181 fsfastpath parameter tuning in, 180, 180, 181 Internet Small Computer System Interface (iSCSI) in, 179 ioo vs. aioo to tune, 181 logical partitions (LPARs) and, 176 manageability of, 175 maxfree tuning in, 178 maxreqs parameter tuning in, 180 maxservers parameter tuning in, 180, 182 memory tuning in, 176–179, 177, 178, 179 minperm/maxperm tuning in, 177, 178 minservers parameter tuning in, 180, 182 netcdctrl daemon and, 182 Network File System (NFS) and, 183 network I/O and, 182 no to tune network I/O in, 182 POSIX and, 181 posix_aix_active parameter in, 181 POWER6 and, 176 role based access control (RBAC) in, 175 security in, 175 strict_maxclient tuning in, 178 tunable defaults in, 175, 176 variable page size support (VPSS) in, 178 virtualization in, 175 NOTE: Boldface indicates illustrations and code; t indicates a table. 209
Index Linux on Power (LoP) and, 201 lparmon for monitoring, 25 lparstat for monitoring, 32–33, 32, 33 monitoring, 25–43 mpstat for monitoring, 25, 33–35, 34, 35 netpmon for monitoring, 25 nice to tune, 46, 46, 192 nmon for historical analysis of, 37–38, 37 nmon for monitoring, 36–37 Oracle and, tuning for, 192 pprof for monitoring, 25 process management and, 45 ps for monitoring, 38–39, 39 ps for tuning, 48, 48 renice tuning tool for, 47, 47, 192 B sar for monitoring, 25, 28–30, 29, 30 balancing system workload, 5 sched_R and sched_D tuning tools for, 50, 50 baseline establishment, 3–4 schedo tuning tool for, 48–50, 49–50, 88, 179 Bell Labs, 7, 17 smtctl tuning tool for, 53, 53 Berkeley Software Distribution (BSD), 7 splat for monitoring, 25 bindprocessor, CPU tuning using, 52–53, 52, 53 thread management and, 45 biod daemon, network I/O tuning using, 138, time for timing of, 41, 41 162, 169, 183 timeslice tuning tool for, 51–52, 52 bottlenecks, 4, 5 timex for timing of, 42, 42–43 CPU, 23–24 timing tools for, 41–43 CPU–bound, 24 topas for monitoring, 25, 35–36 memory– vs. CPU–bound, 5 tprof for tracing of, 39–41, 40, 41 tracing tools for, 39–41 tuning, 23–24, 45–54 C vmstat for monitoring, 25–29, 26, 28, 29, 30 chdev, network I/O tuning using, 161 w for monitoring, 31, 31 client tuning, network I/O and, 162–164. See cpuinfo file, 199 also network I/O, tuning of cron, 36, 55 computational memory, 64, 65, 91 disk I/O monitoring using, 108 concurrent I/O (CIO), 97, 101–102, 125, 127 curt, 5 Oracle and, 193–194 AIX 6.1 and, 176 CPU, 173 CPU monitoring using, 25 Advanced Power Virtualization (APV) and, 25 CPU tracing using, 39 AIX 6.1 and, tuning of, 179 bindprocessor tuning tool for, 52–53, 52, 53 cpuinfo file in, 199 D curt for monitoring, 25 daemons, turning off, in LoP, 203–204, 204 filemon for monitoring, 25 data placement on disk, inner vs. outer areas, gprof tuning tool for, 54 104, 104, 125 historical analysis, 56 Decimal Floating Point, 15, 18 Deep Blue supercomputer, 12 iostat for monitoring, 31, 31 AIX 6.1, continued vm_default_pspa parameter in, 178–179, 179 vmo to tune memory in, 177–179, 177, 178, 179 workload partitions (WPARs) in, 176 analyzing performance data, 18 Apple, 11 asynchronous I/O (AIO), 97, 102, 125, 127, 180–182 Oracle and, 192–193, 192 AT&T, 7, 17 Atkins, Stephen, 37 autoconfig, AIX 6.1 and, 180 AutoSys, 55 210
Index deferred page space allocation (DPSA), 85–86, 92 device drivers and multipath I/O, 127 Digital Equipment Corporation (DEC), 7, 17 direct I/O, 97, 101, 125 disk I/O, 97–130, 173 access time in, 100, 127 AIX 6.1 and, tuning of, 179, 180 AIX LVM commands in monitoring, 112–118 asynchronous (AIO), 97, 102, 125, 127, 180–182, 192–193 capacity of, 100, 127 concurrent (CIO), 97, 101–102, 125, 127, 193–194 cron and, 108 data placement on, inner vs. outer areas, 104, 104, 125 device drivers and multipath, 127 direct, 97, 101, 125 Enhanced Journaled File System (JFS2) tuning and, 105, 122–123, 126 file systems and, 105 filemon to monitor, 110, 116–117, 116, 117, 126 fileplace to monitor, 110, 116, 117–118, 118, 126 inodes and, 101 inter-disk policy for, 105 inter-policy and, 113–114, 125 intra-policy and, 126, 127–128 introduction to, 99–105 ioo to tune, 120–122, 121t, 122 iostat to monitor, 108, 111–112, 111, 112, 192 JFS file system tuning parameters, using ioo, 121, 121t journaling file systems and, 121, 126 logical units (LUNs) in, 126, 127 Logical Volume Manager (LVM) and, 99, 103–104, 112–118, 125 logical volume monitoring in, 111–112, 111, 112 logical volume placement in, 125 logical volumes and placement, intra/inter-policy for, 102–104, 103 lslv to monitor, 110, 113–114, 113, 114, 126 lsof to monitor, 110, 169 lsvg to monitor, 113, 113 lvmo to tune, 119–120, 120, 126 lvmstat to monitor, 115–116, 115, 126 Mirror Write Consistency Check (MWCC) and, 104, 113, 128 mirroring and, 126 monitoring of, 107–118 mount command for, 102, 102 multipath, with device drivers, 127 nmon to monitor, 107, 110–111, 110, 192 Oracle and, 127, 192–193 pacing of, 180 relational database management systems (RDBMS) and, 127 sadc utility and, 108 sar to monitor, 107–108, 107, 108 sequential, 128 server minimum/maximum numbers and, 127 stack in, 99–100, 100 storage area networks (SANs) and, 126 syncd daemon and, 123 system layers and, 103–104, 103 System Management Interface Tool (SMIT) and, 105 topas to monitor, 107, 108–111, 109, 110 tuning of, 119–123 Virtual Memory Manager (VMM) and, 101, 122–123 DNS. See Domain Name Server (DNS) Domain Name Server (DNS), 141, 161–162, 168, 169, 182 Dynamic Energy Management, 15, 18 dynamic logic partitioning (DLPAR), 8, 93 CPU tuning and, 56 Linux on Power (LoP) and, 201 E early page space allocation (EPSA), 85–86, 92, 189 Enhanced Journaled File System (JFS2) AIX 6.1 and, 179 disk I/O and, 105, 122–123, 126 memory and, 65 entstat, network I/O monitoring using, 145, 145, 167 environments for testing, 18 NOTE: Boldface indicates illustrations and code; t indicates a table. 211
Index Ethernet Internet Small Computer System Interface (iSCSI) in, 179 jumbo frames in, 162, 167, 168–169 network I/O and, 136, 141, 168 External Data Representation (XDR), 139 F fastpath parameter tuning, AIX 6.1 and, 180, 180, 181 Fiber Distributed Data Interface (FDDI), network I/O and, 162 file memory, 64, 65, 91 file systems, disk I/O and, 105 filemon, 146 AIX 6.1 and, 176 CPU monitoring using, 25 disk I/O monitoring using, 110, 116–117, 116, 117, 126 fileplace, disk I/O monitoring using, 110, 116, 117–118, 118, 126 free, Linux on Power (LoP) monitoring using, 200 fsck, 8, 179 fsfastpath parameter tuning, AIX 6.1 and, 180, 180, 181 I I/O. See disk I/O; network I/O IBM, iii–iv, 8, 9, 11, 12, 17, 57 ifconfig monitoring network I/O using, 161 Object Data Manager (ODM) and, 161 inetd daemon, network I/O monitoring using, 148 inodes, 101 input/output. See disk I/O; network I/O inter-policy, disk I/O, 105, 113–114, 125 Internet Protocol. See TCP/IP Internet Small Computer System Interface (iSCSI), 179 intra-policy, disk I/O, 126, 127–128 ioo AIX 6.1 and, 175, 176, 181 disk I/O tuning using, 120–122, 121t, 122 iostat, 55, 133 AIX 6.1 and, 176 CPU monitoring using, 31, 31 disk I/O monitoring using, 108, 111–112, 111, 112, 192 ipqmalen parameter in tuning network I/O, 160 iptrace, ipreport, and ipfilter, network I/O monitoring using, 154–155, 155, 168 J G General Electric, 7 Global Technology Services, 57 gprof, CPU tuning using, 54 Griffiths, Nigel, 36 H Hardware Management Console (HMC), 141 historical analysis CPU, 56 nmon for, 37–38, 37 HMC. See Hardware Management Console (HMC) hugepages parameter, Linux on Power (LoP) tuning using, 203, 203 Hypervisor, 12, 13–14, 14, 24 Hypervisor Decrementer (HDEC), 14 212 Jann, Joefon, 12 JFS2. See Enhanced Journaled File System Journaled File System (JFS) memory and, 65 tuning parameters, using ioo, 121, 121t journaling file systems, disk I/O and, 121, 126 jumbo frame Ethernet, 162, 167, 168–169 K Kasparov, Garry, 12 kill commands, 56 L late page space allocation in (LPSA), 85–86, 92 layers, system, 103–104, 103 leaks in memory, 77–79, 78, 79, 92
Index lgpg_size and lgpg_regions parameters, Oracle and, 190 libraries, 8 Linux, 9, 14, 136 Linux on Power (LoP), iii, 173, 199–206 commands for, 200–201 CPU performance in, 201 cpuinfo file in, 199 daemons automatically started in, turning off, 203–204, 204 dynamic logical partitioning (DLPAR) in, 201 free command to monitor, 200, 200 hugepages parameter in, 203, 203 meminfo file in, 203, 203 monitoring, 199–200 nmon to monitor, 199 SEMMSL parameter tuning in, 202 SHMMAX parameter tuning in, 202 symmetric multithreading (SMT) in, 201 sysctl to tune, 202 SystemTap to monitor, 199 top command to monitor, 200 tuning, 202–204 virtualization in, 201 vm.nr.hugepages parameter tuning in, 202–203 vmstat to monitor, 200–201, 200 Live Application Mobility, 9 Live Partition Mobility, 15, 18, 19 load control and memory, 87–88, 93 local area networks (LANs), 137. See also network I/O lockd, 139 logical partitions (LPARs), 6, 176 logical units (LUNs), 126, 127 logical volume, disk I/O and, 102–104, 103, 125 Logical Volume Manager (LVM), 17, 99, 103–104, 111–118, 125 developmental history of, 8 lvmo to tune, 119–120, 120, 126 lvmstat to monitor, 115–116, 115 logical volume monitoring, 111–112, 111, 112 lpamon, CPU monitoring using, 25 lparstat, 55 CPU monitoring using, 32–33, 32, 33 lru_file_repage, memory tuning using, 82–84, 84, 92 lru_file_repage parameter, Oracle and, 191, 191 lrubucket, memory tuning and, 88–89, 89, 93 lsattr, network I/O monitoring using, 139–140, 139, 140, 147, 147, 161 lscfg to monitor network I/O, 140, 140, 167 lslv, disk I/O monitoring using, 110, 113–114, 113, 114, 126 lsof, disk I/O monitoring using, 110, 169 lsps, memory monitoring using, 73, 92 lsvg, disk I/O monitoring using, 113, 113 lvmo, disk I/O tuning using, 119–120, 120, 126 lvmstat, disk I/O monitoring using, 115–116, 115, 126 M market share of AIX, 9 Mars Pathfinder, 12 Massachusetts Institute of Technology (MIT), 7 maxclient memory tuning using, 82–84, 84 network I/O tuning using, 163, 165 maxfree, 191, 191 AIX 6.1 and tuning in, 178 memory tuning using, 84, 85 maximum transfer unit (MTU), 162, 167 maxperm/minperm, 63, 66, 82–84, 84, 177, 178, 191, 191, 191 AIX 6.1 and, 177, 178 memory tuning using, 63, 66, 82–84, 84, 92 network I/O tuning using, 162–163, 163, 165, 169 Oracle and, 191, 191 maxpgahead, 85 maxreqs parameter tuning, AIX 6.1 and, 180 maxservers parameter tuning, AIX 6.1 and, 180, 182 meminfo file, Linux on Power (LoP) tuning using, 203, 203 memory, 61–96, 173 AIX 6.1 and, tuning, 176–179, 177, 178, 179 computational, 64, 65, 91 deferred page space allocation (DPSA) in, 85–86, 92 NOTE: Boldface indicates illustrations and code; t indicates a table. 213
Index memory, continued dynamic logic partitioning (DLPAR) and, 93 early page space allocation (EPSA) in, 85–86, 92 Enhanced Journaled File System (JFS2) and, 65 file memory in, 64, 65, 91 free list in VMM and, 64 introduction to, 63–66, 63 Journaled File System (JFS) and, 65 late page space allocation in (LPSA) in, 85–86, 92 leaks, 77–79, 78, 79, 92 load control and, 87–88, 93 lru_file_repage tuning using, 82–84, 84, 92, 191, 191 lrubucket to tune, 88–89, 89, 93 lsps to monitor, 73, 92 maxclient to tune, 82–84, 84 maxperm/minperm to tune, 63, 66, 82–84, 84, 177, 178, 191, 191, 191 maxpgahead to tune, 85 minfree and maxfree parameters in,84–85, 191, 191 monitoring of, 67–79 Network File System (NFS) and, 65 network subsystem, management of, 141 nmon to monitor, 81, 92 Oracle and, 189–191, 189 page space allocation in, 85–87, 92, 189–190 paging in, 65–66, 91 Partition Load Manager (PLM) and, 93 persistent segments in, 64, 91 ps page space allocation for, 87, 87 ps to monitor, 73–74, 74, 92 RAM added to, 93 rmss to tune, 89–90, 90, 93 sar to monitor, 71–73, 72, 92 scanning and, 88–89, 89, 93 schedo CPU tuning and, 88, 91, 92 svmon to monitor, 74–77, 75, 77, 92, 93 swapping in, 65–66 thrashing and, 65–66, 87–88, 91 topas to monitor, 67, 92 Translation Lookaside Buffer (TLB) and, 190 tuning of, 81–90 214 variable page size support (VPSS) in, 178 Virtual Memory Manager (VMM) and, 61, 63–64, 91. See also Virtual Memory Manager (VMM) vm_default_pspa parameter in, 178–179, 179 VMM statistic summary using vsmstat in, 71, 71 vmm_mpsize_support parameter in, 178 vmo to tune, 66, 81–82, 91, 92, 93, 190–191, 190, 191 vmstat to monitor, 67–70, 69, 70, 91, 92, 93 working segments in, 64, 91 workload balancing and, 87–88, 93 methodology of power tuning, 3–6, 17 minfree, memory tuning using, 84, 85, 191, 191 minperm. See maxperm/minperm minservers parameter tuning, AIX 6.1 and, 180, 182 Mirror Write Consistency Check (MWCC), 104 disk I/O monitoring using, 113, 128 mirroring, disk I/O and, 126 monitoring system performance, 4–5, 18 CPU, 25–43 disk I/O and, 107–118 Linux on Power (LoP) and, 199 memory, 67–79 network I/O and, 143–156 Motorola, 11 mount, 102, 102 hard vs. soft, 163 network I/O tuning using, 163–164, 168 mpstat, 55 CPU monitoring using, 25, 33–35, 34, 35 Multics, 7 multipath I/O and device drivers, 127 Multiplexed Information and Computer Service. See Multics N name resolution, network I/O and, 161–162, 168, 169, 182 netcdctrl daemon, 182 netpmon AIX 6.1 and, 176 CPU monitoring using, 25 network I/O monitoring using, 134, 145–145, 146, 148, 152–153, 152, 153, 167
Index netstat, network I/O and, 131, 133, 135, 143–145, 143, 144, 154, 154, 158, 158, 167, 168 Network File System (NFS), 136–139, 137, 138 memory and, 65 monitoring, 148–153, 167, 169, 183 network I/O and, 131 network I/O, 131, 172, 173 acdirmin, acdirmax, acregmin, acregmax, and actime parameters to tune, 163 Address Resolution Protocol (ARP) and, 141, 160 Advanced Power Virtualization (APV) and, 141, 168 AIX 6.1 and, 182 biod daemon to tune, 138 biod to tune client in, 162, 169, 183 chdev and, 161 client tuning in, 162–164 Domain Name Server (DNS) and, 141, 161–162, 168, 169, 182 entstat to monitor, 145, 145, 167 Ethernet and, virtual and shared, 136, 141, 168 External Data Representation (XDR) and, 139 Fiber Distributed Data Interface (FDDI) and, 162 Hardware Management Console (HMC) and, 141 ifconfig to monitor, 161 inetd daemon to monitor, 148 Internet Small Computer System Interface (iSCSI) in, 179 introduction to, 133–141 ipqmalen parameter in tuning, 160 iptrace, ipreport, and ipfilter to monitor, 154–155, 155, 168 jumbo frame Ethernet and, 162, 167, 168–169 lockd and, 139 lsattr to monitor, 139–140, 139, 140, 147, 147, 161 lscfg to monitor, 140, 140, 167 maxclient to tune, 163, 165, 169 maximum transfer unit (MTU) in, 162, 167 maxperm/minperm in tuning, 162–163, 163, 165, 169 memory management in network subsystems and, 141 monitoring of, 143–156 mount parameters to tune, 163–164, 168 name resolution in, 161–162, 168, 169, 182 netcdctrl daemon and, 182 netpmon to monitor, 134, 145–146, 146, 148, 152–153, 152, 153, 167 netstat to monitor, 131, 133, 143–145, 143, 144, 154, 154, 158, 158, 167, 168 Network File System (NFS), 131, 136–139, 137, 138, 148–153, 167, 169, 183 nfds in, 137–138 nfs to monitor, 148 nfs_rfc1323 in tuning, 164, 164 nfs4cl to monitor, 148, 151–152, 152, 167 nfsd to tune, 165 nfso to tune, 164–165, 164, 165, 168 nfsstat to monitor, 148, 149–151, 150, 151, 167 nmon to monitor, 148–149, 148, 167 no to tune, 157–161, 157, 159, 160, 161, 164, 164, 168, 182 Object Data Manager (ODM) and ifconfig, 161 Open Systems Interconnection (OSI) model for networks and, 135, 138, 167 packets in, 136, 154–156, 168–169 portmap and, 139 protocols in, 141 protocols used in network and, 135–136, 164 remote procedure calls (RPCs) in, 137, 149, 163, 167 rfc1323 parameter in tuning, 159, 164–165, 164, 164 rsize and wsize parameters to tune, 163–164 sb_max parameter in tuning, 159 server tuning in, 164–165 speed of, 139–140 spray to monitor, 148, 148 System Management Interface Tool (SMIT), 161 TCP/IP layers and, 134, 135, 138 NOTE: Boldface indicates illustrations and code; t indicates a table. 215
Index network I/O, continued tcp_nodedelyack parameter in tuning, 160 tcp_recvspace/tcp_sendspace parameters in tuning, 158, 164, 164, 168 tcpdump to monitor, 156, 156, 168 thewall to tune, 157–158, 159–160 threads in, 162 topas to monitor, 148, 149, 167 Transmission Control Protocol (TCP) and, 135–136, 164 trcstop to stop trace in, 146, 146, 152 tuning of, 157–165, 157 udp_recevspace/udp_sendspace parameters in tuning, 158, 159 use_isno parameter in tuning, 161, 161 User Datagram Protocol (UDP) and, 135–136, 164 virtual I/O servers (VIOs), 141, 168, 169 Virtual Memory Manager (VMM) and, 141, 162, 169 nfs, network I/O monitoring using, 148 nfs_rfc1323 in tuning network I/O, 164, 164 nfs4cl, network I/O monitoring using, 148, 151–152, 152, 167 network I/O monitoring using, 151–152, 152, 151 nfsd, network I/O tuning using, 137–138, 165 nfso AIX 6.1 and, 175, 176 network I/O tuning using 164–165, 164, 165, 168 nfsstat, network I/O monitoring using, 148, 149–151, 150, 151, 167 nice, 5, 55 CPU tuning using, 46, 46, 192 Oracle and, 192 nmon, 4, 55, 56, 67 CPU monitoring using, 36–37 disk I/O monitoring using, 107, 110–111, 110, 192 historical analysis using, 37–38, 37 Linux on Power (LoP) monitoring using, 199 memory monitoring using, 81, 92 network I/O monitoring using, 148–149, 148, 167 216 no AIX 6.1 and, 175, 176 network I/O tuning using, 157–161, 157, 159, 160, 161, 164, 164, 168, 182 O Object Data Manager (ODM), 180 ifconfig and, 161 Open Firmware, 14 Open Systems Interconnection (OSI) model for networks, 135, 138, 167 Oracle, 173, 189–198 asynchronous I/O (AIO) and, 192–193 concurrent I/O (CIO) in, 193–194 CPU tuning for, 192 disk I/O and, 127 early allocation of paging space and, 189 iostat to monitor AIO in, 192–193, 193 lgpg_size and lgpg_regions parameters for, 190 lru_file_repage parameter for, 191, 191 memory tuning for, 189–191 minfree and maxfree parameters in, 191, 191 minperm and maxperm parameters in, 191, 191 nice and renice to tune CPU for, 192 nmon to monitor AIO in, 192 Oracle Enterprise Manager (OEM) and, 189, 195–196, 195, 196 page space allocation and, 86–87, 92, 189–190 Statspack for, 194, 195 storage area networks (SANs) and, 194 symmetric multithreading (SMT) for, 192 System Global Area (SGA) in, 83, 190 Translation Lookaside Buffer (TLB) and, 190 Virtual Memory Manager (VMM) and, 189–190 vmo to tune memory for, 190–191, 190, 191 Oracle Enterprise Manager (OEM), 189, 195–196, 195, 196 P pacing disk I/O, 180 packets, in network communication, 136, 154–156, 168–169 page space allocation, 85–87, 92 Oracle and, 189–190
Index paging in memory, 65–66, 91 Partition Load Manager (PLM), 93 CPU tuning and, 56 PDP-7 computers, 7 Performance Toolbox (PTX), 57 persistent segments of memory, 64, 91 PM, 57 portmap, 139 POSIX, 7, 181 posix_aix_active parameter, AIX 6.1 and, 181 POWER, 24 Power Optimization with Enhanced RISC. See POWER servers POWER servers, iii, 8, 9, 11–15, 17, 18–19, 201 power tuning methodology, 3–6, 17 POWER5, 13–14 POWER6, 14–15, 18 AIX 6.1 and, 176 PowerVM, 14, 15, 18, 141, 201 pprof AIX 6.1 and, 176 CPU monitoring using, 25 process management, CPU, 45 procmon, 57 AIX 6.1 and, 176 proctree, AIX 6.1 and, 176 protocols, network, 135–136, 141, 164 ps, 5, 55, 56 CPU monitoring using, 39–39, 39 CPU tuning using, 48, 48 memory monitoring using, 73–74, 74, 92 memory page space allocation using, 87, 87 R RAM, 93 raso, AIX 6.1 and, 175 Red Hat Linux, 136, 200 Regatta architecture, 12, 13 relational database management systems (RDBMS), 127 remote procedure calls (RPCs), 137, 149, 163, 167 renice, 5, 55 CPU tuning using, 47, 47, 192 Oracle and, 192 repeating the tuning process, 6 resource increases, 6 rfc1323 parameter in tuning network I/O, 159, 164–165, 164, 169 RISC architecture and POWER, 11–12 Ritchie, Dennis, 7, 17 rmss, memory tuning using, 89–90, 90, 93 role based access control (RBAC), 175 RS/6000, 8, 17 rsize, network I/O tuning using, 163–164 Run-Time Abstraction Services (RTAS), 14 S sadc, disk I/O monitoring using, 108 sar, 55, 67 CPU monitoring using, 25, 28–30, 29, 30 disk I/O monitoring and, 107–108, 107, 108 memory monitoring using, 71–73, 72, 92 sb_max parameter in tuning network I/O, 159 scanning memory, lrubucket and, 88–89, 89, 93 sched_R and sched_D, CPU tuning using, 50, 50 schedo, 55, 56, 66 AIX 6.1 and, 175, 176 CPU tuning using, 48–50, 49–50, 88, 179 memory tuning using, 91, 92 schedtune, 66, 91 scheduler tuning, 5–6 security, AIX 6.1 and, 175 SEMMSL parameter, Linux on Power (LoP) tuning using, 202 sequential I/O, 128 servers minimum/maximum numbers and I/O, 127 network I/O and, tuning, 164–165 virtual I/O (VIOs), 141, 168, 169 shared partitions, 67 SHMMAX parameter, Linux on Power (LoP) tuning using, 202 simultaneous multithreading (SMT), 13, 24 smctl, 55 SMIT. See System Management Interface Tool (SMIT) smtctl, 56 CPU tuning using, 53, 53, 53 Solaris, 9 NOTE: Boldface indicates illustrations and code; t indicates a table. 217
Index splat, 5 CPU monitoring using, 25 CPU tracing using, 39 spray, network I/O monitoring using, 148, 148 stack, I/O, 99–100, 100 Statspack, 194, 195 storage area networks (SANs) disk I/O and, 126 Oracle and, 194 stress testing, 4–5, 18 strict_maxclient, AIX 6.1 and tuning in, 178 Sun Microsystems, 136 svmon AIX 6.1 and, 176 leaks in memory monitored with, 77–79, 78, 79, 92, 93 memory monitoring, using, 74–77, 75, 77 swapping in memory, 65–66 symmetric multiprocessing (SMP), 8 symmetric multithreading (SMT), 8, 13, 24, 55, 56, 192 Linux on Power (LoP) and, 201 syncd daemon, disk I/O tuning using, 123 sysctl, Linux on Power (LoP) tuning using, 202 System Global Area (SGA), Oracle, 83, 190 system layers, 103–104, 103, 103 System Management Interface Tool (SMIT), 105 network I/O and, 161 System p, 57 SystemTap, Linux on Power (LoP) and, 199 T TCP/IP, layers of, 134, 135, 138 tcp_nodedelyack parameter in tuning network I/O, 160 tcp_sendspace/tcp_recvspace parameter in tuning network I/O, 158, 164, 164, 168 tcpdump, network I/O monitoring using, 156, 156, 168 testing/test environments, 56 thewall, network I/O tuning using, 157–158, 159–160 Thompson, Ken, 7, 17 thrashing, 87–88, 91 memory and paging and swapping and, 65–66 218 memory tuning and, 87–88 page space allocation/tuning and, 87 thread management CPU, 45 network I/O tuning and, 162 time, CPU timing using, 41, 41 timeslice, CPU tuning using, 51–52, 52 timex, CPU timing using, 42, 42–43 timing tools, CPU, 41–43 Tivoli Monitoring System, 57 top, Linux on Power (LoP) monitoring using, 200 topas, 4, 55, 56, 67 AIX 6.1 and, 176 CPU monitoring using, 25, 35–36 CPU tuning with, 24 disk I/O monitoring using, 107, 108–111, 109, 110 memory monitoring using, 67, 92 network I/O monitoring using, 148, 149, 167 tprof, 5, 56, 146 CPU tracing using, 39–41, 40, 41 trace, 5 CPU tracing using, 39 network I/O and, 146, 152 tracing tools, CPU, 39–41 Translation Lookaside Buffer (TLB), 190 Transmission Control Protocol (TCP), 135–136, 164. See also TCP/IP trcrpt, CPU tracing using, 39 trcstop, 146, 146, 152 trpof, AIX 6.1 and, 176 tuning, 5–6 CPU, 45–54 disk I/O and, 119–123 Linux on Power (LoP) and, 202–204 memory, 81–90 network I/O and, 157–165 U udp_sendspace/udp_recevspace parameter in tuning network I/O, 158 Uniplexed Information and Computing Service. See Unix Unix, iii–iv, 7–8, 9, 11, 17 upgrading, 18
Index use_isno parameter in tuning network I/O, 161, 161 User Datagram Protocol (UDP), 135–136, 164 V variable page size support (VPSS), 178 virtual I/O servers (VIOs), 141, 168, 169 virtual memory, 173 Virtual Memory Manager (VMM), 61, 63–64, 67, 91 direct I/O and, 101 disk I/O tuning using, 122–123 early allocation of paging space and, 189 free list in, 64 network I/O and, subsystem memory and, 141 network I/O tuning using, 162, 169 Oracle and, 189–190 page space allocation for, 189–190 paging in, 65–66, 91 summary of statistics for, using vmstat, 71, 71 thrashing and, 65–66 Translation Lookaside Buffer (TLB) and, 190 tuning of, 66 variable page size support (VPSS) in, 178 vm_default_pspa parameter in, 178–179, 179 vmm_mpsize_support parameter in, 178 vmo for, 66 vmtune for, 66 virtualization, 6, 8, 14, 17, 18, 141, 169, 175 Linux on Power (LoP) and, 201 vm.nr.hugepages parameter, Linux on Power (LoP) tuning using, 202–203 vm_default_pspa parameter, 178–179, 179 vmm_mpsize_support parameter, 178 vmo, 66 AIX 6.1 and, 175, 176 AIX 6.1 and, memory tuning using, 177–179, 177, 178, 179 memory tuning using, 81–82, 91, 92, 93, 190–191, 190, 191 Oracle, memory tuning using, 190–191, 190, 191 vmstat, 4, 55, 67, 78, 81, 133 AIX 6.1 and, 176 CPU monitoring using, 25–29, 26, 28, 29, 30 CPU tuning with, 24 Linux on Power (LoP) monitoring using, 200 memory monitoring using, 67–70, 69, 70, 92, 93 memory tuning and, 91 vmtune, 66, 91 W w, 55 CPU monitoring using, 31, 31 wide area networks (WANs), 136. See also network I/O working segments of memory, 64, 91 workload analysis, 55 workload balancing, 5, 87–88, 93 Workload Manager, 55 workload partitions (WPARs), 8–9 AIX 6.1 and, 176 wsize, network I/O tuning using, 163–164 X X/OPEN, 7 XDR. See External Data Representation (XDR) Z zombie processes, 56 NOTE: Boldface indicates illustrations and code; t indicates a table. 219

Your Source for Everything IT =Technical and Thought-leadership Articles =Weekly/semi-monthly newsletters =Industry-leading columnists =Forums and blogs =Industry News =Resourse Directory =Industry Event Directory =White Papers, Webcasts, Trial Software Visit us at www.mcpressonline.com today
See Our Full Line of IT Books and Training Materials at MC-Store.com Choose from a wide variety of topics, including • Security • IT Management • DB2 • IBM System i • IBM WebSphere • RPG • Java and JavaScript • SOA ...and many more. MCPressOnline.com ~ MC-Store.com