Skip to main content
Testing & Assurance · Practitioner Guide

Penetration
testing:
what most
reports won’t
tell you.

Most organisations treat penetration testing as an annual checkbox. The numbers confirm it: only 8% of organisations test continuously, and 13% do not test at all. Of the findings that do surface, only 48% ever get remediated. The test is not the problem. The operating model around it is.

APRA CPS 234 mandates control testing for approximately 680 financial institutions. The Essential Eight expects application-level testing. The SOCI Act covers critical infrastructure. ISM-1163 calls for annual pen testing of internet-facing systems. Australian organisations are not short on mandates. They are short on useful testing.

Written by CREST-accredited penetration testers who deliver testing engagements across Australian industries, from financial services and government to energy and critical infrastructure.

01 / Why most pen tests fail to improve security

The test works. The operating model around it does not.

The penetration test itself is rarely the problem. The methodology is sound. The testers are competent. The report is thorough. And then nothing happens. Only 48% of all vulnerabilities identified in penetration tests are ever resolved. The median time to resolve findings is 67 days -- nearly five times the 14-day SLA most organisations set for themselves. Meanwhile, the average eCrime breakout time -- from initial access to lateral movement -- is 29 minutes.

The mismatch is structural. A standard penetration test runs for one to three weeks. It captures a snapshot of your environment at a single point in time. By the time you receive the report, your environment has changed: new code has been deployed, configurations have shifted, staff have rotated. NIST registered 40,000 CVEs in 2024, up 43% from the prior year. Time-to-exploit for new vulnerabilities has collapsed from 32 days in 2022 to just 5 days in 2023--2024.

This is not an argument against penetration testing. It is an argument against treating a pen test as a complete security programme. An annual pen test without continuous vulnerability management, without a functioning remediation pipeline, without executive accountability for findings -- that is compliance theatre, not security assurance. As our assessment sequencing guide explains, the right assessment at the wrong time delivers the wrong outcome.

The uncomfortable truth: A penetration test tells you what a skilled attacker could achieve in a constrained window. It does not tell you what a persistent threat actor will do over months with unlimited patience. The value is in acting on findings, not in receiving the report.

Organisations that extract genuine value from pen testing share three characteristics: they have a functioning vulnerability management process that tracks findings to closure, executive-level accountability for remediation timelines, and they treat the pen test as one input into a continuous security programme rather than the programme itself. A Lighthouse Assessment can help determine whether your organisation is ready to extract real value from a pen test or whether foundational controls should come first.

02 / What pen tests actually find

The findings are predictable. The remediation gaps are where the risk lives.

After a decade of conducting penetration tests, the most common finding categories remain stubbornly consistent. Security misconfigurations account for 20--30% of all issues. Broken access control is OWASP's number one category in 2025. Cross-site scripting represents 18.4% of web vulnerabilities. Weak or default passwords remain the most common internal pen test finding. And outdated or unpatched software appears in 60% of organisations.

The proportion of serious findings has actually declined from 20% to 11% over the past decade -- a real improvement. But that improvement has plateaued, and it masks a critical gap: only 69% of serious (high and critical) findings are addressed, while 31% remain open. Large enterprises leave roughly 45% of discovered vulnerabilities unresolved after 12 months. Small firms with fewer than 200 employees face the greatest concentration of risk, accounting for 87% of all critical and high findings.

Business logic flaws -- the findings scanners will never catch

The most valuable penetration test findings are often the ones that automated tools cannot detect. Business logic vulnerabilities exploit an application's intended functionality in unintended ways -- working code doing something it should not from a business perspective. OWASP explicitly states they "cannot be detected by a vulnerability scanner and rely upon the skills and creativity of the penetration tester."

Real-world examples illustrate the impact. Negative quantity manipulation has reduced product prices from hundreds of dollars to cents in e-commerce systems. Race conditions in gift card transfer systems have enabled unlimited credit generation. Booking system APIs have exposed passenger data modification capabilities using only a six-character reference -- no authentication required. These vulnerabilities are invisible to automated scanning but represent enormous business risk. Our third-party security risk guide covers how business logic flaws in vendor systems create additional attack surface.

If your pen test report reads like a vulnerability scan export -- a list of CVEs with CVSS scores and no business context -- you are paying pen test prices for scanner results. A genuine penetration test demonstrates exploitation, chains vulnerabilities together, and communicates business impact.

03 / Penetration test types

Not all pen tests are the same. Choosing the wrong type wastes budget and creates blind spots.

Penetration tests are classified by knowledge level (black box, grey box, or white box) and by target domain. The right choice depends on your threat model, regulatory obligations, and what you are trying to learn. Most organisations should start with external network and web application testing, then expand scope based on findings and maturity. See our essential assessments guide for sequencing advice.

01
External Network Penetration Test
Targets your public-facing infrastructure from an external attacker's perspective -- firewalls, routers, DNS, email servers, VPN gateways, and exposed services. Required by PCI DSS, expected under CPS 234, and the minimum baseline for any organisation with internet-facing systems.
Will not find
Internal lateral movement paths, insider threat scenarios, business logic flaws in applications, or social engineering weaknesses.
02
Internal Network Penetration Test
Simulates an assumed-breach scenario, testing lateral movement, privilege escalation, and data exfiltration from within your network. Validates Active Directory security, network segmentation, and zero-trust implementations. Internal testers breach network perimeters in 93% of engagements and gain full infrastructure control at all tested companies.
Will not find
External-facing vulnerabilities, web application logic flaws, or physical security weaknesses.
03
Web Application Penetration Test
Assesses your web applications against the OWASP Web Security Testing Guide -- authentication, session management, access control, input validation, business logic, and API security. The average company maintains over 200 active API endpoints, doubling every 18--24 months, making this increasingly critical.
Will not find
Network-level vulnerabilities, infrastructure misconfigurations, or client-side mobile application issues.
04
Mobile Application Penetration Test
Evaluates both client-side (app binary, local storage, certificate pinning) and server-side (APIs, backend) security across iOS and Android using the OWASP Mobile Application Security Testing Guide. Critical for any organisation with customer-facing mobile apps.
Will not find
Server infrastructure vulnerabilities, network segmentation issues, or web-only application flaws.
05
API Penetration Test
Focused assessment of API endpoints for Broken Object Level Authorisation (BOLA), which accounts for over 40% of API vulnerabilities, along with authentication flaws, injection, excessive data exposure, and rate limiting issues. Increasingly critical as organisations adopt microservices and headless architectures.
Will not find
Front-end UI vulnerabilities, network infrastructure issues, or business logic flaws outside the API layer.
06
Covers WiFi, Bluetooth, and proprietary radio protocols. Targets rogue access points, weak encryption, man-in-the-middle attacks, and segmentation between guest, corporate, and operational networks. Particularly important for organisations with physical premises and OT environments.
Will not find
Application-layer vulnerabilities, remote access weaknesses, or cloud configuration issues.
07
Tests human susceptibility through phishing, vishing, smishing, pretexting, and physical intrusion attempts. Social engineering is simultaneously the most common attack vector and the most commonly excluded pen test scope. This gap means organisations test their technology but not the people who use it.
Will not find
Technical vulnerabilities in systems, applications, or infrastructure.
08
Cloud Infrastructure Review
Targets IAM misconfigurations, exposed storage buckets, overprivileged service accounts, container escape, and Kubernetes RBAC issues across AWS, Azure, and GCP. Each provider has specific pre-authorisation requirements for testing. In 2024, 68% of breaches in cloud-native environments came from security oversights preventable earlier in the lifecycle.
Will not find
On-premises network vulnerabilities, application-layer business logic flaws, or physical security gaps.

Black box, grey box, white box -- which approach?

Black box testing (zero prior knowledge) simulates an external attacker with no inside information. White box (full access to source code, architecture docs, and credentials) maximises coverage. Grey box (partial knowledge, typically valid credentials and basic architecture understanding) sits between them.

Our position: grey box is almost always the right default. It reflects the most realistic threat scenario -- an attacker with initial access, stolen credentials, or insider knowledge -- and delivers the best value for testing budget. Black box wastes time on reconnaissance that duplicates what a real attacker would acquire quickly through OSINT or credential theft. White box is appropriate for specific scenarios like source code review or pre-deployment testing.

04 / Automated scanning versus manual testing

Scanners find known vulnerabilities. Testers find what matters.

The distinction between automated vulnerability scanning and manual penetration testing is foundational, yet frequently blurred by providers who deliver scanner output repackaged as manual testing. Both are necessary. Neither is sufficient alone. Manual pen testing has uncovered nearly 2,000 times more unique security vulnerabilities than automated scans alone.

Dimension Automated scanning Manual penetration testing
Scope Broad infrastructure coverage; checks against databases of known CVEs Targeted, risk-based; focuses on high-value assets and attack paths
Depth Surface-level; identifies individual vulnerabilities in isolation Chains vulnerabilities across layers to demonstrate real exploitation paths
Business logic Cannot detect; scanners do not understand business context Core strength; testers identify flaws in intended application behaviour
False positives 3--48% for SAST tools; legacy DAST tools up to 82%. Triaging a single finding takes ~10 minutes Very low; testers validate every finding through exploitation
Frequency Continuous or scheduled; suitable for daily/weekly/monthly execution Periodic; typically annual or after significant changes
Cost Lower per scan; scales efficiently across large environments Higher per engagement; cost reflects skilled human effort
AI/LLM testing Emerging tools but limited effectiveness for novel attack patterns Essential; prompt injection and jailbreaking require creative human testing

You need both. Automated scanning provides the broad, continuous baseline: catch known CVEs, monitor for configuration drift, flag newly disclosed vulnerabilities. Manual penetration testing provides the depth: validate what is actually exploitable, chain findings into real attack paths, test business logic, and demonstrate business impact. Replacing manual testing with scanning is like replacing a building inspector with a smoke detector -- useful, but fundamentally different.

The optimal model uses automated scanning for continuous coverage and schedules manual application testing for depth validation of high-risk areas. Organisations using this combined approach reduce alert noise and focus effort where it counts.

05 / The AI and LLM chatbot attack surface

If you have deployed an AI chatbot, you have a new attack surface most pen tests ignore.

AI-related vulnerability reports grew 210% in 2025. Prompt injection reports surged 540%. Yet most penetration test scopes do not include AI or LLM testing because the methodologies are new and the testers lack experience. If your organisation has deployed customer-facing AI, you have an attack surface that your standard pen test is not examining.

The OWASP Top 10 for LLM Applications 2025 defines the key risk categories. Prompt injection -- manipulating model responses via crafted inputs -- is ranked the number one risk, with OWASP acknowledging that fool-proof prevention methods may not exist. Twenty per cent of jailbreak attempts succeed, with the average attack taking just 42 seconds across five interactions. Ninety per cent of successful prompt injections result in leakage of sensitive data.

Real failures that demonstrate the stakes

A car dealership's AI chatbot was manipulated into agreeing to sell a vehicle for one dollar. A parcel delivery firm's chatbot was prompted to swear and recommend competitors, reaching 1.3 million views. Most consequentially, an airline's chatbot gave incorrect refund policy advice, and the tribunal ruled that companies remain liable for information provided by their AI chatbots. In Australia, a major retailer's AI shopping assistant began fabricating personal stories during customer interactions in early 2026.

Standard penetration testing methodologies were not designed for this attack surface. LLM-specific testing requires understanding of prompt injection techniques, multi-turn attack sequences, RAG pipeline poisoning, data exfiltration through conversation manipulation, and agentic AI abuse patterns. Our secure AI adoption guide covers the broader risk framework, and our secure AI services include dedicated AI system testing.

Questions to ask your pen test provider about AI testing: Do your testers have experience with prompt injection and jailbreaking techniques? Do you test against the OWASP Top 10 for LLM Applications? Can you test multi-turn attack sequences, not just single-prompt injections? Do you assess the full stack including RAG pipelines, tool-use capabilities, and system prompt leakage?

06 / The testing maturity progression

From vulnerability scanning to purple teaming -- match your testing to your maturity.

Security testing exists on a maturity curve. Each level builds on the one before it, and skipping levels wastes money. If done too soon, red teaming will expose problems your organisation already knows about but has not fixed. If done too late, penetration testing delivers diminishing returns because your basic vulnerabilities are already managed. The key is matching the test to your current maturity.

L1
Vulnerability Scanning
Automated, broad, baseline visibility. Uses tools to check against databases of known vulnerabilities (CVEs) and identify misconfigurations across your environment. Runs in minutes to hours and covers wide infrastructure. Suitable for continuous or scheduled execution. Provides the foundation for all subsequent testing levels.
Prerequisites
Asset inventory, basic patch management process, someone to triage and act on findings.
L2
Penetration Testing
Manual, targeted, exploitation-focused. Skilled testers simulate real attacks to prove what is actually exploitable, chain vulnerabilities together, and demonstrate business impact. This is where most Australian mid-market organisations should be. Typical engagements run 1--3 weeks and cost AUD 15,000--80,000 depending on scope.
Prerequisites
Vulnerability scanning programme in place, basic security controls implemented, remediation pipeline capable of acting on findings.
L3
Red Teaming
Objective-based, multi-vector adversary simulation. Unlike pen testing (which finds as many vulnerabilities as possible), red teaming asks: "Could an attacker achieve this specific goal without being stopped?" Tests detection, response, and resilience across people, process, and technology. Engagements range from AUD 50,000 to 120,000+ over 4--8 weeks. Australia's CORIE framework defines red teaming standards for financial institutions.
Prerequisites
Operational SOC or blue team, pen tests no longer revealing major findings, senior management buy-in, documented incident response plans.
L4
Purple Teaming
Collaborative, knowledge-transfer focused. Offensive findings are shared with defenders in real time so detection and response capabilities improve during the engagement, not just after it. The red team's purpose becomes improving the effectiveness of the blue team. MITRE ATT&CK provides the common language for mapping attacks to detection coverage and systematically closing gaps.
Prerequisites
Mature SOC with detection engineering capability, established red team programme, leadership commitment to continuous improvement.

Our position: most Australian mid-market organisations should be at Level 2, progressing toward Level 3. Jumping to red teaming before your penetration test findings are largely resolved is paying for advanced testing while basic problems persist. Conversely, if your pen tests consistently return clean results, you have outgrown Level 2 and should be testing your detection and response capabilities at Level 3.

07 / Choosing and rotating providers

Rotate testers, not necessarily vendors. And know what accreditations actually mean.

No major regulatory framework -- including APRA CPS 234, PCI DSS v4.0, or ISO 27001 -- mandates the rotation of penetration testing providers. They mandate independence, competence, regularity, and systematic approaches, but leave provider selection to organisational discretion.

The case for rotation centres on fresh perspectives: different testers with different backgrounds find different things. New testers approach assessments without the preconceived notions that develop over time. The counter-argument is equally compelling: rotation causes loss of institutional knowledge, imposes learning curves that can reduce effectiveness, and sacrifices historical trend data.

Our position: rotate the individual testers every two to three engagements, not necessarily the vendor. What matters is fresh eyes on the scope, not a new logo on the report. The ideal model has the previous tester perform quality assurance on the new tester's findings, providing fresh perspective while retaining institutional knowledge. This delivers the benefits of rotation without the costs of starting from scratch.

Accreditations that matter in Australia

CREST accreditation is the primary quality assurance mechanism for penetration testing in Australia. CREST requires dual-factor recognition: both the organisation and the individual testers must meet accreditation and certification standards. Member companies undergo rigorous assessment covering operating procedures, personnel security, testing approach, and data security, with full re-assessment every three years. PCI DSS v4.0 specifically references CREST as a recommended certification. Many Australian Government agencies and APRA-regulated entities require or prefer CREST-accredited providers.

Individual tester certifications to look for include OSCP (Offensive Security Certified Professional, demonstrating practical exploitation skills), OSCE/OSWE (advanced web and exploit development), and GPEN/GXPN (SANS-based penetration testing certifications). The certifications matter less than the practical experience behind them, but they provide a baseline indicator of competence.

Red flags in pen test proposals

Watch for these warning signs when evaluating proposals. A scope that is purely automated with no manual testing effort described. No named testers on the engagement. No methodology description or reference to standards like OWASP, PTES, or CREST. A deliverable that is a scanner export with a cover page. A price significantly below market that cannot be explained by reduced scope. And no provision for retesting of critical and high findings -- if your provider does not offer retesting, they are not invested in your remediation outcome.

08 / Australian regulatory mandates

What APRA, ASD, and the SOCI Act actually require -- and what they leave to your judgement.

Australian regulators are converging on the expectation that penetration testing is a minimum baseline, not an optional extra. The debate is no longer whether to test, but how often and how deeply. The following summarises what each framework actually requires for security testing, linking to our detailed guides where available.

Paragraph 27 mandates a systematic testing programme with frequency commensurate with threat changes, asset criticality, and incident consequences. Paragraph 30 requires testing by appropriately skilled and functionally independent specialists. CPG 234 explicitly identifies penetration testing and red team testing as expected approaches. APRA's tripartite assessment found "inadequate definition and execution of control testing programs" as a top-six gap across 300+ entities. See our full CPS 234 guide.
Essential Eight
Mandatory at Maturity Level 2 for all non-corporate Commonwealth entities. The Assessment Process Guide defines "excellent evidence" as testing a control with a simulated activity designed to confirm it is effective -- directly describing penetration testing methodology. Validates application control, patch effectiveness, administrative privilege restrictions, user application hardening, and MFA bypass resistance. Only 22% of Australian Government entities reached ML2 across all eight strategies in 2025. See our Essential Eight ML3 guide.
SOCI Act
Critical Infrastructure
Responsible entities for 13 prescribed critical infrastructure asset classes across 11 sectors must establish CIRMPs. For assets designated as Systems of National Significance, Enhanced Cyber Security Obligations include mandatory vulnerability assessments including penetration testing and incident response testing. Significant impact incidents must be reported within 12 hours. See our Cyber Security Act 2024 guide for the broader regulatory context.
ISM-1163
ASD · Core control
The core ASD control for continuous monitoring mandates vulnerability assessments and penetration tests prior to deployment (including before significant changes) and at least annually thereafter. All testing must be conducted by suitably skilled personnel independent of the system being assessed. Applicable from Non-Classified through TOP SECRET systems.
PCI DSS v4.0
Requirement 11.4
Mandates penetration testing at least annually and after significant changes, with documented methodology aligned with NIST SP 800-115, OSSTMM, OWASP, or PTES. Testing must cover the entire cardholder data environment from both inside and outside the network. Service providers must perform segmentation testing every six months. Multi-tenant service providers must now support customer external pen testing (new in v4.0).
ISO 27001
Annex A control A.8.8 (Management of Technical Vulnerabilities) explicitly advises periodic, documented penetration tests. A.8.29 requires security testing during and after development. While ISO 27001 does not prescribe specific testing frequency, certification auditors expect documented evidence of systematic security testing proportionate to organisational risk.

The regulatory intent is consistent: systematic, independent, risk-proportionate security testing at minimum annually. Organisations subject to multiple frameworks -- and most mid-to-large Australian organisations are -- can align a single well-scoped testing programme to satisfy overlapping requirements. A cybersecurity audit can map your testing programme to all applicable regulatory obligations.

09 / Shift-left economics

A vulnerability found in development costs 6x less to fix than one found in production.

The economic case for finding vulnerabilities earlier is well established. Fixing a bug found during implementation costs approximately 6 times more than one identified during design. During testing, 15 times more. Post-release, 30 times more. The global average data breach cost reached USD 4.88 million in 2024, a 10% increase from the prior year, while breaches contained within 30 days cost USD 1.76 million less than those that took longer.

Integrating security testing into the software development lifecycle -- SAST in code review, DAST in staging, software composition analysis in CI/CD pipelines -- catches the low-hanging vulnerabilities before they reach production. Seventy-four per cent of security professionals have shifted left or plan to. Over half of DevOps teams run SAST scans and approximately 50% scan containers and dependencies.

This does not eliminate the need for penetration testing. It changes what pen testing is for. When automated security checks catch the known vulnerabilities in development, penetration testers spend their time on what they are uniquely qualified to find: business logic flaws, complex attack chains, architectural weaknesses, and contextual risks that automated tools miss. Shift-left reduces the volume of low-hanging findings in your pen test report and increases the value of what testers spend their time on.

Our web application testing engagements are designed to complement DevSecOps pipelines, not duplicate them. We focus manual effort on the vulnerability classes that automation cannot reach, and our security architecture reviews can help embed testing into your SDLC from the start.

10 / How Cliffside approaches penetration testing

CREST-accredited testers. Named individuals. Manual testing. No scanner-only reports.

Every engagement starts with scoping. A Lighthouse Assessment determines the right type, depth, and timing of testing for your environment, threat profile, and regulatory obligations. We do not upsell testing you do not need, and we will tell you if your maturity level means foundational controls should come before a pen test.

Our testing methodology follows OWASP, PTES, and CREST standards. Manual testing is the core -- scanner results supplement human analysis, not replace it. Every engagement has named CREST-accredited testers assigned, not anonymous resources pulled from a bench. You know who is testing your systems and can speak to them directly about findings.

What you receive

Every penetration test delivers an executive summary with business impact context for board and leadership audiences, a detailed technical report with step-by-step reproduction instructions for every finding, risk-rated remediation recommendations prioritised by exploitability and business impact, and retesting of critical and high findings included as standard -- not an optional add-on.

We rotate testers across engagements to ensure fresh perspectives while maintaining institutional knowledge of your environment. And because we follow an assessment-first approach, our recommendations focus on what you actually need to fix, not on selling follow-on work.

Testing services

Our testing and assurance practice covers the full spectrum of security testing.

Penetration Testing

Know what an
attacker would
find, before
they do.

The Cliffside Lighthouse Assessment scopes the right penetration testing engagement for your environment -- type, depth, timing, and regulatory alignment -- so you test what matters, not just what is easy to scan. We tell you honestly whether a pen test is the right next step, or whether foundational controls should come first.

What you get from a Cliffside pen test
  • CREST-accredited penetration testers, named on every engagement
  • Manual testing with real exploitation, not repackaged scanner output
  • Executive summary with business impact and risk-rated findings
  • Detailed technical report with reproduction steps for every finding
  • Retest of critical and high findings included in every engagement
  • Tester rotation to ensure fresh perspectives across engagements