Cloud Security · Detection Engineering · April 2026 · Luis Guillermo

I built a cloud-storage security scanner in a month with an LLM. Here’s what it taught me about triage.

A field report on Portitor: what it is, how it’s built, what the first week of running it against the open internet taught me, and how I used an LLM as a pair programmer without letting it take the wheel on the decisions that mattered.

It was a Monday night, and the alarm email had just hit my inbox

The subject line read: ALARM: “portitor-critical-finding” in US East (Ohio). Twenty findings had just tripped a CloudWatch threshold I’d configured maybe six hours earlier. Twenty publicly-listable cloud storage buckets, classified CRITICAL, now sitting in a DynamoDB table waiting for me to decide what to do with them.

The tool that found them is called Portitor. I’d finished deploying it to AWS that same morning. This was its first real sweep against the open internet.

Two instincts fought each other. The first was excitement, because the thing worked on day one and it found something. The second was unease, because I suddenly remembered that the rule I’d written into my own documentation weeks earlier said: Targeted scanning requires authorization. Every finding requires manual verification before disclosure. Twenty findings meant twenty manual verifications. And if I got lazy and batch-routed them to a bug bounty platform, which the tool was perfectly capable of doing autonomously, I could embarrass myself on the public internet within the hour.

So I picked the most interesting-looking bucket from the list and queried it by hand.

Here is what I want to tell you about, because it turned out to be the most valuable lesson of the whole project: most “public forgotten buckets” on the internet are not forgotten. They are on purpose. And if you’re building a tool that finds them, the tool is the easy part. Telling signal from noise is the whole job.

This is a write-up of Portitor: what it is, how it’s put together, what the first week of running it taught me, and how I used an LLM as a full-time pair programmer without letting it take the wheel on the decisions that mattered. I’ll show you the architecture. I’ll show you the bugs I found in my own code during live operation. I won’t show you specific findings, because several of them are inside active responsible-disclosure windows with real vendors, and those vendors haven’t consented to being written about yet.

Fair warning: this is less a product demo than a field report. If you’re expecting a “look at the cool tool I built” post, you might find it disappointing. If you want to read about what actually happens when you point a scanner at the open internet and watch 459 findings come back from a single keyword sweep, this one’s for you.

What Portitor actually is

The job Portitor does breaks into four stages, in order: find buckets that might exist, check whether each one is exposed, decide what kind of exposure it is, and route the result to the correct place.

Most of the existing tools in this space do one of those four. Grayhat Warfare maintains an enormous index of buckets it has discovered. Trufflehog scans content for secrets. There are bucket-takeover scripts for the GHOST case. Each is good at its slice. None of them tie the slices together with the ethics of responsible disclosure baked into the wiring. That’s the gap I wanted to fill.

Here is the shape of it:

Portitor pipeline

A bucket name comes in from a discovery source. The ingestion layer normalizes it into the format the relevant cloud provider expects, drops it if it has been seen recently or if targeted scanning of it would be unauthorized, and queues it. A validator pulls it off the queue, does a small number of read-only HTTP probes against the cloud provider’s public endpoints, and returns a structured finding. The classifier looks at the finding and assigns a status. If the status is high enough to act on, a router decides who to tell.

The whole thing runs serverless on AWS. A Lambda function for ingestion and validation, a Lambda for the takeover workflow, an SQS queue between them, DynamoDB tables for the dedup cache and the findings store, an S3 bucket for evidence, an SNS topic for alerts. Terraform manages all of it. The monthly bill is somewhere between eight and twenty dollars depending on sweep volume, because almost everything is pay-per-request and idle costs nothing.

Two architectural choices are worth calling out, because they reflect the threat model rather than the convenience of the implementation.

First, holds on claimed buckets are event-driven, not time-driven. If Portitor reclaims an orphaned bucket name to prevent an attacker from taking it over, the hold persists until the original vendor confirms the reclamation in writing. The seemingly obvious alternative, releasing the hold after a fixed timer like 48 hours, fails badly: any attacker watching the bucket name can just wait out the timer and grab it the moment the hold expires. Vendor triage takes weeks. Timers are an attacker convenience.

Second, secret detection is read-and-report only. If the validator finds something that looks like a credential, it is recorded with the credential redacted and a disclosure is drafted. The credential is never used to authenticate against any service, even to “verify” it works. That’s where security research ends and unauthorized access begins, and the line is a hard one in the code, not a guideline in the docs.

Most of the rest of the system follows from those two stances. The classification severity bands, the routing table, the role of human review in every disclosure, the choice not to scrape and ingest content beyond what’s needed to confirm exposure: each one is downstream of “the tool’s job is to make exposures fixable, not to make them worse.”

The triage reality check

The first real sweep ran at 4:44 UTC on a Tuesday morning. Keyword: backup. Four discovery sources active. Six hundred twenty-one candidate bucket names queued for validation. Forty-five minutes later the pipeline had settled and the findings table held 459 records.

Of those, 437 were LOW, meaning buckets that exist but return 403 on unauthenticated access — basically “something is here but I’m not allowed to see it.” Those get logged and forgotten. Two were HIGH. The remaining twenty were CRITICAL, meaning the bucket is public, listable, and tagged with at least one signal that suggests orphaned or abandoned infrastructure. Twenty CRITICAL findings on the first day is a lot. A reasonable person would be pleased.

I was pleased for about five minutes.

Then I picked the first CRITICAL bucket off the list, a GCP bucket with a 14-character alphanumeric name, and made one anonymous listing request against the Google Cloud Storage JSON API. The response came back with HTTP 200 and five sample filenames. Four of the five were PDFs with titles like:

<plausible-looking author name> <plausible-looking book title> rack space library <4-random-chars>.pdf

The fifth file in the sample was called google0086990799c2b2b2.html. That is the format Google Search Console uses for domain-ownership verification files. Somebody had registered this bucket as a Search Console property. Somebody wanted this bucket indexed.

What I was looking at was not a misconfigured corporate bucket. It was one shard of a deliberately-public PDF piracy operation. The author names and book titles were SEO keyword stuffing. The gibberish suffix was to defeat exact-match takedowns. The Search Console file was so that Google would crawl the bucket and rank the PDFs in search results for anyone looking to download pirated ebooks. This was not a mistake. This was the product.

I went back to the CRITICAL list and started clustering. The 14-character alphanumeric naming pattern appeared on fourteen buckets. All of them were GCS. All were created within a two-week window in March 2017. All had the Search Console fingerprint. I probed one more to confirm, same pattern, same filename structure, and stopped. I had enough to know what the whole cluster was.

Fourteen of my twenty CRITICAL findings were one piracy operator’s sharded content delivery network, abandoned years ago, still serving files to anyone who lands on them from a search engine. If I had auto-routed those to HackerOne, I would have been sending takedown requests for someone else’s spam operation to the wrong address. I would have wasted my own time, wasted the triage time of whichever bug bounty platform I submitted to, and lit my credibility on fire on day one.

The other six CRITICAL findings were genuinely interesting.

One was an S3 bucket with a naming pattern suggesting a backup destination. I probed it and found multiple gzipped SQL dumps, some with LastModified timestamps from earlier that same day. An active, automated backup pipeline was writing database dumps to a publicly-listable S3 bucket every six to eight hours. Whoever it belonged to was still using it. That was a real exposure, time-sensitive, and I had a responsible-disclosure email drafted and sent within about an hour of finding it.

Another was a GCS bucket whose name was a subdomain of a real SaaS company’s primary domain. The objects inside were video files with UUID filenames, the kind of thing you’d expect a user-generated-content platform to produce. The bucket was anonymously listable, which meant anyone who found the bucket name could enumerate and download user uploads. That one is also in the disclosure pipeline.

The remaining four were older, smaller, and either abandoned or low-impact. One of them had already been disclosed the week before through a parallel channel.

So the real day-one scorecard was different from what the classifier said:

Classifier said CRITICAL	Actually was
14 buckets	Intentional piracy hosting, not a disclosure case at all
1 bucket	Active data exfiltration risk, disclosed same day
1 bucket	Legitimate accidental exposure, disclosure drafted
4 buckets	Older, lower-stakes, deferred or already handled

Signal-to-noise: about 25%. Actionable disclosures from a single keyword sweep: two, maybe three. And the only reason I got to those two was that I spent ninety minutes manually probing buckets before I trusted anything the tool told me.

This is the part of the project that I wish I’d understood before I started, because it reframes what the tool is for. The classifier is not the product. The classifier is a filter that reduces six hundred twenty-one candidates down to twenty things a human needs to look at. The human is still the product.

The bugs I found by running it

The piracy-cluster problem from the last section is the interesting failure. It was a classification question, not a correctness question. The classifier did exactly what it was asked to do. The request was wrong.

The rest of what the tool did wrong in its first week was more mundane, and I found most of it by watching it run against real data rather than by looking at my tests.

The dedup layer, whose entire job is to prevent the same bucket from being validated twice, was silently letting duplicates through across sweeps. Two days of operation would have grown the findings table by a factor of seven for no reason. A GCP validator rule that was supposed to auto-escalate any bucket with an allUsers IAM binding to CRITICAL was guarded by a condition that never fired in practice, meaning the flagship signal of the flagship provider was being downgraded. A test in the repo had been catching this bug for weeks. Nobody had run the suite. Seventy-nine messages wound up in the dead-letter queue after the first sweep because the ingestion layer was accepting bucket names from cloud providers I don’t actually support, then wasting fifteen minutes per message trying to validate them. And the classifier’s “stale content” signal, which escalates findings whose objects haven’t been touched in six months, was quietly de-escalating the single most time-sensitive finding of the whole week: an active backup pipeline writing fresh database dumps to a public bucket every six hours. Freshness killed the stale signal. The bucket stayed HIGH when it should have been the loudest thing in the table.

None of these are complicated. All of them are fixable. One of them I fixed with a one-line change the same night I found it.

The lesson is not that my tool has bugs. Every tool has bugs. The lesson is that the only test harness that matters for a security tool is the open internet, and the only sampling strategy that works is running it and looking at what comes back. I had green unit tests and a broken production system at the same time. This is the normal state of things and pretending otherwise is the first step toward shipping something that’s confidently wrong.

Calibration matters more than coverage. A tool that surfaces seven hundred candidates and is wrong about a hundred of them is less useful than a tool that surfaces fifty candidates and is right about forty-five.

You find out which one you’ve built by running it.

About the AI

I built Portitor with an LLM as a full-time pair programmer. I want to be straightforward about what that meant, because the temptation in a piece like this is to either hide it or oversell it, and both are misleading.

I made every decision that mattered. The threat model is mine. The hard rules, the ones that say things like “never authenticate against any service with a credential found during scanning” and “claimed bucket holds persist until the vendor confirms reclamation,” are mine. The call to manually verify every CRITICAL finding before routing it anywhere is mine. The call, on the night the tool first surfaced the piracy cluster, to stop and ask “wait, is this what I think it is?” is mine. None of those were the LLM’s instinct. An LLM, left to its own defaults, would have happily drafted fourteen polished bug bounty reports for the piracy operator.

The LLM helped me write most of the code. It is very good at this. Given a clear spec, a defined interface, and the context of the surrounding system, it produces working Python faster than I can type. It catches edge cases in Terraform that I would have missed on a first pass. It suggests patterns I hadn’t considered, some of which I adopt and some of which I reject. When I describe a classification problem in prose, it translates the prose into a test suite that usually covers cases I didn’t think to specify.

It also got things wrong. More than once, it inferred behavior from log output that didn’t match the actual code. It recommended deleting infrastructure that was fixable. It confidently described architectural choices that existed only in my notes, not in the deployed system. Each of those was caught because I was reading what it produced, running it against reality, and pushing back when the output didn’t match what I knew the system was doing. A reviewer without that context would not have caught them.

The LLM is a spec amplifier. On a well-specified task, I ship five to ten times faster than I would have alone. On a poorly-specified task, I ship garbage five to ten times faster than I would have alone.

The difference lives entirely in how carefully I’m thinking about what I’m asking for, and in my willingness to read every line of what comes back before I trust it.

If you are hiring engineers in 2026, the skill to look for is not “can use an LLM.” Most candidates can. The skill to look for is “knows when the LLM is wrong and has the engineering judgment to override it.” That skill is more discernible in a portfolio piece than a résumé.

What I’m not publishing, and what comes next

Several of the findings I described earlier are in active responsible-disclosure windows with real vendors. I have not named those vendors, and I will not until either the disclosure windows close or the vendors consent. The same applies to the specific bucket names, the content patterns I used to fingerprint the piracy operator, and the particular signals in the secret-detection pipeline that I’d rather not hand to anyone building the opposite kind of tool. The instinct for a portfolio piece is to show everything. The instinct for a responsible disclosure program is to show almost nothing. This article is trying to sit in the middle of those, leaning toward the second.

Portitor is still early. The classifier needs to distinguish abuse hosting from accidental exposure as a first-class concept rather than a manual-review step. The dedup layer needs a real fix. The scheduled sweep pipeline needs to actually run discovery instead of no-op. I have a list of about a dozen more items, each filed, each understood, each fixable. I’ll work through them over the next few weeks as the disclosure queue gives me time.

I am not planning to release the source code publicly at this point. Some of the machinery, particularly the discovery layer and the fingerprinting patterns, is more useful to an attacker than to a defender in its current form. A sanitized version may come later. If you are a security team inside a company that thinks you might be in the findings table and want to compare notes, get in touch. If you are a researcher building something similar and want to trade war stories, also get in touch. If you are hiring security engineers or are building a team where the work looks like this, especially get in touch.

The name Portitor is Latin for “the conveyor,” a figure from Roman mythology who ferried the dead across the river Styx. The tool’s job is to find resources that should have been retired, take custody of them safely when I can, and route them to the places where the right people can fix them. Day one taught me that the ferryman’s first skill is telling the living from the dead. I’m still learning.

Questions, tips, or want to compare notes on responsible disclosure?

lguillermo@me.com
← Back to articles