Popular Posts

Popular Content

Powered by Blogger.

Search This Blog

Blog Archive

Follow on Google+

Recent Posts

About us

I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

[1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model


Comments URL: https://news.ycombinator.com/item?id=47589797

Points: 12

# Comments: 8



from Hacker News: Front Page https://phail.ai
Continue Reading

Hey HN, we're Jon and Kristiane, and we're building Orloj (https://orloj.dev), an open-source (Apache 2.0) orchestration runtime for multi-agent AI systems. You define agents, tools, policies, and workflows in declarative YAML manifests, and Orloj handles scheduling, execution, governance, and reliability.

We built this because running AI agents in production today looks a lot like running containers before Kubernetes: ad-hoc scripts, no governance, no observability, no standard way to manage the lifecycle of an agent fleet. Everyone we talked to was writing the same messy glue code to wire agents together, and nobody had a good answer for "which agent called which tool, and was it supposed to?"

Orloj treats agents the way infrastructure-as-code treats cloud resources. You write a manifest that declares an agent's model, tools, permissions, and execution limits. You compose agents into directed graphs — pipelines, hierarchies, or swarm loops.

The part we're most excited about is governance. AgentPolicy, AgentRole, and ToolPermission are evaluated inline during execution, before every agent turn and tool call. Instead of prompt instructions that the model might ignore, these policies are a runtime gate. Unauthorized actions fail closed with structured errors and full audit trails. You can set token budgets per run, whitelist models, block specific tools, and scope policies to individual agent systems.

For reliability, we built lease-based task ownership (so crashed workers don't leave orphan tasks), capped exponential retry with jitter, idempotent replay, and dead-letter handling. The scheduler supports cron triggers and webhook-driven task creation.

The architecture is a server/worker split. orlojd hosts the API, resource store (in-memory for dev, Postgres for production), and task scheduler. orlojworker instances claim and execute tasks, route model requests through a gateway (OpenAI, Anthropic, Ollama, etc.), and run tools in configurable isolation — direct, sandboxed, container, or WASM. For local development, you can run everything in a single process with orlojd --embedded-worker --storage-backend=memory.

Tool isolation was important to us. A web search tool probably doesn't need sandboxing, but a code execution tool should run in a container with no network, a read-only filesystem, and a memory cap. You configure this per tool based on risk level, and the runtime enforces it.

We also added native MCP support. You register an MCP server (stdio or HTTP), Orloj auto-discovers its tools, and they become first-class resources with governance applied. So you can connect something like the GitHub MCP server and still have policy enforcement over what agents are allowed to do with it.

Three starter blueprints are included (pipeline, hierarchical, swarm-loop).

Docs: https://docs.orloj.dev

We're also building out starter templates for operational workflows where governance really matters. First on the roadmap: 1. Incident response triage, 2. Compliance evidence collector, 3. CVE investigation pipeline, and 4. Secret rotation auditor. We have 20 templates in mind and community contributions are welcome.

We're a small team and this is v0.1.0, so there's a lot still on the roadmap — hosted cloud, compliance packaging, and more. But the full runtime is open source today and we'd love feedback on what we've built so far. What would you use this for? What's missing?


Comments URL: https://news.ycombinator.com/item?id=47526813

Points: 6

# Comments: 1



from Hacker News: Front Page https://ift.tt/XZnqo10
Continue Reading

About an hour ago new versions have been deployed to PyPI.

I was just setting up a new project, and things behaved weirdly. My laptop ran out of RAM, it looked like a forkbomb was running.

I've investigated, and found that a base64 encoded blob has been added to proxy_server.py.

It writes and decodes another file which it then runs.

I'm in the process of reporting this upstream, but wanted to give everyone here a headsup.

It is also reported in this issue: https://github.com/BerriAI/litellm/issues/24512


Comments URL: https://news.ycombinator.com/item?id=47501426

Points: 109

# Comments: 292



from Hacker News: Front Page https://ift.tt/QuWrjMz
Continue Reading

I know that HN isn't a customer support forum and it might not be right to post this here, but we are absolutely desperate and hoping someone in this community can point us in the right direction.

We are a small software company in Africa. For over two years, we've built and maintained an app. It has become a vital economic engine for our local community, employing a whole fleet of delivery agents and serving as a lifeline for local stores and restaurants.

Recently, we discovered that a single employee used a shared company machine to engage in unauthorized activities that violated Apple's Developer Terms of Service.

We took immediate action: we fired the employee on the spot and completely overhauled our security. We revoked all individual access and implemented mandatory, peer-reviewed, supervised sessions for any Apple Developer portal access.

The problem is the collateral damage. Apple terminated our entire organization's account. We submitted an appeal through App Store Connect, but we feel completely stuck behind automated walls. We have also emailed Apple executives, but are waiting in the dark.

Because of this one employee's actions, our app is facing total removal, and families in our community are quite literally losing their daily income. We aren't asking for special treatment, just a chance for a real human at App Review to look at the security steps we've taken and consider a second chance.

If anyone here has been through this, has advice, or knows how to get a human at Apple to actually read our appeal, our entire community would be forever grateful. Thank you so much for your time.

(For reference if any Apple folks are reading: our Apple Team ID is T35TM9SW45)


Comments URL: https://news.ycombinator.com/item?id=47479115

Points: 26

# Comments: 0



from Hacker News: Front Page https://ift.tt/4zqIy31
Continue Reading

I run a building design consultancy. I got tired of paying Wix $40/month for a brochure that couldn’t answer simple service questions, and me wasting hours on the same FAQs.

So I killed it all and spent 4 months building a 'talker': https://axoworks.com

The stack is completely duct-taped: Netlify’s 10s serverless timeout forced me to split the agent into three pieces: Brain (Edge), Hands (Browser), and Voice (Edge). I haven’t coded in 30 years. This was 3 steps forward, 2 steps back, heavily guided by AI.

The fight that proved it worked: 2 weeks ago, a licensed architect attacked the bot, trying to prove my business model harms the profession. The AI (DeepSeek-R3) completely dismantled his arguments. It was hilariously caustic.

Log: https://logs.axoworks.com/chat-architect-vs-concierge-v147.h...

A few battle scars:

* Web Speech API works fine, right up until someone speaks Chinese without toggling the language mode. Then it forcefully spits out English phonetic gibberish. Still a headache.

* Liability is the killer. Hallucinate a building code clause? We’re dead. Insurance won’t touch us.

* We publish the audit logs to keep ourselves honest and make sure the system stays hardened.

Audit: https://logs.axoworks.com/audit-2026-02-19-v148.html

The hardest part was getting the intent right: making one LLM pivot seamlessly from a warm principal’s tone with a homeowner, to a defensive bulldog when attacked by a peer. That took 2.5 months of tuning.

We burn through tokens with an 'Eager RAG' hack (pre-fetching guesses) just to improve responsiveness. I also ripped out the “essential” persistent DBs—less than 5% of visitors ever return, so why bother? If a client drops mid-query, their session vanishes. No server-side queues.

The point: To let me operate with a network of seasoned pros, and trim the fat.

Try to break it. I’ll be in the comments. Kee


Comments URL: https://news.ycombinator.com/item?id=47441587

Points: 7

# Comments: 9



from Hacker News: Front Page https://ift.tt/Xq41rGN
Continue Reading

Why I built Skir: https://medium.com/@gepheum/i-spent-15-years-with-protobuf-t...

Quick start: npx skir init

All the config lives in one YML file.

Website: https://skir.build

GitHub: https://github.com/gepheum/skir

Would love feedback especially from teams running mixed-language stacks.


Comments URL: https://news.ycombinator.com/item?id=47299022

Points: 10

# Comments: 7



from Hacker News: Front Page https://skir.build/
Continue Reading