Ask HN: Is there a market for agentic scraping tools?

3 points by mxfeinberg a day ago

As a long time data scientist and engineer, I've had to write a couple of quick and dirty scrapers and bots over the years using selenium and more recently playwright. I haven't really been tracking it, but I've also been reading about the crawl4ai project.

With the explosion of AI agents, I've been playing around with building agentic scrapers that can simply be given a prompt and a target site and are able to return structured data in a specified format. I've also been playing around with adding in steps that have a different model/step attempt to define the structured format dynamically.

However, as with most AI projects, the token consumption can scale pretty aggressively.

Has anyone else been working on similar projects? Would people realistically pay $0.025 to $0.03 per request?

PaulHoule a day ago

I've been building those since 1999.

One of the weird anomalies I've been following is that people consistently overestimate how hard scraping is, in fact the horrible difficulties that people have developing GUI applications work in favor of scraping because, even though your boss is afraid that the target site is going to change and you're going to have to maintain your scraper, the boss of the guy who maintains that site is afraid that projects to make changes to it will get bogged down and besides if they change anything it will tank their SEO.

I am amazed at the poor judgement that scraper developers seem to have. At work I work on a project where you can go to

   https://our.site/item/39349109
and get really nice structured JSON and you can even add one to that number and get another valid URL. Instead people use something that downloads our React application without a cache and probably scrapes the DOM produced by it to make something like the JSON they could get just by asking.

Now... Web crawlers are totally vibe codable and people aren't intimidated anymore and are discovering just how easy it is. To be fair, in 2025 it should be possible to extract facts out of unstructured text with LLMs (at great expense) but most you can get structured data out of most web sites with CSS selectors.

I've frequently had the experience of, "I could use an API that gives me 80% of what I want with a really low rate limit if I debug their buggy OAuth implementation" vs "I can change three lines from my Flickr scraper I wrote in 2009 and it just works"

What's bothering me though is that those Cloudflare nag screens that used to be performative are really starting to screw up my crawlers [1]... and the people who are slow on the draw are waking up to the dangers of web crawlers 25 years after the cool kids did. So it is getting a lot harder, which is too bad, because Cloudflare is really locking in the Google monopoly and slamming the door in front of those trying to escape the enshittification economy.

[1] could tell you what I am doing about it but then I'd have to kill you

heldrida a day ago

If you're concerned about the costs, you could provide the process/service but require clients to provide their own LLM token. With that being said, you'd have to rethink your service charge.