Closed Bug 1536147 Opened 5 years ago Closed 5 years ago

make archivescraper faster

Categories

(Socorro :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

I wrote archivescraper to scrape version information for betaversion lookups. It's based on ftpscraper and when I wrote it, I wanted to stay as close to ftpscraper as I could so as to make the jump from one to the other as small as possible.

archivescraper takes a while to run. In a fresh local dev environment, it can take 20+ minutes. It's kind of irritating and a time sink and I run it at least once a week.

Relatedly, I wrote a verifyprocessed job. That uses multiprocessing to reduce the time it takes to run significantly.

archivescraper has similar properties--the bulk of the time it takes to run is traversing links on a website which is predominantly slow HTTP conversations. That's pretty ideal for multiprocessing with lots of workers.

This bug covers taking what I did with verifyprocessing and applying it to archivescraper.

The last fresh run I did took 40 minutes.

Grabbing this to tinker with today. I think it's straight-forward except for error handling and reporting. That's a bit trickier.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

This has been running on stage for a while and it's significantly faster. Yay!

We just pushed this to prod. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: