Build companies logos as a service startup
While working on a project, I have to write a program to crawl logo images of various companies. I have automated this with puppeteer script written in Typescript/Javascript. While this script has done the job for me - extracting 100 odd companies logos, I have to warn you that I haven’t tested more than that.
You can install the required packages with the following commands,
npm i puppeteer puppeteer-extra-plugin-stealth --save
npm i reflect-metadata node-fetch fs puppeteer-extra --save
Then,
import { Browser, BrowserContext } from "puppeteer";
import puppeteer, { PuppeteerExtra } from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import fetch from "node-fetch";
import fs from "fs";
async function startLogoDownloader() {
const companyName = "Google";
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: false });
const browserContext = await browser.createIncognitoBrowserContext();
const page = await browserContext.newPage();
const jobUrl = "https://google.com";
try {
await page.goto(jobUrl, { waitUntil: "networkidle2" });
await page.type("input[type=text]", `facebook page of ${companyName} company`, { delay: 40 });
await page.keyboard.press("Enter");
await page.waitForSelector("div#search", { timeout: 10000 });
await page.waitForSelector('a[href^="https://www.facebook.com"', { timeout: 10000 });
const results = [];
const urls = await page.evaluate((resultsObject) => {
const urlResults = JSON.parse(resultsObject).results;
document.getElementById("search").
querySelectorAll('a[href^="https://www.facebook.com"').
forEach((a: HTMLAnchorElement) => {
urlResults.push(a.href);
});
return urlResults;
}, JSON.stringify({ results }));
if (urls.length === 0) {
return;
}
// Take the first result
await page.goto(urls[0], { waitUntil: "networkidle2" });
await page.waitForSelector('a[aria-label="Profile picture"]', { timeout: 10000 });
const imageUrl = await page.evaluate(() => {
const element = document.querySelector('a[aria-label="Profile picture"]').querySelector("img");
return element.src;
});
console.log(`The image url is ${imageUrl}`);
const resp = await fetch(imageUrl);
const buffer = await resp.buffer();
const path = `./${companyName}.png`;
fs.createWriteStream(path).write(buffer);
await page.close();
await browserContext.close();
await browser.close();
} catch (e) {
console.log(`Error(s) occured for ${companyName} : ${e.toString()}`);
}
}
In file logodownloader.ts
, you can invoke this script as,
import "reflect-metadata";
import { startLogoDownloader } from "../common/logo_downloader";
(async () => {
await startLogoDownloader();
})();
TODO
- You can generally store an image in an S3 bucket and store the path in the database or store it in DB as a blob. There are various pros and cons associated with each approach. Every few years this discussion comes up in DB mailing lists as
To blob or Not to blob
. You can google that expression. :)
Key take aways from the script.
- I have hard coded the company name - “Google” to demonstrate the example. You can pass an array of companies and run it.
- Currently, there is no mechanism in the code to retry failed companies.
- You can attach this to
npm run logo_download
by invoking the filelogodownloader.ts
in package.json.