Vehicle Suggestions

October 1, 2021 · 14 min read

Principal Developer

/img/content-blog-raw-blog-vehicle-suggestions-untitled.png

Introduction

The customer owns a franchise store for selling Tesla Automobiles. The objective is to predict user preferences using social media data.

Task 1 - Suggest the best vehicle for the given description

Task 2 - Suggest the best vehicle for the given social media id of the user

Customer queries

// car or truck or no mention of vehicle type means Cyber Truck
// SUV mention means Model X
const one = "I'm looking for a fast suv that I can go camping without worrying about recharging".;
const two = "cheap red car that is able to go long distances";
const three = "i am looking for a daily driver that i can charge everyday, do not need any extras";
const four = "i like to go offroading a lot on my jeep and i want to do the same with the truck";
const five = "i want the most basic suv possible";
const six = "I want all of the addons";
// mentions of large family or many people means model x
const seven = "I have a big family and want to be able to take them around town and run errands without worrying about charging";

Expected output

const oneJson = {
vehicle: 'Model X',
trim : 'adventure',
exteriorColor: 'whiteExterior',
wheels: "22Performance",
tonneau: "powerTonneau",
packages: "",
interiorAddons: "",
interiorColor: "blackInterior",
range: "extendedRange",
software: "",
}

const twoJSON = {
vehicle: 'Cyber Truck',
trim : 'base',
exteriorColor: 'whiteExterior',
wheels: "21AllSeason",
tonneau: "powerTonneau",
packages: "",
interiorAddons: "",
interiorColor: "blackInterior",
range: "extendedRange",
software: "",
}

const threeJSON = {
vehicle: 'Cyber Truck',
trim : 'base',
exteriorColor: 'whiteExterior',
wheels: "21AllSeason",
tonneau: "powerTonneau",
packages: "",
interiorAddons: "",
interiorColor: "blackInterior",
range: "standardRange",
software: "",
}

const fourJSON = {
vehicle: 'Cyber Truck',
trim : 'adventure',
exteriorColor: 'whiteExterior',
wheels: "20AllTerrain",
tonneau: "powerTonneau",
packages: "offroadPackage,matchingSpareTire",
interiorAddons: "",
interiorColor: "blackInterior",
range: "extendedRange",
software: "",
}

const fiveJSON = {
vehicle: 'Model X',
trim : 'base',
exteriorColor: 'whiteExterior',
wheels: "20AllTerrain",
tonneau: "manualTonneau",
packages: "",
interiorAddons: "",
interiorColor: "blackInterior",
range: "standardRange",
software: "",
}

const sixJSON = {
vehicle: 'Cyber Truck',
trim : 'adventure',
exteriorColor: 'whiteExterior',
wheels: "20AllTerrain",
tonneau: "powerTonneau",
packages: "offroadPackage,matchingSpareTire",
interiorAddons: "wirelessCharger",
interiorColor: "blackInterior",
range: "extendedRange",
software: "selfDrivingPackage",
}

const sevenJSON = {
vehicle: 'Model X',
trim : 'base',
exteriorColor: 'whiteExterior',
wheels: "21AllSeason",
tonneau: "powerTonneau",
packages: "",
interiorAddons: "",
interiorColor: "blackInterior",
range: "mediumRange",
software: "",
}

Vehicle model configurations

const configuration = {
meta: {
configurationId: '???',
storeId: 'US_SALES',
country: 'US',
version: '1.0',
effectiveDate: '???',
currency: 'USD',
locale: 'en-US',
availableLocales: ['en-US'],
},

defaults: {
basePrice: 50000,
deposit: 1000,
initialSelection: [
'adventure',
'whiteExterior',
'21AllSeason',
'powerTonneau',
'blackInterior',
'mediumRange',
],
},

groups: {
trim: {
name: { 'en-US': 'Choose trim' },
multiselect: false,
required: true,
options: ['base', 'adventure'],
},
exteriorColor: {
name: { 'en-US': 'Choose paint' },
multiselect: false,
required: true,
options: [
'whiteExterior',
'blueExterior',
'silverExterior',
'greyExterior',
'blackExterior',
'redExterior',
'greenExterior',
],
},
wheels: {
name: { 'en-US': 'Choose wheels' },
multiselect: false,
required: true,
options: ['21AllSeason', '20AllTerrain', '22Performance'],
},
tonneau: {
name: { 'en-US': 'Choose tonneau cover' },
multiselect: false,
required: true,
options: ['manualTonneau', 'powerTonneau'],
},
packages: {
name: { 'en-US': 'Choose upgrades' },
multiselect: true,
required: false,
options: ['offroadPackage', 'matchingSpareTire'],
},
interiorColor: {
name: { 'en-US': 'Choose interior' },
multiselect: false,
required: true,
options: ['greyInterior', 'blackInterior', 'greenInterior'],
},
interiorAddons: {
name: { 'en-US': 'Choose upgrade' },
multiselect: true,
required: false,
options: ['wirelessCharger'],
},
range: {
name: { 'en-US': 'Choose range' },
multiselect: false,
required: true,
options: ['standardRange', 'mediumRange', 'extendedRange'],
},
software: {
name: { 'en-US': 'Choose upgrade' },
multiselect: true,
required: false,
options: ['selfDrivingPackage'],
},
specs: {
name: { 'en-US': 'Specs overview *' },
attrs: {
description: {
'en-US':
"* Options, specs and pricing may change as we approach production. We'll contact you to review any updates to your preferred build.",
},
},
multiselect: false,
required: false,
options: ['acceleration', 'power', 'towing', 'range'],
},
},

options: {
base: {
name: { 'en-US': 'Base' },
attrs: {
description: { 'en-US': 'Production begins 2022' },
},
visual: true,
price: 0,
},
adventure: {
name: { 'en-US': 'Adventure' },
attrs: {
description: { 'en-US': 'Production begins 2021' },
},
visual: true,
price: 10000,
},

standardRange: {
name: { 'en-US': 'Standard' },
attrs: {
description: { 'en-US': '230+ miles' },
},
price: 0,
},
mediumRange: {
name: { 'en-US': 'Medium' },
attrs: {
description: { 'en-US': '300+ miles' },
},
price: 3000,
},
extendedRange: {
name: { 'en-US': 'Extended' },
attrs: {
description: { 'en-US': '400+ miles' },
},
price: 8000,
},

greenExterior: {
name: { 'en-US': 'Adirondack Green' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/green.svg',
},
visual: true,
price: 2000,
},
blueExterior: {
name: { 'en-US': 'Trestles Blue' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/blue.svg',
},
visual: true,
price: 1000,
},
whiteExterior: {
name: { 'en-US': 'Arctic White' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/white.svg',
},
visual: true,
price: 0,
},
silverExterior: {
name: { 'en-US': 'Silver Gracier' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/silver.svg',
},
visual: true,
price: 1000,
},
blackExterior: {
name: { 'en-US': 'Cosmic Black' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/black.svg',
},
visual: true,
price: 1000,
},
redExterior: {
name: { 'en-US': 'Red Rocks' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/red.svg',
},
visual: true,
price: 2000,
},
greyExterior: {
name: { 'en-US': 'Antracite Grey' },
attrs: {
imageUrl: '/public/images/configurationOptions/exteriorcolors/grey.svg',
},
visual: true,
price: 1000,
},

'21AllSeason': {
name: { 'en-US': '21" Cast Wheel - All Season' },
attrs: {
imageUrl: '/public/images/configurationOptions/wheels/twentyone.svg',
},
visual: true,
price: 0,
},
'20AllTerrain': {
name: { 'en-US': '20" Forged Wheel - All Terrain' },
attrs: {
imageUrl: '/public/images/configurationOptions/wheels/twenty.svg',
},
visual: true,
price: 0,
},
'22Performance': {
name: { 'en-US': '22" Cast Wheel - Performance' },
attrs: {
imageUrl: '/public/images/configurationOptions/wheels/twentytwo.svg',
},
visual: true,
price: 2000,
},

manualTonneau: {
name: { 'en-US': 'Manual' },
attrs: {
description: { 'en-US': 'Description here' },
},
price: 0,
},
powerTonneau: {
name: { 'en-US': 'Powered' },
attrs: {
description: { 'en-US': 'Description here' },
},
price: 0,
},

blackInterior: {
name: { 'en-US': 'Black' },
attrs: {
imageUrl: '/public/images/configurationOptions/interiorcolors/black.svg',
},
visual: true,
price: 0,
},
greyInterior: {
name: { 'en-US': 'Grey' },
attrs: {
imageUrl: '/public/images/configurationOptions/interiorcolors/grey.svg',
},
visual: true,
price: 1000,
},
greenInterior: {
name: { 'en-US': 'Green' },
attrs: {
imageUrl: '/public/images/configurationOptions/interiorcolors/green.svg',
},
visual: true,
price: 2000,
},

offroadPackage: {
name: { 'en-US': 'Off-Road' },
attrs: {
description: { 'en-US': 'Lorem ipsum dolor sit amet.' },
imageUrl: '/public/images/configurationOptions/packages/offroad.png',
},
visual: true,
price: 5000,
},
matchingSpareTire: {
name: { 'en-US': 'Matching Spare Tire' },
attrs: {
description: { 'en-US': 'Full sized tire' },
imageUrl: '/public/images/configurationOptions/packages/spare.png',
},
price: 500,
},

wirelessCharger: {
name: { 'en-US': 'Wireless charger' },
attrs: {
description: { 'en-US': 'Lorem ipsum dolor sit amet.' },
imageUrl: '/public/images/configurationOptions/packages/wireless.png',
},
price: 100,
},
selfDrivingPackage: {
name: { 'en-US': 'Autonomy' },
attrs: {
description: { 'en-US': 'Lorem ipsum dolor sit amet.' },
imageUrl: '/public/images/configurationOptions/packages/autonomy.png',
},
price: 7000,
},

acceleration: {
name: { 'en-US': '0 - 60 mph' },
attrs: {
units: { 'en-US': 'sec' },
decimals: 1,
},
value: 3.4,
},
power: {
name: { 'en-US': 'Horsepower' },
attrs: {
units: { 'en-US': 'hp' },
},
value: 750,
},
towing: {
name: { 'en-US': 'Towing' },
attrs: {
units: { 'en-US': 'lbs' },
},
value: 10000,
},
range: {
name: { 'en-US': 'Range' },
attrs: {
units: { 'en-US': 'mi' },
},
value: 400,
},
}
};

Public datasets

Instagram: 16539 images from 972 Instagram influencers (link)
TechCrunchPosts: (link)
Tweets: (link)

Primary (available for academic use only, need university affiliation for access)

A Dataset and Benchmarks for Multimedia Social Analysis

Secondary (low quality data, not sure if can be used at all)

Hacker News Posts
TechCrunch Posts Compilation
Instagram image data HowTo
Flikr Large with likes and comments
The Images of Groups Dataset
http://www.multimediaeval.org/datasets/
The InstaCities1M Dataset
Multimodal Meme Classification: Identifying Offensive Content in Image and Text
Understanding Police Social Media Usage Through Posts and Tweets
Topic clusters text
- Model X
  - I like model X
  - I want to buy model X
  - Model X is my favorite car
  - Tesla Modelx is my dream
  - modelx tesla love
- Cyber Truck
  - I like Cyber Truck
  - I want to buy Cyber Truck
  - Cyber Truck is my favorite car
  - Tesla Cyber Truck is my dream
  - CyberTruck tesla love
- Adventure
  - I like adventure
  - sports i play
  - i went on trip
  - I travels a lot
  - car adventure
- Exterior Color White
  - I like white color
  - White is my fav
  - white car love
  - I like white exterior
- Exterior Color Black
  - I like Black color
  - Black is my fav
  - Black car love
  - I like Black exterior
- Exterior Color Blue
  - I like Blue color
  - Blue is my fav
  - Blue car love
  - I like Blue exterior
- Exterior Color Green
  - I like Green color
  - Green is my fav
  - Green car love
  - I like Green exterior
- Exterior Color Red
  - I like Red color
  - Red is my fav
  - Red car love
  - I like Red exterior
- Exterior Color Grey
  - I like Grey color
  - Grey is my fav
  - Grey car love
  - I like Grey exterior
- Exterior Color Silver
  - I like Silver color
  - Silver is my fav
  - Silver car love
  - I like Silver exterior
- Self driving
  - I like self driving technology
  - selfDrivingPackage
  - selfDrivingtech love
  - self drive is my fav
  - self driving car is amazing
Celebs

Logical Reasoning

If I implicitly rate pictures of blue car, that means I might prefer a blue car.
If I like posts of self-driving, that means I might prefer a self-driving option.

Scope

Scope 1

/img/content-blog-raw-blog-vehicle-suggestions-untitled-2.png

Scope 2

media content categories: text and images

platforms: facebook, twitter and instagram

implicit rating categories: like, comment, share

columns: userid, timestamp, platform, type, content, rating

Model Framework

Model framework 1

Convert user's natural language query into vector using Universal Sentence Embedding model
Create a product specs binary matrix based on different categories
Find TopK similar query vectors using cosine distance
For each TopK vector, Find TopM product specs using interaction table weights
For each TopM specification, find TopN similar specs using binary matrix
Show all the qualified product specifications

Model framework 2

Seed data: 10 users with ground-truth persona, media content and implicit ratings
Inflated data: 10 users with media content and implicit ratings
media content → Implicit rating (A)
media content → feature vector (B) + (A) → weighted pooling → similar users (C)
media content → QA model → slot filling → global pooling → item associations (D)
(C) → content-based filtering → item recommendations → (D) → top-k recommendations

User selection

People who are connected to social media community of electric vehicles
Seed users are those who already have an electric vehicle
Inflated users are those who doesn't own an EV but inclined to purchase
Users having presense on all three sites or at least 2
List of common users https://www.facebook.com/gossman https://www.facebook.com/ryanm06 https://www.facebook.com/chad.turner.7146 https://www.facebook.com/cjacobs05 https://www.facebook.com/MafiaAllen https://www.facebook.com/rahul.mii.33 https://www.facebook.com/francisco.chavira.547 https://www.facebook.com/JayTheillest74 https://www.facebook.com/michael.creighton20 https://www.facebook.com/darryl.grigggardening https://www.facebook.com/4X4Aus/ https://www.instagram.com/minnyrc/ https://www.instagram.com/warnerbu7lt/
List of celebs
1. https://en.wikipedia.org/wiki/List_of_most-followed_Instagram_accounts
2. https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts
3. https://en.wikipedia.org/wiki/List_of_most-followed_Facebook_pages
  ['Jennifer Lopez', 'Virat Kohli', 'Ariana Grande', 'Dwayne Johnson', 'Kylie Jenner', 'Lionel Messi', 'LeBron James', 'Beyoncé', 'Justin Bieber', 'Akshay Kumar', 'Demi Lovato', 'Kendall Jenner', 'Nicki Minaj', 'Khloé Kardashian', 'Kim Kardashian', 'Gigi Hadid', 'Ellen DeGeneres', 'Deepika Padukone', 'Rihanna', 'Shakira', 'Cardi B', 'Eminem', 'Drake', 'Chris Brown', 'Maluma', 'Vin Diesel', 'Ronaldinho', 'Kevin Hart', 'Emma Watson', 'Shawn Mendes', 'Neymar', 'Justin Timberlake', 'Katy Perry', 'Donald Trump', 'Lady Gaga', 'Amitabh Bachchan', 'Selena Gomez', 'Lil Wayne', 'Elon Musk', 'Britney Spears', 'Jimmy Fallon', 'Bill Gates', 'Ariana Grande', 'Miley Cyrus', 'Oprah Winfrey', 'Cristiano Ronaldo', 'Salman Khan', 'Shah Rukh Khan', 'Niall Horan']

Model framework 3

User-User Similarity (clustering)

User → Media content → Embedding → Average pooling
Cosine Similarity of user's social vector with other user's social vector

User-Item Similarity (reranking)

User → Implicit Rating on media content M → M's correlation with item features
Item features: familySize
Cosine Similarity of user's social vector with item's feature vector

User-User Similarity (clustering)

User → Media content → Embedding → Average pooling
Cosine Similarity of user's social vector with other user's social vector

User-Item Similarity (reranking)

User → Implicit Rating on media content M → M's correlation with item features
Item features: familySize
Cosine Similarity of user's social vector with item's feature vector

Model framework 4

/img/content-blog-raw-blog-vehicle-suggestions-untitled-3.png

Text → Prepare → Vectorize → Average → Similar Users

Image → Prepare → Vectorize → Average → Similar Users

Text → Prepare → QA → Slot filling

Image → Prepare → VQA → Slot filling

Image → Similar Image from users → Detailed enquiry

Model framework 5

Topic Clusters Text
Topic Clusters Image
Fetch raw text and images
Combine, Clean and Store text in text dataframe
Vectorize Texts
Cosine similarities of texts with topic clusters
Vectorize Images
Cosine similarities of images with topic clusters

Experimental Setup

Experiment 1

import numpy as np
import pandas as pd
import tensorflow_hub as hub
from itertools import product
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

vehicle = ['modelX', 'cyberTruck']
trim = ['adventure', 'base']
exteriorColor = ['whiteExterior', 'blueExterior', 'silverExterior', 'greyExterior', 'blackExterior', 'redExterior', 'greenExterior']
wheels = ['20AllTerrain', '21AllSeason', '22Performance']
tonneau = ['powerTonneau', 'manualTonneau']
interiorColor = ['blackInterior', 'greyInterior', 'greenInterior']
range = ['standardRange', 'mediumRange', 'extendedRange']
packages = ['offroadPackage', 'matchingSpareTire', 'offroadPackage,matchingSpareTire', 'None']
interiorAddons = ['wirelessCharger', 'None']
software = ['selfDrivingPackage', 'None']

specs_cols = ['vehicle', 'trim', 'exteriorColor', 'wheels', 'tonneau', 'interiorColor', 'range', 'packages', 'interiorAddons', 'software']
specs = pd.DataFrame(list(product(vehicle, trim, exteriorColor, wheels, tonneau, interiorColor, range, packages, interiorAddons, software)),
                     columns=specs_cols)

enc = OneHotEncoder(handle_unknown='error', sparse=False)
specs = pd.DataFrame(enc.fit_transform(specs))

specs_ids = specs.index.tolist()

query_list = ["I'm looking for a fast suv that I can go camping without worrying about recharging",
              "cheap red car that is able to go long distances",
              "i am looking for a daily driver that i can charge everyday, do not need any extras",
              "i like to go offroading a lot on my jeep and i want to do the same with the truck",
              "i want the most basic suv possible",
              "I want all of the addons", 
              "I have a big family and want to be able to take them around town and run errands without worrying about charging"]

queries = pd.DataFrame(query_list, columns=['query'])
query_ids = queries.index.tolist()

const_oneJSON = {
'vehicle': 'modelX',
'trim' : 'adventure',
'exteriorColor': 'whiteExterior',
'wheels': "22Performance",
'tonneau': "powerTonneau",
'packages': "None",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "extendedRange",
'software': "None",
}

const_twoJSON = {
'vehicle': 'cyberTruck',
'trim' : 'base',
'exteriorColor': 'whiteExterior',
'wheels': "21AllSeason",
'tonneau': "powerTonneau",
'packages': "None",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "extendedRange",
'software': "None",
}

const_threeJSON = {
'vehicle': 'cyberTruck',
'trim' : 'base',
'exteriorColor': 'whiteExterior',
'wheels': "21AllSeason",
'tonneau': "powerTonneau",
'packages': "None",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "standardRange",
'software': "None",
}

const_fourJSON = {
'vehicle': 'cyberTruck',
'trim' : 'adventure',
'exteriorColor': 'whiteExterior',
'wheels': "20AllTerrain",
'tonneau': "powerTonneau",
'packages': "offroadPackage,matchingSpareTire",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "extendedRange",
'software': "None",
}

const_fiveJSON = {
'vehicle': 'modelX',
'trim' : 'base',
'exteriorColor': 'whiteExterior',
'wheels': "20AllTerrain",
'tonneau': "manualTonneau",
'packages': "None",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "standardRange",
'software': "None",
}

const_sixJSON = {
'vehicle': 'cyberTruck',
'trim' : 'adventure',
'exteriorColor': 'whiteExterior',
'wheels': "20AllTerrain",
'tonneau': "powerTonneau",
'packages': "offroadPackage,matchingSpareTire",
'interiorAddons': "wirelessCharger",
'interiorColor': "blackInterior",
'range': "extendedRange",
'software': "selfDrivingPackage",
}

const_sevenJSON = {
'vehicle': 'modelX',
'trim' : 'base',
'exteriorColor': 'whiteExterior',
'wheels': "21AllSeason",
'tonneau': "powerTonneau",
'packages': "None",
'interiorAddons': "None",
'interiorColor': "blackInterior",
'range': "mediumRange",
'software': "None",
}

historical_data = pd.DataFrame([const_oneJSON, const_twoJSON, const_threeJSON, const_fourJSON, const_fiveJSON, const_sixJSON, const_sevenJSON])

input_vec = enc.transform([specs_frame.append(historical_data.iloc[0], sort=False).iloc[-1]])
idx = np.argsort(-cosine_similarity(input_vec, specs.values))[0,:][:1]
rslt = enc.inverse_transform([specs.iloc[idx]])

interactions = pd.DataFrame(columns=['query_id','specs_id'])
interactions['query_id'] = queries.index.tolist()
input_vecs = enc.transform(specs_frame.append(historical_data, sort=False).iloc[-len(historical_data):])
interactions['specs_id'] = np.argsort(-cosine_similarity(input_vecs, specs.values))[:,0]

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
embed_model = hub.load(module_url)
def embed(input):
  return embed_model(input)
query_vecs = embed(queries['query'].tolist()).numpy()

_query = input('Please enter query: ') or 'i want the most basic suv possible'
_query_vec = embed([_query]).numpy()
_match_qid = np.argsort(-cosine_similarity(_query_vec, query_vecs))[0,:][:1]
_match_sid = interactions.loc[interactions['query_id']==_match_qid[0], 'specs_id'].values[0]
input_vec = enc.transform([specs_frame.append(historical_data.iloc[0], sort=False).iloc[-1]])
idx = np.argsort(-cosine_similarity([specs.iloc[_match_sid].values], specs.values))[0,:][:5]
results = []
for x in idx:
  results.append(enc.inverse_transform([specs.iloc[x]]))
_temp = np.array(results).reshape(5,-1)
_temp = pd.DataFrame(_temp, columns=specs_frame.columns)
print(_temp)

Experiment 2

Celeb Scraping

Facebook Scraping

/img/content-blog-raw-blog-vehicle-suggestions-untitled-4.png

Twitter Scraping

/img/content-blog-raw-blog-vehicle-suggestions-untitled-5.png

Dataframe

/img/content-blog-raw-blog-vehicle-suggestions-untitled-6.png

Insta Image Grid

/img/content-blog-raw-blog-vehicle-suggestions-untitled-7.png

User Text NER

/img/content-blog-raw-blog-vehicle-suggestions-untitled-8.png

Experiment 3

Topic model

Topic scores

/img/content-blog-raw-blog-vehicle-suggestions-untitled-9.png

JSON rules

/img/content-blog-raw-blog-vehicle-suggestions-untitled-10.png

Results and Discussion

API with 3 input fields - Facebook username, Twitter handle & Instagram username
The system will automatically scrap the user's publicly available text and images from these 3 social media platforms and provide a list of recommendations from most to least preferred product

Web Scraping using Scrapy, BS4, and Selenium

October 1, 2021 · 4 min read

Sparsh Agarwal

Principal Developer

Handling single request & response by extracting a city’s weather from a weather site using Scrapy
Handling multiple request & response by extracting book details from a dummy online book store using Scrapy
Scrape the cover images of all the books from the website books.toscrape.com using Scrapy
Logging into Facebook using Selenium
Extract PM2.5 data from openaq.org using Selenium
Extract PM2.5 data from openaq.org using Selenium Scrapy

Scrapy vs. Selenium

Selenium is an automation tool for testing web applications. It uses a webdriver as an interface to control webpages through programming languages. So, this gives Selenium the capability to handle dynamic webpages effectively. Selenium is capable of extracting data on its own. It is true, but it has its caveats. Selenium cannot handle large data, but Scrapy can handle large data with ease. Also, Selenium is much slower when compared to Scrapy. So, the smart choice would be to use Selenium with Scrapy to scrape dynamic webpages containing large data, consuming less time. Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage’s source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.

# SKELETON FOR COMBINING SELENIUM WITH SCRAPY
from scrapy import Selector
# Other Selenium and Scrapy imports
...
driver = webdriver.Chrome()
# Selenium tasks and actions to render the webpage with required content
selenium_response_text = driver.page_source
new_selector = Selector(text=selenium_response_text)
# Scrapy tasks to extract data from Selector

Project tree

.
├── airQuality
│   ├── countries_list.json
│   ├── get_countries.py
│   ├── get_pm_data.py
│   ├── get_urls.py
│   ├── openaq_data.json
│   ├── openaq_scraper.py
│   ├── README.md
│   └── urls.json
├── airQualityScrapy
│   ├── LICENSE
│   ├── openaq
│   │   ├── countries_list.json
│   │   ├── openaq
│   │   │   ├── __init__.py
│   │   │   ├── items.py
│   │   │   ├── middlewares.py
│   │   │   ├── pipelines.py
│   │   │   ├── settings.py
│   │   │   └── spiders
│   │   ├── output.json
│   │   ├── README.md
│   │   ├── scrapy.cfg
│   │   └── urls.json
│   ├── performance_comparison
│   │   ├── performance_comparison
│   │   │   ├── __init__.py
│   │   │   ├── items.py
│   │   │   ├── middlewares.py
│   │   │   ├── pipelines.py
│   │   │   ├── settings.py
│   │   │   └── spiders
│   │   ├── README.md
│   │   ├── scrapy.cfg
│   │   ├── scrapy_output.json
│   │   └── selenium_scraper
│   │       ├── bts_scraper.py
│   │       ├── selenium_output.json
│   │       └── urls.json
│   └── README.md
├── books
│   ├── books
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── book_spider.py
│   │       ├── crawl_spider.py
│   │       └── __init__.py
│   ├── crawl_spider_output.json
│   ├── README.md
│   └── scrapy.cfg
├── booksCoverImage
│   ├── booksCoverImage
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── image_crawl_spider.py
│   │       └── __init__.py
│   ├── output.json
│   ├── path
│   │   └── to
│   │       └── store
│   ├── README.md
│   └── scrapy.cfg
├── etc
│   └── Selenium
│       ├── chromedriver.exe
│       ├── chromedriver_v87.exe
│       └── install.sh
├── facebook
│   └── login.py
├── gazpacho1
│   ├── data
│   │   ├── media.html
│   │   ├── ocr.html
│   │   ├── page.html
│   │   ├── static
│   │   │   └── stheno.mp4
│   │   └── table.html
│   ├── media
│   │   ├── euryale.png
│   │   ├── medusa.mp3
│   │   ├── medusa.png
│   │   ├── stheno.mp4
│   │   └── test.png
│   ├── scrap_login.py
│   ├── scrap_media.py
│   ├── scrap_ocr.py
│   ├── scrap_page.py
│   └── scrap_table.py
├── houzzdotcom
│   ├── houzzdotcom
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── crawl_spider.py
│   │       └── __init__.py
│   └── scrapy.cfg
├── media
│   └── test.png
├── README.md
├── scrapyPractice
│   ├── scrapy.cfg
│   └── scrapyPractice
│       ├── __init__.py
│       ├── items.py
│       ├── middlewares.py
│       ├── pipelines.py
│       ├── settings.py
│       └── spiders
│           └── __init__.py
└── weather
    ├── output.json
    ├── README.md
    ├── scrapy.cfg
    └── weather
        ├── __init__.py
        ├── items.py
        ├── middlewares.py
        ├── pipelines.py
        ├── settings.py
        └── spiders
            ├── __init__.py
            └── weather_spider.py

35 directories, 98 files

For code, drop me a message on mail or LinkedIn.

Web Scraping with Gazpacho

October 1, 2021 · One min read

Sparsh Agarwal

Principal Developer

Using gazpacho to Download and Parse the Contents of a Website. Scrape the names of the three "Gorgons".

/img/content-blog-raw-blog-web-scraping-with-gazpacho-untitled.png

Using gazpacho and pandas to Retrieve the Contents of an HTML Table. Scrape the creature and habitat columns.

/img/content-blog-raw-blog-web-scraping-with-gazpacho-untitled-1.png

/img/content-blog-raw-blog-web-scraping-with-gazpacho-untitled-2.png

Using gazpacho and pytesseract to Parse the Contents of “Non-Text” Text Data. Extract the embedded text.

/img/content-blog-raw-blog-web-scraping-with-gazpacho-untitled-3.png

Using gazpacho and urllib to Retrieve and Download Images, Videos, and Audio Clippings. To download the Image, Audio and Video data.

/img/content-blog-raw-blog-web-scraping-with-gazpacho-untitled-4.png

Wellness tracker chatbot

October 1, 2021 · One min read

Sparsh Agarwal

Principal Developer

Problem Statement

A bot that logs daily wellness data to a spreadsheet (using the Airtable API), to help the user keep track of their health goals. Connect the assistant to a messaging channel—Twilio—so users can talk to the assistant via text message and Whatsapp.

Proposed Solution

RASA chatbot with Forms and Custom actions
Connect with Airtable API to log records in table database
Connect with Whatsapp for user interaction

Modeling

Delivery

https://github.com/sparsh-ai/chatbots/tree/master/wellnessTracker

Reference

https://www.udemy.com/course/rasa-for-beginners/learn/lecture/20746878#overview

What is Livestream Ecommerce

October 1, 2021 · 4 min read

Sparsh Agarwal

Principal Developer

/img/content-blog-raw-blog-what-is-livestream-ecommerce-untitled.png

Recent years witness the prosperity of online live streaming. With the development of mobile phones, cameras, and high-speed internet, more and more users are able to broadcast their experiences in live streams on various social platforms, such as Facebook Live and YouTube Live. There are a variety of live streaming applications, including knowledge share, video-gaming, and outdoor traveling.

One of the most important scenarios is live streaming commerce, a new form of online shopping becomes more and more popular, which combines live streaming with E-Commerce activity. The streamers introduce products and interact with their audiences, and hence greatly improve the performance of selling products.

/img/content-blog-raw-blog-what-is-livestream-ecommerce-untitled-1.png

Livestream ecommerce is a business model in which retailers, influencers, or celebrities sell products and services via online video streaming where the presenter demonstrates and discusses the offering and answers audience questions in real-time.

/img/content-blog-raw-blog-what-is-livestream-ecommerce-untitled-2.png

Examples

https://media.nngroup.com/media/editor/2021/02/16/tiktok_livestream_compressed.mp4

During a livestream event hosted by Walmart on TikTok, users watched an influencer presenting various products such as a pair of jeans. Those interested in the jeans could tap the product listing shown at the bottom of the screen. They could also browse the list of products promoted during the livestream and purchase them without leaving the TikTok app. Viewers’ real-time comments appeared along the left-hand side of the livestream feed.

Advantages

Livestreams allow users to see products in detail and get their questions answered in real time
During livestream sessions, the hosts can show product details in close-up (left), give instructions of use for products like essential oils and cosmetic face masks (middle), or even show how a particular product, like the tea they’re selling, is made (right)
Greatly shorten the decision-making time of consumers and provoke the sales volume
The expert streamers introduce and promote the products in a live streaming manner, which makes the shopping process more interesting and convincing
Rich and real-time interactions between streamers and their audiences, which makes live streaming a new medium and a powerful marketing tool for E-Commerce
Viewers not only can watch the showing for product’s looks and functions, but also can ask the streamers to show different or individual perspectives of the products in real-time

Market

Livestream ecommerce has been surging dramatically in China. According to Forbes, this industry is estimated to earn $60 billion annually. In 2019, about 37 percent of the online shoppers in China (265 million people) made livestream purchases. On Taobao’s 2020 annual Single-Day Global Shopping Festival (November 11th), livestreams accounted for $6 billion in sales (twice the amount from the prior year).

Amazon has also launched its live platform, where influencers promote items and chat with potential customers. And Facebook and Instagram are exploring the integration between ecommerce and social media. For instance, the new Shop feature on Instagram allows users to browse products and place orders directly within Instagram — a form of social commerce.

The total GMV driven by live streaming achieved $6 Billion USD. Some quantitative research results show that adopting live streaming in sales can achieve a 21.8% increase in online sales volume.

/img/content-blog-raw-blog-what-is-livestream-ecommerce-untitled-4.png

The Anatomy of a Livestream Session

/img/content-blog-raw-blog-what-is-livestream-ecommerce-untitled-5.png

A typical livestream session has the following basic components:

The video stream, where the host shows the products, talks about them, and answers questions from the audience. In the Amazon Live case, the stream occupies the most of the screen space.
The list of products being promoted, with the product currently being shown highlighted. This list appears at the bottom of the Amazon video stream.
A chat area, where viewers can type questions and comments to interact with the host and other viewers. The chat area is at the right of the live stream on Amazon Live.
A reaction button, that users can use to send reactions, displayed as animated emojis. The reaction button shows up as a little star icon at the bottom right of the video stream on Amazon.

References

Object detection with YOLO3

January 23, 2021 · 2 min read

Sparsh Agarwal

Principal Developer

Live app

This app can detect COCO 80-classes using three different models - Caffe MobileNet SSD, Yolo3-tiny, and Yolo3. It can also detect faces using two different models - SSD Res10 and OpenCV face detector. Yolo3-tiny can also detect fires.

/img/content-blog-raw-blog-object-detection-with-yolo3-untitled.png

/img/content-blog-raw-blog-object-detection-with-yolo3-untitled-1.png

Code

import streamlit as st
import cv2
from PIL import Image
import numpy as np
import os

from tempfile import NamedTemporaryFile
from tensorflow.keras.preprocessing.image import img_to_array, load_img

temp_file = NamedTemporaryFile(delete=False)

DEFAULT_CONFIDENCE_THRESHOLD = 0.5
DEMO_IMAGE = "test_images/demo.jpg"
MODEL = "model/MobileNetSSD_deploy.caffemodel"
PROTOTXT = "model/MobileNetSSD_deploy.prototxt.txt"

CLASSES = [
    "background",
    "aeroplane",
    "bicycle",
    "bird",
    "boat",
    "bottle",
    "bus",
    "car",
    "cat",
    "chair",
    "cow",
    "diningtable",
    "dog",
    "horse",
    "motorbike",
    "person",
    "pottedplant",
    "sheep",
    "sofa",
    "train",
    "tvmonitor",
]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

@st.cache
def process_image(image):
    blob = cv2.dnn.blobFromImage(
        cv2.resize(image, (300, 300)), 0.007843, (300, 300), 127.5
    )
    net = cv2.dnn.readNetFromCaffe(PROTOTXT, MODEL)
    net.setInput(blob)
    detections = net.forward()
    return detections

@st.cache
def annotate_image(
    image, detections, confidence_threshold=DEFAULT_CONFIDENCE_THRESHOLD
):
    # loop over the detections
    (h, w) = image.shape[:2]
    labels = []
    for i in np.arange(0, detections.shape[2]):
        confidence = detections[0, 0, i, 2]

        if confidence > confidence_threshold:
            # extract the index of the class label from the `detections`,
            # then compute the (x, y)-coordinates of the bounding box for
            # the object
            idx = int(detections[0, 0, i, 1])
            box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
            (startX, startY, endX, endY) = box.astype("int")

            # display the prediction
            label = f"{CLASSES[idx]}: {round(confidence * 100, 2)}%"
            labels.append(label)
            cv2.rectangle(image, (startX, startY), (endX, endY), COLORS[idx], 2)
            y = startY - 15 if startY - 15 > 15 else startY + 15
            cv2.putText(
                image, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2
            )
    return image, labels

def main():
  selected_box = st.sidebar.selectbox(
    'Choose one of the following',
    ('Welcome', 'Object Detection')
    )
    
  if selected_box == 'Welcome':
      welcome()
  if selected_box == 'Object Detection':
      object_detection() 

def welcome():
  st.title('Object Detection using Streamlit')
  st.subheader('A simple app for object detection')
  st.image('test_images/demo.jpg',use_column_width=True)

def object_detection():
  
  st.title("Object detection with MobileNet SSD")

  confidence_threshold = st.sidebar.slider(
    "Confidence threshold", 0.0, 1.0, DEFAULT_CONFIDENCE_THRESHOLD, 0.05)

  st.sidebar.multiselect("Select object classes to include",
  options=CLASSES,
  default=CLASSES
  )

  img_file_buffer = st.file_uploader("Upload an image", type=["png", "jpg", "jpeg"])

  if img_file_buffer is not None:
      temp_file.write(img_file_buffer.getvalue())
      image = load_img(temp_file.name)
      image = img_to_array(image)
      image = image/255.0

  else:
      demo_image = DEMO_IMAGE
      image = np.array(Image.open(demo_image))

  detections = process_image(image)
  image, labels = annotate_image(image, detections, confidence_threshold)

  st.image(
      image, caption=f"Processed image", use_column_width=True,
  )

  st.write(labels)

main()

You can play with the live app *here. Source code is available here on Github.*

MobileNet SSD Caffe Pre-trained model

January 19, 2020 · One min read

Sparsh Agarwal

Principal Developer

You can play with the live app here. Souce code is available here on Github.

Live app

/img/content-blog-raw-mobilenet-ssd-caffe-pre-trained-model-untitled.png

Code

#------------------------------------------------------#
# Import libraries
#------------------------------------------------------#

import datetime
import urllib
import time
import cv2 as cv
import streamlit as st

from plugins import Motion_Detection
from utils import GUI, AppManager, DataManager

#------------------------------------------------------#
#------------------------------------------------------#

def imageWebApp(guiParam):
    """
    """
    # Load the image according to the selected option
    conf = DataManager(guiParam)
    image = conf.load_image_or_video()
    
    # GUI
    switchProcessing = st.button('* Start Processing *')

    # Apply the selected plugin on the image
    bboxed_frame, output = AppManager(guiParam).process(image, True)

    # Display results
    st.image(bboxed_frame, channels="BGR",  use_column_width=True)

def main():
    """
    """
    # Get the parameter entered by the user from the GUI
    guiParam = GUI().getGuiParameters()

    # Check if the application if it is Empty
    if guiParam['appType'] == 'Image Applications':
        if guiParam["selectedApp"] is not 'Empty':
            imageWebApp(guiParam)

    else:
        raise st.ScriptRunner.StopException

#------------------------------------------------------#
#------------------------------------------------------#

if __name__ == "__main__":
    main()

Introduction

Customer queries​

Public datasets​

Logical Reasoning​

Scope

Scope 1​

Scope 2​

Model Framework

Model framework 1​

Model framework 2​

Model framework 3​

Model framework 4​

Model framework 5​

Experimental Setup

Experiment 2​

Facebook Scraping​

Twitter Scraping​

Dataframe​

Insta Image Grid​

User Text NER​

Experiment 3​

Topic scores​

JSON rules​

Results and Discussion

Scrapy vs. Selenium

Project tree​

Using gazpacho to Download and Parse the Contents of a Website. Scrape the names of the three "Gorgons".​

Using gazpacho and pandas to Retrieve the Contents of an HTML Table. Scrape the creature and habitat columns.​

Using gazpacho and Selenium to Retrieve the Contents of a Password-Protected Web Page. Scrape the quote text behind the login form.​

Using gazpacho and pytesseract to Parse the Contents of “Non-Text” Text Data. Extract the embedded text.​

Using gazpacho and urllib to Retrieve and Download Images, Videos, and Audio Clippings. To download the Image, Audio and Video data.​

Problem Statement​

Proposed Solution​

Modeling​

Delivery​

Reference​

Examples​

Advantages​

Market​

The Anatomy of a Livestream Session​

References​

Live app​

Code​

Live app​

Code​

Customer queries

Public datasets

Logical Reasoning

Scope 1

Scope 2

Model framework 1

Model framework 2

Model framework 3

Model framework 4

Model framework 5

Experiment 2

Facebook Scraping

Twitter Scraping

Dataframe

Insta Image Grid

User Text NER

Experiment 3

Topic scores

JSON rules

Project tree

Using gazpacho to Download and Parse the Contents of a Website. Scrape the names of the three "Gorgons".

Using gazpacho and pandas to Retrieve the Contents of an HTML Table. Scrape the creature and habitat columns.

Using gazpacho and Selenium to Retrieve the Contents of a Password-Protected Web Page. Scrape the quote text behind the login form.

Using gazpacho and pytesseract to Parse the Contents of “Non-Text” Text Data. Extract the embedded text.

Using gazpacho and urllib to Retrieve and Download Images, Videos, and Audio Clippings. To download the Image, Audio and Video data.

Problem Statement

Proposed Solution

Modeling

Delivery

Reference

Examples

Advantages

Market

The Anatomy of a Livestream Session

References

Live app

Code

Live app

Code