***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***
***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***
***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***
Company Overview:
Flow Global Software Technologies, LLC., operating in the Information Technology (IT) sector, is a cutting-edge high-tech enterprise AI company that engages in the design, engineering, marketing, sales, and 5-star support of cloud-based SaaS AI sales platforms, with patent pending artificial intelligence, deep learning, and other proprietary technologies awaiting patent approval. The company's first product, Flow Turbo™, is a future-generation SaaS AI sales prospecting platform that is designed to maximize the productivity day-to-day for B2B sales reps within B2B outbound, inbound, and inside sales organizations. The company also provides world-class award-winning customer support, professional services, and advisory services. The company is headquartered in Austin, Texas and is registered in Delaware.
Position Overview:
Flow is seeking exceptionally advanced, experienced, dedicated, and committed Senior Data Scientist Interns to engage in the end-to-end development, implementation, delivery, optimization, and large-scale cloud infrastructure deployment of NLP-driven named entity resolution (NER), infinite lead generation pipelines, and high-dimensional knowledge graph construction. This role is designed for individuals who possess an above-PhD-level understanding of context-aware Named Entity Recognition (NER), probabilistic graph-based data fusion, transformer architectures, real-time data extraction, large-scale data associativity resolution, and automated entity validation pipelines. Senior Data Scientist Interns will architect, develop, and optimize multi-tiered NLP and graph-based entity disambiguation frameworks, ensuring precise, scalable, and ultra-high-volume data ingestion and contact enrichment for lead generation pipelines. The role involves the engineering of large-scale, dynamically evolving, and self-healing entity relationship graphs with direct applications in high-precision mass bulk phone number collection, contact information collection, real-time entity clustering, and probabilistic entity matching in all-source datasets. Candidates must exhibit unparalleled expertise in Java and Python development, leveraging cost-minimized, high-efficiency, multi-threaded, and GPU-accelerated NLP workflows to automate, refine, and scale lead generation pipelines to transformative dimensions, which is vital to the success of all salespeople, and Flow’s AI platforms.
The ideal candidate will be responsible for architecting, training, fine-tuning, and deploying transformer-based NLP models with an emphasis on multi-modal Named Entity Recognition (NER), knowledge graph embeddings, and cross-context entity validation for lead generation pipelines. This includes implementing hierarchical, context-sensitive NER pipelines utilizing SpaCy pre-trained models, custom entity linking algorithms, and transformer architectures such as BERT, RoBERTa, XLNet, DistilBERT, and Sentence-BERT. The candidate will be expected to engineer multi-phase entity resolution frameworks, leveraging fuzzy matching, probabilistic distance functions, and rule-based entity refinement to normalize and enrich structured and unstructured datasets. You will deploy context-aware approximate nearest neighbor (ANN) search algorithms for rapid similarity computation across high-dimensional entity representations, optimizing for zero-error entity classification and deduplication across multi-source lead generation pipelines. Additionally, You will be responsible for constructing domain-specific, dynamically evolving NLP models tailored for high-precision full name, phone number, email address, and organization-to-contact relationship extraction from unstructured and semi-structured data sources.
Candidates must possess advanced distributed web scraping, data extraction, and real-time data ingestion expertise, employing multi-threaded, distributed browser automation frameworks such as Puppeteer, Playwright, Selenium, and BeautifulSoup4, optimized for high-throughput, multi-region dynamic content extraction from JavaScript-heavy, AJAX-rendered, and obfuscated DOM structures. You will design XPath-based heuristics and regex-powered parsers engineered for real-time DOM navigation, adaptive field extraction, and context-aware entity association. You must implement dynamic proxy rotation strategies, browser fingerprinting mitigation, and stealth-mode scraping methodologies to circumvent detection mechanisms and maximize high-value entity retrieval at infinite scale. The candidate will be responsible for developing low-latency, serverless distributed web crawling and distributed web scraping architectures with Kafka-backed real-time streaming pipelines to ingest, validate, and enrich petabyte-scale datasets in real time. Additionally, the role demands mastery in multi-source entity disambiguation, applying fuzzy logic-based associative clustering, graph-based entity fusion, and probabilistic deduplication techniques to eliminate low-confidence records and erroneous entity matches.
This role requires deep expertise in large-scale graph-based data structures, knowledge graph embeddings, and Graph Neural Networks (GNNs) for high-precision entity matching and contact enrichment. The candidate will engineer scalable, distributed graph databases using Neo4j, JanusGraph, and ArangoDB, optimized for real-time, high-fanout entity traversal and recursive associative relationship resolution. You will implement heterogeneous knowledge graph construction methodologies, leveraging Graph Attention Networks (GATs), Relational Graph Convolutional Networks (R-GCNs), and GraphSAGE to construct automated, self-healing, and continuously evolving entity relationship models. The candidate must design probabilistic entity fusion scripts to resolve ambiguities across all-source datasets, utilizing advanced contextual similarity measures such as Levenshtein distance, Jaccard similarity, Cosine similarity, and MinHash locality-sensitive hashing algorithms. You will apply open source, pre-built, pre-trained GNN-based entity disambiguation frameworks to eliminate duplicate contact records, resolve conflicting entity relationships, and validate high-confidence associations between disparate datasets. Additionally, you will be responsible for optimizing knowledge graph traversal algorithms, ensuring sub-millisecond latency for entity resolution queries at petabyte scale.
Senior Data Scientist Interns will construct high-performance, distributed ETL pipelines using Apache Spark, Apache Flink, and Apache Kafka Streams, engineered for low-latency, high-volume data ingestion, transformation, and anomaly detection in infinite-scale lead generation workflows. You will develop optimized Approximate Nearest Neighbor (ANN) retrieval mechanisms, leveraging HNSW, FAISS, and ScaNN for real-time, billion-scale entity similarity search and matching. You must implement Luhn’s Algorithm and heuristic-based probabilistic weighting models, ensuring precise textual summarization, deduplication, and high-confidence relevance scoring for extracted lead data. You will also be responsible for developing real-time, event-driven entity validation and enrichment frameworks, enabling multi-region, horizontally scalable lead generation pipelines.
Candidates must demonstrate expertise in end-to-end cloud-based data pipeline automation, leveraging event-driven serverless architectures utilizing AWS Lambda, Google Cloud Functions, Azure Functions, and open source serverless technologies to automate entity ingestion, classification, and validation. Senior Data Scientist Interns must be able to develop REST API, GraphQL, and gRPC-based high-performance microservices, interfacing with distributed knowledge graphs and entity resolution engines to support real-time lead generation workflows. The candidate will be responsible for multi-cloud Kubernetes orchestration with Helm-based package management, ensuring horizontally scalable deployment of NLP and GNN-based entity resolution frameworks. Senior Data Scientist Interns must also implement Terraform, and Ansible-driven Infrastructure as Code (IaC), enforcing fully declarative, reproducible, and auto-scaling infrastructure deployment models across AWS, Azure, GCP, Oracle Cloud, and self-hosted cloud.
This role requires an ultra-elite level of technical expertise, encompassing full-stack NLP engineering, multi-modal entity resolution, large-scale graph database optimization, and real-time event-driven data extraction. Candidates should already have a Master’s degree in Computer Science, Data Science, NLP, or Graph Theory, with a proven professional industry track record in developing high-performance, data science solutions. Candidates must fully commit to a minimum of 30 hours per week, and must be able to fully commit to staying at the company for at least a very bare minimum of 6 full months.
***MUST BE ABLE TO COMMIT STAYING AT THE COMPANY FOR AT LEAST A BARE MINIMUM OF 6 MONTHS.***
Key Responsibilities:
Qualifications:
Benefits:
Note:
This internship offers an exciting opportunity to gain hands-on experience in advanced data science within a high pressure and innovative environment. Candidates must be self-motivated, proactive, and capable of delivering high-quality results independently. The internship provides valuable exposure to cutting-edge technologies and real-world software development practices, making it an ideal opportunity for aspiring senior data scientists.
***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***
Please send resumes to services_admin@flowai.tech