Great thanks to Darshil Parmar for putting together such an easy-to-follow project! I made adjustments to Kafka consumer and glue crawler part to fit my use cases.
https://www.youtube.com/watch?v=KerNf0NANMo
EC2 Instance should have linux as OS and T2 Micro tier (free-tier) to make life easier
-
Install Kafka
wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz tar -xvf kafka_2.13-3.3.1.tgz
-
Install Java (Kafka runs on JVM)
sudo yum install java-1.8.0-amazon-corretto.x86_64
-
Start Zookeeper
./kafka_2.13-3.3.1/bin/zookeeper-server-start.sh ./kafka_2.13-3.3.1/config/zookeeper.properties
-
Increase memory to Kafka Heap
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
-
Start Kafka
cd kafka_2.12-3.3.1 ./kafka_2.13-3.3.1/bin/kafka-server-start.sh ./kafka_2.13-3.3.1/config/server.properties
-
Make the Kafka listen through public IP
vim ./config/server.properties # Change this part # Listener name, hostname and port the broker will advertise to clients. # If not set, it uses the value for "listeners". advertised.listeners=PLAINTEXT://ec2-54-237-239-178.compute-1.amazonaws.com:9092
-
Rerun Zoo-Keeper and Kafka
-
Allow Inbound traffic on AWS console (
Instance Details → Security → Inbound Rules → launch-wizard-1 → {security group name} → edit inbound rules
)
cd kafka_2.13-3.3.1
bin/kafka-topics.sh --create --topic demo_test --bootstrap-server 54.237.239.178:9092 --replication-factor 1 --partitions 1
./kafka_2.13-3.3.1/bin/kafka-console-producer.sh --topic demo_test --bootstrap-server ec2-44-206-255-156.compute-1.amazonaws.com:9092
cd kafka_2.13-3.3.1
./kafka_2.13-3.3.1/bin/kafka-console-consumer.sh --topic demo_test --bootstrap-server ec2-44-206-255-156.compute-1.amazonaws.com:9092
# Define Producer
producer = KafkaProducer(bootstrap_servers=["ec2-44-206-255-156.compute-1.amazonaws.com"],
value_serializer = lambda x: dumps(x).encode('utf-8'))
while True:
# Wait for random time beween 0 ~ 10 seconds
sleep(random.uniform(0,3))
# Generate random data
generated_data = df.sample(1).to_dict(orient='records')[0]
# Send it to the kafka consumer
producer.send("demo_test", value=generated_data)
# Flush the data in producer
producer.flush()
============On a seperate File============
consumer = KafkaConsumer(
'demo_test',
bootstrap_servers=["ec2-44-206-255-156.compute-1.amazonaws.com:9092"],
value_deserializer = lambda x: loads(x.decode('utf-8')))
for c in consumer:
print(c.value)
-
user → Add user
-
Check
Access Key - Programmatic access
-
Attach existing policies directly
→ addAmazonS3FullAccess
-
Create User
-
IAM
→Users
→{user_name}
→Access Keys
→Create Access Keys
→Command Line Interface (CLI)
-
Download .csv
: keep the secret key & the access key
-
Setting up the AWS CLI allows
s3fs
library to access the s3 directories as a local directory -
aws configure
→ add the keys from previous step
# Create S3 Object
s3 = S3FileSystem()
# Intialize a list to hold the data
tmp_data = []
for count, message in enumerate(consumer):
# Append the data to the list
tmp_data.append(message.value)
# If we collect 10 messages save it to a parquet file
if count != 0 and count % 10 == 0:
# Create a pandas DataFrame from the accumulated data
df = pd.DataFrame(tmp_data)
# Convert the DataFrame to a PyArrow table
table = pa.Table.from_pandas(df)
# Record current time to use in file name
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S%Z")
with s3.open(f"s3://kafka-stock-data-trial/stock_market_data_\
{current_timestamp}.parquet", "wb") as file:
print(table)
pq.write_table(table, file)
#Renew the list
tmp_data = []
parquet tables look like this:
and the files are saved as below:
-
Glue
→Data Catalog
→Crawlers
→Create Crawler
-
Enter Crawler Name →
Next
-
Select
Not Yet
forIs your data already mapped to Glue tables?
→Add a data source
→ select your bucket forS3 Path
→Add an S3 data Source
→Next
-
Choose IAM role → (If None) click
Create new IAM role
→Next
-
Add database
→ select the created database → ClickAdvanced options
-
Crawler Schedule
: On demand -
Create Cralwer
Click Run Crawler
and wait until the crawler finishes