Tuesday, May 30, 2006

Scan to PDF in Ubuntu, with Beagle Searchable Meta Data

This bash script lets you scan directly to a PDF and then search your scanned PDF's using beagle, not just by file name, but by the information (meta data), that you can save with your PDF.

I developed this script because I really want to have a paperless desk and I could not find an easy way to scan documents to PDF, (and find them again!).

A picture's worth a thousand words...

I have a "launcher" on my desktop called "scan", I just click it and....

Select colour or Grey (color or Gray) for the Americans!





Then press OK for each page of the PDF you want to make. (Cancel to finish)



Then give the PDF a file name... (jam recipe!)



Then some meta data so beagle can find it... jam, recipe, grannys, yum.



Thats all !

Now search for "grannys jam" in Beagle.....



And there it is !!



Here is the script, it was made pretty fast so......

Please feel free to mess about with it. And post improvements back here !

Programs you'll need..
zenity
ps2pdf
pdftk
scanimage

Copy the script into its own directory, chmod u+rxw. Thats all!.

-----------------------------------

#!/bin/sh

#scan to pdf with metadata, by Mac Jones, New Zealand
#http://maconstuff.blogspot.com/

#scan a batch

#decide grey or colour (gray or color for the Americans!)
colour=`zenity --list --title "Color or Gray?" --radiolist --column "-" --column "Scan" TRUE Gray FALSE Color`

a=0 #page counter
cont=1 #should we continue?

until [ $cont -eq "0" ] #keep doing it until cont variable is not a zero.
do
echo -n "$a "
let "a+=1"
if zenity --question --text "OK to scan a page, Cancel to finish, Page=$a" --title "Scanning pages"
then
cont=1
scanimage --format pnm --resolution 150 --mode $colour > "$a.pnm"
else
cont=0
fi

done # No surprises, so far.

#convert the raw file to postscript
convert -density 150 *.pnm out.ps | zenity --progress --auto-close --title "Converting to Postscript"

#convert the postscript to pdf
ps2pdf out.ps out.pdf | zenity --progress --auto-close --title "Converting Postscript to PDF"

#remove raw scan files
rm *.pnm

#remove old ps files
rm out.ps

#beep to get attention after processing
echo -e "\a"

#add the metadata and file name.
#this meta data can be searched from Beagle in Ubuntu.

#echo "Please enter a name for the PDF file (** no .pdf on end)"
nm=`zenity --entry --text "Enter file name, (no .pdf on the end)" --title "File Name?"`

#echo "Please enter Metadata for searching"
meta=`zenity --entry --text "Meta data for searching" --entry-text=$nm --title "Meta Data for Searching"`

echo "InfoKey: Producer" > tmp
echo "InfoValue: $meta" >> tmp
echo "InfoKey: Keywords" >> tmp
echo "InfoValue: $meta" >> tmp
echo "InfoKey: Title" >> tmp
echo "InfoValue: $nm" >> tmp

#update the metadata
pdftk out.pdf update_info tmp output "$nm.pdf"

#rm metadata file and pdf
rm tmp
rm out.pdf

zenity --info --text="All done, $nm.pdf is ready!" --title "Thanks!"

--------------------

9 Comments:

Blogger Rajiv Vyas said...

Now I have to find a driver for my Epson scanner. Thanks,

9:03 pm  
Blogger Jason said...

I have been looking to be able to do this for a LONG time, and I found this script. It works great. I've even added a few things:

- Producer option for your name
- Output folder, with temp files in /tmp
- Device selection (since scanimage picks up my TV tuner as the default scanner), if more than one devices exists
- Resolution options
- Page size options (since my default was like post card size on my scanner), including a "custom size"
- Progress bar for scanning (however, does not show "progress" at the moment)
- And an option to restart with a new document

I am not a bash script expert, and I just whipped this together. There may be better ways to do things... just let me know

Jason Greb
jgreb@electronerdz.com


#!/bin/sh

#scan to pdf with metadata, by Mac Jones, New Zealand
#http://maconstuff.blogspot.com/

producer="Your Name" # If you'd like it inputed into the PDF, otherwise, rem the line
outputfolder="/home/user/ScannedDocuments/" # Pre-fix for last outputted file

#scan a batch

#select a scanner if more than one exists, or just uncomment the next line, and comment the rest
#scannerdevice="-d DEVICE"

devices=`scanimage -L | awk '{print $2}' | cut -c 2-50 | tr -d =\'=`
devicenum=`echo $devices | wc -w`
if [ "$devicenum" -gt 1 ]; then
devices=`echo $devices | sed -e 's/ / FALSE /g'`
scannerdevice=`zenity --list --title "Device?" --radiolist --column "-" --column "Device" TRUE $devices`
else
scannerdevice="$devices"
fi

restart=1
until [ "$restart" -eq 0 ]; do

#ask for the resolution
resolution=`zenity --list --title "Resolution?" --radiolist --column "-" --column "Resolution" FALSE 100 TRUE 150 FALSE 300 FALSE 350 FALSE 600`

#ask for the page size, mm on most scanners, yours may differ, run scanimage --help -d DEVICE
qpagesize=`zenity --list --title "Page Size?" --radiolist --column "-" --column "Page Size" TRUE Letter FALSE Legal FALSE A4 FALSE Custom`
if [ "$qpagesize" == "Letter" ]; then
pagesize="-x 215 -y 280"
elif [ "$qpagesize" == "Legal" ]; then
pagesize="-x 215 -y 355"
elif [ "$qpagesize" == "A4" ]; then
pagesize="-x 210 -y 297"
elif [ "$qpagesize" == "Custom" ]; then
pagex=`zenity --entry --text "Enter the page width (mm)" --title "Width"`
pagey=`zenity --entry --text "Enter the page height (mm)" --title "Height"`
pagesize="-x $pagex -y $pagey"
fi

#decide grey or colour (gray or color for the Americans!)
colour=`zenity --list --title "Color or Gray?" --radiolist --column "-" --column "Scan Mode" TRUE Gray FALSE Color`

a=0 #page counter
cont=1 #should we continue?

until [ $cont -eq "0" ] #keep doing it until cont variable is not a zero.
do
echo -n "$a "
let "a+=1"
if zenity --question --text "OK to scan a page, Cancel to finish, Page=$a" --title "Scanning pages"
then
cont=1
`scanimage -d $scannerdevice --format pnm --resolution $resolution $pagesize --mode $colour > "/tmp/$a.pnm"` | zenity --progress --auto-close --title "Scanning page..."
else
cont=0
fi

done # No surprises, so far.

#convert the raw file to postscript
convert -density $resolution /tmp/*.pnm /tmp/out.ps | zenity --progress --auto-close --title "Converting to Postscript..."

#convert the postscript to pdf
ps2pdf /tmp/out.ps /tmp/out.pdf | zenity --progress --auto-close --title "Converting Postscript to PDF..."

#remove raw scan files
rm /tmp/*.pnm

#remove old ps files
rm /tmp/out.ps

#beep to get attention after processing
echo -e "\a"

#add the metadata and file name.
#this meta data can be searched from Beagle in Ubuntu.

#echo "Please enter a name for the PDF file (** no .pdf on end)"
nm=`zenity --entry --text "Enter file name, (no .pdf on the end)" --title "File Name?"`

#echo "Please enter Metadata for searching"
meta=`zenity --entry --text "Meta data for searching" --entry-text=$nm --title "Meta Data for Searching"`

echo "InfoKey: Producer" > /tmp/pdfdata.tmp
if [ "$producer" ]; then
echo "InfoValue: $producer" >> /tmp/pdfdata.tmp
else
echo "InfoValue: $meta" >> /tmp/pdfdata.tmp
fi
echo "InfoKey: Keywords" >> /tmp/pdfdata.tmp
echo "InfoValue: $meta" >> /tmp/pdfdata.tmp
echo "InfoKey: Title" >> /tmp/pdfdata.tmp
echo "InfoValue: $nm" >> /tmp/pdfdata.tmp

#update the metadata
pdftk /tmp/out.pdf update_info /tmp/pdfdata.tmp output "$outputfolder$nm.pdf"

#rm metadata file and pdf
rm /tmp/pdfdata.tmp
rm /tmp/out.pdf

if zenity --question --text "$nm.pdf has been created! Would you like to scan another document?" --title "Thanks!"; then
restart=1
else
restart=0
fi

done
exit 0

4:18 am  
Blogger ktraglin said...

where can i get "scanimage"...Using Dapper.

2:16 am  
Blogger Peter_D said...

You can find scanimage for Dapper under sane-utls,

http://packages.ubuntulinux.org
/dapper/graphics/sane-utils

I used Synaptic, but you could use apt-get I guess. I also had to install Imagemagick as well. Thanks for a very cool script Mac.

12:55 pm  
Blogger Scott said...

This comment has been removed by a blog administrator.

4:51 pm  
Blogger Scott said...

When I use the script everything seems to work, but I end up with a blank .pdf document. Could someone help since I'd really like to use this process. I'm using an HP PSC 1350 All-in-One scanner. Thanks.

5:44 am  
Blogger Dr Lawrence M Fox said...

This script works GREAT with flatbed scanners. What adaptation do I need to make to use it with a ADF (Automatic Document Feeder)

Thanks a lot!

Larry

6:56 am  
Blogger Jeff said...

Try gscan2pdf here:

https://sourceforge.net/project/showfiles.php?group_id=174140

Lets you preview your scan before you produce the PDF, reorder pages, all in a nice GTK-based GUI.

Jeff

8:23 pm  
Blogger hasuf said...

Thanks for the GREAT script (including jason's mods).

I've modified the script further to scan to tif and to have tesseract do an ocr pass on the image. After a verification window, this text is then saved as keyword metadata.

The system works so well that I ordered a document scanner to get rid of all my office clutter.

Through all this, I noticed that beagle doesn't actually index the keyword metadata... so I'm wondering how the original demo worked. I submitted a patch to index pdf keywords here, and apparently it got accepted. Hopefully it'll make it in 0.2.18...

http://bugzilla.gnome.org/show_bug.cgi?id=463003

I've got the patch available (for 0.2.13 and 0.2.16... the filter api changed somewhere in there) in case anyone wants it...

10:30 am  

Post a Comment

<< Home